Robotics is nearing its ChatGPT moment
We grew up watching the Jetsons, dreaming about what the future would look like: self-driving flying cars, delivery robots dropping off our groceries, and humanoid companions making our coffee.
The technology to make this future a reality isn’t far away, but sometimes the pace of innovation is hard to project. AI is the perfect case study: over the past decade, the advent of transformer-based AI models have radically transformed the way that we learn, work, and play. 52% of all Americans and 78% of organizations have reported that they are using AI (three years ago, this number was in the low single digits).
Today, robotics is nearing its ChatGPT moment. Waymos are already shuttling passengers in American cities, Figure and Agility are producing humanoids for manufacturing deployment, and food delivery robots are unmissable in Miami and LA. Still, robotics is lagging the widespread adoption of text-based language models. What’s keeping us from living like the Jetsons? Data.
Today’s AI models are trained almost entirely off what exists on the internet—entirely digital artifacts of text, image, and video—and plenty of it. For coding and writing tasks, the trillions of tokens of training data were able to produce models that are incredibly performant.
Data from our physical world is much more scarce, yet it’s required to train the autonomous robots and vehicles of the future. This is a complicated problem–it’s not just about collecting the data (both internal telemetry such as limb position/force, and external such as audio/visual), but benchmarking it and comparing models to one another. The real-world is unpredictable and full of edge cases. We need open, interoperable, and live models to help physical world devices navigate the physical world.
In theory, that’s where DePIN (Decentralized Physical Infrastructure Networks) physical data collection comes in. The idea is to use cryptoeconomic incentives to mobilize and reward large, distributed groups of people to collect data to train these models and build the backbone of physical AI. As more data is collected a flywheel effect occurs and it becomes easier for teams to confidently source data essential for next generation opportunities. Teams can use networks of validators to enforce data quality standards and build structures to reward contributors for accuracy, uniqueness, and quality.
Importantly, this data would lead to community-powered open models, in contrast to the siloed, proprietary research being advanced by well-funded labs. A community built model with an endless stream of updated data points expands the playing field for developing state-of-the-art models to a wider set of players.
Physical AI requires a new approach to data collection
If you add up all open robotics datasets today (vision, manipulation, driving, simulation) you get ~5TB. For comparison, LLM training corpora is 100TB+. There are very few open-source, real-time, fine-grained, global maps of road hazards. There’s no universal, timely inventory of sidewalk obstacles, construction zones, or changing factory floors.
Big companies are finding their own way to gather this data: Tesla is paying $48 an hour to “data collection operators” to do things like repeatedly fold laundry to train Optimus robots. But, this data is proprietary. Because even Tesla has limitations, the datasets big companies collect are often insufficiently granular and limited in its use or reach. For example, it’s hard to get data to train self-driving cars in rural areas. It’s even harder if you have to take into account rain, or snow, or the potential for unexpected wildlife (how many Teslas have seen a moose cross the road?).
One tactic being used to generate this data is simulation. While simulated data has advantages, it also has clear downsides:
Collection Type | Description | Drawbacks |
---|---|---|
Synthetic data | Artificially created using generative models or physics-based simulations to mimic real-world data for AI training and evaluation. Sometimes, AI is trained on game environments (yes, video games), where robot actions are simulated and trained on millions of gameplay hours. | Struggles with real-world transfer in many tasks and often doesn't capture the complexity of certain environments. Also can be prone to overfitting of synthetic artifacts. |
Video simulation data | Consists of visual recordings or renderings that are useful for visual perception, navigation, or manipulations. | Videos can't convey tactile feedback and force, so models trained on video miss critical sensory cues. |
Teleoperation data | Collected when humans remotely operate robots or devices, generating demonstrations of real-world actions for training AI. Usually requires extensive hardware setups and coordination between human operators and physical systems. (See this example from PrismaX) | Difficult to scale due to the need for extensive hardware setups and coordination between human operators and physical systems. |
Simulated data will continue to improve, but almost always fall short of what real world data can provide. For physical AI to scale, we’ll need billions of hours of real-world video, sensor data, and geospatial updates—data that a distributed, crowdsourced approach can feasibly supply.
DePIN in Action: How Blockchain and Decentralization Bridge the Gap
A DePIN-powered approach to this challenge rewards anyone with a relevant device—dashcams, drones, robots for contributing to these models.
This isn’t the first instance of crypto incentives being used for data collection. Some of the largest open source datasets today were bootstrapped with crypto networks. Grass has generated 1M frames for the cliptagger visual language model, Frodobots generated 2,000 hours of data from tele-operated robots on sidewalks across America, and Reborn has 200,000 monthly active users helping generate data for their models.
As physical AI grows, we’re seeing early examples of distributed data collection emerge across multiple verticals:
Autonomous Vehicles
By the end of 2025, a fleet of connected vehicles could generate 10 exabytes of data globally each month, with a single fully-autonomous car capable of producing as much as 19 terabytes an hour. To safely deploy autonomous vehicles, models need to be trained and refreshed with data spanning different geographies, weather, and driving cultures—a scale that no single company can manage.
Projects like Hivemapper, ROVR, and NATIX address this gap by turning everyday drivers into decentralized mappers. NATIX’s partnership with Grab, Southeast Asia’s leading superapp, is a live demonstration: thousands of drivers contribute real-time road video to continually update mapping data, with over 250,000 drivers collectively mapping 170 million kilometers thus far. Grab is a paying customer, not just a research partner—a sign that DePIN can deliver both computing and commercial value.
These are often complementary, not competitive data sets. In this case:
- NATIX is focused on a device that pulls data out of existing Tesla cameras
- Hivemapper is installing net new dashcams on cars and trucks
- ROVR is collecting LIDAR data, not camera based collection
Drones and Precision Sensing
Drones are revolutionizing agriculture, logistics, and critical infrastructure monitoring. Their business case depends on accurate, hyperlocal, and constantly refreshed data. Distributed networks like GEODNET and Onocoy turn rural collectives, farmers, and entrepreneurs into contributors to globally accurate, decentralized sensor and location networks. Raad Labs takes a different approach, by enabling critical infrastructure providers to request regular, bespoke monitoring of systems like solar arrays and construction sites. This approach lowers agricultural monitoring costs by up to 73% while making the backbone network for delivery drones, ag-bots, and surveillance scalable even in remote areas.
Humanoid Robots and Industrial Automation
The humanoid robot market is exploding. Some predictions have the market scaling from 3,500 robots in 2024 to 1.4 million in 2035. Morgan Stanely projects that the humanoid robot market will reach $5 trillion by 2050[4]. Each robot deployed to a new task needs unique motion, environmental, and handling data.
Startups like Bitrobot, PrismaX, and Reborn are experimenting with decentralized networks, where robots or human operators “teach” others by uploading data to open platforms. Current revenue is modest—the total customer base for robotics data remains just a handful of buyers like Google DeepMind, Figure Robotics, and Tesla—and distribution models are becoming more generalizable. Some teams are considering capturing more value by training their own models.
An adjacent category is spatial perception, a subset of projects that are crowdsourcing 3d maps, mostly for robots. Auki Labs, OverTheReality, and MeshMap are building in this space. There’s also the possibility of a broader geospatial perception category (essentially, ChatGPT for earth) that’s less mature but has the potential to crack open markets worth $100B alone.
Game Engines Data
Game engine data is simulated, but we’re including it here because it demonstrates another approach to how DePIN can be used to train physical models–through video games.
There’s now decades of research on how to efficiently simulate real world activity, refined through millions of hours of game play. Today, World Models like Genie3 are being used to simulate a robot’s actions before they’re enacted in the real world, often training on millions of gameplay hours. (Or, check out this paper on using Grand Theft Auto to train self-driving cars).
DePIN incentives allow for many more possibilities and millions more hours of gameplay to be used to help train these models. Games have the advantage of simulating responses to events that we want to be prepared for, but hope we never have to face (terrorist attacks, fires, large earthquakes, etc). Shaga, for example, is an ultra-low-latency peer-to-peer gaming platform that also rewards users for their gameplay. Their output is an action-labeled data corpus, combining controls, frames, and engine events.
Market sizing
Projecting the size of the physical AI and DePIN market is part science, part informed optimism. Here’s where consensus is emerging:
Vertical | Market Size Potential | Notes |
---|---|---|
Autonomous driving | $350 billion | Source for TAM by 2035. |
Drone networks | $83 billion | Source for TAM by 2035. Agriculture, surveying, security |
Humanoid robots | $38 billion by 2035 | Source for TAM by 2035. Morgan Stanley sees $5T by 2050. |
Robotic exoskeletons | $19B | Source for TAM by 2035. Market potential in logistics, construction, and military. |
World models & game simulation | $5.8B today | Source for 2035. Uses gameplay to simulate real world events. |
Physical AI overall | $100–$600B+ by 2035 | Varies by analyst, with MCP modelers seeing multipliers from cross-sector growth |
Cold Water: The Hard Realities for Physical AI and DePIN
Physical AI represents a massive opportunity. But, it’s also a nascent industry that has serious obstacles to overcome before becoming the $600B industry that some analysts predict it will be.
1. Market Fragmentation & Thin Demand
There’s about 10 major customers for robotic data in the world (Deepmind, Tesla, and a handful of Chinese tech conglomerates dominate demand). The contract sizes for this type of data is relatively small when compared to the addressable market. External data only represents a small slice of the spend towards training these models—most funds are focused on in-house, proprietary models that 1) allow these companies to build a sustainable moat, 2) are tailored specifically for their needs. For many startups, the long-run business case beyond token launches is still unproven.
2. Simulations are cheaper–and might be good enough
The real world operates based on a set of rules that are nonfungible: physics. The billions of dollars that’s been invested in simulation (particularly gaming R&D) over the past several decades can be applied towards more and more real world problems. AI World Models are efficiently learning computable models of reality from simulated data. The Veo3 AI video engine learned how water splashes by analyzing the compendium of YouTube videos with water splashes. Simulated data has the added benefit that it can safely train models safely for low-probability, high-tail risk scenarios.
Simulated data is also generally much cheaper to collect than physical world counterparts. So, it’s possible that long term, it will win out over its real world data collection counterparts.
3. Data Specificity vs. Generality
It’s uncertain how generalizable crowdsourced robot or vehicle data will ever be; most autonomous systems require highly specialized data configured for their own hardware and software constraints. Open-source platforms could make inroads over time, but for now, centralized giants like Waymo and Tesla train on deeply proprietary, usage-specific datasets.
Projects are starting to get wind of the importance of generalizable data. Drone data company Spexi, for example, spent a significant amount of time ensuring their data is generalizable and useful to end buyers. Raad Labs takes the opposite approach and works with drone operators to create bespoke aerial intelligence, tailored to customer demand (so there’s always a buyer).
4. Value Tends to Accrue at the Model Layer
Even in a world in which DePIN networks are able to collect meaningful training data for robotics use cases, as we have seen with LLMs, the majority of the value accrual comes at the model layer. While web publishers like Reddit, Shutterstock, Reuters, among others struck one-off licensing deals with AI labs, the value of those deals pale in comparison to the revenues collected by the proprietors of the models. Some teams like Reborn and PrismaX have aspirations to train their own open-source robotics models eventually. Bitrobot enables researchers to compete directly in building robotics models. Time will tell if they are competitive enough with their closed source counterparts.
Conclusion
One of the most exciting realities of physical AI data collection is it is a chance to do data collection with a wider distribution of benefits. After decades of big companies freely collecting data without our permission, blockchain-based physical AI gives people all over the world the opportunity to contribute to these models and get paid for their contributions.
But, the inherently distributed nature of this approach presents challenges. The fact also remains that the market size for this data is very limited. Founders building for the physical AI future are entering an exciting space, but we have two pieces of advice:
Build where the feedback loop between real-world data collection and revenue is shortest, and design your incentives to reward contributors for data quality and frequency. This market’s total addressable market is relatively small, which means it’s critical to know where your revenue will come from early—the best approach is to find a potential buyer or two for your data and build for them directly, so you can feel confident that someone needs what you’re building. It’s also critical to pinpoint what buyers in your vertical are most interested in–whether that’s cheaper data, higher quality or quality of data, distribution of data, or something else. This is a competitive market, so defining advantage is essential.
Focus on generalizable data to maximize the long-term value and applicability of your dataset across models, partners, and domains. Founders who invest in collection standards, and diverse sampling will create assets that appeal to a broader set of buyers and future use cases.
Special thanks to:
Sal from EV3, Alvaro from Borderless, Rob Sarrow from Volt, Evan Feng from CoinFund, Dylan Bane from Messari, the teams at Shaga, Bitrobot, NATIX, Connor Lovely from Proof of Coverage.