A summary of the VLA night in the web data loft.
We brought engineers from Agility Robotics, Tesla, Prometheus, and Distill Labs to Bright Data’s Web Data Loft in San Francisco to discuss one question:
What does it actually take to move from a language model to a robot that works in the real world?
The answer was more grounded than the hype suggests. The bottleneck is not only model architecture. It is the training corpus: what you collect, how you mix it, where it comes from, and whether you can curate it at a scale no manual team can match.
On the panel were Sri and Ahmed from Agility Robotics, Ankur, a robotics ML engineer speaking in a personal capacity, Daniel from Prometheus, formerly 1X and Waymo, and Jacek, co-founder of Distill Labs. The conversation was moderated by Adam of HackerSquad and the Builders Collective.
Below are the five takeaways that matter if you are building a Vision-Language-Action model, a world model, or the data pipeline behind one.
1. A VLA is a VLM with an action head, and its generalization comes from web-scale pretraining
The panel’s working definition was simple: a VLA starts as a vision-language model trained on internet-scale text and images, on tasks such as captioning, segmentation, and object understanding. You then add an action component and fine-tune it on robotic data.
That distinction matters. The robot data teaches execution. The web-scale pretraining teaches the model what the world is.
This is why a VLA can sometimes pick up an object it was never explicitly trained to pick up. The generalization does not come from a small set of teleoperated robot demonstrations alone. It comes from broad visual and semantic exposure before the robot ever enters the loop.
If your pretraining corpus is narrow, no amount of expensive teleoperation data fully buys back the generalization you skipped.
“It’s trained on internet-scale data on text and images… then you fine-tune the VLM on robotic data and you get a vision-language-action model. The nice thing is it has better generalization: if you train it to pick up a certain object, you can ask it to pick up a different object, because it has seen similar things.”
— Ankur, robotics ML engineer, speaking in a personal capacity. Watch at 9:59 →
📖 Related reading: What is a Vision-Language Model (VLA)? · Best Robotics AI Libraries · Foundation Models explained
2. Vision, language, and action are moving into one token space
Modern VLAs increasingly look like LLMs in one important way: they predict the next token.
That token might be a word, an image patch, or a joint-space control command. As Jacek, co-founder of Distill Labs, explained, the connection to software agents is direct. An LLM calls API tools. A VLA calls physical tools. The harness changes from “call an endpoint” to “grab the cup,” but the underlying pattern is similar.
The implication is powerful: every modality that can be tokenized can become part of the same training space. Web video, egocentric footage, human demonstrations, teleoperation, and on-policy robot data can all contribute to a shared representation.
The constraint then shifts from “can the model use this?” to “can we source the right examples at the right scale?”
“You can think about your action space as function calling for LLMs… you break it down like that and it’s not different from what people build for the non-physical world, agents that spin up sub-agents in a harness that exposes tools. Now the harness is more physical. That’s what makes it powerful, because you can rely on web training data to get a pretty good starting point.”
— Jacek, co-founder, Distill Labs. Watch at 15:14 →
📖 Related reading: Tokenization explained · Inside the AI Agent Tech Stack · How to Build AI Agents: Complete Roadmap
3. VLAs and world models need different data, confusing the two is costly
One of the sharpest distinctions of the night was between VLA training and world-model training.
As Ankur framed it, a VLA is largely an imitation-learning problem. You want clean, successful, high-quality trajectories. Bad demonstrations can hurt.
A world model is different. It needs to predict what happens next given an action, which means it has to understand not only successful outcomes, but also mistakes, edge cases, and failures. If you want to use a world model for planning or as a learned simulator for reinforcement learning, it has to represent the full range of possible futures.
Daniel, an engineer at Prometheus who previously led world-model work at 1X, explained why this is hard. Many current world models are biased toward successful outcomes. When shown a trajectory that is about to fail, they may hallucinate a recovery instead of modeling the mistake. In robotics, that is especially dangerous. The model must be action-controllable precisely at the moments where contact, grasping, and failure are most likely.
The takeaway: “robotics data” is not one generic bucket. Imitation policies and world models require deliberately different corpora.
“You really want a world model that is very action-controllable… the make-or-break moment when you’re grasping an object. If you get gaps there, that’s a really bad sign.”
— Daniel, Prometheus, formerly 1X. Watch at 35:36 →
📖 Related reading: What is AI Model Training? · AI Hallucination explained · Robotics datasets
4. The data hierarchy is real: web data gives breadth, robot data gives control
Ahmed, an engineer at Agility Robotics, laid out a clear hierarchy of signal.
Teleoperation data contains the strongest control information because it includes the full robot state. Human demonstrations and egocentric video carry less direct control signal. Web video carries the least at the low-level control layer.
But that does not make web data less important. It makes its role different.
Web-scale video teaches semantics, context, task structure, object diversity, and general world knowledge. It helps the model understand what rooms, tools, people, objects, and goals look like across enormous variation. What it does not teach well is the fine-grained physics of a specific robot body executing a specific action.
Ankur gave the clearest analogy: you can watch every Messi or Ronaldo video ever recorded and understand soccer deeply, but you still cannot play without practicing. Web data teaches the game. On-robot data teaches the body.
The practical data-budget insight came from the same exchange: one hour of web data may provide roughly the transferable value of five minutes of teleoperation data. Web data does not replace teleop, but strong web-scale pretraining can reduce how much expensive robot data you need.
“We can watch a lot of soccer videos of Messi or Ronaldo, but until we go practice ourselves we can’t really play. The understanding of the task we get from web data. To actually execute it, we need on-robot data… maybe one hour of web data is the same as five minutes of teleop data.”
— Ankur, robotics ML engineer, speaking in a personal capacity. Watch at 1:01:09 →
📖 Related reading: Video data for AI · YouTube Videos Dataset · Audio Datasets for AI · Image Datasets
5. There are no reliable scaling laws yet, so curation speed becomes the advantage
For LLMs, the industry has Kaplan and Chinchilla scaling laws. For VLAs and world models, Daniel was direct: robotics is not there yet.
Teams still cannot reliably predict robot performance as a clean function of web tokens, teleop hours, deployment data, compute, or model size. Part of the challenge is that imitation learning and world modeling use different supervision signals. Another is that the metric that matters is downstream task success, not pretraining loss.
Daniel also drew a useful contrast with autonomous-vehicle simulation. In self-driving, the simulation often stops when contact happens. In robotics, contact is where the real complexity begins. Grasping, pushing, slipping, deforming, colliding, and recovering are not edge cases. They are the task.
Until better scaling laws emerge, the advantage goes to teams that can find and curate the right examples fastest: specific scenes, task families, object interactions, failures, and contact-rich moments. That is not just a modeling challenge. It is a discovery and data-pipeline challenge.
“Answering scaling laws with respect to flop counts or token counts is now common for LLMs, Kaplan et al., the Chinchilla scaling laws. We’re not really asking those questions to scientifically compare VLAs and world models today… I think the answer is we’re not there yet, and we really should get there.”
— Daniel, Prometheus, formerly 1X and Waymo. Watch at 54:35 →
📖 Related reading: Data Discovery · Best AI Training Data Providers · LLM Training Data
What this means for your robotics data strategy
The panel converged on a clear conclusion:
Web-scale data gives robots a broad understanding of the world. On-robot data teaches them how to act in it. The better your pretraining corpus, the less expensive robot data you need to reach reliable execution.
Acting on that requires three capabilities most teams underestimate:
🌐 Web-scale extraction
Petabyte-scale video, image, and audio collection from the open web, not only frozen academic datasets with outdated taxonomies. See Bright Data’s web-scale data collection infrastructure and custom data solutions.
🔍 Visual discovery beyond keyword search
The most valuable task diversity often appears in scenes that are never described in a title, tag, or caption. Keyword search misses much of the long tail. Explore visual and semantic discovery via the Discover API.
⚖️ Defensible provenance
Text models train on trillions of tokens. VLAs train on trillions of frames. Every frame can carry a licensing and provenance question, and real-world robot deployment raises the stakes. Learn more in our Trust Center and our ethical data collection guidelines.
The models are converging. The differentiator is becoming the corpus: how broad it is, how relevant it is, and whether you can defend where it came from.
Building a VLA or world model?
Talk to our team → about discovering and sourcing training video at web scale.
Learn more about Bright Data for AI, explore our video data offering for VLAs, or browse our ready-made datasets for robotics, computer vision, and multimodal training.