The basic models have made great progress in robot technology, enabling the creation of Visual-Language-Action (VLA) models that can generalize to objects, scenes, and tasks beyond the training data. However, the adoption of these models is limited due to their closed nature and lack of best practices for deployment and adaptation to new environments.
To address these challenges, researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google DeepMind, and other labs have introduced OpenVLA, an open-source VLA model trained on a diverse collection of real-world robot demonstrations.
According to the researchers, OpenVLA outperforms other similar models in robot tasks. Additionally, it can be easily fine-tuned to adapt to multi-object, multi-task environments. Furthermore, it has been designed to run on consumer-grade GPUs with minimal cost for fine-tuning.
With basic models becoming the cornerstone of robot technology, OpenVLA can make these models more accessible and customizable to meet the needs of a wider range of companies and research labs.
Visual-Language-Action Models for Robot Technology
Classic robot learning strategies are difficult to generalize beyond the training data. They are not robust to scene interference or unfamiliar objects and struggle to execute task instructions slightly different from the training content.
Large Language Models (LLMs) and Visual Language Models (VLMs) possess this generalization ability due to the world knowledge captured from internet-scale pre-training datasets. Recently, research labs have started using LLMs and VLMs as building blocks for training robot strategies.
One popular approach is to use pre-trained LLMs and VLMs as components for task planning and execution in modular systems. Another direction is to train Visual-Language-Action Models (VLAs) from scratch to directly generate robot control actions. Examples of VLAs include RT-2 and RT-2-X, which set new standards for general robot strategies.
However, current VLAs face two main challenges. Firstly, they are closed, with low visibility into their architecture, training process, and data mixing. Secondly, there is a lack of best practices for deploying and adapting VLAs to new robots, environments, and tasks.
"We believe that, in order to lay a solid foundation for future research and development, robot technology needs open-source VLAs that support effective fine-tuning and adaptation, similar to the existing ecosystem around open-source language models," the researchers wrote.
OpenVLA
OpenVLA is an open-source VLA based on the Prismatic-7B Visual Language Model with 7 billion parameters. It consists of a two-part visual encoder that extracts features from input images and processes language instructions using the Llama-2 7B model.
To create OpenVLA, the researchers fine-tuned the Prismatic model on a large dataset of 970,000 robot operation trajectories from the Open-X Embodiment dataset, covering a wide range of robot entities, tasks, and scenes. They also configured the model to output special markers that can be mapped to robot actions.
OpenVLA receives natural language instructions (e.g., "wipe the table") as well as input images captured by a camera. The model reasons based on the instructions and visual input to determine which sequence of action markers will enable the robot to accomplish the desired task.
According to the researchers, OpenVLA performs better than the previous state-of-the-art VLA model, the 55 billion parameter RT-2-X model, on WidowX and Google Robot entities.
The researchers also experimented with efficient fine-tuning strategies for VLAs on seven manipulation tasks, ranging from object grasping and placing to cleaning tables. Fine-tuned OpenVLA strategies outperformed fine-tuned pre-training strategies. Fine-tuning OpenVLA also improved performance on instructions that require mapping language instructions to multi-object, multi-task behaviors.
"It is worth noting that most prior work achieves strong performance either on narrow single-instruction tasks or diverse multi-instruction tasks, leading to large success rate differences," the researchers wrote. "OpenVLA is the only method that achieves at least 50% success rate on all test tasks, indicating that it can be a strong default option for imitation learning tasks, especially if they involve a diverse range of language instructions."
The researchers also made OpenVLA more accessible and computationally efficient through optimization techniques. They used Low-Rank Adaptation (LoRA) to fine-tune OpenVLA for new tasks on a single A100 GPU in 10-15 hours, reducing computation by 8 times compared to full fine-tuning. By model quantization, they were able to reduce the size of the OpenVLA model and run it on consumer-grade GPUs without significant performance degradation.
Open-Sourcing OpenVLA
The researchers have open-sourced all the models, deployment and fine-tuning notebooks, and the OpenVLA codebase for large-scale VLA training, "hoping that these resources will drive future exploration and adaptation of robot VLAs," they wrote. The codebase supports fine-tuning models on a single GPU and training VLAs with billions of parameters on multi-node GPU clusters. It is also compatible with modern optimization and parallelization techniques.
In the future, the researchers plan to improve OpenVLA by adjusting it to support multiple image and proprioceptive inputs as well as observation history. They also suggest that using VLMs pre-trained on interleaved image and text data may facilitate fine-tuning VLAs with flexible inputs.