MIT Develops New Technology to Advance General-Purpose Robot Training

2024-10-29

In the classic animated series "The Jetsons," the robotic maid Rosie effortlessly transitions from cleaning to cooking and taking out the trash, demonstrating the desirable capabilities of a versatile robot. However, in reality, developing a general-purpose robot capable of performing multiple tasks remains a substantial challenge.

Conventional robot training methods typically require extensive data collection for specific tasks and are trained in controlled environments. This process is both time-consuming and labor-intensive, and robots often struggle to adapt to new environments or tasks. To address this issue, researchers at the Massachusetts Institute of Technology have developed a new technology aimed at training robots to perform a wide range of tasks by integrating heterogeneous data from multiple sources.

The researchers' approach involves aligning data from various fields and modalities—such as simulated environments, real-world robot operations, visual sensors, and robot arm position encoders—into a common "language" that a generative AI model can process. This strategy enables the utilization of large datasets to train robots without having to start from scratch each time.

According to Rui Wang, the first author of the paper and a graduate student in MIT's Electrical Engineering and Computer Science (EECS) program, the data challenges in robotics extend beyond insufficient quantity—they also involve data originating from diverse domains, modalities, and robot hardware. Consequently, their research focuses on integrating this heterogeneous data to train more versatile robots.

Drawing inspiration from large language models, Rui Wang and his team developed a novel architecture called the Heterogeneous Pre-trained Transformer (HPT). This architecture can uniformly process data from various modalities and domains, leveraging extensive pre-training data to enhance the robot's adaptability.

In the HPT framework, researchers employ a machine learning model known as a transformer to handle visual and proprioceptive inputs. They organize this data into a format that transformers can process, referred to as tokens. As the model processes and learns from more data, the transformer grows in scale and its performance improves accordingly.

To create large-scale datasets for pre-training, the researchers compiled 52 datasets, including human demonstration videos and simulations, encompassing over 200,000 robot trajectories across four major categories. Additionally, they developed an efficient method to convert raw proprioceptive signals from a set of sensors into data that transformers can process.

In both simulated and real-world tests, the HPT significantly enhanced the robot's performance, achieving over a 20% improvement compared to training from scratch each time. Even when confronted with tasks that differed substantially from the pre-training data, HPT was able to improve the robot's performance.

David Held, an Associate Professor at Carnegie Mellon University's Robotics Institute, highly praised this work, stating that it offers a novel approach for training a single strategy across various robot forms. It allows robots to train on diverse datasets, thereby significantly expanding the scale of trainable data.

In the future, the researchers aim to further investigate how data diversity can enhance the performance of HPT and improve its ability to handle unlabeled data. Rui Wang expressed their vision of a universal robot "brain" that users can download and utilize without any training required. While still in the early stages, they are committed to advancing this field and hope to achieve breakthrough advancements in robot strategies through scaling, akin to large language models.