Alter3: Humanoid Robots Driven by GPT-4

2024-06-25

Tokyo University and Alternative Machine researchers have jointly developed a new humanoid robot system that can directly translate natural language commands issued by humans into actual robot actions. This system, called Alter3, is specifically designed to utilize the vast knowledge base contained in large language models (LLMs) such as GPT-4 to perform complex tasks, such as taking selfies or simulating being a ghost.


This represents the latest breakthrough in the field of combining the powerful capabilities of base models with robotics technology. While such systems have not yet developed into scalable commercial solutions, they have played a positive role in advancing robotics research in recent years and have shown tremendous potential.

How LLM Controls Robots

Alter3 uses GPT-4 as its backend model. The model receives contextual instructions in natural language describing actions or situations that the robot needs to respond to.

The LLM plans a series of actions that the robot needs to perform in order to achieve its goals using an "agent framework". In the first stage, the model plays the role of a planner, determining the steps required to execute the desired actions.


Subsequently, the action plan is passed to an encoding agent, which generates the instructions required for the robot to perform each step. Since GPT-4 was not specifically trained on programming instructions for Alter3, researchers utilized its contextual learning capabilities to adapt to the robot's API. This means that the prompts include a list of instructions and a set of examples demonstrating how to use each instruction. The model then maps each step to one or more API instructions and sends these instructions to the robot for execution.

"Before LLM, we had to control all 43 axes in a certain order to mimic human poses or pretend to perform actions like pouring tea or playing chess," the researchers wrote. "Thanks to LLM, we are now freed from the constraints of repetitive labor."

Learning from Human Feedback

Language is not the most precise medium for describing physical postures. Therefore, the action sequences generated by the model may not fully produce the desired behavior on the robot.

To support corrections, researchers added a feature that allows humans to provide feedback, such as "raise the arm a bit higher". These instructions are sent to another GPT-4 agent, which reasons about the code and makes the necessary adjustments before returning the action sequence to the robot. The optimized action plans and code are stored in a database for future use.


Researchers tested Alter3 on various tasks, including everyday activities such as taking selfies and drinking tea, as well as imitating behaviors such as pretending to be a ghost or a snake. They also tested the model's responsiveness to scenarios that require carefully planned actions.

"LLM training includes a wide range of motion language representations. GPT-4 can accurately map these representations onto Alter3's body," the researchers wrote.

GPT-4's extensive knowledge about human behavior and actions enables the creation of more realistic behavior plans for humanoid robots like Alter3. The researchers' experiments also demonstrated the ability to make the robot mimic emotions such as awkwardness and joy in its actions.

"Even without explicit expressions of emotions in the text, LLM can infer appropriate emotions and reflect them in Alter3's physical responses," the researchers wrote.

More Advanced Models

In robotics research, the use of base models is gradually becoming popular. For example, Figure, a company valued at $2.6 billion, uses OpenAI models behind the scenes to understand human instructions and perform actions in the real world. As multimodality becomes the norm for base models, robotic systems will be better equipped to reason about their environment and select their actions.

Alter3 belongs to a category of projects that use off-the-shelf base models as the reasoning and planning modules in robot control systems. Alter3 did not use a fine-tuned version of GPT-4, and the researchers noted that the code can be used for other humanoid robots.

Other projects, such as RT-2-X and OpenVLA, use specially designed base models to directly generate robot instructions. These models often produce more stable results and can be extended to more tasks and environments. However, they require certain technical skills and have higher development costs.

An often overlooked aspect of these projects is the foundational challenge of creating robots capable of performing basic tasks such as grasping objects, maintaining balance, and moving. "There is a lot of other work to be done at those lower levels that the models haven't touched yet," said AI and robotics research scientist Chris Paxton in an interview earlier this year. "And that work is very challenging. In many ways, it's because the relevant data is not sufficient."