MIT Research Team Proposes Multimodal HiP Framework for Robotic Complex Task Planning

2024-01-09

Your daily to-do list may be quite simple and straightforward: washing dishes, grocery shopping, and other mundane tasks. You are unlikely to write down "pick up the first dirty plate" or "wash that plate with a sponge" because each of these small steps in household chores feels intuitive. While we can go through each step routinely without much thought, robots require a more detailed and complex plan.

The Improbable AI Lab at MIT, a subgroup within the Computer Science and Artificial Intelligence Laboratory (CSAIL), has provided assistance to these machines by proposing a new multimodal framework: Hierarchical Planning with Compositional Base Models (HiP), which can develop detailed and feasible plans by combining the expertise of three different base models. Similar to OpenAI's GPT-4, which builds on the base models of ChatGPT and Bing Chat, these base models have been extensively trained for tasks such as generating images, translating text, and robot technology.

This research paper was published on the preprint server arXiv.

Unlike RT2 and other trained multimodal models that are trained on paired visual, language, and action data, HiP uses three different base models, each trained on different data modalities. Each base model captures different parts of the decision-making process and works together at decision time. HiP eliminates the need for acquiring paired visual, language, and action data, which can be difficult to obtain. HiP also makes the reasoning process more transparent.

What may be considered everyday household tasks for humans can be a "long-term goal" for robots - an overall goal that involves first completing many small steps - requiring sufficient data for planning, understanding, and executing the goal. While computer vision researchers have attempted to build a single base model for this problem, pairing language, vision, and action data is expensive. Instead, HiP represents a different multimodal approach: a cost-effective triad that integrates language, physical, and environmental intelligence into robots.

Jim Fan, an AI researcher at NVIDIA, who was not involved in this research, said, "Base models don't have to be singular. This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision problem more tractable and transparent."

The research team believes that their system can help these machines with household chores such as placing a book or putting a bowl in the dishwasher. Additionally, HiP can assist in executing multi-step construction and manufacturing tasks, such as stacking and placing different materials in a specific order.

Evaluation of HiP

The CSAIL team tested the agility of HiP on three manipulation tasks, and its performance surpassed comparable frameworks. The system reasons through intelligent planning that adapts to new information.

First, the researchers asked it to stack blocks of different colors on top of each other and then place other blocks nearby. The challenge was that some correct colors were not present, so the robot had to put white blocks into a paint bowl for coloring. Compared to state-of-the-art task planning systems like Transformer BC and Action Diffuser, HiP often accurately adjusted its plan to stack and place each block as needed.

Another test involved arranging items, such as candies and hammers, inside a brown box while ignoring other items. Some of the items that needed to be moved were dirty, so HiP adjusted its plan to first put them in a cleaning bin and then into the brown container. In the third demonstration, the robot was able to ignore unnecessary items to accomplish sub-goals in the kitchen, such as opening the microwave, cleaning the kettle, and turning on the light. Some prompted steps had already been completed, so the robot adaptively skipped those instructions.

Trifurcated Hierarchical Structure

HiP's trifold planning process operates hierarchically, allowing each of its components to be pre-trained, including datasets beyond robotics. At the bottom of this hierarchical structure is a large language model (LLM), which starts by capturing all the necessary symbolic information and formulating an abstract task plan. The model applies common-sense knowledge found on the internet to decompose its goal into sub-goals. For example, "make a cup of tea" becomes "fill a pot with water," "boil this pot," and the subsequent actions required.

"We just wanted to leverage existing pre-trained models and make them successfully connect with each other," said Anurag Ajay, a doctoral student in MIT's Department of Electrical Engineering and Computer Science (EECS) and a member of CSAIL. "We didn't push one model to do everything, but rather combined multiple models that utilize different internet data modalities. When used in conjunction, they help with robot decision-making and could be useful for tasks in homes, factories, and construction sites."

These models also need some form of "eyes" to understand the environment they are operating in and execute each sub-goal correctly. The team used a large video diffusion model to enhance the initial plan generated by LLM, which collects geometric and physical information about the world from videos on the internet. The video model then generates an observation trajectory plan, refining the outline provided by LLM to incorporate new physical knowledge.

This process, called iterative refinement, allows HiP to reason about its ideas, receiving feedback at each stage to generate more realistic outlines. The feedback loop is similar to writing an article, where an author may send a draft to an editor, and after incorporating revisions, the publisher reviews the final changes and finalizes it.

At the top of this hierarchical structure is a self-centered action model, or a series of first-person images, which infer which actions should be taken based on the surrounding environment. At this stage, the observation plan from the video model is mapped onto the space visible to the robot, helping the machine determine how to execute each task within the long-term goal. If the robot is using HiP to make tea, it means it will precisely map out the positions of the pot, sink, and other key visual elements and begin accomplishing each sub-goal.

However, this multimodal work is limited by the lack of high-quality video base models. Once high-quality models are available, they can interface with HiP's small-scale video model to further enhance visual sequence prediction and robot action generation. A higher-quality version would also reduce the current data requirements of the video model.

That being said, the CSAIL team's approach only uses a small amount of data overall. Additionally, HiP has low training costs and demonstrates the potential of using off-the-shelf base models to accomplish long-term tasks.

"What Anurag has shown is how we can combine models trained on different tasks and data modalities into a proof-of-concept for a robot planning model," said Pulkit Agrawal, an assistant professor in MIT's EECS department and director of the Low Probability of Intercept AI Lab. "In the future, HiP could be enhanced with pre-trained models that can handle touch and sound for better planning." The team is also considering applying HiP to solve long-term robot tasks in the real world.