Meta Unveils OpenEQA Benchmark Dataset to Propel Embodied AI Research
Meta AI's research team has released an open-source benchmark dataset called OpenEQA, which focuses on evaluating the ability of artificial intelligence systems in "embodied question answering" - how the system understands and answers natural language questions about the real world, deepening its understanding of the real world.
As a core benchmark in Meta's emerging field of "embodied AI," the OpenEQA dataset contains over 1600 questions related to more than 180 different real-world environments, such as homes and offices. These questions cover seven categories and comprehensively test the AI's capabilities in object and attribute recognition, spatial and functional reasoning, as well as common-sense knowledge.
"In this context, we propose the concept of embodied question answering (EQA), which is not only a practical end application but also an effective means of evaluating the intelligence agent's understanding of the real world. In short, the task of EQA is to deeply analyze the environment in order to answer questions related to it using natural language," the researchers explained in their published paper.
The OpenEQA project is located at the intersection of computer vision, natural language processing, knowledge representation, and robotics - hot areas of AI research. Its long-term vision is to create intelligent agents that can perceive, interact with the world, communicate naturally with humans, and leverage knowledge to assist us in our daily lives.
The researchers believe that this "embodied intelligence" will have two main applications in the short term. Firstly, as an AI assistant embedded in augmented reality glasses or head-mounted devices, it can use video and other sensor data to provide users with camera-like memory functions, helping to answer questions like "Where did I put my keys?" Secondly, as mobile robots, they can autonomously explore the environment to gather information, for example, searching at home to answer questions like "Do I still have coffee left?"
To build this challenging benchmark dataset, Meta's researchers first collected videos and 3D scanning data of real-world environments. They then showed these videos to humans and asked them to come up with questions that could be asked to an AI assistant with access to this visual data.
The final collection of 1636 questions comprehensively tests a wide range of perception and reasoning abilities. For example, to answer the question "How many chairs are around the dining table?", the AI needs to be able to recognize objects in the scene, understand the spatial concept of "around," and calculate the number of relevant objects. Other questions require the AI to have basic common-sense knowledge about object usage and attributes.
Each question is accompanied by multiple human-generated answers to account for different possible ways of answering the question. To evaluate the performance of AI agents, the researchers used a large language model for automatic scoring to measure the similarity between AI-generated answers and human answers.