x.AI's research lab, Grok-1.5 Vision (Grok-1.5V), recently released a preview version of its first multimodal model. This achievement is undoubtedly remarkable for the emerging company, which has only been established for 9 months. The upgraded version of this large-scale language model demonstrates more powerful capabilities in understanding and interacting with the physical world.
Grok-1.5V has the ability to process a wide range of visual information, including documents, charts, graphs, and photos. It excels in multidisciplinary reasoning and understanding spatial relationships in the physical world, surpassing similar models even in x.AI's newly launched RealWorldQA benchmark test.
In a blog post, the startup showcased various applications of Grok-1.5V. It effortlessly handles tasks such as writing code based on drawings, calculating calories from nutrition label photos, and even creating bedtime stories from children's drawings. Additionally, the model can explain internet memes, convert tables to CSV format, and provide suggestions for home maintenance issues like decaying wood on the patio. These functionalities fully demonstrate the astonishing versatility and practicality of Grok-1.5V.
x.AI stated in the blog post, "Enhancing our multimodal understanding and generation capabilities is an important step towards building beneficial Artificial General Intelligence (AGI) that can comprehend the universe." The lab is excited to release RealWorldQA to the community and plans to further expand the benchmark test as they improve their multimodal model.
The launch of RealWorldQA highlights x.AI's determination to advance AI's understanding of the physical world, a crucial step in developing practical real-world AI assistants. The benchmark test consists of over 760 images with question-answer pairs, presenting significant challenges to cutting-edge models despite many examples being relatively simple for humans. This further emphasizes the significance of Grok-1.5V's achievements.
Earlier this week, Meta also released its OpenEQA benchmark test, aiming to evaluate AI models' understanding of physical space. The benchmark test includes over 1600 questions about real environments, testing the models' ability to recognize objects, perform spatial reasoning, and apply common-sense knowledge. Given Grok-1.5V's outstanding performance in understanding the physical world, there are high expectations for its performance on the OpenEQA benchmark test.
x.AI emphasizes the importance of advancing multimodal understanding and generation capabilities in building beneficial AGI. They plan to make significant progress in various modalities such as images, audio, and video in the coming months. The company also stated that Grok-1.5V will soon be made available to early testers and existing Grok users.