Apple Develops Multimodal LLM for Image Data Interpretation

2024-03-20

A group of computer scientists and engineers at Apple have developed an LLM model, which the company claims can interpret images and data. The team has published a paper on the arXiv preprint server, describing their newly developed MM1 series multimodal models and their test results.

LLM has gained attention in the past year for its advanced AI capabilities. It is worth noting that Apple has been absent from this discussion. In this new attempt, the research team explicitly states that Apple is not simply adding an LLM developed by another company (they are currently in negotiations with Google to incorporate Gemini AI technology into Apple devices); instead, they have been working on developing the next generation of LLM, an LLM that can interpret images and textual data.

Multimodal AI works by integrating and processing different types of data inputs, such as visual, auditory, and textual information. This integration allows AI to have a more comprehensive understanding of complex data, providing more accurate and context-aware interpretations compared to single-modal AI systems.

Apple's research team claims to have made significant progress in using multimodal AI with the MM1 model, which integrates text and image data to enhance image description, visual question answering, and query learning capabilities. Their MM1 is part of what they call a series of multimodal models, with each model containing up to 30 billion parameters.

The researchers note that these models use datasets composed of pairs of images, documents containing both images and text, and documents containing only text. The researchers also claim that their multimodal LLM (MLLM) can count objects, identify parts of objects in images, and provide useful information about the content of the image based on common knowledge about everyday objects.

The researchers also claim that their MLLM has contextual learning capabilities, meaning it does not need to start from scratch with each question; it leverages what it has learned in the current conversation. The team provides examples of advanced features of their model—one example includes uploading a photo of a group of friends holding a menu at a bar and asking the model how much it would cost to buy each person a beer based on the prices listed on the menu.