Apple releases new multimodal large language model Ferret-UI, focusing on UI understanding AI NEWS

Home
AInews
Apple releases new multimodal large language model Ferret-UI, focusing on UI understanding

Apple releases new multimodal large language model Ferret-UI, focusing on UI understanding

2024-09-18

Apple's research team recently announced a research achievement, introducing a new multimodal large language model (MLLM) called Ferret-UI. The model performs well in understanding user interface (UI) elements, functions, and potential user interactions, even surpassing GPT-4V in some basic UI tasks.

Ferret-UI aims to address three core tasks related to mobile screens: reference, localization, and inference. These capabilities enable it to accurately understand screen content and perform operations based on it. Specifically, it can identify and classify UI elements such as controls, icons, and text using different input formats such as bounding boxes, doodles, or points. In terms of localization, Ferret-UI can precisely locate elements on the screen and respond to relevant commands, such as finding specific controls or listing all controls. Additionally, the model has inference capabilities, allowing it to understand the overall functionality of the screen, describe detailed content, engage in goal-oriented dialogues, and infer the purpose of UI layouts.

It is worth noting that Ferret-UI has the "anyres" feature, which allows it to adapt to different screen ratios while maintaining high accuracy in UI element recognition and interaction. By dividing the screen into sub-images, the model can capture both the overall context and the fine details of UI elements simultaneously.

During the research process, Apple used GPT-3.5 to generate a diverse and rich training dataset, enhancing the accuracy of Ferret-UI in complex mobile UI tasks. Although it is currently only a research project, the technological potential of Ferret-UI should not be overlooked. In the future, it may be applied to more intelligent voice assistants like Siri, enabling them to navigate and use phones like humans, handle complex voice commands, automate multi-step tasks across applications, or provide more detailed assistance based on screen content.

In the field of mobile accessibility, Ferret-UI also has the potential to provide more accurate and context-aware application interface descriptions, assist developers in automated UI testing, and even drive more intelligent application recommendations. However, Ferret-UI also has limitations, such as relying on predefined UI element detection, which may overlook subtle differences in design aesthetics, and still faces challenges in complex reasoning.

Apple's research on Ferret-UI further demonstrates its focus on developing dedicated and efficient AI models that can run directly on devices, aligning with its emphasis on user privacy and device-level processing capabilities. Despite being relatively low-key in the era of generative AI, Apple's investment and contributions in the field of AI research should not be underestimated. Earlier this year, they also released an open-source model family on the Hugging Face platform.

Miko

AI interactive learning companion for children

Comet

Smart browser with AI features available for any website

Mirelo AI

AI-generated soundtracks for your video projects

Giskard AI

AI platform for identifying model vulnerabilities

SnapCalorie

AI photo calorie tracker for accurate nutrition

Supio

**AI legal assistant for personal injury cases**

TTS Maker

Free AI tool for converting text to speech

RECENT AI TOOLS

Spot AI

Miko

Comet

Mirelo AI

Giskard AI

RECENT AI NEWS

Microsoft Deploys the World's First GB300 Supercluster for OpenAI

Unitree R1 Bipedal Humanoid Robot Ranks on TIME's 2025 Best Inventions List

Dishwashing and laundry "housework buddy" is here! Figure 03 humanoid robot: 1.68 meters tall, 5-hour battery life

Sora Reaches 1 Million Downloads Faster Than ChatGPT

Google Launches Gemini Enterprise: Unified AI Platform for Businesses

Figma Leverages Google's Gemini to Accelerate Enterprise AI in Its Design Platform

Intel Launches Panther Lake, the First Core Ultra Based on 18A Process

Amazon Launches Quick Suite, Introducing AI Agents to the Enterprise Workplace

RECENT AI TOOLS