Apple's research team recently announced a research achievement, introducing a new multimodal large language model (MLLM) called Ferret-UI. The model performs well in understanding user interface (UI) elements, functions, and potential user interactions, even surpassing GPT-4V in some basic UI tasks.
Ferret-UI aims to address three core tasks related to mobile screens: reference, localization, and inference. These capabilities enable it to accurately understand screen content and perform operations based on it. Specifically, it can identify and classify UI elements such as controls, icons, and text using different input formats such as bounding boxes, doodles, or points. In terms of localization, Ferret-UI can precisely locate elements on the screen and respond to relevant commands, such as finding specific controls or listing all controls. Additionally, the model has inference capabilities, allowing it to understand the overall functionality of the screen, describe detailed content, engage in goal-oriented dialogues, and infer the purpose of UI layouts.
It is worth noting that Ferret-UI has the "anyres" feature, which allows it to adapt to different screen ratios while maintaining high accuracy in UI element recognition and interaction. By dividing the screen into sub-images, the model can capture both the overall context and the fine details of UI elements simultaneously.
During the research process, Apple used GPT-3.5 to generate a diverse and rich training dataset, enhancing the accuracy of Ferret-UI in complex mobile UI tasks. Although it is currently only a research project, the technological potential of Ferret-UI should not be overlooked. In the future, it may be applied to more intelligent voice assistants like Siri, enabling them to navigate and use phones like humans, handle complex voice commands, automate multi-step tasks across applications, or provide more detailed assistance based on screen content.
In the field of mobile accessibility, Ferret-UI also has the potential to provide more accurate and context-aware application interface descriptions, assist developers in automated UI testing, and even drive more intelligent application recommendations. However, Ferret-UI also has limitations, such as relying on predefined UI element detection, which may overlook subtle differences in design aesthetics, and still faces challenges in complex reasoning.
Apple's research on Ferret-UI further demonstrates its focus on developing dedicated and efficient AI models that can run directly on devices, aligning with its emphasis on user privacy and device-level processing capabilities. Despite being relatively low-key in the era of generative AI, Apple's investment and contributions in the field of AI research should not be underestimated. Earlier this year, they also released an open-source model family on the Hugging Face platform.