ByteDance Launches New Generation GUI Proxy Model UI-TARS

2025-01-26

Recently, ByteDance officially launched its newly developed next-generation native graphical user interface (GUI) agent model - UI-TARS. This model aims to achieve automated interaction with desktops, mobile devices, and web interfaces through natural language, providing users with more convenient and efficient operational experiences.

UI-TARS possesses robust perception, reasoning, action, and memory capabilities. It can understand dynamic interface content in real-time, support various input forms such as text and images, and perform interactions across platforms (desktop, mobile, web). Users can converse with UI-TARS through natural language commands to accomplish complex task planning and execution. Furthermore, UI-TARS supports multi-step reasoning and error correction, enabling it to handle intricate interaction scenarios similar to humans.

In terms of cross-platform operations, UI-TARS offers standardized action definitions while being compatible with platform-specific operations like shortcuts and gestures. Additionally, it features visual recognition and interaction capabilities, accurately locating interface elements through screenshots and executing actions like mouse clicks and keyboard inputs, making it suitable for complex visual tasks.

The memory and context management abilities of UI-TARS are another highlight. It captures task context information, retains historical interaction records, thereby better supporting continuous tasks and complex scenarios. This means that when users engage in a series of operations, UI-TARS remembers previous steps and outcomes, offering users a more coherent and smooth experience.

Regarding automated task execution, UI-TARS can automatically complete a range of tasks such as launching applications, searching for information, and filling out forms, enhancing user productivity. Moreover, it supports flexible deployment, allowing for both cloud-based and local deployments to meet diverse user needs.

From a technical perspective, UI-TARS is trained using large-scale GUI screenshot datasets, enabling contextual awareness and precise description of interface elements. Simultaneously, it employs a visual encoder to extract visual features in real-time for multimodal understanding of interfaces. In action modeling, UI-TARS standardizes cross-platform operations, defining a unified action space, and achieves precise element location and interaction through extensive action trajectory data training.

Additionally, UI-TARS introduces systematic reasoning mechanisms, supporting multi-step task decomposition, reflective thinking, and milestone identification. This enables high-level planning and decision-making in complex tasks. To continuously improve model performance, UI-TARS adopts iterative training and online reflection methods, automatically collecting, filtering, and reflecting on new interaction trajectories for iterative training, thus adapting to unforeseen situations and reducing manual intervention.

In summary, UI-TARS, as a new generation GUI agent model launched by ByteDance, excels in perception, reasoning, action, and memory capabilities. Its introduction will bring users more convenient and efficient operational experiences and is expected to play a significant role in the future of automated interaction fields.