UI-TARS: ByteDance's New AI Agent to Control Computers for Complex Tasks

2025-01-23

Recently, ByteDance has introduced a new AI agent called UI-TARS, which can control computers and execute complex workflows. Similar to Anthropic's Computer Use, this agent understands graphical user interfaces (GUI), performs reasoning, and autonomously executes actions step by step.

Training and Performance Excellence

UI-TARS was trained on approximately 50 billion tokens and comes in two versions: 7B and 72B parameters. It achieved state-of-the-art (SOTA) performance in over ten GUI benchmarks, consistently outperforming OpenAI's GPT-4o, Claude, and Google's Gemini in areas such as performance, perception, localization, and overall agent functionality.

Researchers from ByteDance and Tsinghua University noted in their latest research paper that through iterative training and reflective adjustments, UI-TARS can continuously learn from its mistakes and adapt to unforeseen situations with minimal human intervention.

Multimodal Input and Step-by-Step Thinking

UI-TARS operates on desktop, mobile, and web applications, using multimodal inputs (including text, images, and interactions) to understand the visual environment. Its interface design is intuitive, with the left side showing the step-by-step "thinking" process and the larger right area displaying file access, website navigation, and application automation.

For instance, in a demonstration video, the model was asked to find a round-trip flight from SEA to NYC and sort them by price in ascending order. UI-TARS navigated to Delta Airlines' website, filled in relevant information, clicked on relevant dates, sorted and filtered by price, while explaining each step in its thinking box.

In another scenario, it was instructed to install the autoDocstring extension in VS Code. It first reported needing to open VS Code, waited for the application to initialize, accessed the extensions view, and retried clicking when encountering minor glitches to ensure success. Once in the extensions view, it typed 'autoDocstring' and waited for installation completion.

Superior Performance Over Competitors

In various benchmarks, UI-TARS consistently outperformed competitors. For example, it scored 82.8% on VisualWebBench, surpassing GPT-4's 78.5% and Claude 3.5's 78.2%. Additionally, it excelled in WebSRC benchmarks, ScreenQA-short, and tests measuring GUI element understanding and localization capabilities.

Researchers emphasized that UI-TARS's superior perception and understanding in web and mobile environments form the foundation of its agent tasks, crucial for task execution and decision-making.

Technical Details and Training Strategies

To help UI-TARS take actions step by step and identify what it sees, the research team trained it on a large-scale screenshot dataset parsed with metadata including element descriptions and types, visual descriptions, bounding boxes, element functions, and text from various websites, applications, and operating systems.

Furthermore, UI-TARS uses state transition captions to recognize and describe differences between two consecutive screenshots and determine if an action occurred. The Set of Markers (SoM) prompts allow it to overlay different markers on specific image regions.

To handle current tasks and retain historical interactions for improved subsequent decisions, UI-TARS is equipped with short-term and long-term memory. The research team also trained the model to perform System 1 and System 2 reasoning, allowing multi-step decisions, "reflective" thinking, milestone recognition, and error correction.

To ensure the agent maintains consistent goals and hypothesizes, tests, and evaluates potential actions through trial and error, the research team introduced two types of data: error correction and post-reflection data. This strategy ensures that UI-TARS not only learns to avoid errors but also dynamically adapts when errors occur.

Future Prospects and Competitive Advantage

UI-TARS demonstrates impressive capabilities and its evolving use cases in the increasingly competitive AI agent field will be noteworthy. Researchers pointed out that while native agents represent a significant leap forward, the future lies in integrating proactive and lifelong learning. Through autonomous learning driven by continuous real-world interactions, UI-TARS is poised for greater breakthroughs in the future.

Compared to competitors, UI-TARS excels in both web and mobile domains. For example, Claude's computer use performs strongly in web-based tasks but struggles significantly in mobile scenarios, indicating that its GUI operation capabilities have not yet transferred well to mobile environments. In contrast, UI-TARS shows strong capabilities in this area.

In summary, UI-TARS, as a newly introduced AI agent, has already demonstrated exceptional performance and broad application prospects. With ongoing technological advancements and expanding applications, it is expected to play an even more significant role