The rapid development of large language models (LLMs) and visual language models (VLMs) is transforming the way mobile devices are automated, offering unprecedented solutions for complex user tasks. However, traditional step-by-step GUI agent methods, which handle user tasks through dynamic decision-making and reflection, heavily rely on powerful cloud-based models like GPT-4 and Claude, raising concerns about privacy, security, data usage, and cost.
In the past, mobile task automation primarily depended on template-based approaches, such as Siri, Google Assistant, and Cortana. These methods, however, had limitations when dealing with complex tasks. As technology advanced, GUI-based automation methods emerged, capable of handling more intricate tasks without relying on third-party APIs or extensive programming. Nevertheless, these methods, especially script-based GUI agents, still face challenges in knowledge extraction and script execution due to the dynamic nature of mobile applications.
To address these challenges, researchers from the Institute of Artificial Intelligence Industry Research at Tsinghua University have introduced AutoDroid-V2. This innovative mobile task automation tool leverages the encoding capabilities of small language models (SLMs) to build robust GUI agents. Unlike traditional step-by-step GUI agents, AutoDroid-V2 uses a script-based approach to generate and execute multi-step scripts based on user commands, significantly enhancing efficiency and performance.
AutoDroid-V2's architecture consists of offline and online phases. In the offline phase, the system builds application documentation by thoroughly analyzing the app exploration history, providing a foundation for script generation. This documentation integrates AI-guided GUI state compression, automatic element XPath generation, and GUI dependency analysis, ensuring the scripts are concise and accurate. In the online phase, when a user submits a task request, a customized local LLM generates a multi-step script, which is then executed by a domain-specific interpreter for reliable and efficient runtime execution.
Experimental results show that AutoDroid-V2 tested 226 tasks across 23 mobile applications, achieving a task completion rate improvement of 10.5% to 51.7% compared to leading benchmarks like AutoDroid, SeeClick, CogAgent, and Mind2Web. Additionally, it significantly reduced computational requirements, with input and output token consumption decreased by 43.5 times and 5.8 times, respectively, and LLM inference latency reduced by 5.7 to 13.4 times. When tested on different LLMs, AutoDroid-V2 consistently demonstrated high performance.
The researchers noted that AutoDroid-V2 represents a significant advancement in mobile task automation. By utilizing on-device SLMs and an innovative document-guided, script-based approach, it achieves accuracy comparable to cloud-based solutions while maintaining device-level privacy and security. This groundbreaking achievement brings new hope and development directions to the field of mobile task automation.
Although AutoDroid-V2 has achieved notable success in handling GUI applications with structured text representations, it still faces limitations when dealing with GUI applications lacking such representations, such as those based on Unity and web-based applications. However, the researchers suggest that integrating VLMs to restore structured GUI representations based on visual features could address this challenge, further expanding the applicability of AutoDroid-V2.