ScreenSpot-Pro: A New Benchmark for Professional High-Resolution GUI Agents

2025-01-06

Within professional settings, graphical user interface (GUI) agents face three principal challenges. Firstly, professional applications are inherently more complex than general software, necessitating a profound comprehension of intricate layouts. Secondly, the high resolution of specialized tools results in smaller target sizes and decreased positional certainty. Lastly, reliance on supplementary tools and documentation complicates workflows further. These issues highlight the urgent need to enhance GUI agent performance in demanding professional contexts.

Presently, existing models for determining GUI positions and evaluation standards fall short of meeting the demands of professional environments. For instance, tools like ScreenSpot are tailored for low-resolution tasks and lack the accuracy required to simulate diverse real-world scenarios. Models such as OS-Atlas and UGround suffer from computational inefficiencies, particularly failing under conditions of dense icons or small targets—situations common in professional applications. Moreover, the absence of multilingual support limits their applicability in global workflows. These deficiencies underscore the necessity for more comprehensive and realistic evaluation criteria to drive advancements in this field.

To address these gaps, researchers from the National University of Singapore, East China Normal University, and Hong Kong Baptist University have introduced ScreenSpot-Pro, a novel framework specifically designed for professional high-resolution environments. This benchmark encompasses a dataset derived from 1581 tasks across 23 applications in industries including development, creative tools, computer-aided design (CAD), scientific platforms, and office suites. The dataset features high-resolution, full-screen visuals, and expert annotations, ensuring precision and authenticity. Multilingual guidelines are available in both English and Chinese, broadening the scope of evaluation. A distinctive feature of ScreenSpot-Pro is its recording of actual workflows, generating authentic high-quality annotations, thus serving as a robust assessment and development tool for GUI position determination models.

The ScreenSpot-Pro dataset captures challenging and realistic scenarios. Based on high-resolution images, the target areas occupy only about 0.07% of the total screen area, focusing on minute and precise GUI elements. Data collection is performed by experienced professionals using specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities, testing performance in bilingual environments, and includes various workflows that capture the nuances of real professional tasks. These characteristics provide significant advantages in evaluating and improving the accuracy and flexibility of GUI agents.

Evaluating current GUI position determination models with ScreenSpot-Pro reveals notable deficiencies when handling high-resolution professional environments. OS-Atlas-7B achieves the highest accuracy rate of 18.9%. However, iterative methods, such as ReGround, can refine predictions through multi-step approaches, raising the accuracy to 40.2%. Small components like icons pose substantial challenges, while bilingual tasks further expose model limitations. These findings emphasize the need for improved techniques to bolster contextual understanding and resilience in complex GUI environments.

ScreenSpot-Pro sets a transformative standard for assessing GUI agents in professional high-resolution environments. It tackles specific challenges within complex workflows, offering a diverse and precise dataset that drives innovation in GUI position determination. This contribution lays the groundwork for smarter and more efficient agents, supporting seamless execution of professional tasks and significantly enhancing productivity and innovation across various industries.