Zhipu AI has officially launched a specialized Agent task model, CogAgent-9B-20241220, based on the GLM-4V-9B model. The company has also announced that the base of this model will be open-sourced to facilitate further development and utilization by the community. This move aims to promote the growth and prosperity of the large model Agent ecosystem.
CogAgent-9B-20241220 is designed specifically for GUI (Graphical User Interface) interaction scenarios. It only requires screen captures as input, without the need for additional HTML or other text representations. Based on user-defined tasks and historical operation records, it can predict and execute the next GUI actions. This feature makes CogAgent suitable for a wide range of devices, including personal computers, smartphones, and in-vehicle systems, that use GUI interactions.
Compared to the first version of CogAgent, which was open-sourced in December 2023, the new CogAgent-9B-20241220 model has seen significant improvements in GUI perception, inference accuracy, action space completeness, task versatility, and generalization. Additionally, it supports bilingual (Chinese and English) screen captures and language interactions, expanding its application scenarios.
In its workflow, the CogAgent model receives natural language instructions from users, historical action records, and current GUI screenshots as inputs. It then calculates the most appropriate action based on these inputs and injects the action into the GUI via a client-side application. After the GUI responds and updates the image content, the action is added to the historical record. CogAgent continues to calculate subsequent actions based on the updated history and screenshot until the task is completed.
The output of CogAgent includes four components: the thought process (including state and plan), a natural language description of the next action, a structured description of the next action, and a sensitivity judgment of the next action. The structured description is presented in a function call format, making it easy for client-side applications to parse and execute.
In terms of model upgrades, CogAgent-9B-20241220 uses the more powerful GLM-4V-9B as its base model, significantly enhancing image understanding performance. It also optimizes the visual processing module to support high-resolution image inputs and improves model efficiency through parameterized downsampling methods. Zhipu AI has also collected and integrated various datasets, including unsupervised data and GUI instruction fine-tuning datasets, for model training and fine-tuning.
For pre-training and post-training strategies, CogAgent has implemented several optimizations. For example, it introduces a GUI Grounding pre-training strategy, using screen captures and layouts to establish the correspondence between interface sub-regions and layout representations. In the post-training phase, it deepens the model's understanding of GUI content and functions by integrating multi-task data related to GUI and adopting more scientific training strategies.
To validate the performance of CogAgent-9B-20241220, Zhipu AI conducted tests on multiple datasets. The results show that CogAgent achieved leading results in various aspects, including GUI localization, single-step operations, Chinese step-wise benchmarks, and multi-step operations. It only slightly underperformed on the OSworld dataset compared to the specialized Claude-3.5-Sonnet and GPT-4o models with external GUI Grounding Models.
In summary, the launch of CogAgent-9B-20241220 marks a significant advancement for Zhipu AI in the field of GUI interaction Agents. With continued development and utilization by the community, CogAgent is expected to provide intelligent support for a broader range of devices and applications that rely on GUI interactions.