CogVLM2: Zhifu AI Releases Next-Generation Multimodal Large Model

2024-05-21

Leading AI technology development company Zhipu·AI officially announced that its latest developed multimodal large model CogVLM2 has been officially launched. This new generation model has achieved a qualitative leap in key performance indicators. Compared with the previous generation CogVLM, it has significantly improved in processing capability, understanding depth, and applicability. CogVLM2 not only supports text lengths of up to 8K, but also handles images with resolutions of up to 1344*1344, setting a new benchmark in the field of AI vision and text processing.


According to Zhipu·AI, CogVLM2 achieved a performance improvement of up to 32% in the OCRbench benchmark test, and a performance improvement of 21.9% in the TextVQA benchmark test, fully demonstrating its outstanding ability in document image understanding. Although the model size of CogVLM2 reaches 19B, its performance in various tests is close to or even surpasses the well-known GPT-4V model in the industry.

CogVLM2's technical architecture has been carefully optimized, with a visual encoder with 5 billion parameters and a visual expert module with up to 7 billion parameters. This unique design allows for a closer integration of visual and language modalities, achieving deep fusion. Through fine parameter settings and interactions between modules, CogVLM2 can accurately model the complex relationship between visual and language sequences, thereby significantly improving the processing capability of visual information while maintaining the advantage in language processing.

CogVLM2 actually activates only about 12 billion parameters during inference, thanks to its unique multi-expert module structure. This design not only significantly improves inference efficiency, but also makes CogVLM2 more stable and efficient in processing large-scale data.

In terms of model performance, CogVLM2 performs well in multimodal benchmark tests. Whether it is in text and image understanding tests such as TextVQA, DocVQA, ChartQA, or in complex reasoning and interdisciplinary task tests such as OCRbench, MMMU, MMVet, MMBench, CogVLM2 has achieved excellent results. Both of its models have achieved state-of-the-art performance in multiple benchmarks, and can also compete with closed-source models in other performance aspects.

The release of CogVLM2 model by Zhipu·AI undoubtedly promotes the development of AI technology in the field of multimodal processing. With the continuous progress of technology and the expansion of application scenarios, CogVLM2 is expected to bring more possibilities and opportunities for AI technology.