Zhipu has officially launched its latest image generation model, CogView4, and decided to open-source it. This model excels in complex semantic alignment and instruction-following capabilities, supporting inputs of arbitrary length in both Chinese and English. It can generate images with resolutions within a specified range and possesses some text generation abilities. Notably, CogView4 is the first image generation model to be open-sourced under the Apache 2.0 license.
In evaluations, CogView4 achieved outstanding results in the DPG-Bench benchmark, which primarily assesses text-to-image generation models for their performance in complex semantic alignment and instruction-following. The CogView4-6B version ranked first in the comprehensive evaluation, showcasing its leading position among open-source text-to-image models.
The CogView4 model implements a mixed training paradigm that supports text descriptions of arbitrary length and images of any resolution. For image positional encoding, the model adopts two-dimensional rotational position encoding (2D RoPE) to model image location information and uses interpolated positional encoding to support image generation at different resolutions. In terms of diffusion generation modeling, CogView4 combines Flow-matching schemes with parameterized linear dynamic noise scheduling to accommodate the signal-to-noise ratio requirements of images at various resolutions.
In terms of architectural design, CogView4 continues the Share-param DiT architecture from the previous generation and designs independent adaptive LayerNorm layers for text and image modalities to achieve efficient adaptation between modalities. Additionally, the model employs a multi-stage training strategy, including basic resolution training, general resolution training, high-quality data fine-tuning, and human preference alignment training to ensure that generated images possess high aesthetic quality and align with human preferences.
In optimizing the training framework, CogView4 breaks through the traditional limitation of fixed token lengths, allowing for higher token limits and significantly reducing text token redundancy during training. When the average length of training captions is between 200-300 tokens, compared to the traditional approach of fixing 512 tokens, CogView4 reduces token redundancy by about 50% and achieves efficiency improvements in the progressive training phase of the model.
In technical implementation, CogView4 replaces the purely English T5 encoder with the bilingual GLM-4 encoder and trains it using bilingual text-image pairs, enabling the model to handle bilingual prompt inputs. This feature makes CogView4 more suitable for creative needs in domestic advertising, short videos, and other fields.
Furthermore, the CogView4-6B model supports the Apache 2.0 license. Zhipu stated that they will gradually add ecosystem support such as ControlNet and ComfyUI and release a complete set of fine-tuning toolkits. The release and open-sourcing of the CogView4 model will provide new options and references for research and applications in the field of image generation.