CogVideoX, the AI video generation model, officially open-sourced by Zhipu AI, the same model as "Qingying"

2024-08-06

Recently, domestic artificial intelligence company Zhupu AI announced a major move, officially open sourcing its self-developed video generation model CogVideoX to developers worldwide. This move aims to further promote the rapid development of video generation technology and expand its application boundaries in the commercial and creative fields. With its cutting-edge large-scale model technology architecture, CogVideoX not only meets the needs of high-end commercial applications but also achieves significant breakthroughs in performance optimization.


Outstanding performance of the open-source version, unlimited creativity with a single card

It is worth noting that the open-source CogVideoX-2B version demonstrated extraordinary performance optimization capabilities. Under FP-16 precision, the model only requires 18GB of VRAM for inference and 40GB of VRAM for fine-tuning. This means that a single NVIDIA RTX 4090 graphics card can easily complete the inference task, and the fine-tuning work can be efficiently done on a single NVIDIA A6000 graphics card. This major breakthrough greatly reduces the technical threshold, enabling more developers and small businesses to easily get started and participate in the innovation and application of video generation technology.


Empowered by 3D VAE, reshaping the benchmark of video generation quality

The core competitiveness of the CogVideoX model lies in its use of 3D Variational Autoencoder (3D VAE) technology. This technology compresses the spatial and temporal dimensions of videos through innovative 3D convolution, achieving unprecedented high compression ratio and excellent reconstruction quality. The model's architecture design is ingenious, including an encoder, a decoder, and a latent space regularizer, which ensures the causal logic of information processing through temporal causal convolution and guarantees the coherence and rationality of the generated video content. In addition, the model integrates expert Transformer technology, which can deeply analyze the encoded video data and combine it with textual input to create high-quality and story-rich video content.


High-quality data-driven approach to solve video generation pain points

In order to train a high-performance CogVideoX model, Zhupu AI has invested a lot of resources in developing an efficient method for selecting high-quality video data. This method effectively eliminates low-quality videos with excessive editing and incoherent motion, ensuring the high standards and purity of the training data. At the same time, the team has innovatively built a pipeline for generating video captions from image captions, cleverly solving the problem of video data lacking detailed textual descriptions and providing a richer and multidimensional source of information for model learning.

Leading performance evaluation, continuous exploration in the future

CogVideoX has demonstrated outstanding performance in multiple key performance evaluation metrics, especially in human motion capture, scene reconstruction, and dynamicity, winning wide recognition in the industry. Meanwhile, Zhupu AI has also introduced evaluation tools focusing on video dynamic characteristics, further refining the evaluation dimensions of the model.