Alibaba Cloud has recently open-sourced its advanced video generation model, Wan2.1, which boasts powerful visual content generation capabilities. This model supports two primary tasks: text-to-video generation (text-to-video) and image-to-video generation (image-to-video). To cater to diverse needs, Wan2.1 offers two versions: a professional edition with 14 billion parameters, designed for handling complex motion generation and physical modeling, and an ultra-fast edition with 1.3 billion parameters, optimized for consumer-grade GPUs with lower memory requirements, making it suitable for secondary development and academic research.
Technically, Wan2.1 is built on the Causal 3D VAE and Video Diffusion Transformer architectures. The Causal 3D VAE architecture is specifically designed for video generation, capable of processing spatiotemporal information in videos while incorporating causality constraints to ensure coherent and logically consistent generated content. The Video Diffusion Transformer architecture combines the strengths of diffusion models and Transformers, generating data by gradually removing noise and leveraging self-attention mechanisms to capture long-range dependencies within the video.
In terms of training and inference, Wan2.1 employs various parallel strategies to accelerate the training process. During training, data parallelism (DP) and full Sharded data parallelism (FSDP) are utilized, with RingAttention and Ulsses hybrid parallel strategies introduced for the diffusion module to further enhance training efficiency. For inference, channel parallelism (CP) is used for acceleration, and model slicing techniques are applied for large models to optimize inference performance.
In practical applications, Wan2.1 demonstrates versatility. Beyond basic text-to-video and image-to-video tasks, it also supports video editing, text-to-image generation (text-to-image), and video-to-audio generation. Additionally, the model features visual effects and text rendering capabilities, catering to a wide range of creative scenarios.
Performance-wise, Wan2.1 achieved remarkable results on the authoritative evaluation dataset Vbench. The professional edition with 14 billion parameters scored an impressive 86.22%, significantly outperforming other domestic and international models like Sora, Luma, and Pika. The ultra-fast edition can generate 480P videos with just 8.2GB of GPU memory, compatible with almost all consumer-grade GPUs, offering high generation efficiency.
Notably, Wan2.1 is open-sourced under the Apache 2.0 license and supports multiple mainstream frameworks. It is available on platforms such as GitHub, HuggingFace, and ModelScope, providing developers with convenient usage and deployment environments. This move aims to promote further advancements and applications in video generation technology.