Open-Sora 2.0 Fully Open-Source: Low-Cost, High-Performance, ushering in a New Era of Video Generation

2025-03-13

Open-Sora 2.0, a brand-new open-source video generation model, has officially been released. This model features an 11-billion-parameter scale and was successfully trained on 224 GPUs at a cost of $200,000, achieving commercial-grade performance comparable to HunyuanVideo and Step-Video with 30 billion parameters.

Open-Sora 2.0 has demonstrated outstanding performance in authoritative evaluations like VBench and user preference tests. Its results are on par with closed-source models that often require millions of dollars for training costs. The release includes not only the model weights and inference code but also fully open-sources the entire distributed training process, significantly enhancing the accessibility and scalability of high-quality video generation technology.

Technically, Open-Sora 2.0 continues the design philosophy of its predecessor, utilizing a 3D autoencoder and Flow Matching training framework while incorporating a 3D full-attention mechanism and MMDiT architecture, further improving video generation quality. Additionally, by initializing with the open-source image-to-video model FLUX, it substantially reduces training costs.

To lower training expenses, Open-Sora 2.0 implements several innovative methods. First, through a rigorous data filtering mechanism, it ensures high-quality data input, boosting training efficiency. Second, it prioritizes low-resolution training to efficiently learn motion information before transitioning to high-resolution training, thus reducing overall computational costs. Moreover, by focusing on image-to-video tasks first, it accelerates model convergence. Finally, combining ColossalAI with system-level optimizations enables efficient parallel training, significantly improving computational resource utilization.

During the inference stage, Open-Sora 2.0 has also been optimized. The model trains a highly compressed video autoencoder, reducing inference time to under three minutes on a single GPU card, achieving a tenfold increase in inference speed. To train this highly compressed encoder, the Open-Sora team introduced residual connections in the video upscaling and downscaling modules and employed a distillation-based optimization strategy, enhancing the feature space expressiveness of the autoencoder.

The release of Open-Sora 2.0 marks a new breakthrough in open-source video generation technology. Not only does the model achieve commercial-grade performance, but it also significantly lowers the cost of high-quality video generation through comprehensive open-sourcing and various optimization measures. This achievement is expected to attract more developers' attention and exploration of video generation technology, collectively driving further advancements in the field.

Currently, the open-source repository for Open-Sora 2.0 is now live, where users can access model weights, inference code, and resources related to the entire distributed training process on the GitHub platform. Additionally, the Open-Sora team has provided a promotional video showcasing the powerful generative capabilities of the model for users to reference and experience.