"Stability AI releases Stable Video Diffusion model, supporting high-definition video AI generation"

2023-12-06

Stability AI has released the code and model weights for Stable Video Diffusion (SVD), a video generation AI model. This model can generate 25 frames of video with a resolution of 576x1024 pixels when given an input image as context.

The model is based on Stability's Stable Diffusion text-to-image generation model, with additional pre-training on videos and fine-tuning using a high-quality curated dataset. To facilitate this additional training, Stability collected a dataset called Large Video Dataset (LVD), which consists of 580 million video clips, equivalent to 212 years of runtime. While the initial release of the model only supports image-to-video generation, Stability AI claims that it can adapt to various video generation tasks, including text-to-video and multi-view (i.e., 3D object) generation. The company also announced a waiting list for access to a web-based text-to-video interface. The model license only allows for research purposes:

"While we eagerly update our model with the latest advancements and strive to incorporate your insights, we want to emphasize that the model is currently not suitable for real-world or commercial applications. Your insights and feedback on safety and quality are crucial for the final release of this model."

Stability AI's overall strategy for building SVD involves collecting and annotating a large amount of video dataset. The team first removes motion inconsistencies, such as "cuts," from the original videos, as well as videos with no motion at all. They then combine image-only caption models, video caption models, and LLM to apply three synthetic captions to each clip. They also use CLIP to extract aesthetic scores for selected frames in video samples.

After training the base video diffusion model on the large dataset, the researchers fine-tuned specific task models for text-to-video, image-to-video, frame interpolation, and multi-view generation using a smaller curated dataset. They also trained a LoRA camera control block for the image-to-video model. In evaluations by human reviewers, the output generated by the image-to-video model was more popular than state-of-the-art commercial products generated by GEN-2 and PikaLabs. The multi-view generation model surpassed state-of-the-art models like Zero123 and SyncDreamer.

Emad Mostaque, CEO of Stability AI, wrote about the current and future capabilities of this model on X:

"Not only does it have camera control through LoRA, you can do explosions and all sorts of effects... We will have stage setups, storyboarding, scene design, cinematography, and all other scene creation elements and brand new elements..."

In a discussion on Hacker News about SVD, a user pointed out the drawbacks of this approach:

"While I like SD and these video examples are great... it's a flawed approach: they never get the lighting right, and there are lots of inconsistencies everywhere. Any 3D artist or photographer would immediately notice this. However, I bet we'll soon have something better: you describe something, and then you get a complete 3D scene with 3D models, lighting setups, etc. Then this scene gets sent to Blender, and you click a button for actual rendering with correct lighting."

The code for Stable Video Diffusion is available on GitHub, and the model weights can be obtained from Huggingface.