Microsoft Unveils DragNUWA: Elevating Video Generation to New Heights

2024-01-10

Artificial intelligence companies are competing to master the art of video generation. In the past few months, several companies in the industry, including Stability AI and Pika Lab, have released models that can generate various types of videos based on text and image prompts. Building on this, Microsoft AI has released a model aimed at providing finer control over video production.

This project is called DragNUWA, which complements the known text and image-based methods with trajectory-based generation. This allows users to control objects or entire video frames on specific trajectories. It provides a simple way to generate highly controllable videos from semantic, spatial, and temporal perspectives, while ensuring high-quality output.

Microsoft has open-sourced the model weights and demos of this project, allowing the community to try it out. However, it is important to note that this is still a research work and far from perfect.

What sets Microsoft DragNUWA apart?

From a historical perspective, AI-driven video generation has revolved around text, image, or trajectory inputs. These works have been quite good, but each method has encountered challenges in generating the desired fine control in the output.

For example, combining only text and images cannot convey the complex motion details present in videos. Additionally, images and trajectories may not fully represent future objects and trajectories, and language can lead to ambiguity when expressing abstract concepts. An example is the inability to distinguish between a fish in the real world and a fish painting.

To address this issue, in August 2023, the Microsoft AI team proposed DragNUWA, an open-domain diffusion-based video generation model that combines images, text, and trajectories to facilitate highly controllable video generation from semantic, spatial, and temporal perspectives. This allows users to strictly define the desired text, images, and trajectories in the input to control camera movements, including zooming effects or object motion, in the output video.

For example, one can upload an image of a boat in water and add text prompts like "a boat sailing in a lake" and instructions marking the trajectory of the boat. This will generate a video of the boat sailing along the marked direction, achieving the desired effect. The trajectory provides motion details, language describes future objects, and images differentiate different objects.

Released on Hugging Face

In the early version 1.5 of DragNUWA released on Hugging Face, Microsoft utilized Stability AI's Stable Video Diffusion model to animate images or their objects based on specific paths. Once matured, this technology can make video generation and editing simple. Imagine being able to transform backgrounds, animate images, and guide motion paths by simply drawing a line here or there.

AI enthusiasts are excited about this development, with many considering it a major leap for creative AI. However, the performance of this research model in the real world remains to be observed. In its testing, Microsoft claims that the model can achieve accurate camera and object movements through different dragging trajectories.

"First, DragNUWA supports complex curved trajectories, enabling the generation of objects moving along specific complex paths. Second, DragNUWA allows for varying trajectory lengths, with longer trajectories resulting in larger motion amplitudes. Finally, DragNUWA can control the trajectories of multiple objects simultaneously. To our knowledge, no existing video generation model has effectively achieved such trajectory controllability, highlighting the tremendous potential of DragNUWA in advancing controllable video generation in future applications," the company's researchers stated in their paper.

This work adds a new direction to the growing research in AI video field. Just recently, Pika Lab made headlines by open-sourcing its text-to-video interface, which works similarly to ChatGPT and offers various customization options to generate high-quality short videos.