Meta GenAI Launches Fairy: A Groundbreaking AI Video Editing Framework

2023-12-29

Artificial intelligence has recently been applied to various fields of life. Similarly, it is also used for video generation and video editing. AI opens up new possibilities for creativity, making content generation and manipulation seamless. However, video editing remains challenging due to the complexity of maintaining temporal coherence between frames. Traditional video editing methods address this issue by using optical flow techniques to track the motion of pixels or by reconstructing the video into a layered representation. However, these techniques are prone to failure when dealing with videos that have a large amount of motion or complex dynamics, as pixel tracking remains an unsolved problem in computer vision.

Therefore, researchers at Meta GenAI have proposed a new efficient video-to-video synthesis framework called Fairy, specifically designed to guide video editing tasks. Fairy takes a video input with N frames and uses natural language editing instructions to create a new video that follows the given instructions while maintaining the semantic context of the original video. Fairy uses an anchor-based cross-frame attention mechanism to propagate features across adjacent frames. With this technique, Fairy is able to create a 512×384 resolution video with 120 frames in just 14 seconds, at least 44 times faster than previous state-of-the-art systems.

Fairy also maintains temporal continuity during the editing process. Researchers have used a unique data augmentation strategy to introduce affine transformation equivalence into the model. As a result, the system is able to effectively handle variations in the source and target images, especially when dealing with videos that have wide-ranging motion or complex dynamics, further enhancing performance.

The developers have designed a solution that propagates value attributes extracted from carefully selected anchor frames using the cross-frame attention mechanism. This subsequently enables the creation of an attention map as a similarity measure, ultimately fine-tuning and coordinating feature representations across different frames. This design significantly reduces feature discrepancies and enhances temporal consistency in the final output.

The researchers evaluated the model by rigorously assessing 1000 generated videos. They found that Fairy surpasses previous state-of-the-art systems in visual quality. Additionally, it demonstrates over 44 times speed improvement, thanks to the parallel processing capabilities supported by eight GPUs. However, it also has some limitations. Despite having the same textual prompts and random initialization noise, it may have slight inconsistencies in the input frames. These anomalies may be caused by affine modifications performed on the input or minor changes occurring in the video sequence.


In conclusion, Meta's Fairy is a transformative leap in video editing and artificial intelligence. With its exceptional temporal consistency and video synthesis, Fairy establishes itself as the benchmark for quality and efficiency in the industry. Users can generate high-resolution videos at extraordinary speeds thanks to its innovative image editing diffusion model, anchor-based cross-frame attention, and invariant fine-tuning.