Microsoft launches VASA-1, AI Synthesized Realistic Speech Video

2024-04-22

Microsoft Research team has launched a new AI model called VASA-1, which can generate highly realistic talking facial videos based on just an image and an audio clip. These videos not only have accurate lip synchronization with the audio, but also exhibit lifelike facial expressions and natural head movements, resembling real human beings. The core of VASA-1 lies in its diffusion-based holistic facial dynamics and head motion generation model, which operates in the facial latent space. This innovative model has two key points. Firstly, it abandons the traditional approach of separately modeling different factors and instead adopts a holistic approach to generate facial dynamics and head motions in the learned facial latent space. Secondly, the facial latent space is carefully designed and trained on a large video corpus to capture subtle differences and dynamic variations in facial appearance, while effectively separating facial expressions, head poses, and identity information. In experiments, VASA-1 has shown significant improvements over the state-of-the-art methods in terms of lip synchronization quality, realism of head movements, and overall video quality. Visually, the generated videos have made a qualitative leap in synthesizing talking facial expressions, making it difficult to distinguish between real and fake. It is worth mentioning that the model can also handle various challenging scenarios, such as artistic photos, singing audios, and non-English speech, demonstrating good adaptability even without specific training on these data. VASA-1 not only delivers outstanding performance but also possesses real-time application capabilities. It can generate 512x512 pixel videos at a speed of up to 40 frames per second, with extremely low latency, making it highly suitable for real-time applications. Additionally, the model provides optional control over the generated gaze direction, head distance, and emotions, offering users greater flexibility and personalized choices. During testing, VASA-1 has demonstrated superior performance in multiple key metrics, particularly in lip synchronization quality and naturalness of head movements, surpassing existing technologies. Researchers have employed a series of newly developed evaluation techniques to precisely measure these animation effects, further confirming the advanced capabilities of VASA-1. Despite the risks of potential misuse, researchers actively emphasize the broad application prospects of VASA-1 in fields such as education, accessibility, and healthcare. By harnessing the ability to generate realistic facial videos with VASA-1, unprecedented convenience and possibilities can be brought to these domains.