Voice is undoubtedly the core element of producing high-quality videos. Therefore, despite the realistic effects achieved in video production by tools like Google's Veo, OpenAI's Sora, and Runway's Gen-3 Alpha, these works often feel lacking in agility and vitality. To make up for this shortcoming, Google DeepMind's latest AI model attempts to inject vitality into videos by generating synchronized music. This is truly an amazing technology.
Google's V2A (Video-to-Audio) technology cleverly combines video pixels with optional text prompts to create audio that closely matches the visual content. It can not only generate music and sound effects but also match the dialogue of actions on the screen.
V2A adopts a diffusion-based approach to generate realistic audio. The system first encodes the video input into a compressed form and then refines the audio from random noise, guided by visual content and optional text prompts. The generated audio is then decoded into waveforms and seamlessly integrated with the video.
To improve audio quality and achieve more accurate sound generation, DeepMind trained the model on additional data, such as AI-generated sound annotations and dialogue scripts. This enables V2A to accurately match audio events with various visual scenes while responding to provided annotations or scripts.
However, V2A also has its limitations. The quality of the audio largely depends on the quality of the input video, and flaws or distortions in the video directly affect the sound quality. In addition, there is room for improvement in lip synchronization for speech videos, as the paired video generation model may not perfectly match mouth movements with the script.
In the field of generative AI, there are also other tools striving to address this issue. For example, earlier this year, Pika Labs launched a similar feature called "Sound Effects." Recently, Eleven Labs also introduced the Sound Effects Generator.
According to Google, what sets V2A apart is its ability to deeply understand the original video pixels. At the same time, it eliminates the tedious step of manually aligning the generated sound with the visual content. Combining V2A with video generation models like Veo can create a coherent audiovisual experience, making it highly promising in entertainment and virtual reality applications.
Google is very cautious when releasing video AI tools. Currently, to the disappointment of AI content creators, Google does not plan to publicly release these tools immediately. Instead, the company is focused on addressing existing limitations and ensuring a positive impact on the creative community. Like other models, the output of the V2A model will include a SynthID watermark to prevent misuse.