DeepMind Unveils Next-Generation Video AI Veo 2, Challenging OpenAI

2024-12-17

Google's leading AI research lab, DeepMind, has recently unveiled the next-generation video generation AI, Veo 2. This technology is an enhanced version of Veo and is extensively integrated across multiple Google product lines. Veo 2 is capable of producing video clips exceeding two minutes in length with a resolution of up to 4K (4096x2160 pixels).

Compared to OpenAI's video generation model Sora, Veo 2 theoretically offers significant advantages. While Sora can produce videos with a maximum resolution of 1080P and a duration of 20 seconds, Veo 2 boasts four times the resolution and over six times the length. However, this advantage currently remains mostly theoretical. In Google's experimental video creation tool, VideoFX, videos generated by Veo 2 are limited to a resolution of 720P and a duration of 8 seconds, compared to Sora's 1080P and 20-second outputs.

Currently in the invitation-only testing phase, VideoFX is expected to expand user access this week, according to Google. Eli Collins, Vice President of DeepMind, revealed that as the model matures, Veo 2 will be made available to a broader user base through Google's Vertex AI developer platform.

Veo 2 features several functional enhancements. Similar to Veo, Veo 2 can generate videos based on text prompts (e.g., "a car speeding on the highway") or a combination of text and reference images. However, Veo 2 demonstrates improved understanding of physical laws, camera control, and video clarity. Specifically, Veo 2 produces sharper textures and images, especially during rapid scene changes. Additionally, its advanced camera control allows for more precise positioning of the virtual camera, enabling the capture of objects and characters from various angles through camera movement.

Moreover, DeepMind asserts that Veo 2 offers more realistic simulations of motion, fluid dynamics (such as pouring coffee into a cup), and lighting attributes (including shadows and reflections). This encompasses the simulation of various camera shots and cinematic effects, as well as complex human expressions.

However, despite DeepMind's emphasis that Veo 2 is unlikely to produce false elements such as extra fingers or "unexpected objects," Veo 2 has yet to fully overcome the "uncanny valley." In certain generated videos, the actions and expressions of characters or objects appear unnatural.

Collins acknowledges that Veo 2 still has room for improvement in coherence and consistency. He stated, "Veo can follow prompts for several minutes but struggles with maintaining complex instructions over extended periods. Similarly, character consistency can be a challenge. Additionally, generating fine details, handling fast and complex actions, and further enhancing realism are areas that need improvement."

Regarding the training data for Veo 2, DeepMind has declined to disclose specific sources. However, as YouTube is a video platform under Google's umbrella, it is likely a potential training data source. DeepMind stated that Veo 2 was trained using a vast number of video-description pairs but has not provided a mechanism for creators to remove their works from the existing training datasets. This has sparked debates concerning copyright and artists' rights.

To mitigate potential risks associated with generative models, such as content duplication or the creation of harmful content, DeepMind has incorporated prompt-level filters in Veo 2 targeting violent, graphic, and adult content. Additionally, DeepMind employs its proprietary watermark technology, SynthID, to embed invisible markers within frames generated by Veo 2, aiming to reduce the risk of deepfakes. However, the watermark technology is not foolproof.

Meanwhile, DeepMind has announced an upgrade to the Imagen 3 image generation model. The new version will offer brighter and better-composed images and can generate photos in various styles, including realism, impressionism, and anime. Additionally, the upgraded Imagen 3 will more accurately follow instructions and deliver images with richer details and textures.