Google AI has launched VideoPoet, a modeling approach that can transform any autoregressive language model or large language model into a high-quality video generator. VideoPoet demonstrates state-of-the-art performance in video generation, particularly in generating various large, interesting, and high-fidelity movements.
The core of VideoPoet is its multitasking capability. From animating static images to patching or extending video edits, and even generating audio from videos, it has a wide range of applications. The model can take text, images, or videos as input, and its outputs span various transformations such as text-to-video, image-to-video, and video-to-audio. This versatility makes VideoPoet a comprehensive solution for various video generation tasks. One of its main advantages is integrating multiple capabilities into a single model, eliminating the need for separate specialized components.
What sets VideoPoet apart is its reliance on discrete tokens to represent videos and audio, similar to how LLM handles language. By using multiple tokenizers (MAGVIT V2 for videos and images, and SoundStream for audio), VideoPoet can encode and decode these modalities into visual formats. This approach allows the model to extend its language processing capabilities to videos and audio, providing powerful tools for creators and technical experts.
VideoPoet is capable of generating videos with diverse actions and styles based on specific text inputs, showcasing its advanced content and contextual understanding. Whether it's bringing a painting to life or generating video clips based on descriptive text, the model demonstrates exceptional ability in maintaining object integrity and appearance, even over longer durations. Google notes that the model supports generating videos in square or portrait orientations to accommodate the generation of short video content, and also supports generating audio from video inputs.
A notable feature of VideoPoet is its interactive video editing capability. Users can guide the model to modify actions or dynamics in the video, providing a high level of creative control. The model can also accurately respond to camera motion commands, further enhancing its utility in creating dynamic and visually appealing content. Additionally, VideoPoet can generate reasonable audio for the generated videos without any guidance, showcasing its outstanding multimodal understanding.
By default, VideoPoet outputs 2-second videos. However, given a 1-second video clip, it can predict a 1-second video output. This process can be repeated indefinitely to generate videos of arbitrary lengths.
While there is still a considerable gap in output quality compared to tools like Runway and Pika, VideoPoet highlights the significant progress Google has made in AI-based video generation and editing.