Meta announces new breakthroughs in AI image and video generation with Emu tool.

2023-11-17

Researchers at Meta have made significant progress in AI image and video generation.

The parent company of Facebook and Instagram has developed a new tool that allows for better control of the image editing process through text commands, as well as a new method for text-to-video generation. These new tools are based on Meta's Expressive Media Universe (Emu), the company's first foundational model for image generation.

Emu was announced in September and is now in production, being used to power Meta's AI feature, Imagine. Imagine allows users to generate realistic images within Messenger. In a blog post, Meta's AI researchers explain that generative AI image generation is typically an iterative process, where users try input prompts and the generated images may not fully match their intentions. As a result, users are forced to continuously adjust the prompts until the generated image closely resembles what they had in mind.

Emu Edit for image editing

Meta aims to eliminate this process and give users more precise control, which is where its new Emu Edit tool comes in. It offers a novel approach to image manipulation, where users simply input text-based commands. It can perform local and global edits, add or remove backgrounds, colors and geometric transformations, object detection, segmentation, and many other editing tasks.

"Current methods often tend to overmodify or perform poorly on various editing tasks," the researchers write. "We believe that the primary goal should not just be to produce a 'believable' image, but that the model should focus on precisely modifying only the pixels relevant to the editing request."

To achieve this goal, Emu Edit is designed to strictly follow the user's instructions, ensuring that pixels unrelated to the editing request are not affected. For example, if a user wants to add the text "Aloha!" to an image of a baseball cap, the cap itself should not be altered.

The researchers state that incorporating computer vision into the instruction of image generation models allows users to have unprecedented control over image editing.

Emu Edit was trained on a dataset containing 10 million synthetic samples, each consisting of an input image, a task description, and a target output image. The researchers believe this is the largest dataset created to date, enabling Emu Edit to provide excellent results in terms of fidelity to the instructions and image quality.

Emu Video for video generation

Meta's AI team is also focused on improving video generation. The researchers explain that the process of creating videos using generative AI is similar to image generation, but with the added element of bringing the images to life through motion.

Emu Video utilizes the Emu model and offers a simple diffusion-based text-to-video generation method. Meta states that this tool can respond to various inputs, including text-only, image-only, or a combination of both.

The video generation process is divided into several steps, starting with creating an image conditioned on text and then creating a video based on that image and another text prompt. According to the team, this "decomposition" approach provides an extremely efficient way to train video generation models.

"We demonstrate that decomposed video generation can be achieved through a diffusion model," the researchers write. "We provide key design decisions such as adjusting the noise process of video diffusion and multi-stage training, which enable us to directly generate higher-resolution videos."

Meta states that the advantages of this new approach include simplicity, as it only requires a pair of diffusion models to generate 4-second videos at 512x512 resolution and 16 frames per second, compared to the previous Make-A-Video tool that required five models. The company claims that human evaluations show this work is "more well-received" in terms of overall quality and fidelity compared to earlier image generation work.

Emu Video also offers additional features, such as adding animation effects to a user's image based on simple text prompts, surpassing previous work once again.

Currently, Meta's research in generative AI image editing and video generation is ongoing, but the team emphasizes that the technology has many exciting applications. For example, it can enable users to instantly create their own animated stickers and GIFs instead of searching for existing stickers that match their desired transformations. It can also allow people to edit their own photos without the need for complex tools like Photoshop.

The company also adds that its latest models are unlikely to replace professional artists and animators in the short term. Instead, their potential lies in helping people express themselves in new ways.