Apple has released a new open-source AI model called "MGIE" that can edit images based on natural language instructions. MGIE, short for MLLM-Guided Image Editing, utilizes a multimodal large language model (MLLM) to interpret user commands and perform pixel-level operations. The model can handle various editing aspects such as Photoshop-style modifications, global photo optimization, and local editing.
MGIE is a collaboration between Apple and researchers from the University of California, Santa Barbara. The model was presented at the 2024 International Conference on Learning Representations (ICLR), one of the top conferences in artificial intelligence research. The paper showcased the effectiveness of MGIE in improving automatic metrics and human evaluations while maintaining competitive inference efficiency.
How does MGIE work?
MGIE leverages the idea of MLLM, a powerful AI model that can handle both text and images, to enhance instruction-based image editing capabilities. MLLM has demonstrated remarkable abilities in cross-modal understanding and visual perception response generation but has not been widely applied to image editing tasks.
MGIE integrates MLLM into the image editing process in two ways: firstly, it uses MLLM to derive expressive instructions from user inputs. These instructions provide clear guidance for the editing process. For example, when inputting "make the sky bluer," MGIE can generate the instruction "increase the saturation of the sky region by 20%."
Secondly, it utilizes MLLM to generate visual imagination, i.e., the latent representation of the desired edits. This representation captures the essence of the edits and can be used to guide pixel-level operations. MGIE adopts a novel end-to-end training scheme that jointly optimizes instruction derivation, visual imagination, and image editing modules.
What can MGIE do?
MGIE can handle various editing scenarios, from simple color adjustments to complex object manipulations. The model can also perform global and local edits based on user preferences. Some features and functionalities of MGIE include:
Instruction-based expression editing: MGIE can generate concise and clear instructions that effectively guide the editing process. This not only improves editing quality but also enhances the overall user experience.
Photoshop-style modifications: MGIE can execute common Photoshop-style edits such as cropping, resizing, rotating, flipping, and adding filters. The model can also apply more advanced edits like changing backgrounds, adding or removing objects, and blending images.
Global photo optimization: MGIE can optimize the overall quality of photos, including brightness, contrast, sharpness, and color balance. The model can also apply artistic effects like sketching, painting, and comic styles.
Local editing: MGIE can edit specific regions or objects in an image, such as faces, eyes, hair, clothing, and accessories. The model can modify attributes of these regions or objects, such as shape, size, color, texture, and style.
How to use MGIE?
MGIE is an open-source project on GitHub, where users can find the code, data, and pretrained models. The project also provides a demo notebook demonstrating how to use MGIE for various editing tasks. Users can also try MGIE online through a web demo on Hugging Face Spaces, a machine learning (ML) project sharing and collaboration platform.
MGIE is designed to be user-friendly and customizable. Users can provide natural language instructions for editing images, and MGIE will generate the edited images along with derived instructions. Users can also provide feedback to MGIE to refine the edits or request different edits. MGIE can also be integrated with other applications or platforms that require image editing capabilities.
Why is MGIE important?
MGIE represents a breakthrough in instruction-based image editing, which is a challenging and important task for both artificial intelligence and human creativity. MGIE showcases the potential of using MLLM to enhance image editing and opens up new possibilities for cross-modal interaction and communication.
MGIE is not only a research achievement but also a practical tool applicable to various scenarios. It can help users create, modify, and optimize images for personal or professional purposes, such as social media, e-commerce, education, entertainment, and art. MGIE also enables users to express their ideas and emotions through images, inspiring their creativity.
For Apple, MGIE demonstrates the company's growing strength in AI research and development. In recent years, this consumer technology giant has rapidly expanded its machine learning capabilities, and MGIE may be one of the most impressive showcases of how AI can enhance everyday creative tasks.
While MGIE represents a significant breakthrough, experts note that there is still much work to be done to improve multimodal AI systems. However, progress in this field is rapidly accelerating. If the release of MGIE signifies anything, it is that this assistive AI may soon become an indispensable creative assistant.