Meta's Transfusion Model Utilizes Single Architecture for Text and Image Processing

2024-09-02

Multi-modal models that can handle both text and images are a hot topic in the field of artificial intelligence. However, training these models faces a unique challenge: language models deal with discrete values (words and tokens), while image generation models must handle continuous pixel values.


Current multi-modal models use techniques that compromise the quality of data representation. In a recent research paper, scientists from Meta and the University of Southern California introduced a new technique called Transfusion, which allows a single model to seamlessly handle both discrete and continuous modalities.


The Challenges of Multi-modal Models


Current approaches to addressing multi-modal challenges often involve different trade-offs. Some techniques use different architectures for language and image processing, often pre-training each component separately. This is the approach taken by models like LLaVA. These models struggle with learning complex interactions between different modalities, especially when dealing with documents that intertwine images and text.


Other techniques quantize images into discrete values, effectively converting them into token sequences similar to text. This is the approach used by Meta's Chameleon, which was introduced earlier this year. While this method allows for image processing using language models, it results in the loss of information contained in continuous pixel values.




Chunting Zhou, Senior Research Scientist at Meta AI and co-author of the paper, previously worked on the research for the Chameleon paper.


"We noticed that quantization methods create an information bottleneck in image representation, where the discrete representation of images is highly compressed, leading to the loss of information from the original images," she told VentureBeat. "At the same time, training a good discrete image tokenizer is also very challenging. So, we posed the question: 'Can we directly use the more natural continuous representation of images when training multi-modal models with discrete text?'"


Transfusion: A Unified Approach to Multi-modal Learning


"Diffusion models and autoregressive models based on next-token prediction are the best methods for generating continuous and discrete data," Zhou said. "This inspired us to develop a new multi-modal approach that combines the best aspects of these two methods in a natural and simple way."


Transfusion is a method for training a single model to handle both discrete and continuous modalities without quantization or separate modules. The core idea of Transfusion is to train a single model with two objectives: language modeling for text and diffusion modeling for images.


Transfusion combines these two objectives to train a Transformer model capable of processing and generating both text and images. During training, the model is exposed to both text and image data, and the loss functions for language modeling and diffusion are applied simultaneously.




"We demonstrate that by training a single model to simultaneously predict discrete text tokens and diffuse continuous images, we can fully integrate these two modalities without loss of information," the researchers wrote.


Transfusion uses a unified architecture and vocabulary to handle mixed-modal inputs. The model includes lightweight modality-specific components that transform text tokens and image blocks into appropriate representations before they are processed by the Transformer.


To enhance the representation capacity of image data, Transfusion utilizes a variational autoencoder (VAE), a neural network that learns to represent complex data, such as images, in a low-dimensional continuous space. In Transfusion, VAE is used to encode each 8x8 block of an image into a series of continuous values.




"Our main innovation lies in demonstrating that we can use separate losses for shared data and parameters of different modalities - language modeling for text and diffusion for images," the researchers wrote.


Transfusion Outperforms Quantization-based Methods


The researchers trained a 7 billion parameter model using Transfusion and evaluated its performance on various standard single-modal and cross-modal benchmarks, including text-to-text, text-to-image, and image-to-text tasks. They compared the performance of Transfusion with an equivalently sized model based on Chameleon, which is currently a prominent open-source scientific approach for training native multi-modal models.


In their experiments, Transfusion consistently outperformed Chameleon on all modalities. In text-to-image generation, Transfusion achieved better results at less than one-third of the computational cost of Chameleon. Similarly, in image-to-text generation, Transfusion achieved performance comparable to Chameleon with only 21.8% of the computational resources.


Surprisingly, even in terms of text, Transfusion exhibited better performance on the text-only benchmark, despite using the same language modeling objective as Chameleon. This suggests that training based on quantized image tokens may have a negative impact on text performance.


"As an alternative, Transfusion offers greater scalability in all aspects compared to commonly used quantization-based multi-modal training methods," Zhou said.




The researchers conducted separate experiments on image generation and compared Transfusion with other image generation models. Transfusion outperformed other popular models such as DALL-E 2 and Stable Diffusion XL while being able to generate text.


"Transfusion opens up many new opportunities and interesting applications for multi-modal learning," Zhou said. "By working like large language models (LLMs) but handling multi-modal data, Transfusion could unlock new applications with better controllability in user-input interactive sessions, such as interactive editing of images and videos."