"Synthetic Image Sets New Standard for AI Training Efficiency"

2023-11-21

Researchers at MIT have studied the potential of using synthetic images generated by text-to-image models to learn visual representations. They are the first team to demonstrate that models trained solely on synthetic images outperform their counterparts trained on real images in large-scale environments.

Data is the soil, and MIT researchers have planted more than just pixels in this fertile new ground. By training machine learning models using synthetic images, a group of scientists has recently surpassed the results obtained by traditional "real image" training methods.

The core of this approach is a system called StableRep, which generates synthetic images using text-to-image models such as Stable Diffusion. It's like creating a world with words.

So what's the secret of StableRep? It's a strategy called "multi-positive contrastive learning."

Lijie Fan, a doctoral student in MIT's Department of Electrical Engineering and a research affiliate at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), who is the lead researcher on this work, said, "We teach the model to understand higher-level concepts through context and variation, rather than just providing it with data." This work is currently published on the arXiv preprint server.

"When multiple images generated from the same text are considered as depictions of the same underlying thing, the model delves deeper into the concepts behind the images, such as objects, rather than just their pixels."

This method treats multiple images generated from the same text prompt as positive pairs, providing additional information during the training process. It not only increases diversity but also indicates to the visual system which images are similar and which are different. It is worth noting that StableRep's capabilities surpass those of top models trained on large-scale datasets, such as SimCLR and CLIP.

"StableRep not only helps alleviate the challenges of data acquisition in machine learning but also takes a step towards a new era of AI training techniques. The ability to generate high-quality, diverse synthetic images on demand can help reduce the required costs and resources," said Fan.

The process of data collection has never been simple. In the 1990s, researchers had to manually take photos to collect datasets of objects and faces. In the early 21st century, individuals searched for data on the internet. However, compared to real-world scenes, these unprocessed raw data often have differences and reflect social biases, presenting distorted views of reality.

The task of manually cleaning up datasets through human intervention is not only expensive but also extremely challenging. But imagine if this laborious data collection could be simplified by simply issuing natural language commands.

A key aspect of StableRep's success is adjusting the "guiding ratio" in the generation model, which ensures a delicate balance between diversity and authenticity of synthetic images. When finely tuned, the synthetic images used to train these self-supervised models were found to be as effective as, if not more effective than, real images.

To go further, language supervision was added, creating an enhanced variant: StableRep+. After training with 20 million synthetic images, StableRep+ not only achieved higher accuracy but also demonstrated significant efficiency compared to the CLIP model trained with an astonishing 50 million real images.

However, the future path is not without obstacles. The researchers candidly pointed out several limitations, including the slow speed of image generation, semantic mismatch between text prompts and resulting images, amplification of potential biases, and the complexity of image ownership, all of which must be addressed in future advancements.

Another issue is that StableRep needs to be initially trained on large-scale real data to train the generation model. The team acknowledges that starting with real data is still a necessity; however, once you have a good generation model, you can use it for new tasks such as training recognition models and visual representations.

While StableRep offers a good solution by reducing reliance on massive real images, it raises concerns about internal biases in the unfiltered data used by these text-to-image models. The careful selection of text prompts during the image synthesis process plays a crucial role, "highlighting the key role of thoughtful text selection or necessary human curation," said Fan.

"With the latest text-to-image models, we have gained unprecedented control over image generation, allowing for diverse visual effects from a single text input. This surpasses image collection in the real world in terms of efficiency and diversity. It is particularly useful in specific tasks, such as balancing image diversity in long-tail recognition, and demonstrates a practical complement to training with real images," said Fan.

"Our work marks a step forward in visual learning, towards cost-effective training alternatives, while highlighting the need for continuous improvement in data quality and synthesis," Fan added.

"Being able to generate data that is useful for training discriminative models has always been the dream of generative model learning," said David Fleet, a researcher at Google DeepMind and a professor of computer science at the University of Toronto, who was not involved in the writing of this paper.

"While we have seen some signs of it, this dream has always been elusive, especially in large-scale complex domains like high-resolution images. This paper provides the first compelling evidence that this dream is becoming a reality."