Google DeepMind's recent research reveals hidden issues in evaluating the performance of text-to-image AI models. In a study published on the preprint server arXiv, they introduce a new method called "Gecko" aimed at providing a more comprehensive and reliable evaluation standard for this emerging technology.
"Although text-to-image generation models have become very popular, the images they generate may not fully align with the given textual descriptions," warns the DeepMind team in their paper titled "Re-evaluating Text-to-Image Evaluation with Gecko: Metrics, Prompts, and Human Ratings."
They point out that the current datasets and automated evaluation metrics used to assess the capabilities of models like DALL-E, Midjourney, and Stable Diffusion are not comprehensive. Small-scale human evaluations provide limited information, while automated evaluation metrics may overlook important details and even contradict human reviewers' perspectives.
Introduction to Gecko: A New Benchmark for Text-to-Image Models
To uncover these issues, the researchers developed Gecko, a new evaluation toolkit that sets higher difficulty levels for text-to-image models. Gecko presents these models with 2,000 textual descriptions covering a wide range of skills and complexities. It categorizes these descriptions into specific sub-skills, going beyond vague categories to precisely identify the specific bottlenecks hindering model development.
"This skill-based benchmark categorizes descriptions into sub-skills, allowing practitioners to not only identify which skills are challenging but also determine the level of complexity at which these skills become challenging," explains co-lead author Olivia Wiles.
A More Accurate Portrait of AI Capabilities
The researchers also collected over 100,000 human ratings of images generated by multiple leading models using Gecko prompts. By gathering this unprecedented volume of feedback data from different models and evaluation frameworks, this benchmark can distinguish whether performance differences are due to genuine model limitations, the ambiguity of the descriptions, or inconsistencies in evaluation methods.
"We collected over 100,000 annotations from humans for four prompts and four text-to-image models. This allows us to understand whether differences are due to the inherent ambiguity of the descriptions or due to differences in evaluation metrics and model quality," emphasizes the study.
Finally, Gecko incorporates an enhanced question-answering-based automated evaluation metric that aligns better with human judgments compared to existing metrics. When used to compare state-of-the-art models on the new benchmark, this combination reveals previously undetected differences in their strengths and weaknesses.
"We introduce a new QA-based automated evaluation metric that correlates better with human ratings on our new dataset, different human prompts, and existing metrics on TIFA160," the paper states. Overall, DeepMind's own Muse model performs exceptionally well when subjected to the Gecko test.
The researchers hope that their work will demonstrate the importance of using diverse benchmarks and evaluation methods to truly understand what text-to-image AI can and cannot do before deploying it in the real world. They plan to release the code and data of Gecko for free to promote further progress in this field.
"Our research shows that the choice of dataset and evaluation metrics has a significant impact on performance perception," says Wiles. "We hope that Gecko can more accurately evaluate and diagnose model capabilities in the future."
Therefore, despite the impressive AI-generated images that may seem incredible, we still need rigorous testing to distinguish genuine masterpieces from impostors. Gecko provides us with a way to achieve this.