Accuracy of Image Recognition: A Potential Challenge for Artificial Intelligence

2023-12-18

Imagine you are scrolling through photos on your phone and suddenly come across an image that you can't immediately recognize. It might be something on a sofa, but is it a pillow or a jacket? After a few seconds, it dawns on you - of course! That fluffy blob is your friend's cat, Mocha. Some photos you can understand in an instant, but why are certain cat photos so much more challenging?


Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) were surprised to discover that the concept of image difficulty, despite being crucial in fields ranging from healthcare to transportation to consumer devices, has been largely overlooked by humans. While datasets have been the main driving force behind advancements in deep learning-based AI, our understanding of how data propels large-scale deep learning progress remains limited.


In real-world applications that require understanding visual data, models may perform well on current datasets, but humans still outperform models on object recognition tasks, including those explicitly designed to challenge machines with biased images or distribution variations. This problem persists partly because we lack guidance on the absolute difficulty of images or datasets. Without controlling for the difficulty of images used in evaluations, it is difficult to objectively assess improvements in model capabilities.


To address this knowledge gap, David Mayo, an electrical engineering and computer science doctoral student and CSAIL member at MIT, delved into the world of image datasets to explore why certain images are more difficult for humans and machines to recognize. "Understanding the brain's activity and its relationship with machine learning models during the process of recognizing and understanding images that take longer to identify is crucial. Perhaps our current models lack complex neural circuits or unique mechanisms that only manifest in challenging visual stimuli tests. This exploration is crucial for understanding and enhancing machine vision models," said Mayo.


This led to the development of a new metric called "minimum viewing time" (MVT), which quantifies the difficulty of recognizing an image based on the time people need to view it. Using a subset of the popular ImageNet dataset in machine learning and the ObjectNet dataset for testing object recognition robustness, the team presented participants with images ranging from as short as 17 milliseconds to as long as 10 seconds and asked them to select the correct object from 50 options. After conducting over 200,000 image presentation trials, the team found that existing test sets, including ObjectNet, seemed to favor easier images with shorter MVTs, and most benchmark performance came from images that were easy for humans to recognize.


The project identified interesting trends in model performance, particularly those related to scalability. Larger models showed significant improvements on simple images but made less progress on more challenging ones. Models that combined language and vision, such as the CLIP model, stood out as they moved towards more human-like recognition.


"Traditionally, object recognition datasets tend to have less complex images, which leads to inflated model performance metrics that do not truly reflect the model's robustness or its ability to handle complex visual tasks. Our research reveals that more difficult images present more daunting challenges, resulting in a distribution shift that is often not considered in standard evaluations," said Mayo. "We have released a collection of difficulty-labeled image sets and an automated tool for calculating MVT, enabling the addition of MVT to existing benchmarks and its extension to various applications. This includes measuring the difficulty of test sets before deploying real-world systems, discovering neural correlates of image difficulty, and advancing object recognition technology to bridge the gap between benchmarks and real-world performance."


"My biggest takeaway is that we now have another dimension to evaluate models. We want models that can recognize any image, even those that are difficult for humans to identify. Our results show that even state-of-the-art techniques today cannot achieve this, and our current evaluation methods lack the ability to tell us when we can achieve this goal because standard datasets are overly biased towards simple images," said Jesse Cummings, a graduate student in electrical engineering and computer science at MIT and the first author of the paper.


From ObjectNet to MVT


A few years ago, the team behind this project identified a major challenge in the field of machine learning: models struggle when dealing with images outside their distribution or images that are not well represented in the training data. This led to the creation of ObjectNet, a dataset composed of images collected from real-life environments. The dataset emphasizes the performance gap between machine learning models and human recognition abilities by eliminating incidental correlations present in other benchmark evaluations, such as correlations between objects and their backgrounds. ObjectNet revealed the performance gap of machine vision models in datasets and real-world applications. Many researchers and developers have adopted it, leading to improved model performance.


Fast forward to the present, and the team has taken their research a step further with MVT. Unlike traditional approaches, this new method evaluates model performance by comparing their responses to the easiest and most difficult images. The research further explores how to interpret and test image difficulty in relation to human visual processing. Using metrics such as c-score, depth prediction, and adversarial robustness, the team found that networks handle more difficult images differently. "While some observable trends, such as more recognizable images being more typical, exist, a comprehensive semantic interpretation of image difficulty still puzzles the scientific community," said Mayo.


For example, in the healthcare field, understanding the complexity of visual information becomes particularly important. The ability of AI models to interpret medical images, such as X-rays, depends on the diversity and difficulty distribution of the images. Researchers advocate for detailed difficulty distribution analysis tailored to professionals to ensure the evaluation of AI systems based on expert standards rather than layman interpretations.


Towards Human-Level Performance


Looking ahead, researchers are not only focused on exploring methods to enhance AI's predictive image capabilities. The team is investigating the correlation between viewing time difficulty to generate versions of images that are more difficult or easier.


While this research has made significant progress, the researchers acknowledge limitations, particularly in separating object recognition from visual search tasks. The current approach indeed focuses on recognizing objects, excluding the complexity introduced by complex scenes.


"This comprehensive approach addresses the long-standing challenge of objectively evaluating object recognition towards human-level performance and opens up new avenues for understanding and advancing the field," said Mayo. "This work has the potential to adapt the Minimum Viewing Time metric to various visual tasks, paving the way for more robust and human-like performance in object recognition, ensuring models are truly put to the test, and preparing for understanding the visual complexity of the real world."