UniBench: A Comprehensive Framework for Evaluating VLM

2024-08-19

Vision-language models (VLM) have attracted much attention due to their ability to handle various multimodal tasks. However, the rapid increase in benchmark tests used to evaluate these models has made the evaluation field complex and fragmented. This situation has brought many challenges to researchers. Implementing numerous benchmark tests is time-consuming and labor-intensive, and interpreting results across multiple evaluation metrics has become difficult. Running all available benchmark tests requires considerable computational resources, resulting in many researchers only evaluating a subset of benchmark tests on new models. This selective approach leads to a biased understanding of model performance and complicates comparisons between different VLMs. Therefore, a standardized evaluation framework is needed to draw meaningful conclusions about the most effective strategies for advancing VLM technology. Ultimately, the field needs a more streamlined and comprehensive approach to benchmark testing these models.


Researchers from Meta FAIR, Univ Gustave Eiffel, CNRS, LIGM Lab, and Brown University have introduced a comprehensive framework called UniBench to address the challenges of evaluating VLMs. This unified platform implements 53 diverse benchmark tests in a user-friendly codebase, covering a wide range of capabilities including object recognition, spatial understanding, counting, and specific applications in medical and satellite image domains. UniBench categorizes these benchmark tests into seven major categories and seventeen finer-grained capabilities, allowing researchers to quickly identify the strengths and weaknesses of models in a standardized manner.


The utility of UniBench has been validated through the evaluation of nearly 60 publicly available VLMs, which encompass different architectures, model sizes, training dataset scales, and learning objectives. This systematic comparison across different axes of progress reveals that while scaling up model size and training data significantly improves performance in many domains, its benefits are limited for visual relationships and reasoning tasks. UniBench also reveals that even in state-of-the-art VLMs, numerical understanding tasks remain challenging.


To facilitate practical applications, UniBench provides a refined set of representative benchmark tests that can be quickly run on standard hardware. This comprehensive and efficient approach aims to simplify the VLM evaluation process and make comparisons and insights into effective strategies for VLM research more meaningful.


UniBench demonstrates its utility by conducting a comprehensive evaluation of 59 publicly available VLMs across 53 different benchmark tests, which are categorized into seven types and seventeen capabilities. This systematic evaluation reveals several key insights into VLM performance and areas for improvement.


Key insights revealed by the evaluation include:


- Significant performance differences across tasks: While VLMs excel in many domains, they perform poorly on certain benchmark tests, such as Winoground, iNaturalist, DSPR, Small Norb, dmlab, Clevr, PCam, Renderedssst2, and Kitti, where their performance is close to or below random levels.


- Limitations of scaling: Increasing model size and training dataset size significantly improves performance in many domains, particularly object recognition and robustness. However, the benefits of this scaling approach are minimal for visual relationships and reasoning tasks.


- Surprising weaknesses: VLMs perform poorly on traditionally simple tasks, such as MNIST digit recognition. Even in the top five accuracy rankings, VLMs achieve only a modest accuracy of around 90% on MNIST, while a basic two-layer MLP (multilayer perceptron) can achieve 99% accuracy.


- Counting and numerical tasks: VLMs consistently exhibit weak numerical understanding abilities in multiple benchmark tests, including SVHN, CountBench, and ClevrCount.


- Data quality over quantity: Models trained on 2 billion high-quality samples outperform models trained on larger datasets, highlighting the importance of data curation.


- Customized objectives: Models with specific learning objectives, such as NegCLIP, outperform large-scale models in relation understanding tasks.

- Model recommendations: For general-purpose use, large ViT encoders like Eva-2 ViT-E/14 exhibit the best overall performance. For specific tasks like relationships or counting, specialized models like NegCLIP are recommended.


UniBench overcomes the challenges of comprehensive VLM evaluation by distilling its 53 benchmark tests into representative subsets, striking a balance between comprehensiveness and efficiency. This approach overcomes the computational demands of comprehensive evaluation, which would require processing over 6 million images on an A100 GPU for more than 2 hours. While ImageNet is associated with many benchmark tests, it represents only a small fraction of the other 18 benchmark tests, highlighting the importance of diverse metrics. The streamlined set of UniBench is selected to represent key axes of progress and can be run on a single A100 GPU in just 5 minutes for the ViT-B/32 model. This efficient workflow provides a practical solution for rapid and comprehensive VLM evaluation, enabling researchers and practitioners to gain meaningful insights quickly.