Meta's New AI Model Benchmark May Be Misleading

2025-04-07

Recently, Meta unveiled its new flagship AI model, Maverick, which secured the second position in the LM Arena test. LM Arena evaluates model performance by having human evaluators compare model outputs and select the better option. However, there are indications that the version of Maverick deployed on LM Arena differs from the one widely available to developers.

In its announcement, Meta noted that the version of Maverick on LM Arena is an "experimental chat variant." Additionally, a chart on the Llama official website shows that Meta's LM Arena testing used a version described as "Llama 4 Maverick optimized for conversational abilities."

There have been reports questioning the reliability of LM Arena in assessing AI model performance. Typically, AI companies do not customize or fine-tune models specifically to perform better on LM Arena—or at least they don't openly admit to such practices.

The concern arises when a model is tailored for a specific benchmark but only a non-optimized "base version" is released. This can make it challenging for developers to accurately predict how the model will perform in real-world applications. Such actions can be misleading and contradict the fundamental purpose of benchmark testing—to objectively reflect a model’s strengths and weaknesses across multiple tasks, despite inherent limitations in current benchmarks.

Research indicates significant differences between the LM Arena version of Maverick and the publicly downloadable version, with the former using numerous emojis and providing more verbose responses.

An official response has been requested from both Meta and the maintainers of Chatbot Arena, which oversees LM Arena.