Meta Accused of Manipulating AI Benchmark Results AI NEWS

Home
AInews
Meta Accused of Manipulating AI Benchmark Results

Meta Accused of Manipulating AI Benchmark Results

2025-04-09

In a recent announcement, Meta introduced two new models in the Llama 4 series: the compact model Scout and the mid-sized model Maverick. According to Meta, Maverick surpasses GPT-4o and Gemini 2.0 Flash in several widely reported benchmarks.

Maverick quickly climbed to second place on the AI benchmarking site LMArena. This platform allows users to compare outputs from different systems and vote for their preferred results. In their press release, Meta highlighted that Maverick achieved an ELO rating of 1417, surpassing OpenAI’s GPT-4o and trailing only Gemini 2.5 Pro. A higher ELO score indicates more victories against competing models in direct comparisons.

This achievement seemingly positions Meta’s open-source Llama 4 as a strong contender against the most advanced proprietary models from OpenAI, Anthropic, and Google. However, upon closer inspection of Meta’s documentation, AI researchers have uncovered some irregularities.

Meta admitted in the fine print of its documentation that the version of Maverick tested on LMArena was not identical to the one released to the public. Based on Meta's own information, they deployed an "experimental chat-optimized version" specifically tailored for conversational abilities on LMArena. TechCrunch was the first to report this revelation.

"Meta's interpretation of our policies does not align with the behavior we expect from model providers," LMArena posted on social media platform X two days after the model's release. "Meta should have clearly indicated that 'Llama-4-Maverick-03-26-Experimental' is a customized model optimized for human preferences. Therefore, we will update our leaderboard policy to strengthen our commitment to fair and reproducible evaluations and prevent such confusion in the future."

Ashley Gabrielle, a spokesperson for Meta, stated in an email declaration: "We have experimented with various types of customized variants. 'Llama-4-Maverick-03-26-Experimental' is a chat-optimized version we tried, and it performed exceptionally well on LMArena."

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

Amazon Nova Act

Amazon Nova Act - Error retrieving information

RIZZ AI

RIZZ AI - Elevate your Tinder experience with AI chat

RECENT AI TOOLS

Scan Relief

Mindtrip

Ai Drive

Convex

Ilus AI

RECENT AI NEWS

OpenAI's ChatGPT-4.5 Passes Turing Test with 73% Success Rate

Overtraining Large Language Models May Lead to Fine-Tuning Difficulties

Google Launches AI to Decode Dolphin Language on Pixel Phones

Hugging Face Expands into Hardware with Acquisition of Pollen Robotics

OpenAI May Tidy Up Model Naming This Summer, Ditching Embarrassing Terms Like “GPT-4o”

Meta Plans to Use EU User Data for AI Training

Google Classroom Adds AI-Generated Quiz Question Feature

NVIDIA Advances US Chip Production: Blackwell AI GPU Leads Domestication Efforts

RECENT AI TOOLS