Did OpenAI Cheat in Large Math Tests?

2025-01-25

OpenAI launched the o3 model in December last year, heavily promoting its outstanding performance in various benchmark tests. At that time, some critics even hailed it as nearly as powerful as AGI (Artificial General Intelligence), capable of achieving human-like performance in any task users require.

However, the influence of money can change everything, even in areas like math exams. When OpenAI triumphed in the challenging mathematical benchmark test FrontierMath with an astonishing score of 25.2%, a shocking fact emerged: OpenAI not only successfully passed the test but also participated in its creation.

The footnote update in the FrontierMath white paper by Epoch AI thanked OpenAI for its support in creating the benchmark, drawing attention from some enthusiasts. Worse still, OpenAI not only funded the development of FrontierMath but also obtained questions and solutions based on its own needs. Epoch AI later revealed that they were hired by OpenAI to provide 300 math problems and their solutions.

Epoch AI stated on Thursday that according to the conventions of commissioned work, OpenAI retained ownership of these questions and could access them and their answers. Although Epoch indicated that OpenAI signed a contract beforehand promising not to use these questions and answers to train the o3 model, experts pointed out that accessing test materials could still optimize model performance through iterative adjustments.

Tamay Besiroglu, vice president of Epoch AI, revealed that OpenAI initially requested non-disclosure of the financial relationship between both parties. He wrote in a post: "Before the release of o3, we were constrained from revealing the partnership. Now, it seems we should have strived harder to make this capacity transparent to contributors of the benchmark test as soon as possible. Our contract explicitly prohibited us from disclosing information about funding sources and datasets that OpenAI had access to but not all."

Tamay said although OpenAI promised not to use Epoch AI's questions and solutions, no legal contract was signed to ensure this. "We acknowledge that OpenAI indeed had access to a large portion of FrontierMath's questions and solutions," he wrote, "but we had a verbal agreement that these materials would not be used for model training."

Although this statement seemed somewhat suspicious, Elliot Glazer, chief mathematician at Epoch AI, believed OpenAI would honor its promise. He posted on Reddit: "My personal opinion is that OpenAI's scores are legitimate (i.e., they did not train on the dataset) and there was no reason for them to lie about internal benchmark test performance." He also shared a link to an online debate on Less Wrong forum on Twitter.

This controversy not only involves OpenAI but also points to systemic issues in the AI industry regarding progress verification. A survey by AI researcher Louis Hunt showed that other top-performing models such as Mistral 7b, Google's Gemma, Microsoft's Phi-3, Meta's Llama-3, and Alibaba's Qwen 2.5 could reproduce word-for-word the 6882 pages of content from MMLU and GSM8K benchmarks. Both MMLU and GSM8K are synthetic benchmarks measuring multi-task performance and mathematical proficiency.

This makes it difficult to accurately assess the strength and accuracy of these models. Just as giving a student with photographic memory a list of questions and answers for their next exam, we cannot determine whether they deduce solutions through reasoning or simply recite memorized answers. Since these tests aim to prove AI models' reasoning ability, the controversy is evident.

Vasily Morzakov, founder of RemBrain, warned: "This is actually a very big problem. Models are tested on guided versions of MMLU and GSM8K tests. But the fact that benchmark models can reproduce tests means they have been pre-trained."

To ensure genuine testing capability, Epoch plans to use a "holdout set" consisting of 50 randomly selected questions in the future without providing them to OpenAI. However, the challenge of creating truly independent evaluations remains. Computer scientist Dirk Roeckmann believes that ideal tests require a neutral sandbox environment, which is not easy to achieve, and even then, there is still the risk of intentional leakage of test data by humans.