A non-profit organization called Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, has introduced a new and challenging test designed to evaluate the general intelligence of leading AI models. This latest test, named ARC-AGI-2, has proven to be particularly difficult for most AI systems.
According to the Arc Prize leaderboard, so-called "reasoning" AI models like OpenAI’s o1-pro and DeepSeek’s R1 scored between just 1% and 1.3% on ARC-AGI-2. Similarly, powerful non-reasoning models such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash achieved scores hovering around 1%.
The ARC-AGI test consists of a series of puzzle-like questions that require AI to identify visual patterns from grids of colored blocks and generate the correct "answer" grid. These questions are crafted to force AI systems to adapt to problems they have never encountered before.
To establish a human baseline, the Arc Prize Foundation invited over 400 individuals to participate in the ARC-AGI-2 test. The results showed that this group of participants achieved an average accuracy of 60%, significantly higher than any AI model’s performance.
Chollet stated that compared to its predecessor, ARC-AGI-1, ARC-AGI-2 serves as a better indicator of an AI model's true intelligence. The foundation's tests aim to assess whether AI systems can efficiently acquire new skills beyond their training data.
To address the "brute-force solving" issue present in ARC-AGI-1 (where models relied heavily on computational power to find solutions), ARC-AGI-2 introduces new efficiency metrics and requires models to explain patterns in real-time rather than relying on memorization.
Greg Kamradt, another founder of the Arc Prize Foundation, wrote in a blog post that intelligence is not only defined by the ability to solve problems or achieve high scores but also by the efficiency with which these abilities are acquired and deployed. The core question isn’t just “Can AI acquire the skills to complete tasks?” but also “At what cost or efficiency?”
Notably, ARC-AGI-1 remained unbeaten for nearly five years until December 2024 when OpenAI released its advanced reasoning model, o3. The o3 model outperformed all other AI systems on ARC-AGI-1 and matched human-level performance. However, its success came at a significant computational cost. For instance, the first version of o3 (o3-low), which scored 75.7% on ARC-AGI-1, managed only a 4% score on ARC-AGI-2 while costing $200 per task.
With the introduction of ARC-AGI-2, many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Thomas Wolf, co-founder of Hugging Face, recently told TechCrunch that the AI industry lacks sufficient tests to evaluate key characteristics of so-called artificial general intelligence, including creativity.
In addition, the Arc Prize Foundation announced its 2025 Arc Prize competition, challenging developers to achieve 85% accuracy on the ARC-AGI-2 test while keeping costs below $0.42 per task.