What are the top AI models? There’s a lot of different ranking systems, but the Arc Prize is a great one to start with.
Sep 18, 2025 Ranking
Table recreated courtesy of ARC Prize, a nonprofit.
This table shows the latest rankings following ARC 1 and 2 tests.
Rank | AI System | Organization | System Type | ARC-AGI-1 | ARC-AGI-2 |
1 | Human Panel | Human | N/A | 98.00% | 100.00% |
2 | J. Berman (2025) | Bespoke | CoT + Synthesis | 79.60% | 29.40% |
3 | E. Pang (2025) | Bespoke | CoT + Synthesis | 77.10% | 26.00% |
4 | Grok 4 (Thinking) | xAI | CoT | 66.70% | 16.00% |
5 | GPT-5 (High) | OpenAI | CoT | 65.70% | 9.90% |
6 | Claude Opus 4 (Thinking 16K) | Anthropic | CoT | 35.70% | 8.60% |
7 | GPT-5 (Medium) | OpenAI | CoT | 56.20% | 7.50% |
8 | o3 (High) | OpenAI | CoT | 60.80% | 6.50% |
9 | o4-mini (High) | OpenAI | CoT | 58.70% | 6.10% |
10 | Claude Sonnet 4 (Thinking 16K) | Anthropic | CoT | 40.00% | 5.90% |
11 | o3-Pro (High) | OpenAI | CoT + Synthesis | 59.30% | 4.90% |
12 | Gemini 2.5 Pro (Thinking 32K) | CoT | 37.00% | 4.90% | |
13 | Claude Opus 4 (Thinking 8K) | Anthropic | CoT | 30.70% | 4.50% |
14 | GPT-5 Mini (High) | OpenAI | CoT | 54.30% | 4.40% |
15 | Gemini 2.5 Pro (Thinking 16K) | CoT | 41.00% | 4.00% | |
16 | GPT-5 Mini (Medium) | OpenAI | CoT | 37.30% | 4.00% |
17 | o3-preview (Low)* | OpenAI | CoT + Synthesis | 75.70% | 4.00% |
18 | Gemini 2.5 Pro (Preview) | CoT | 33.00% | 3.80% | |
19 | Gemini 2.5 Pro (Preview, Thinking 1K) | CoT | 31.30% | 3.40% | |
20 | o3-mini (High) | OpenAI | CoT | 34.50% | 3.00% |
See full table here: ARC Leaderboard
How ARC-AGI-1 Works
“ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”
ARC-AGI-2 Explained:
Here’s a direct quote:
“ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.
ARC-AGI-2 – the next iteration of the benchmark – is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.
Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.
Can you create a system that can reach 85% accuracy?”
July 10, 2025 Ranking
As of July 10, 2025 Grok 4 is the best AI model, according to ARC Prize’s ARC-AGI Leadersboard.

According to their X announcement:
“Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA.”
-ARC on X
View the full ARC-AGI Leaderboard page for real-time updates.
According to their team:
“ARC-AGI has evolved from its first version (ARC-AGI-1) which measured basic fluid intelligence, to ARC-AGI-2 which challenges systems to demonstrate both high adaptability and high efficiency.
The scatter plot above visualizes the critical relationship between cost-per-task and performance – a key measure of intelligence efficiency. True intelligence isn’t just about solving problems, but solving them efficiently with minimal resources.”
Leave a Reply