What are the top AI models? There’s a lot of different ranking systems, but the Arc Prize is a great one to start with as a definitive source of LLM leaderboard rankings. See our post on AI ranking factors for more intel.
October 26, 2025 Ranking
Just eyballing it, but GPT-5 Pro is the current top AI model in production, only beat by custom AIs and a human panel.
Grok 4 (Thinking) is the #2 production model, with a very solid cost/task of $2.17 compared to GPT-5 Pro’s $7.14 a task for pretty close scores. Should likely be used in production a lot, and more often soon.
Claude Sonnet 4.5 (Thinking 32K) is right behind the other two, with an astonishingly low $0.76 a task, lower than Grok 4 (Thinking) by a factor of 3. This should be used even more frequently in production as a strong default AI.
Fresh pull of the rankings today, here are the top 20 as of now:
| AI System | Organization | System Type | ARC-AGI-1 | ARC-AGI-2 | Cost/Task |
| Human Panel | Human | N/A | 98.00% | 100.00% | $17.00 |
| J. Berman (2025) | Bespoke | CoT + Synthesis | 79.60% | 29.40% | $30.40 |
| E. Pang (2025) | Bespoke | CoT + Synthesis | 77.10% | 26.00% | $3.97 |
| GPT-5 Pro | OpenAI | CoT | 70.20% | 18.30% | $7.14 |
| Grok 4 (Thinking) | xAI | CoT | 66.70% | 16.00% | $2.17 |
| Claude Sonnet 4.5 (Thinking 32K) | Anthropic | CoT | 63.70% | 13.60% | $0.76 |
| GPT-5 (High) | OpenAI | CoT | 65.70% | 9.90% | $0.73 |
| Claude Opus 4 (Thinking 16K) | Anthropic | CoT | 35.70% | 8.60% | $1.93 |
| GPT-5 (Medium) | OpenAI | CoT | 56.20% | 7.50% | $0.45 |
| Claude Sonnet 4.5 (Thinking 8K) | Anthropic | CoT | 46.50% | 6.90% | $0.24 |
| Claude Sonnet 4.5 (Thinking 16K) | Anthropic | CoT | 48.30% | 6.90% | $0.35 |
| o3 (High) | OpenAI | CoT | 60.80% | 6.50% | $0.83 |
| Tiny Recursion Model (TRM) | Bespoke | N/A | 40.00% | 6.30% | $2.10 |
| o4-mini (High) | OpenAI | CoT | 58.70% | 6.10% | $0.86 |
| Claude Sonnet 4 (Thinking 16K) | Anthropic | CoT | 40.00% | 5.90% | $0.49 |
| Claude Sonnet 4.5 (Thinking 1K) | Anthropic | CoT | 31.00% | 5.80% | $0.14 |
| Grok 4 (Fast Reasoning) | xAI | CoT | 48.50% | 5.30% | $0.06 |
| o3-Pro (High) | OpenAI | CoT + Synthesis | 59.30% | 4.90% | $7.55 |
| Gemini 2.5 Pro (Thinking 32K) | CoT | 37.00% | 4.90% | $0.76 | |
| Claude Opus 4 (Thinking 8K) | Anthropic | CoT | 30.70% | 4.50% | $1.16 |
View the entire leaderboard here at ARCprize.
Sep 18, 2025 Ranking
Table recreated courtesy of ARC Prize, a nonprofit.
This table shows the latest rankings following ARC 1 and 2 tests.
| Rank | AI System | Organization | System Type | ARC-AGI-1 | ARC-AGI-2 |
| 1 | Human Panel | Human | N/A | 98.00% | 100.00% |
| 2 | J. Berman (2025) | Bespoke | CoT + Synthesis | 79.60% | 29.40% |
| 3 | E. Pang (2025) | Bespoke | CoT + Synthesis | 77.10% | 26.00% |
| 4 | Grok 4 (Thinking) | xAI | CoT | 66.70% | 16.00% |
| 5 | GPT-5 (High) | OpenAI | CoT | 65.70% | 9.90% |
| 6 | Claude Opus 4 (Thinking 16K) | Anthropic | CoT | 35.70% | 8.60% |
| 7 | GPT-5 (Medium) | OpenAI | CoT | 56.20% | 7.50% |
| 8 | o3 (High) | OpenAI | CoT | 60.80% | 6.50% |
| 9 | o4-mini (High) | OpenAI | CoT | 58.70% | 6.10% |
| 10 | Claude Sonnet 4 (Thinking 16K) | Anthropic | CoT | 40.00% | 5.90% |
| 11 | o3-Pro (High) | OpenAI | CoT + Synthesis | 59.30% | 4.90% |
| 12 | Gemini 2.5 Pro (Thinking 32K) | CoT | 37.00% | 4.90% | |
| 13 | Claude Opus 4 (Thinking 8K) | Anthropic | CoT | 30.70% | 4.50% |
| 14 | GPT-5 Mini (High) | OpenAI | CoT | 54.30% | 4.40% |
| 15 | Gemini 2.5 Pro (Thinking 16K) | CoT | 41.00% | 4.00% | |
| 16 | GPT-5 Mini (Medium) | OpenAI | CoT | 37.30% | 4.00% |
| 17 | o3-preview (Low)* | OpenAI | CoT + Synthesis | 75.70% | 4.00% |
| 18 | Gemini 2.5 Pro (Preview) | CoT | 33.00% | 3.80% | |
| 19 | Gemini 2.5 Pro (Preview, Thinking 1K) | CoT | 31.30% | 3.40% | |
| 20 | o3-mini (High) | OpenAI | CoT | 34.50% | 3.00% |
See full table here: ARC Leaderboard
How ARC-AGI-1 Works
“ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”
ARC-AGI-2 Explained:
Here’s a direct quote:
“ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.
ARC-AGI-2 – the next iteration of the benchmark – is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.
Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.
Can you create a system that can reach 85% accuracy?”
July 10, 2025 Ranking
As of July 10, 2025 Grok 4 is the best AI model, according to ARC Prize’s ARC-AGI Leadersboard.

According to their X announcement:
“Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA.”
-ARC on X
View the full ARC-AGI Leaderboard page for real-time updates.
According to their team:
“ARC-AGI has evolved from its first version (ARC-AGI-1) which measured basic fluid intelligence, to ARC-AGI-2 which challenges systems to demonstrate both high adaptability and high efficiency.
The scatter plot above visualizes the critical relationship between cost-per-task and performance – a key measure of intelligence efficiency. True intelligence isn’t just about solving problems, but solving them efficiently with minimal resources.”
Other Leaderboards
Kearney Leaderboard: Out of Date
We don’t recommend referencing this one by Kearny, as it mentions o1 as an “up and coming” model, so it’s already out of date.
Leave a Reply