Reading Time: 2 minutes

What are the top AI models? There’s a lot of different ranking systems, but the Arc Prize is a great one to start with.

Sep 18, 2025 Ranking

Table recreated courtesy of ARC Prize, a nonprofit.

This table shows the latest rankings following ARC 1 and 2 tests.

RankAI SystemOrganizationSystem TypeARC-AGI-1ARC-AGI-2
1Human PanelHumanN/A98.00%100.00%
2J. Berman (2025)BespokeCoT + Synthesis79.60%29.40%
3E. Pang (2025)BespokeCoT + Synthesis77.10%26.00%
4Grok 4 (Thinking)xAICoT66.70%16.00%
5GPT-5 (High)OpenAICoT65.70%9.90%
6Claude Opus 4 (Thinking 16K)AnthropicCoT35.70%8.60%
7GPT-5 (Medium)OpenAICoT56.20%7.50%
8o3 (High)OpenAICoT60.80%6.50%
9o4-mini (High)OpenAICoT58.70%6.10%
10Claude Sonnet 4 (Thinking 16K)AnthropicCoT40.00%5.90%
11o3-Pro (High)OpenAICoT + Synthesis59.30%4.90%
12Gemini 2.5 Pro (Thinking 32K)GoogleCoT37.00%4.90%
13Claude Opus 4 (Thinking 8K)AnthropicCoT30.70%4.50%
14GPT-5 Mini (High)OpenAICoT54.30%4.40%
15Gemini 2.5 Pro (Thinking 16K)GoogleCoT41.00%4.00%
16GPT-5 Mini (Medium)OpenAICoT37.30%4.00%
17o3-preview (Low)*OpenAICoT + Synthesis75.70%4.00%
18Gemini 2.5 Pro (Preview)GoogleCoT33.00%3.80%
19Gemini 2.5 Pro (Preview, Thinking 1K)GoogleCoT31.30%3.40%
20o3-mini (High)OpenAICoT34.50%3.00%

See full table here: ARC Leaderboard

How ARC-AGI-1 Works

“ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”

ARC-AGI-2 Explained:

Here’s a direct quote:

“ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.

ARC-AGI-2 – the next iteration of the benchmark – is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.

Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.

Can you create a system that can reach 85% accuracy?”

July 10, 2025 Ranking

As of July 10, 2025 Grok 4 is the best AI model, according to ARC Prize’s ARC-AGI Leadersboard.

According to their X announcement:

“Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA.”

-ARC on X

View the full ARC-AGI Leaderboard page for real-time updates.

According to their team:

“ARC-AGI has evolved from its first version (ARC-AGI-1) which measured basic fluid intelligence, to ARC-AGI-2 which challenges systems to demonstrate both high adaptability and high efficiency.

The scatter plot above visualizes the critical relationship between cost-per-task and performance – a key measure of intelligence efficiency. True intelligence isn’t just about solving problems, but solving them efficiently with minimal resources.”

Joe Robison

Founder & Consultant
Joe Robison is the founder of Green Flag Digital. He founded the agency in 2015 and has been heads-down scaling content marketing and SEO services for clients ever since. He is an occasional surfer, fledgling yogi, and sucker for organized travel tours.
Selected articles for you
people sitting on chair in front of table while holding pens during daytime

Why Digital PR Must Start With a Win/Win Mindset

Give me a shoutout. Because I want one. No, I have nothing to offer you. Imagine being on the receiving…

Read More