Reading Time: 3 minutes

What are the top AI models? There’s a lot of different ranking systems, but the Arc Prize is a great one to start with as a definitive source of LLM leaderboard rankings. See our post on AI ranking factors for more intel.

October 26, 2025 Ranking

Just eyballing it, but GPT-5 Pro is the current top AI model in production, only beat by custom AIs and a human panel.

Grok 4 (Thinking) is the #2 production model, with a very solid cost/task of $2.17 compared to GPT-5 Pro’s $7.14 a task for pretty close scores. Should likely be used in production a lot, and more often soon.

Claude Sonnet 4.5 (Thinking 32K) is right behind the other two, with an astonishingly low $0.76 a task, lower than Grok 4 (Thinking) by a factor of 3. This should be used even more frequently in production as a strong default AI.

Fresh pull of the rankings today, here are the top 20 as of now:

AI SystemOrganizationSystem TypeARC-AGI-1ARC-AGI-2Cost/Task
Human PanelHumanN/A98.00%100.00%$17.00
J. Berman (2025)BespokeCoT + Synthesis79.60%29.40%$30.40
E. Pang (2025)BespokeCoT + Synthesis77.10%26.00%$3.97
GPT-5 ProOpenAICoT70.20%18.30%$7.14
Grok 4 (Thinking)xAICoT66.70%16.00%$2.17
Claude Sonnet 4.5 (Thinking 32K)AnthropicCoT63.70%13.60%$0.76
GPT-5 (High)OpenAICoT65.70%9.90%$0.73
Claude Opus 4 (Thinking 16K)AnthropicCoT35.70%8.60%$1.93
GPT-5 (Medium)OpenAICoT56.20%7.50%$0.45
Claude Sonnet 4.5 (Thinking 8K)AnthropicCoT46.50%6.90%$0.24
Claude Sonnet 4.5 (Thinking 16K)AnthropicCoT48.30%6.90%$0.35
o3 (High)OpenAICoT60.80%6.50%$0.83
Tiny Recursion Model (TRM)BespokeN/A40.00%6.30%$2.10
o4-mini (High)OpenAICoT58.70%6.10%$0.86
Claude Sonnet 4 (Thinking 16K)AnthropicCoT40.00%5.90%$0.49
Claude Sonnet 4.5 (Thinking 1K)AnthropicCoT31.00%5.80%$0.14
Grok 4 (Fast Reasoning)xAICoT48.50%5.30%$0.06
o3-Pro (High)OpenAICoT + Synthesis59.30%4.90%$7.55
Gemini 2.5 Pro (Thinking 32K)GoogleCoT37.00%4.90%$0.76
Claude Opus 4 (Thinking 8K)AnthropicCoT30.70%4.50%$1.16

View the entire leaderboard here at ARCprize.

Sep 18, 2025 Ranking

Table recreated courtesy of ARC Prize, a nonprofit.

This table shows the latest rankings following ARC 1 and 2 tests.

RankAI SystemOrganizationSystem TypeARC-AGI-1ARC-AGI-2
1Human PanelHumanN/A98.00%100.00%
2J. Berman (2025)BespokeCoT + Synthesis79.60%29.40%
3E. Pang (2025)BespokeCoT + Synthesis77.10%26.00%
4Grok 4 (Thinking)xAICoT66.70%16.00%
5GPT-5 (High)OpenAICoT65.70%9.90%
6Claude Opus 4 (Thinking 16K)AnthropicCoT35.70%8.60%
7GPT-5 (Medium)OpenAICoT56.20%7.50%
8o3 (High)OpenAICoT60.80%6.50%
9o4-mini (High)OpenAICoT58.70%6.10%
10Claude Sonnet 4 (Thinking 16K)AnthropicCoT40.00%5.90%
11o3-Pro (High)OpenAICoT + Synthesis59.30%4.90%
12Gemini 2.5 Pro (Thinking 32K)GoogleCoT37.00%4.90%
13Claude Opus 4 (Thinking 8K)AnthropicCoT30.70%4.50%
14GPT-5 Mini (High)OpenAICoT54.30%4.40%
15Gemini 2.5 Pro (Thinking 16K)GoogleCoT41.00%4.00%
16GPT-5 Mini (Medium)OpenAICoT37.30%4.00%
17o3-preview (Low)*OpenAICoT + Synthesis75.70%4.00%
18Gemini 2.5 Pro (Preview)GoogleCoT33.00%3.80%
19Gemini 2.5 Pro (Preview, Thinking 1K)GoogleCoT31.30%3.40%
20o3-mini (High)OpenAICoT34.50%3.00%

See full table here: ARC Leaderboard

How ARC-AGI-1 Works

“ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”

ARC-AGI-2 Explained:

Here’s a direct quote:

“ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.

ARC-AGI-2 – the next iteration of the benchmark – is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.

Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.

Can you create a system that can reach 85% accuracy?”

July 10, 2025 Ranking

As of July 10, 2025 Grok 4 is the best AI model, according to ARC Prize’s ARC-AGI Leadersboard.

According to their X announcement:

“Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA.”

-ARC on X

View the full ARC-AGI Leaderboard page for real-time updates.

According to their team:

“ARC-AGI has evolved from its first version (ARC-AGI-1) which measured basic fluid intelligence, to ARC-AGI-2 which challenges systems to demonstrate both high adaptability and high efficiency.

The scatter plot above visualizes the critical relationship between cost-per-task and performance – a key measure of intelligence efficiency. True intelligence isn’t just about solving problems, but solving them efficiently with minimal resources.”

Other Leaderboards

Kearney Leaderboard: Out of Date

We don’t recommend referencing this one by Kearny, as it mentions o1 as an “up and coming” model, so it’s already out of date.

Joe Robison

Founder & Consultant
Joe Robison is the founder of Green Flag Digital. He founded the agency in 2015 and has been heads-down scaling content marketing and SEO services for clients ever since. He is an occasional surfer, fledgling yogi, and sucker for organized travel tours.
Selected articles for you
This is the sign you've been looking for neon signage

The Medium is the Message: 2023 Edition

"The medium is the message" We’ve all heard this, but what does it mean for our everyday work as marketers…

Read More