Reading Time: 3 minutes

What are the top AI models? There’s a lot of different ranking systems, but the Arc Prize is a great one to start with as a definitive source of LLM leaderboard rankings. See our post on AI ranking factors for more intel.

October 26, 2025 Ranking

Just eyballing it, but GPT-5 Pro is the current top AI model in production, only beat by custom AIs and a human panel.

Grok 4 (Thinking) is the #2 production model, with a very solid cost/task of $2.17 compared to GPT-5 Pro’s $7.14 a task for pretty close scores. Should likely be used in production a lot, and more often soon.

Claude Sonnet 4.5 (Thinking 32K) is right behind the other two, with an astonishingly low $0.76 a task, lower than Grok 4 (Thinking) by a factor of 3. This should be used even more frequently in production as a strong default AI.

Fresh pull of the rankings today, here are the top 20 as of now:

AI SystemOrganizationSystem TypeARC-AGI-1ARC-AGI-2Cost/Task
Human PanelHumanN/A98.00%100.00%$17.00
J. Berman (2025)BespokeCoT + Synthesis79.60%29.40%$30.40
E. Pang (2025)BespokeCoT + Synthesis77.10%26.00%$3.97
GPT-5 ProOpenAICoT70.20%18.30%$7.14
Grok 4 (Thinking)xAICoT66.70%16.00%$2.17
Claude Sonnet 4.5 (Thinking 32K)AnthropicCoT63.70%13.60%$0.76
GPT-5 (High)OpenAICoT65.70%9.90%$0.73
Claude Opus 4 (Thinking 16K)AnthropicCoT35.70%8.60%$1.93
GPT-5 (Medium)OpenAICoT56.20%7.50%$0.45
Claude Sonnet 4.5 (Thinking 8K)AnthropicCoT46.50%6.90%$0.24
Claude Sonnet 4.5 (Thinking 16K)AnthropicCoT48.30%6.90%$0.35
o3 (High)OpenAICoT60.80%6.50%$0.83
Tiny Recursion Model (TRM)BespokeN/A40.00%6.30%$2.10
o4-mini (High)OpenAICoT58.70%6.10%$0.86
Claude Sonnet 4 (Thinking 16K)AnthropicCoT40.00%5.90%$0.49
Claude Sonnet 4.5 (Thinking 1K)AnthropicCoT31.00%5.80%$0.14
Grok 4 (Fast Reasoning)xAICoT48.50%5.30%$0.06
o3-Pro (High)OpenAICoT + Synthesis59.30%4.90%$7.55
Gemini 2.5 Pro (Thinking 32K)GoogleCoT37.00%4.90%$0.76
Claude Opus 4 (Thinking 8K)AnthropicCoT30.70%4.50%$1.16

View the entire leaderboard here at ARCprize.

Sep 18, 2025 Ranking

Table recreated courtesy of ARC Prize, a nonprofit.

This table shows the latest rankings following ARC 1 and 2 tests.

RankAI SystemOrganizationSystem TypeARC-AGI-1ARC-AGI-2
1Human PanelHumanN/A98.00%100.00%
2J. Berman (2025)BespokeCoT + Synthesis79.60%29.40%
3E. Pang (2025)BespokeCoT + Synthesis77.10%26.00%
4Grok 4 (Thinking)xAICoT66.70%16.00%
5GPT-5 (High)OpenAICoT65.70%9.90%
6Claude Opus 4 (Thinking 16K)AnthropicCoT35.70%8.60%
7GPT-5 (Medium)OpenAICoT56.20%7.50%
8o3 (High)OpenAICoT60.80%6.50%
9o4-mini (High)OpenAICoT58.70%6.10%
10Claude Sonnet 4 (Thinking 16K)AnthropicCoT40.00%5.90%
11o3-Pro (High)OpenAICoT + Synthesis59.30%4.90%
12Gemini 2.5 Pro (Thinking 32K)GoogleCoT37.00%4.90%
13Claude Opus 4 (Thinking 8K)AnthropicCoT30.70%4.50%
14GPT-5 Mini (High)OpenAICoT54.30%4.40%
15Gemini 2.5 Pro (Thinking 16K)GoogleCoT41.00%4.00%
16GPT-5 Mini (Medium)OpenAICoT37.30%4.00%
17o3-preview (Low)*OpenAICoT + Synthesis75.70%4.00%
18Gemini 2.5 Pro (Preview)GoogleCoT33.00%3.80%
19Gemini 2.5 Pro (Preview, Thinking 1K)GoogleCoT31.30%3.40%
20o3-mini (High)OpenAICoT34.50%3.00%

See full table here: ARC Leaderboard

How ARC-AGI-1 Works

“ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”

ARC-AGI-2 Explained:

Here’s a direct quote:

“ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by ARC Prize 2024 and OpenAI.

ARC-AGI-2 – the next iteration of the benchmark – is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.

Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.

Can you create a system that can reach 85% accuracy?”

July 10, 2025 Ranking

As of July 10, 2025 Grok 4 is the best AI model, according to ARC Prize’s ARC-AGI Leadersboard.

According to their X announcement:

“Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA.”

-ARC on X

View the full ARC-AGI Leaderboard page for real-time updates.

According to their team:

“ARC-AGI has evolved from its first version (ARC-AGI-1) which measured basic fluid intelligence, to ARC-AGI-2 which challenges systems to demonstrate both high adaptability and high efficiency.

The scatter plot above visualizes the critical relationship between cost-per-task and performance – a key measure of intelligence efficiency. True intelligence isn’t just about solving problems, but solving them efficiently with minimal resources.”

Other Leaderboards

Kearney Leaderboard: Out of Date

We don’t recommend referencing this one by Kearny, as it mentions o1 as an “up and coming” model, so it’s already out of date.

Joe Robison

Founder & Consultant
Joe Robison is the founder of Green Flag Digital. He founded the agency in 2015 and has been heads-down scaling content marketing and SEO services for clients ever since. He is an occasional surfer, fledgling yogi, and sucker for organized travel tours.
Selected articles for you
Create a Lean Content Team

How Should a Business Do Lean Content Marketing?

In working with small companies of 10-30 employees I’ve realized the need for solidifying a lean content team. My theory…

Read More