$ cat /var/log/models/snapshot — 2026-04-16

Every AI model worth running,
ranked on what matters.

16 models. 10 benchmarks. Coding, reasoning, agentic work, math, multimodal. No vibes. Pick your task, sort the table, compare the charts. The numbers come from public leaderboards. The verdicts come from me.

16 models tracked10 benchmarks7 open weightsnapshot 2026-04-16

$ tail /var/log/verdicts

Best at one thing each.

Every model is good at something. None is best at everything. Match the model to the work — not the marketing.

Best at Coding

Claude Opus 4.7

Runner-up: GPT-5.4 Pro

SWE-bench Verified + Aider Polyglot, evenly weighted.

Best at Reasoning

Gemini 3.1 Pro

Runner-up: Claude Opus 4.7

GPQA Diamond + MMLU-Pro + ARC-AGI-2, averaged.

Best at Agentic Work

GPT-5.4 Pro

Runner-up: Claude Opus 4.7

OSWorld-Verified + τ²-bench. Real tools, real failure modes.

Best at Math

GPT-5.4 Pro

Runner-up: DeepSeek R1

MATH-500. Where reasoning models still win.

Best Value

Qwen3.5-397B

Runner-up: DeepSeek V4

Composite score per blended dollar. Where the budget actually goes.

Best Open Weight

DeepSeek V4

Runner-up: GLM-5

Highest composite score under any license you can self-host.

Best for Long Context

Gemini 3.1 Pro

Runner-up: GPT-5.4 Pro

Largest context window with credible composite score.

Best Multimodal

GPT-5.4 Pro

Runner-up: Claude Opus 4.7

Vision-capable models, ranked by overall intelligence.

$ ./explore --interactive

Compare any models, on any benchmark.

Click models in the table to add them to the radar chart. Sort by any benchmark. Filter by license, capability, or lab.

16 models · click rows to compare (4/4)

Capability radar

4 models · 8 axes

Selected

Claude Opus 4.7

Anthropic · 83.4 · 1M ctx

Still the model to beat for sustained agent reliability — and the price tag tells you Anthropic knows it.

GPT-5.4

OpenAI · 81.4 · 1M ctx

Best polyglot coder in the room — Aider 88 is not an accident. Reasoning mode is solid; computer-use is finally credible.

Gemini 3.1 Pro

Google · 82.2 · 2M ctx

GPQA 94, ARC-AGI 77, 2M context, $1.25 in. The reasoning + cost combo is genuinely uncomfortable for the rest of the field.

DeepSeek V4

DeepSeek · 71.5 · 256K ctx

Top open-weight on SWE-bench at $0.30 in. The closed-frontier price umbrella has a hole in it now — the question is how long Anthropic and OpenAI keep pretending.

Intelligence index

unweighted mean across 8 benchmarks · click to add to radar

Model	Score▼	SWE-bench	Aider	LiveCode	HumanEval	MMLU-Pro	GPQA	ARC-AGI-2	MATH	OSWorld	τ-bench	Ctx	$/MTok	Value
GPT-5.4 Pro OpenAI · Closed	83.7	82.1	89.5	86.5	96.0	90.4	93.6	61.2	97.4	76.4	78.9	1M	$52.50	2
Claude Opus 4.7 Anthropic · Closed	83.4	87.6	84.0	82.1	96.0	87.1	89.3	68.8	95.6	76.2	78.4	1M	$10	8
Gemini 3.1 Pro Google · Closed	82.2	78.8	81.6	79.3	94.5	89.8	94.1	77.1	96.8	68.4	71.2	2M	$2.19	38
GPT-5.4 OpenAI · Closed	81.4	80.4	88.0	84.7	95.2	88.5	92.0	54.0	96.2	75.0	76.8	1M	$5.63	14
GPT-5.3 Codex OpenAI · Closed	75.9	78.3	86.4	87.2	95.8	82.0	91.5	48.0	94.3	62.0	65.0	400K	$6	13
Claude Sonnet 4.6 Anthropic · Closed	75.5	79.6	79.8	76.4	93.7	84.5	84.2	38.0	92.1	72.0	74.1	1M	$6	13
Grok 4.20 xAI · Closed	75.1	79.0	79.6	78.2	92.3	84.0	87.5	55.4	94.1	58.2	62.8	256K	$7.50	10
DeepSeek V4 DeepSeek · Open weight	71.5	81.0	78.4	82.4	95.0	86.2	84.0	31.0	96.5	54.7	60.0	256K	$0.50	143
GLM-5 Z.ai · Open weight	69.5	77.8	74.6	84.9	93.4	83.5	80.4	24.1	94.0	58.0	63.2	200K	$0.70	99
Kimi K2.5 Moonshot · Open weight	68.3	76.8	75.0	85.0	93.0	82.7	79.8	22.0	93.6	55.5	61.0	256K	$1.07	64
Gemini 3 Flash Google · Closed	66.9	64.2	68.0	70.1	89.2	81.4	82.7	36.5	91.4	52.0	58.6	1M	$0.85	79
Qwen3.5-397B Alibaba · Apache 2.0	65.6	72.4	73.2	80.7	92.5	81.6	78.0	19.5	92.8	50.3	56.7	256K	$0.35	187
Claude Haiku 4.5 Anthropic · Closed	64.8	73.3	65.0	64.3	89.0	78.1	75.8	22.4	86.5	56.8	60.2	200K	$2	32
DeepSeek R1 DeepSeek · Open weight	63.7	67.8	71.3	76.8	92.0	84.0	81.2	18.6	97.3	41.0	48.4	128K	$0.96	66
Mistral Large 3 Mistral · Apache 2.0	55.6	58.0	60.4	65.2	87.0	78.2	73.0	12.3	86.5	34.0	42.5	256K	$0.88	64
Llama 4 Maverick Meta · Custom	51.2	47.2	51.0	58.4	84.0	80.5	70.4	9.8	84.0	28.4	38.0	1M	$0.42	123

Hover any column header for the benchmark's full name and description. Scores are 0–100. — means no published score.

$ whoami --opinion

What the data actually says.

The frontier got crowded.

A year ago, the top of every leaderboard was a one-horse race. Today, four labs ship credible frontier models — Anthropic, OpenAI, Google, and (when you squint at the open-weight tier) DeepSeek. The spread between them on most benchmarks is under 10 points. The differentiation moved from raw IQ to behavior under load — agent stamina, tool reliability, computer-use ceilings.

If you're choosing a model based on a single benchmark number, you're going to ship something brittle. Pick three benchmarks that match your workload and look at the cluster.

Open weight is no longer a gap year.

DeepSeek V4 at 81% SWE-bench, GLM-5 and Kimi K2.5 at 77%, all under $1/MTok blended. The closed-frontier price umbrella has a hole in it now, and the labs charging $25/MTok output need a story for what you get for the 15× markup.

The honest answer for most teams: if your workload is predictable, the open weights are fine. If your workload is long-running, multi-tool, multi-turn — the closed frontier still wins on reliability, and reliability is the thing you're paying for.

Reasoning models are a separate animal.

DeepSeek R1 at 97.3% MATH-500 from a reasoning-tuned open-weight model is a strong reminder that test-time compute scales — but R1 is bad at agentic work, because it wasn't trained for it.

Treat the reasoning specialists (R1, GPT-5.4 Pro, GPT-5.3 Codex) as batch tools, not interactive ones. Send them the hard problems offline. Don't put them in a chat loop and act surprised when they think for 90 seconds.

The benchmarks are leaking. Watch the deltas.

OpenAI publicly stopped reporting SWE-bench Verified because every frontier model now shows training-set contamination. The same will happen to MMLU-Pro by Q3. The benchmarks worth trusting in 2026 are the ones too new or too expensive to game: ARC-AGI-2, OSWorld, τ-bench, LiveCodeBench.

If a model leads on the saturated benchmarks but trails on the hard ones, that's your tell.

$ cat sources.txt

Methodology & sources.

Scores come from public leaderboards as of 2026-04-16. Where a model has no comparable published score on a benchmark, it's shown as —rather than estimated. The composite "intelligence index" is a simple unweighted mean of the eight non-coding-saturated benchmarks. Value score is composite ÷ blended price (3:1 input:output mix).

SWE-bench Verified leaderboard

www.swebench.com

Aider Polyglot leaderboard

aider.chat

ARC Prize leaderboard

arcprize.org

OSWorld benchmark

airank.dev

τ²-bench (Sierra Research)

artificialanalysis.ai

MMLU-Pro leaderboard

artificialanalysis.ai

Anthropic — Claude Opus 4.7

www.anthropic.com

OpenAI — GPT-5.4 release

openai.com

Artificial Analysis

artificialanalysis.ai

$ subscribe --weekly

This page updates. So does my Substack.

New benchmarks every week. New models every month. The big calls land in Nova's Signal.

Subscribe on Substack