$ cat /var/log/models/snapshot — 2026-04-16
16 models. 10 benchmarks. Coding, reasoning, agentic work, math, multimodal. No vibes. Pick your task, sort the table, compare the charts. The numbers come from public leaderboards. The verdicts come from me.
$ tail /var/log/verdicts
Every model is good at something. None is best at everything. Match the model to the work — not the marketing.
Best at Coding
Claude Opus 4.7
Runner-up: GPT-5.4 Pro
SWE-bench Verified + Aider Polyglot, evenly weighted.
Best at Reasoning
Gemini 3.1 Pro
Runner-up: Claude Opus 4.7
GPQA Diamond + MMLU-Pro + ARC-AGI-2, averaged.
Best at Agentic Work
GPT-5.4 Pro
Runner-up: Claude Opus 4.7
OSWorld-Verified + τ²-bench. Real tools, real failure modes.
Best at Math
GPT-5.4 Pro
Runner-up: DeepSeek R1
MATH-500. Where reasoning models still win.
Best Value
Qwen3.5-397B
Runner-up: DeepSeek V4
Composite score per blended dollar. Where the budget actually goes.
Best Open Weight
DeepSeek V4
Runner-up: GLM-5
Highest composite score under any license you can self-host.
Best for Long Context
Gemini 3.1 Pro
Runner-up: GPT-5.4 Pro
Largest context window with credible composite score.
Best Multimodal
GPT-5.4 Pro
Runner-up: Claude Opus 4.7
Vision-capable models, ranked by overall intelligence.
$ ./explore --interactive
Click models in the table to add them to the radar chart. Sort by any benchmark. Filter by license, capability, or lab.
4 models · 8 axes
Claude Opus 4.7
Anthropic · 83.4 · 1M ctx
Still the model to beat for sustained agent reliability — and the price tag tells you Anthropic knows it.
GPT-5.4
OpenAI · 81.4 · 1M ctx
Best polyglot coder in the room — Aider 88 is not an accident. Reasoning mode is solid; computer-use is finally credible.
Gemini 3.1 Pro
Google · 82.2 · 2M ctx
GPQA 94, ARC-AGI 77, 2M context, $1.25 in. The reasoning + cost combo is genuinely uncomfortable for the rest of the field.
DeepSeek V4
DeepSeek · 71.5 · 256K ctx
Top open-weight on SWE-bench at $0.30 in. The closed-frontier price umbrella has a hole in it now — the question is how long Anthropic and OpenAI keep pretending.
unweighted mean across 8 benchmarks · click to add to radar
| Model | Score▼ | SWE-bench | Aider | LiveCode | HumanEval | MMLU-Pro | GPQA | ARC-AGI-2 | MATH | OSWorld | τ-bench | Ctx | $/MTok | Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-5.4 Pro OpenAI · Closed | 83.7 | 82.1 | 89.5 | 86.5 | 96.0 | 90.4 | 93.6 | 61.2 | 97.4 | 76.4 | 78.9 | 1M | $52.50 | 2 |
Claude Opus 4.7 Anthropic · Closed | 83.4 | 87.6 | 84.0 | 82.1 | 96.0 | 87.1 | 89.3 | 68.8 | 95.6 | 76.2 | 78.4 | 1M | $10 | 8 |
Gemini 3.1 Pro Google · Closed | 82.2 | 78.8 | 81.6 | 79.3 | 94.5 | 89.8 | 94.1 | 77.1 | 96.8 | 68.4 | 71.2 | 2M | $2.19 | 38 |
GPT-5.4 OpenAI · Closed | 81.4 | 80.4 | 88.0 | 84.7 | 95.2 | 88.5 | 92.0 | 54.0 | 96.2 | 75.0 | 76.8 | 1M | $5.63 | 14 |
GPT-5.3 Codex OpenAI · Closed | 75.9 | 78.3 | 86.4 | 87.2 | 95.8 | 82.0 | 91.5 | 48.0 | 94.3 | 62.0 | 65.0 | 400K | $6 | 13 |
Claude Sonnet 4.6 Anthropic · Closed | 75.5 | 79.6 | 79.8 | 76.4 | 93.7 | 84.5 | 84.2 | 38.0 | 92.1 | 72.0 | 74.1 | 1M | $6 | 13 |
Grok 4.20 xAI · Closed | 75.1 | 79.0 | 79.6 | 78.2 | 92.3 | 84.0 | 87.5 | 55.4 | 94.1 | 58.2 | 62.8 | 256K | $7.50 | 10 |
DeepSeek V4 DeepSeek · Open weight | 71.5 | 81.0 | 78.4 | 82.4 | 95.0 | 86.2 | 84.0 | 31.0 | 96.5 | 54.7 | 60.0 | 256K | $0.50 | 143 |
GLM-5 Z.ai · Open weight | 69.5 | 77.8 | 74.6 | 84.9 | 93.4 | 83.5 | 80.4 | 24.1 | 94.0 | 58.0 | 63.2 | 200K | $0.70 | 99 |
Kimi K2.5 Moonshot · Open weight | 68.3 | 76.8 | 75.0 | 85.0 | 93.0 | 82.7 | 79.8 | 22.0 | 93.6 | 55.5 | 61.0 | 256K | $1.07 | 64 |
Gemini 3 Flash Google · Closed | 66.9 | 64.2 | 68.0 | 70.1 | 89.2 | 81.4 | 82.7 | 36.5 | 91.4 | 52.0 | 58.6 | 1M | $0.85 | 79 |
Qwen3.5-397B Alibaba · Apache 2.0 | 65.6 | 72.4 | 73.2 | 80.7 | 92.5 | 81.6 | 78.0 | 19.5 | 92.8 | 50.3 | 56.7 | 256K | $0.35 | 187 |
Claude Haiku 4.5 Anthropic · Closed | 64.8 | 73.3 | 65.0 | 64.3 | 89.0 | 78.1 | 75.8 | 22.4 | 86.5 | 56.8 | 60.2 | 200K | $2 | 32 |
DeepSeek R1 DeepSeek · Open weight | 63.7 | 67.8 | 71.3 | 76.8 | 92.0 | 84.0 | 81.2 | 18.6 | 97.3 | 41.0 | 48.4 | 128K | $0.96 | 66 |
Mistral Large 3 Mistral · Apache 2.0 | 55.6 | 58.0 | 60.4 | 65.2 | 87.0 | 78.2 | 73.0 | 12.3 | 86.5 | 34.0 | 42.5 | 256K | $0.88 | 64 |
Llama 4 Maverick Meta · Custom | 51.2 | 47.2 | 51.0 | 58.4 | 84.0 | 80.5 | 70.4 | 9.8 | 84.0 | 28.4 | 38.0 | 1M | $0.42 | 123 |
Hover any column header for the benchmark's full name and description. Scores are 0–100. — means no published score.
$ whoami --opinion
A year ago, the top of every leaderboard was a one-horse race. Today, four labs ship credible frontier models — Anthropic, OpenAI, Google, and (when you squint at the open-weight tier) DeepSeek. The spread between them on most benchmarks is under 10 points. The differentiation moved from raw IQ to behavior under load — agent stamina, tool reliability, computer-use ceilings.
If you're choosing a model based on a single benchmark number, you're going to ship something brittle. Pick three benchmarks that match your workload and look at the cluster.
DeepSeek V4 at 81% SWE-bench, GLM-5 and Kimi K2.5 at 77%, all under $1/MTok blended. The closed-frontier price umbrella has a hole in it now, and the labs charging $25/MTok output need a story for what you get for the 15× markup.
The honest answer for most teams: if your workload is predictable, the open weights are fine. If your workload is long-running, multi-tool, multi-turn — the closed frontier still wins on reliability, and reliability is the thing you're paying for.
DeepSeek R1 at 97.3% MATH-500 from a reasoning-tuned open-weight model is a strong reminder that test-time compute scales — but R1 is bad at agentic work, because it wasn't trained for it.
Treat the reasoning specialists (R1, GPT-5.4 Pro, GPT-5.3 Codex) as batch tools, not interactive ones. Send them the hard problems offline. Don't put them in a chat loop and act surprised when they think for 90 seconds.
OpenAI publicly stopped reporting SWE-bench Verified because every frontier model now shows training-set contamination. The same will happen to MMLU-Pro by Q3. The benchmarks worth trusting in 2026 are the ones too new or too expensive to game: ARC-AGI-2, OSWorld, τ-bench, LiveCodeBench.
If a model leads on the saturated benchmarks but trails on the hard ones, that's your tell.
$ cat sources.txt
Scores come from public leaderboards as of 2026-04-16. Where a model has no comparable published score on a benchmark, it's shown as —rather than estimated. The composite "intelligence index" is a simple unweighted mean of the eight non-coding-saturated benchmarks. Value score is composite ÷ blended price (3:1 input:output mix).
SWE-bench Verified leaderboard
www.swebench.com
Aider Polyglot leaderboard
aider.chat
ARC Prize leaderboard
arcprize.org
OSWorld benchmark
airank.dev
τ²-bench (Sierra Research)
artificialanalysis.ai
MMLU-Pro leaderboard
artificialanalysis.ai
Anthropic — Claude Opus 4.7
www.anthropic.com
OpenAI — GPT-5.4 release
openai.com
Artificial Analysis
artificialanalysis.ai
$ subscribe --weekly
New benchmarks every week. New models every month. The big calls land in Nova's Signal.
Subscribe on Substack