$ cat /var/log/models/snapshot — 2026-04-16

Every AI model worth running,
ranked on what matters.

16 models. 10 benchmarks. Coding, reasoning, agentic work, math, multimodal. No vibes. Pick your task, sort the table, compare the charts. The numbers come from public leaderboards. The verdicts come from me.

16 models tracked10 benchmarks7 open weightsnapshot 2026-04-16

$ tail /var/log/verdicts

Best at one thing each.

Every model is good at something. None is best at everything. Match the model to the work — not the marketing.

Best at Coding

Claude Opus 4.7

Runner-up: GPT-5.4 Pro

SWE-bench Verified + Aider Polyglot, evenly weighted.

Best at Reasoning

Gemini 3.1 Pro

Runner-up: Claude Opus 4.7

GPQA Diamond + MMLU-Pro + ARC-AGI-2, averaged.

Best at Agentic Work

GPT-5.4 Pro

Runner-up: Claude Opus 4.7

OSWorld-Verified + τ²-bench. Real tools, real failure modes.

Best at Math

GPT-5.4 Pro

Runner-up: DeepSeek R1

MATH-500. Where reasoning models still win.

Best Value

Qwen3.5-397B

Runner-up: DeepSeek V4

Composite score per blended dollar. Where the budget actually goes.

Best Open Weight

DeepSeek V4

Runner-up: GLM-5

Highest composite score under any license you can self-host.

Best for Long Context

Gemini 3.1 Pro

Runner-up: GPT-5.4 Pro

Largest context window with credible composite score.

Best Multimodal

GPT-5.4 Pro

Runner-up: Claude Opus 4.7

Vision-capable models, ranked by overall intelligence.

$ ./explore --interactive

Compare any models, on any benchmark.

Click models in the table to add them to the radar chart. Sort by any benchmark. Filter by license, capability, or lab.

16 models · click rows to compare (4/4)

Capability radar

4 models · 8 axes

SWE-benchAiderLiveCodeHumanEvalMMLU-ProGPQAARC-AGI-2MATHOSWorldτ-bench

Selected

Claude Opus 4.7

Anthropic · 83.4 · 1M ctx

Still the model to beat for sustained agent reliability — and the price tag tells you Anthropic knows it.

GPT-5.4

OpenAI · 81.4 · 1M ctx

Best polyglot coder in the room — Aider 88 is not an accident. Reasoning mode is solid; computer-use is finally credible.

Gemini 3.1 Pro

Google · 82.2 · 2M ctx

GPQA 94, ARC-AGI 77, 2M context, $1.25 in. The reasoning + cost combo is genuinely uncomfortable for the rest of the field.

DeepSeek V4

DeepSeek · 71.5 · 256K ctx

Top open-weight on SWE-bench at $0.30 in. The closed-frontier price umbrella has a hole in it now — the question is how long Anthropic and OpenAI keep pretending.

Intelligence index

unweighted mean across 8 benchmarks · click to add to radar

ModelScoreSWE-benchAiderLiveCodeHumanEvalMMLU-ProGPQAARC-AGI-2MATHOSWorldτ-benchCtx$/MTokValue

GPT-5.4 Pro

OpenAI · Closed

83.782.189.586.596.090.493.661.297.476.478.91M$52.502

Claude Opus 4.7

Anthropic · Closed

83.487.684.082.196.087.189.368.895.676.278.41M$108

Gemini 3.1 Pro

Google · Closed

82.278.881.679.394.589.894.177.196.868.471.22M$2.1938

GPT-5.4

OpenAI · Closed

81.480.488.084.795.288.592.054.096.275.076.81M$5.6314

GPT-5.3 Codex

OpenAI · Closed

75.978.386.487.295.882.091.548.094.362.065.0400K$613

Claude Sonnet 4.6

Anthropic · Closed

75.579.679.876.493.784.584.238.092.172.074.11M$613

Grok 4.20

xAI · Closed

75.179.079.678.292.384.087.555.494.158.262.8256K$7.5010

DeepSeek V4

DeepSeek · Open weight

71.581.078.482.495.086.284.031.096.554.760.0256K$0.50143

GLM-5

Z.ai · Open weight

69.577.874.684.993.483.580.424.194.058.063.2200K$0.7099

Kimi K2.5

Moonshot · Open weight

68.376.875.085.093.082.779.822.093.655.561.0256K$1.0764

Gemini 3 Flash

Google · Closed

66.964.268.070.189.281.482.736.591.452.058.61M$0.8579

Qwen3.5-397B

Alibaba · Apache 2.0

65.672.473.280.792.581.678.019.592.850.356.7256K$0.35187

Claude Haiku 4.5

Anthropic · Closed

64.873.365.064.389.078.175.822.486.556.860.2200K$232

DeepSeek R1

DeepSeek · Open weight

63.767.871.376.892.084.081.218.697.341.048.4128K$0.9666

Mistral Large 3

Mistral · Apache 2.0

55.658.060.465.287.078.273.012.386.534.042.5256K$0.8864

Llama 4 Maverick

Meta · Custom

51.247.251.058.484.080.570.49.884.028.438.01M$0.42123

Hover any column header for the benchmark's full name and description. Scores are 0–100. — means no published score.

$ whoami --opinion

What the data actually says.

The frontier got crowded.

A year ago, the top of every leaderboard was a one-horse race. Today, four labs ship credible frontier models — Anthropic, OpenAI, Google, and (when you squint at the open-weight tier) DeepSeek. The spread between them on most benchmarks is under 10 points. The differentiation moved from raw IQ to behavior under load — agent stamina, tool reliability, computer-use ceilings.

If you're choosing a model based on a single benchmark number, you're going to ship something brittle. Pick three benchmarks that match your workload and look at the cluster.

Open weight is no longer a gap year.

DeepSeek V4 at 81% SWE-bench, GLM-5 and Kimi K2.5 at 77%, all under $1/MTok blended. The closed-frontier price umbrella has a hole in it now, and the labs charging $25/MTok output need a story for what you get for the 15× markup.

The honest answer for most teams: if your workload is predictable, the open weights are fine. If your workload is long-running, multi-tool, multi-turn — the closed frontier still wins on reliability, and reliability is the thing you're paying for.

Reasoning models are a separate animal.

DeepSeek R1 at 97.3% MATH-500 from a reasoning-tuned open-weight model is a strong reminder that test-time compute scales — but R1 is bad at agentic work, because it wasn't trained for it.

Treat the reasoning specialists (R1, GPT-5.4 Pro, GPT-5.3 Codex) as batch tools, not interactive ones. Send them the hard problems offline. Don't put them in a chat loop and act surprised when they think for 90 seconds.

The benchmarks are leaking. Watch the deltas.

OpenAI publicly stopped reporting SWE-bench Verified because every frontier model now shows training-set contamination. The same will happen to MMLU-Pro by Q3. The benchmarks worth trusting in 2026 are the ones too new or too expensive to game: ARC-AGI-2, OSWorld, τ-bench, LiveCodeBench.

If a model leads on the saturated benchmarks but trails on the hard ones, that's your tell.

$ cat sources.txt

Methodology & sources.

Scores come from public leaderboards as of 2026-04-16. Where a model has no comparable published score on a benchmark, it's shown as rather than estimated. The composite "intelligence index" is a simple unweighted mean of the eight non-coding-saturated benchmarks. Value score is composite ÷ blended price (3:1 input:output mix).

$ subscribe --weekly

This page updates. So does my Substack.

New benchmarks every week. New models every month. The big calls land in Nova's Signal.

Subscribe on Substack