17.6 Commercial Model Benchmarks

The previous section asked how a team decides whether a model is safe enough and useful enough to release. Now we look at a more public question: when OpenAI, Google, Anthropic, Mistral, Kimi, DeepSeek, Alibaba, MiniMax, and other labs release new models every few weeks, how should we read the benchmark numbers without getting lost?

Think of the model landscape like a city map, not a medal table. One road shows raw reasoning ability. Another shows coding-agent reliability. Another shows price per token. Another shows speed. The “best” model depends on which road your product actually travels.

The interactive map below uses public model information available around May 8, 2026. The default view shows the latest Artificial Analysis Intelligence Index values where they are available, and the release timeline starts in early 2024 so the acceleration is visible. Change the benchmark and x-axis to see the same models from different angles: performance over time, performance per token price, performance versus speed, or performance versus context window.

Commercial Model Benchmark Map

Choose a benchmark, then change the x-axis to see whether the same model still looks strong by time, token cost, speed, or context budget.

BenchmarkX-axisCompany

Release timeline

Major public model releases from early 2024 onward. The pattern to watch is not one winner, but the shrinking gap between releases and the repeated shift from static QA to reasoning, agents, long context, and cost efficiency.

Snapshot as of May 8, 2026. Values combine cited lab cards/release notes with Artificial Analysis and Vals AI public benchmark pages.

1. Metrics And Model Families In The Map

This map is not meant to crown a permanent winner. It is a snapshot for seeing how public model releases are accelerating and which capability axes keep moving. A blank value means that the model does not have a public number from a compatible harness. The table avoids borrowing numbers from unrelated benchmark variants.

Field	What it measures	How to read it
AA Index	An Artificial Analysis aggregate intelligence signal across multiple reasoning/evaluation tasks	A useful public aggregate, not a substitute for your private workflow eval.
Vals Index	Industry-oriented work across law, finance, healthcare, education, coding, and multimodal tasks	Closer to knowledge work, but still not identical to any one company’s workflow.
SWE-Bench	Coding-agent performance on real GitHub issues and repository edits	Variant choice matters: Verified, Pro, Multilingual, vendor harnesses, and scaffolds are not interchangeable.
Terminal-Bench	Agent performance in shell environments requiring command execution and recovery	Strongly stresses command-line workflow and tool coordination.
GPQA Diamond	Hard expert-level scientific reasoning	A reasoning signal, but not a direct measure of agentic execution or tool use.
Humanity’s Last Exam	Difficult frontier questions across many expert domains	A fast-moving frontier signal. The trend matters more than any single point.
Output speed	Output token generation speed	In interactive products, speed can matter as much as raw score.
Token price	A simple blended proxy for input/output token cost	Real cost depends on prompt length, reasoning tokens, retries, and tool calls.
Context window	Token budget the model can process in one request	Long context does not guarantee quality; retrieval, attention stability, and lost-in-the-middle failures still matter.
Release date	Public release or model-card timing	The most important axis for this section: it shows how compressed the improvement cycle has become.

The major model families also represent different product theses.

Model family	Main signal
OpenAI GPT / Codex	Combines reasoning, coding agents, computer use, and a broad tool ecosystem into one product system.
Google Gemini	Pushes long context, native multimodality, fast serving, and broad reasoning.
Anthropic Claude	Has a strong identity around coding, long-running work, instruction following, safety posture, and enterprise workflows.
DeepSeek / Kimi / Qwen / MiniMax	Pressures frontier economics through MoE efficiency, open-weight releases, reasoning, long context, and lower-cost inference.
Mistral	Expands open-weight, enterprise, multilingual, and European/regional deployment options.

So the point is not to stare at one score. Watch how release date, benchmark score, price, speed, and context move together. The same model can mean something very different when the x-axis changes.

2. Why One Leaderboard Is Not Enough

During the GPT-4 period, a simple story still worked reasonably well: a stronger model usually produced higher scores on broad academic and professional exams. GPT-4’s release emphasized that kind of progress: stronger reasoning, better exam performance, and early multimodal capability [1].

That story is now too small. Recent frontier models are not only chat models with better next-token prediction. They are product systems:

routers that choose between fast and deeper reasoning modes
models with adjustable thinking budgets
tool-using agents that browse, write code, edit files, and operate computers
multimodal systems that parse screenshots, documents, audio, video, and long repositories
safety layers that may change behavior by domain

OpenAI describes GPT-5 as a unified system with fast responses, deeper reasoning, and routing components rather than a single flat model endpoint [2]. Google describes Gemini 2.5 and Gemini 3.1 Pro as reasoning-oriented multimodal models with long-context behavior [6] [7]. Anthropic’s recent Claude releases emphasize long-running software work, computer use, prompt-injection resistance, and controlled access for more sensitive cybersecurity capabilities [8] [9].

This changes the meaning of a benchmark number. A score now depends on at least five choices:

Choice	Why it matters
Model variant	”Pro”, “Thinking”, “Instant”, “max effort”, and coding-specialized variants can be different operating points.
Harness	A browser-use benchmark with screenshots, DOM access, or shell access measures different skills.
Tool allowance	Search, Python, code execution, and file-system access can dominate the score.
Reasoning budget	Higher effort may improve accuracy while increasing latency and cost.
Grading method	Exact match, unit tests, LLM judge, pairwise preference, and human review reward different behaviors.

So the useful question is not “Which model is best?” A better question is: best under which harness, with which tools, at what latency and cost budget?

3. The Three Public Lenses

Lab model cards and release notes

Model cards and release notes tell us what the model creator thinks matters: context length, modalities, intended use, safety posture, and benchmark methodology. They are the first place to check, but they are not neutral scoreboards. Labs naturally highlight the evaluations that best explain the model they are shipping.

For example, OpenAI’s GPT-5.3-Codex release focused on SWE-Bench Pro, Terminal-Bench, OSWorld, GDPval, cybersecurity CTFs, and SWE-Lancer because the product was a coding and computer-use agent [3]. GPT-5.4 and GPT-5.5 then pushed the same family of metrics further into general professional work and native computer use [4] [5].

Google’s Gemini 3.1 Pro model card highlights HLE, ARC-AGI-2, GPQA Diamond, SWE-Bench, LiveCodeBench, BrowseComp, MMMU-Pro, and long-context tests [7]. Anthropic’s Opus 4.7 announcement emphasizes long-running software engineering, third-party GDPval-AA, document reasoning, prompt-injection resistance, and migration effects such as tokenizer and effort-level changes [8].

Lab cards answer: What did the creator optimize for, and what deployment constraints come with it?

Independent benchmark dashboards

Artificial Analysis, Vals AI, and LMArena are useful because they compare models from many providers under a more consistent public process.

Artificial Analysis reports an Intelligence Index that combines multiple evaluations such as GDPval-AA, Terminal-Bench Hard, GPQA Diamond, Humanity’s Last Exam, SciCode, instruction-following tests, and other reasoning tasks [10]. Vals AI focuses on real-world industry tasks across law, finance, healthcare, education, coding, and multimodal work [11]. LMArena collects blind pairwise votes across text, code, vision, image, search, and video arenas [12]; the underlying Chatbot Arena method is based on large-scale human preference comparisons [13].

Independent dashboards answer: How does a model compare under a shared public harness today?

Internal product evaluations

The most important benchmark for a company is usually private. It comes from real failures: support conversations, code review incidents, spreadsheet tasks, RAG misses, medical triage escalations, prompt-injection cases, tool-call errors, or customer workflows that are too specific to appear in a public benchmark.

Public benchmarks tell you whether a model deserves a trial. Private evals tell you whether it should replace the model already serving users.

4. What Changed In The Last Three Years

From static knowledge to reasoning and agents

In 2023, MMLU, GSM8K, HumanEval, and professional exams were strong headline signals. They still matter, but they are now closer to unit tests for baseline competence than final release criteria.

The frontier moved toward:

scientific reasoning: GPQA Diamond, HLE, SciCode, CritPt
coding agents: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench, LiveCodeBench
computer use: OSWorld, WebArena-style browser tasks, MCP workflow tasks
knowledge work: GDPval, FinanceAgent, OfficeQA, Vals Index
preference and product fit: LMArena, WebDev Arena, domain-specific human review

The important change is that many valuable tasks are now loops rather than one-shot answers:

\text{observe} \rightarrow \text{plan} \rightarrow \text{act} \rightarrow \text{verify} \rightarrow \text{revise}

That loop is harder to score than a multiple-choice answer. It also reveals problems that old benchmarks hide: tool-call reliability, latency, state management, recovery from bad intermediate steps, and safety behavior under hidden instructions.

Faster releases, smaller visible gaps

Commercial release cadence has accelerated. In early 2024, Gemini 1.5 Pro made 1M-token long context a visible frontier feature [15], Claude 3 formalized the Opus/Sonnet/Haiku ladder [16], GPT-4o pushed low-latency multimodal interaction into the mainstream [17], Claude 3.5 Sonnet moved high capability into a cheaper mid-tier model [18], and OpenAI o1 made test-time reasoning a product category [19]. Mistral Large 2, DeepSeek-V3, DeepSeek-R1, Qwen3, MiniMax-M1, and Kimi K2 then showed that the challenger lane was no longer just “smaller open models”; it was competing on MoE efficiency, reasoning, long context, and agentic coding [20] [21] [22] [23] [24] [25].

By 2026, OpenAI’s public sequence from GPT-5.3-Codex to GPT-5.4 to GPT-5.5 shows updates only weeks or months apart, each with a slightly different emphasis: coding-agent specialization, general professional work, computer use, and long-horizon task completion [3] [4] [5]. Google’s Gemini 2.5 to Gemini 3 to Gemini 3.1 Pro sequence shows a similar pattern: reasoning behavior and multimodality have become the central product axis rather than optional features [6] [7]. Anthropic’s Claude line now splits not only by Opus/Sonnet/Haiku economics, but also by effort level, computer-use reliability, safety posture, and gated access for unusually sensitive capabilities [8] [9].

The visible gap between top models on old static tasks is often small. The real separation appears on harder harnesses:

long-horizon coding tasks where the agent must inspect a repo and run tests
professional tasks where output quality is judged by experts
browser or desktop tasks where the model must recover from UI friction
adversarial safety settings where a model must ignore hidden instructions
cost-adjusted evaluations where a slightly weaker but cheaper model wins in production

The challenger effect

China-based and open-weight labs changed the economics of evaluation. Kimi, DeepSeek, Alibaba’s Qwen family, MiniMax, Z.ai, and related models often appear near the frontier on independent dashboards, especially when cost is included. Mistral occupies a different strategic position: European, enterprise-oriented, and focused on open-weight and customizable deployments [14].

This matters because model selection is no longer a pure quality ranking. A model that is a few points lower on an aggregate index but much cheaper, faster, open-weight, or deployable in a preferred jurisdiction may be the correct engineering choice.

5. What Commercial Models Teach Us

The main lesson is not that one provider stays permanently ahead. The lesson is that the unit of progress has changed. In 2024, major releases were separated by months and were easy to describe as model upgrades: more context, better vision, stronger coding, lower price. By 2025 and 2026, the frontier started moving as a sequence of product systems: thinking modes, tool use, browser and desktop operation, coding agents, long-horizon workflow completion, and risk-gated access.

That acceleration has three practical consequences.

First, benchmark leadership is becoming short-lived. A model can be first on a public aggregate score and still lose a few weeks later on a different harness, a cheaper challenger, or a specialized coding agent. This is why the release timeline matters: it shows the slope of change, not just the current winner.

Second, improvement is increasingly architectural and operational, not only statistical. The important deltas are often MoE routing efficiency, long-context stability, inference-time reasoning, tool reliability, token efficiency, and safety gating. Those details affect production behavior more than a one-point difference on a saturated exam.

Third, the frontier is spreading sideways. Closed proprietary models still set many top-line scores, but Chinese labs and European open-weight providers have made cost, deployment control, multilingual support, and self-hosting part of the evaluation conversation. The result is a faster, wider Pareto surface: model selection is now a moving engineering trade-off, not a static ranking exercise.

For actual model selection, a compact metric vector is more useful than one rank:

\mathbf{s}(m) = \left[ q_{\text{reasoning}}, q_{\text{coding}}, q_{\text{tool}}, q_{\text{domain}}, c_{\text{token}}, \ell_{\text{p95}}, r_{\text{safety}} \right]

The chosen model is rarely the argmax of one quality score. It is usually the model that maximizes utility under constraints:

m^* = \arg\max_m U(m) \quad \text{s.t.} \quad c(m) \le C,\ \ell_{p95}(m) \le L,\ r_{\text{critical}}(m) = 0

where $U(m)$ is task utility, $C$ is the cost budget, $L$ is the latency budget, and $r_{\text{critical}}$ is the count or probability of unacceptable failures. Capability asks “can it solve the task?” Deployability asks whether it still works under your latency, cost, data-residency, safety-policy, and private-eval constraints. Keeping those separate prevents public leaderboard excitement from turning into a production mistake.

6. Practical Takeaway

The frontier is no longer a single leaderboard race. It is a moving Pareto surface of intelligence, tool use, latency, token price, safety posture, openness, and product fit. Since early 2024, the release cycle has visibly compressed while improvement has shifted from static academic exams toward dynamic, agentic, and domain-specific work. Good teams read public scores as useful but incomplete signals and maintain private evals as release gates.

The next chapter section can now return to the larger question: if benchmarks keep moving, how do we build evaluation systems that improve faster than models can overfit them?

Quizzes

Quiz 1: Why is it misleading to compare a GPT-5.5 Terminal-Bench result directly with a Gemini 3.1 Pro SWE-Bench result?

They measure different task distributions and execution loops. Terminal-Bench emphasizes command-line workflows, tool coordination, and environment recovery, while SWE-Bench variants emphasize repository issue resolution under a particular scaffold. The model, harness, tool allowance, reasoning budget, and grader all shape the score.

Quiz 2: A cheaper open-weight model is 4 points below the top proprietary model on an aggregate benchmark. When might it still be the better production choice?

It can be better when latency, token cost, data residency, self-hosting, customization, auditability, or fallback capacity dominate the utility function. A small quality gap on a public benchmark may be irrelevant if the open-weight model passes the private workflow eval and satisfies operational constraints the proprietary model cannot.

Quiz 3: Why did public evaluation shift from MMLU-style static exams toward agentic benchmarks such as SWE-Bench Pro, Terminal-Bench, OSWorld, and GDPval?

Static exams became saturated and are vulnerable to contamination. High-value commercial workloads require multi-step planning, tool use, state tracking, verification, and recovery from errors. Agentic benchmarks expose these behaviors, along with cost and latency, better than one-shot question answering.

Quiz 4: What is the main risk of copying numbers from model cards into a comparison table without methodology notes?

The table can imply apples-to-apples comparison where none exists. Labs may use different prompts, tools, effort levels, sampling settings, dates, and graders. Without methodology notes, readers may interpret a benchmark delta as intrinsic model quality rather than harness-specific measurement.

Quiz 5: In the utility formula

m^* = \arg\max_m U(m)

subject to cost, latency, and critical-risk constraints, why is a hard safety constraint often better than folding safety into the average utility score?

Some failures are unacceptable even if the average quality is high. Folding critical safety into an average can hide rare severe regressions behind many benign wins. A hard constraint keeps release decisions aligned with operational risk: a model can be generally better and still be blocked.

References

OpenAI. (2023). GPT-4. OpenAI Research.
OpenAI. (2025). GPT-5 System Card. OpenAI.
OpenAI. (2026). Introducing GPT-5.3-Codex. OpenAI.
OpenAI. (2026). Introducing GPT-5.4. OpenAI.
OpenAI. (2026). Introducing GPT-5.5 and GPT-5.5 System Card. Release, System Card.
Google DeepMind. (2025). Gemini 2.5: Our newest Gemini model with thinking. Google Blog.
Google DeepMind. (2026). Gemini 3.1 Pro Model Card. Google DeepMind.
Anthropic. (2026). Introducing Claude Opus 4.7. Anthropic.
Anthropic. (2026). Project Glasswing. Anthropic.
Artificial Analysis. (2026). AI Model Evaluations and Intelligence Index. Artificial Analysis.
Vals AI. (2026). Benchmarks. Vals AI.
LMArena. (2026). Leaderboard. LMArena.
Chiang, W.-L., et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
Mistral AI. (2026). Models Overview. Mistral Docs.
Google. (2024). Our next-generation model: Gemini 1.5. Google Blog.
Anthropic. (2024). Introducing the next generation of Claude. Anthropic.
OpenAI. (2024). Hello GPT-4o. OpenAI.
Anthropic. (2024). Claude 3.5 Sonnet. Anthropic.
OpenAI. (2024). Introducing OpenAI o1-preview. OpenAI.
Mistral AI. (2024). Large Enough: Mistral Large 2. Mistral AI.
DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Qwen Team. (2025). Qwen3: Think Deeper, Act Faster. Qwen Blog.
MiniMax. (2025). MiniMax-M1, the World’s First Open-Source, Large-Scale, Hybrid-Attention Reasoning Model. MiniMax.
Moonshot AI. (2025). Kimi K2. Kimi.