AI Intelligence per Dollar — Frontier Model Benchmark Efficiency (2020–2026)
Tracks how much AI capability a dollar buys across six years of frontier model releases. Combines MMLU, HumanEval, MATH, BIG-Bench Hard, and GPQA Diamond scores into a normalised Composite Capability Index, then divides by published API inference cost to produce an Intelligence-per-Dollar ratio. Covers GPT-3 (2020) through GPT-5, Claude Sonnet 4.6, and Gemini 2.5 Pro (2026).
Data
| Period | Model | Composite Score | MMLU (5-shot) | HumanEval (pass@1) | MATH | Cost / 1M Output Tokens | Intelligence-per-Dollar |
|---|---|---|---|---|---|---|---|
| 2026-Q1 | Claude Opus 4.6 (Anthropic) | 95.0 | 90.5% | 95.0% | 91.5% | $75.00 | +3× |
| 2026-Q1 | Claude Sonnet 4.6 (Anthropic) | 95.5 | 89.7% | 92.0% | 97.8% | $15.00 | +15× |
| 2025-Q3 | GPT-5 (OpenAI) | 97.3 | 92.5% | 93.4% | 96.0% | $10.00 | +23× |
| 2025-Q2 | GPT-4.1 (OpenAI) | 91.3 | 90.2% | 94.5% | 82.1% | $8.00 | +27× |
| 2025-Q1 | o3 (OpenAI) | 100.0 | 92.9% | 97.8% | 97.8% | $10.00 | +22× |
| 2025-Q1 | Gemini 2.5 Pro (Google) | 95.8 | 89.8% | 95.0% | 95.0% | $10.00 | +23× |
| 2025-Q1 | GPT-4.5 (OpenAI) | 91.1 | 90.8% | 88.6% | 87.1% | $150.00 | +1× |
| 2025-Q1 | Claude 3.7 Sonnet (Anthropic) | 90.2 | 88.8% | 94.0% | 82.2% | $15.00 | +14× |
| 2024-Q3 | o1 (OpenAI) | 88.3 | 90.8% | 92.4% | 96.4% | $60.00 | +3× |
| 2024-Q2 | GPT-4o (OpenAI) | 88.6 | 88.7% | 90.2% | 76.6% | $15.00 | +14× |
| 2023-Q2 | GPT-4 (OpenAI) | 70.3 | 86.4% | 67.0% | 42.2% | $60.00 | +3× |
| 2022-Q1 | GPT-3.5-turbo (OpenAI) | 45.5 | 70.0% | 53.9% | 34.1% | $2.00 | +66× |
| 2020-Q2 | GPT-3 (OpenAI) | 0.0 | 43.9% | 14.0% | 4.0% | $60.00 | +1× |
About this Dataset
GPT-3.5-turbo in early 2022 delivered 66 times more benchmark performance per dollar of API spend than GPT-3 had in 2020. That came from two things happening at once: a 97% price cut and average benchmark scores that nearly doubled. No subsequent model has matched that efficiency ratio. Understanding why tells you more about how to buy AI than any spec sheet.
The Composite Capability Index (CCI) aggregates performance across five benchmarks: MMLU (world knowledge, 57 subjects), HumanEval (Python code generation, pass@1), Hendrycks MATH (competition mathematics), BIG-Bench Hard (23 compositional reasoning tasks), and GPQA Diamond (PhD-level biology, chemistry, and physics). Each score is linearly normalised against GPT-3’s 2020 performance as the floor and o3’s 2025 performance as the ceiling, then averaged across whichever benchmarks existed at each model’s release date. GPQA was published in 2023, so the 2020–2022 data points use only four benchmarks; those comparisons are noted accordingly.
In September 2024, o1 became the first commercial model to score above the documented PhD expert baseline on GPQA Diamond, reaching 77.3% against a human expert range of 65–69.7%. The benchmark was specifically designed to require genuine domain knowledge rather than pattern-matching.
The chart shows o1 (88.3) slightly below GPT-4o (88.6). That is not a data error. o1’s BBH score of 72.3 is well below GPT-4o’s 86.8 on the same 3-shot evaluation protocol, because o1’s extended chain-of-thought architecture does not suit that prompting format — even as it dominates on MATH (96.4%) and GPQA Diamond (77.3%). The composite recovers to 100.0 at o3 once BBH performance also peaks. For teams building capability assessments, the practical implication is: weight HumanEval and MMLU for coding and knowledge retrieval tasks; weight GPQA Diamond for scientific or analytical workloads.
The Intelligence-per-Dollar chart does not move in one direction, and that is the point. GPT-4 launched in 2023 at $60 per million output tokens — the same price as GPT-3 three years earlier — despite a 70-point composite improvement. That produced a 3× efficiency ratio, no better than GPT-3 at baseline. GPT-4o’s reprice to $15/M in 2024 (driven partly by competition from Claude 3 and Gemini 1.5) pushed the ratio back to 14×. O3 at $10/M reaches 22×. GPT-5 at the same $10/M with a composite of 97.3 reaches 23×. GPT-4.1 at $8/M is the most cost-efficient model in this dataset at 27×. The outlier in the other direction is GPT-4.5: a research preview at $150/M output, its efficiency ratio falls to approximately 1× — equivalent to GPT-3. The pattern holds across every generation: the cost-efficiency peak sits one tier below the capability frontier, in models repriced after their successors launched.
The Intelligence-per-Dollar figures for reasoning models in this dataset carry a caveat the headline ratio conceals. Most frontier releases from o1 onward generate reasoning tokens — an internal chain-of-thought stream that precedes the visible response — typically billed at the same output token rate as the response itself. A medium-complexity analytical task tends to produce between 2,000 and 15,000 reasoning tokens in addition to a few hundred tokens of visible output; hard mathematical or scientific problems can regularly exceed 30,000. At o1’s $60/M output rate, a task drawing 8,000 reasoning tokens plus 500 visible output tokens would cost around $0.51 — versus roughly $0.05 for the equivalent output on GPT-4 at the same nominal price. The effective cost-per-task multiplier across reasoning models is typically 5–20× what the published token price implies, though it varies considerably by query type. Lower headline prices in later models are real improvements, but where reasoning chain length scales with problem difficulty, effective task cost on demanding queries is likely to remain materially higher than the nominal efficiency ratio suggests. This dataset calculates Intelligence-per-Dollar on the published $/1M output token rate only, using the same token-count basis as non-reasoning predecessors. Buyers evaluating reasoning models for production workloads should benchmark effective cost-per-task on their own query distribution rather than relying on the nominal ratio shown here.
Key methodological considerations for professional use of this dataset:
- Benchmark saturation: MMLU has an estimated true ceiling of ~93% due to labelling errors in the original dataset; o3 at 92.9% is functionally at that limit. HumanEval’s 164-problem set is now considered insufficient for distinguishing frontier models.
- MATH discontinuity: Models from o1 onward are evaluated on MATH-500, a curated 500-question subset; earlier models used the full 12,500-question set. This inflates the apparent improvement in mathematical reasoning for the most recent epoch.
- Evaluation protocol variance: Temperature, few-shot count, and chain-of-thought instructions differ across providers and evaluations. Scores are best-effort figures from technical reports and community leaderboards, not controlled comparative experiments.
- API price vs. compute efficiency: Published prices reflect competitive strategy as much as underlying compute costs. The Intelligence-per-Dollar ratio measures commercial procurement value, not algorithmic progress per training FLOP.
- Equal weighting is a choice: GPQA Diamond has 198 questions; MMLU has 14,079. Equal weighting treats them identically. Teams with specific application contexts should reweight accordingly.
- Reasoning token inflation: The Intelligence-per-Dollar metric uses the published output token price on the same basis for all models. Most reasoning models (o1 onward) emit an internal chain-of-thought before the visible response; those tokens are typically billed at the same rate and can add 5–20× to the effective token count per task on demanding queries. The stated efficiency ratios for reasoning models likely overstate their cost competitiveness relative to non-reasoning predecessors on workloads that trigger extended reasoning.
- Benchmarks do not measure general capability: The CCI does not capture reliability, calibration, factual grounding, agentic performance, or fit for any particular workflow.
The 2025–2026 model cohort — Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-5, Claude Sonnet 4.6, Claude Opus 4.6 — clusters tightly between composite scores of 90 and 97. On the benchmarks tracked here, the gap between providers has largely closed. The remaining differentiation is in dimensions this dataset does not cover: GPQA Diamond scientific reasoning (Claude Opus 4.6 at 91.3%), coding agent tasks (SWE-bench Verified), and structured tool use. For enterprise qualification decisions, those are now the benchmarks that matter.