AI & Research

AI Intelligence per Dollar — Frontier Model Benchmark Efficiency (2020–2026)

Tracks how much AI capability a dollar buys across six years of frontier model releases. Combines MMLU, HumanEval, MATH, BIG-Bench Hard, and GPQA Diamond scores into a normalised Composite Capability Index, then divides by published API inference cost to produce an Intelligence-per-Dollar ratio. Covers GPT-3 (2020) through GPT-5, Claude Sonnet 4.6, and Gemini 2.5 Pro (2026).

Updated April 14, 2026 Per model release OpenAI Technical Reports; GPT-Fathom (arXiv:2309.16583); Epoch AI Benchmark Database; llm-stats.com; Artificial Analysis

Peak Intelligence-per-Dollar (GPT-3.5-turbo, 2022)

66×

capability per $1M output vs. GPT-3 2020 baseline

97% price cut + near-doubled benchmark scores in two years

Best Current Intelligence-per-Dollar (GPT-4.1, 2025-Q2)

27×

capability per $1M output; $8/M output price

Best cost-efficiency among 2025–2026 frontier models

Composite Capability Index — o3 (2025-Q1)

100.0

0–100 normalised index; GPT-3 2020 = 0

+100 points since GPT-3 (2020)

Frontier API Cost per 1M Output Tokens (2025–2026)

~$10

GPT-5, o3, Gemini 2.5 Pro (vs. $60 for GPT-3 in 2020)

−83% real cost reduction since 2020

Data

Period	Model	Composite Score	MMLU (5-shot)	HumanEval (pass@1)	MATH	Cost / 1M Output Tokens	Intelligence-per-Dollar
2026-Q1	Claude Opus 4.6 (Anthropic)	95	90.5%	95%	91.5%	$75.00	+3×
2026-Q1	Claude Sonnet 4.6 (Anthropic)	95.5	89.7%	92%	97.8%	$15.00	+15×
2025-Q3	GPT-5 (OpenAI)	97.3	92.5%	93.4%	96%	$10.00	+23×
2025-Q2	GPT-4.1 (OpenAI)	91.3	90.2%	94.5%	82.1%	$8.00	+27×
2025-Q1	o3 (OpenAI)	100	92.9%	97.8%	97.8%	$10.00	+22×
2025-Q1	Gemini 2.5 Pro (Google)	95.8	89.8%	95%	95%	$10.00	+23×
2025-Q1	GPT-4.5 (OpenAI)	91.1	90.8%	88.6%	87.1%	$150.00	+1×
2025-Q1	Claude 3.7 Sonnet (Anthropic)	90.2	88.8%	94%	82.2%	$15.00	+14×
2024-Q3	o1 (OpenAI)	88.3	90.8%	92.4%	96.4%	$60.00	+3×
2024-Q2	GPT-4o (OpenAI)	88.6	88.7%	90.2%	76.6%	$15.00	+14×
2023-Q2	GPT-4 (OpenAI)	70.3	86.4%	67%	42.2%	$60.00	+3×
2022-Q1	GPT-3.5-turbo (OpenAI)	45.5	70%	53.9%	34.1%	$2.00	+66×
2020-Q2	GPT-3 (OpenAI)	0	43.9%	14%	4%	$60.00	+1×

About this Dataset

GPT-3.5-turbo in early 2022 delivered 66 times more benchmark performance per dollar of API spend than GPT-3 had in 2020. That came from two things happening at once: a 97% price cut and average benchmark scores that nearly doubled. No subsequent model has matched that efficiency ratio. Understanding why tells you more about how to buy AI than any spec sheet.

The Composite Capability Index (CCI) aggregates performance across five benchmarks: MMLU (world knowledge, 57 subjects), HumanEval (Python code generation, pass@1), Hendrycks MATH (competition mathematics), BIG-Bench Hard (23 compositional reasoning tasks), and GPQA Diamond (PhD-level biology, chemistry, and physics). Each score is linearly normalised against GPT-3's 2020 performance as the floor and o3's 2025 performance as the ceiling, then averaged across whichever benchmarks existed at each model's release date. GPQA was published in 2023, so the 2020–2022 data points use only four benchmarks; those comparisons are noted accordingly.

In September 2024, o1 became the first commercial model to score above the documented PhD expert baseline on GPQA Diamond, reaching 77.3% against a human expert range of 65–69.7%. The benchmark was specifically designed to require genuine domain knowledge rather than pattern-matching.

The chart shows o1 (88.3) slightly below GPT-4o (88.6). That is not a data error. o1's BBH score of 72.3 is well below GPT-4o's 86.8 on the same 3-shot evaluation protocol, because o1's extended chain-of-thought architecture does not suit that prompting format — even as it dominates on MATH (96.4%) and GPQA Diamond (77.3%). The composite recovers to 100.0 at o3 once BBH performance also peaks. For teams building capability assessments, the practical implication is: weight HumanEval and MMLU for coding and knowledge retrieval tasks; weight GPQA Diamond for scientific or analytical workloads.

The Intelligence-per-Dollar chart does not move in one direction, and that is the point. GPT-4 launched in 2023 at $60 per million output tokens — the same price as GPT-3 three years earlier — despite a 70-point composite improvement. That produced a 3× efficiency ratio, no better than GPT-3 at baseline. GPT-4o's reprice to $15/M in 2024 (driven partly by competition from Claude 3 and Gemini 1.5) pushed the ratio back to 14×. O3 at $10/M reaches 22×. GPT-5 at the same $10/M with a composite of 97.3 reaches 23×. GPT-4.1 at $8/M is the most cost-efficient model in this dataset at 27×. The outlier in the other direction is GPT-4.5: a research preview at $150/M output, its efficiency ratio falls to approximately 1× — equivalent to GPT-3. The pattern holds across every generation: the cost-efficiency peak sits one tier below the capability frontier, in models repriced after their successors launched.

The Intelligence-per-Dollar figures for reasoning models in this dataset carry a caveat the headline ratio conceals. Most frontier releases from o1 onward generate reasoning tokens — an internal chain-of-thought stream that precedes the visible response — typically billed at the same output token rate as the response itself. A medium-complexity analytical task tends to produce between 2,000 and 15,000 reasoning tokens in addition to a few hundred tokens of visible output; hard mathematical or scientific problems can regularly exceed 30,000. At o1's $60/M output rate, a task drawing 8,000 reasoning tokens plus 500 visible output tokens would cost around $0.51 — versus roughly $0.05 for the equivalent output on GPT-4 at the same nominal price. The effective cost-per-task multiplier across reasoning models is typically 5–20× what the published token price implies, though it varies considerably by query type. Lower headline prices in later models are real improvements, but where reasoning chain length scales with problem difficulty, effective task cost on demanding queries is likely to remain materially higher than the nominal efficiency ratio suggests. This dataset calculates Intelligence-per-Dollar on the published $/1M output token rate only, using the same token-count basis as non-reasoning predecessors. Buyers evaluating reasoning models for production workloads should benchmark effective cost-per-task on their own query distribution rather than relying on the nominal ratio shown here.

Key methodological considerations for professional use of this dataset:

Benchmark saturation: MMLU has an estimated true ceiling of ~93% due to labelling errors in the original dataset; o3 at 92.9% is functionally at that limit. HumanEval's 164-problem set is now considered insufficient for distinguishing frontier models.
MATH discontinuity: Models from o1 onward are evaluated on MATH-500, a curated 500-question subset; earlier models used the full 12,500-question set. This inflates the apparent improvement in mathematical reasoning for the most recent epoch.
Evaluation protocol variance: Temperature, few-shot count, and chain-of-thought instructions differ across providers and evaluations. Scores are best-effort figures from technical reports and community leaderboards, not controlled comparative experiments.
API price vs. compute efficiency: Published prices reflect competitive strategy as much as underlying compute costs. The Intelligence-per-Dollar ratio measures commercial procurement value, not algorithmic progress per training FLOP.
Equal weighting is a choice: GPQA Diamond has 198 questions; MMLU has 14,079. Equal weighting treats them identically. Teams with specific application contexts should reweight accordingly.
Reasoning token inflation: The Intelligence-per-Dollar metric uses the published output token price on the same basis for all models. Most reasoning models (o1 onward) emit an internal chain-of-thought before the visible response; those tokens are typically billed at the same rate and can add 5–20× to the effective token count per task on demanding queries. The stated efficiency ratios for reasoning models likely overstate their cost competitiveness relative to non-reasoning predecessors on workloads that trigger extended reasoning.
Benchmarks do not measure general capability: The CCI does not capture reliability, calibration, factual grounding, agentic performance, or fit for any particular workflow.

The 2025–2026 model cohort — Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-5, Claude Sonnet 4.6, Claude Opus 4.6 — clusters tightly between composite scores of 90 and 97. On the benchmarks tracked here, the gap between providers has largely closed. The remaining differentiation is in dimensions this dataset does not cover: GPQA Diamond scientific reasoning (Claude Opus 4.6 at 91.3%), coding agent tasks (SWE-bench Verified), and structured tool use. For enterprise qualification decisions, those are now the benchmarks that matter.

Frequently Asked Questions

The Composite Capability Index (CCI) is a 0–100 normalised aggregate of performance across up to five standardised benchmarks: MMLU (world knowledge, 57 subjects), HumanEval (Python code generation, pass@1), MATH (Hendrycks competition mathematics), BIG-Bench Hard (23 compositional reasoning tasks), and GPQA Diamond (PhD-level STEM, available from 2023 only). Each benchmark score is linearly normalised against the minimum observed value (GPT-3, 2020) and the maximum observed value (o3, 2025-Q1), then averaged across available benchmarks. The CCI is a constructed index, not an official metric from any model developer; its primary utility is compressing heterogeneous benchmark performance into a single trend-visible number for comparative analysis. Limitations include: the weighting of each benchmark is equal regardless of task coverage; different model families use different evaluation protocols; the 2020–2022 composite excludes GPQA (not yet published); and benchmark saturation means the index cannot meaningfully distinguish between models once all constituent benchmarks approach ceiling performance.

The five benchmarks span the major capability dimensions that investment analysts and enterprise technology teams track: MMLU captures broad knowledge retrieval and professional-level language understanding across 57 subject areas; HumanEval measures practical software engineering capability with direct commercial relevance; MATH tests multi-step symbolic reasoning that proxies for analytical depth; BIG-Bench Hard specifically isolates tasks where earlier models (pre-GPT-4) failed to exceed average human performance, making it a sensitive detector of genuine reasoning progress; and GPQA Diamond, introduced in 2023, provides the most rigorous available test of scientific reasoning, using questions validated by PhD researchers to require genuine domain expertise rather than pattern matching. Together they provide triangulated coverage of knowledge breadth, code generation, mathematical reasoning, compositional reasoning, and expert-level science — the domains most relevant to enterprise AI deployment decisions.

Intelligence-per-Dollar is defined as the raw average benchmark score across available benchmarks divided by the published API output token price ($/1M tokens), normalised so that GPT-3 in 2020 equals 1×. It measures how much capability a buyer receives per unit of inference spend, capturing the joint effect of capability improvements and price reductions. The GPT-3.5-turbo vintage (2022) shows the highest Intelligence-per-Dollar ratio in this dataset at approximately 66× the 2020 baseline — driven by a dramatic 97% price reduction to ~$2/1M tokens while capability nearly doubled. GPT-4 (2023) shows a sharp reversal to ~3× as its $60/1M output price matched GPT-3's original price despite materially higher capability. GPT-4o (2024) partially restored the ratio to ~14× through the $15/1M price point. The ratio is a commercial efficiency measure, not a compute efficiency measure: it reflects vendor pricing strategy and competitive dynamics as much as underlying model improvement. Enterprise buyers optimising for cost-adjusted capability should track this ratio across new model releases, particularly when providers cut prices in response to competitive pressure.

Benchmark saturation is the most serious methodological concern for this dataset. MMLU is estimated to have a theoretical ceiling of approximately 93% due to mislabelled questions in the original dataset; o3 at 92.9% is functionally at ceiling. HumanEval's original 164-problem test set is now considered inadequate: several providers have been found to include similar problems in training data, and pass@1 rates above 95% are common among frontier models. The MATH benchmark uses MATH-500 (a curated subset) for frontier model evaluations since 2024, whereas earlier models were evaluated on the full 12,500-question set — creating a methodological discontinuity that modestly inflates apparent improvement for reasoning models. BIG-Bench Hard is experiencing a similar saturation, which prompted the February 2025 release of BIG-Bench Extra Hard (BBEH), where even leading models score below 55%. GPQA Diamond remains the least saturated benchmark (o3 at 83.3% vs. human expert ceiling of ~69–74%), though it is also the shortest (198 questions) and therefore noisiest. Finally, model providers select benchmark reporting conditions (temperature, number of shots, prompting strategy) and in some cases may have used benchmark-adjacent data during training — the scores represent best-effort published figures but cannot be considered fully controlled experiments.

Most frontier models from o1 onward generate two categories of output tokens: reasoning tokens, which are the internal chain-of-thought computations that precede the response, and response tokens, which are the visible output. Both categories are typically billed at the same output token rate, though billing structures vary by provider and can change. Reasoning token counts are generally not directly controllable by the caller; they tend to scale with problem difficulty and can range from a few hundred tokens on trivial tasks to tens of thousands on hard mathematical or scientific problems. Evaluations of reasoning models in coding, mathematics, and analysis workloads have typically found 4,000–20,000 reasoning tokens per query, though observed ranges vary widely. At o1's $60/1M token price, a task drawing 8,000 reasoning tokens plus 500 visible output tokens would cost around $0.51 — compared to roughly $0.03–$0.08 for a similarly-scoped GPT-4o query. Later reasoning models carry lower headline prices, but where reasoning chain length scales with problem difficulty regardless of generation, the effective cost-per-task gap with non-reasoning models may persist. The Intelligence-per-Dollar ratios in this dataset are computed on the published $/1M output token price using the same methodology for all models; they do not adjust for reasoning token volume. Most reasoning models in the dataset therefore likely appear more cost-efficient here than they will be in practice on demanding workloads. GPT-4.1, which does not appear to emit reasoning tokens, is the most directly comparable model to non-reasoning predecessors on a per-task cost basis for standard tasks. For workloads that require chain-of-thought reasoning, benchmark actual token consumption on your own query distribution before committing to a deployment model.

The benchmark trajectory from 2020 to 2025 supports three practical conclusions for enterprise buyers and investors. First, the capability-per-dollar curve is non-monotonic: the maximum Intelligence-per-Dollar for this vintage set was achieved by GPT-3.5-turbo in 2022, not by frontier models, because frontier models carry substantial premium pricing at launch. Enterprises optimising for unit economics rather than frontier performance should systematically evaluate whether the previous generation's model — typically repriced 6–12 months after launch — meets their accuracy threshold. Second, the transition from GPT-3.5 to GPT-4 (composite index jump of +24.8 points) and from GPT-4 to GPT-4o (a further +18.3 points) correspond to the benchmark inflection points where models crossed professional-competence thresholds on knowledge and code tasks — the capability improvements most directly correlated with measurable enterprise workflow automation. Third, the jump from non-reasoning models (GPT-4o at 88.6) to explicit chain-of-thought reasoning models (o1, o3) primarily delivered gains on MATH and GPQA rather than on MMLU and HumanEval, which were already near-saturated — suggesting that the value of frontier reasoning models concentrates in scientific R&D, legal analysis, and complex financial modelling rather than general knowledge retrieval or standard code generation, where earlier models are already sufficient.

Data

About this Dataset

Frequently Asked Questions

Related Content