AI Autonomous Task Time Horizon — How Long Frontier Models Work Without Human Intervention (2020–2026)
Tracks the 50% task time horizon of frontier AI agents on METR's HCAST benchmark: the length of task (measured by human expert completion time) at which a model succeeds roughly half the time when working autonomously. From 9 seconds for early GPT-3 agents in 2020 to 14.5 hours for Claude Opus 4.6 in February 2026, the horizon has roughly doubled every 7 months.
Data
| Model | Organisation | Date | 50% Task Horizon | Notes | Source |
|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | Feb 2026 | 14.5h | METR TH1.1 task suite; point estimate with ±30–50% uncertainty | METR Time Horizon 1.1 |
| Claude Opus 4.5 | Anthropic | Jul 2025 | 4h 49m | Direct METR evaluation; TH1.1 suite includes more 8h+ tasks | METR — LessWrong post |
| GPT-5 | OpenAI | Jun 2025 | 2h 17m | Direct METR evaluation | METR GPT-5 evaluation report |
| o3 / Claude 3.7 Sonnet | OpenAI / Anthropic | Q1 2025 | ~75 min | METR Time Horizon 1.0 estimate; extended trend | METR time-horizons page |
| Claude 3.5 Sonnet | Anthropic | Oct 2024 | ~40 min | Late-2024 frontier cohort; HCAST+SWAA benchmark | METR arXiv:2503.14499 |
| GPT-4o | OpenAI | May 2024 | ~8 min | METR HCAST+SWAA evaluation | METR arXiv:2503.14499 |
| GPT-4 | OpenAI | Mar 2023 | ~4 min | HCAST evaluation | METR arXiv:2503.14499 |
| GPT-3 agents | OpenAI | Mid-2020 | ~9s | Retrospective estimate; early agentic scaffolding | METR arXiv:2503.14499 |
About this Dataset
In mid-2020, an AI agent built on early GPT-3 infrastructure could autonomously complete tasks requiring roughly 9 seconds of equivalent human expert effort at the 50% success rate threshold. By February 2026, Claude Opus 4.6 reached a 50% task horizon of approximately 14.5 hours on METR’s Time Horizon 1.1 evaluation. That progression — from 9 seconds to 52,200 seconds — spans nearly four orders of magnitude in six years and represents the most systematically measured trajectory of AI agentic capability currently available in published research.
The metric behind this chart is METR’s task time horizon, introduced in the March 2025 paper “Measuring AI Ability to Complete Long Tasks” (Kwa et al., arXiv:2503.14499). The 50% task time horizon is defined as the human expert completion time of the task at which a frontier agent succeeds approximately half the time when working entirely autonomously. “Human expert completion time” is the critical unit: it measures the difficulty of the task as a proxy, not how long the AI takes to execute. A task rated at 40 minutes of human expert time might take an AI agent considerably longer — or, where tasks parallelize, considerably less — but the benchmark uses human-time ratings as a stable, model-agnostic difficulty scale. Researchers derive the 50% threshold by fitting a logistic curve to hundreds of tasks, each pre-rated by annotators, and bootstrapping confidence intervals around the point estimate.
Between early 2024 and early 2026, the 50% task time horizon for frontier models doubled roughly every four to seven months — a rate that, if sustained, would imply day-length autonomous task capability within the current decade. Whether that rate is sustained remains empirically open.
The growth pattern is most clearly visible on the logarithmic scale used in the chart above. On a log axis, each doubling of the task horizon appears as an equal vertical increment. The trajectory from GPT-4 (approximately 4 minutes, March 2023) to GPT-4o (approximately 8 minutes, May 2024) represents one doubling; from GPT-4o to Claude 3.5 Sonnet’s late-2024 cohort (approximately 40 minutes) represents roughly two and a half additional doublings across six months, suggesting acceleration in the 2024 period. The jump from GPT-5 (2h 17m, June 2025) to Claude Opus 4.5 (4h 49m, July 2025) within a single month is consistent with the accelerated doubling rate observed in 2024–2025.
Three scope limitations are essential context for professional use of this data. First, HCAST is predominantly a software engineering benchmark. Most tasks involve Python scripting, shell automation, data pipelines, repository navigation, and related programming-adjacent work in a sandboxed Linux environment. The benchmark does not systematically cover the sustained natural-language reasoning, interpersonal coordination, and domain-specific judgment that characterise many professional workflows. A 14.5-hour task horizon in HCAST does not imply equivalent autonomous capability across general office work. Second, the task suite used for the most recent evaluations — METR’s Time Horizon 1.1, published January 2026 — expanded the set of tasks in the 4–12 hour range to ensure measurement coverage at higher capability levels. This makes direct numerical comparisons with the earlier HCAST estimates from the March 2025 paper partially but not fully methodologically homogeneous. Third, point estimates carry approximately ±30–50% uncertainty as ratios; a stated horizon of 40 minutes plausibly covers a range from roughly 20 to 60 minutes. The chart plots point estimates; for quantitative applications, confidence intervals from the source data should be incorporated.
The mid-2020 data point deserves a separate note: it is a retrospective estimate by METR researchers, not a contemporaneous evaluation. Early GPT-3-era agents were not systematically evaluated on task-horizon methodology at the time; the 9-second estimate is derived by back-fitting the observed trend and applying it to documented early agent performance. It should be treated as an indicative anchor rather than a precision measurement.
For policy researchers and investors, the practical signal in this dataset is primarily about pace of capability development rather than deployment readiness. Benchmark performance in a controlled sandbox environment and autonomous operation in real enterprise workflows involve meaningfully different conditions: live enterprise settings typically include authentication and access management, embedded human approval gates, integration with legacy systems not designed for agent access, and accountability structures where errors carry organisational consequences. The gap between benchmark capability and deployable autonomy is not constant, and measuring it requires empirical evidence from real deployment contexts — which the HCAST benchmark alone does not provide.
Key methodological notes for quantitative use of this dataset:
- Task time unit: “Human expert completion time” is a difficulty proxy rated by annotators, not AI execution time. The two can differ substantially.
- Logistic curve fitting: The 50% horizon is a model-derived estimate, not a directly observed value. Point estimates carry ±30–50% uncertainty as ratios.
- Benchmark scope: HCAST and TH1.1 are software-engineering-heavy. Performance on general professional tasks may differ substantially.
- TH1.1 expansion: Post-March-2025 evaluations use an expanded task suite with more long-duration tasks; comparisons with earlier HCAST estimates are partially but not fully homogeneous.
- Retrospective baseline: The mid-2020 estimate is a back-fitted retrospective, not a contemporaneous evaluation.
- Trend extrapolation: The observed doubling rate is a historical pattern in eight data points. It carries no mechanistic guarantee of continuation, acceleration, or saturation.