AI & Research

AI Autonomous Task Time Horizon — How Long Frontier Models Work Without Human Intervention (2020–2026)

Tracks the 50% task time horizon of frontier AI agents on METR's HCAST benchmark: the length of task (measured by human expert completion time) at which a model succeeds roughly half the time when working autonomously. From 9 seconds for early GPT-3 agents in 2020 to 14.5 hours for Claude Opus 4.6 in February 2026, the horizon has roughly doubled every 7 months.

Updated April 2026 Quarterly (approx.) METR (Model Evaluation & Threat Research); arXiv:2503.14499 (Kwa et al., March 2025)

50% Task Horizon — Claude Opus 4.6 (Feb 2026)

14.5 hours

METR Time Horizon 1.1 evaluation

vs. 9 seconds in 2020

Historical Doubling Rate (2020–2025)

~7 months

Average time to double the 50% task horizon

Recently accelerating to ~4 months in 2024–2025

50% Task Horizon — GPT-5 (Jun 2025)

2h 17m

METR direct evaluation

~34× increase since GPT-4 in 2023

Total Increase (2020 → 2026)

~4 orders of magnitude

9 seconds → 14.5 hours

~5,800× in absolute terms

Data

Model	Organisation	Date	50% Task Horizon	Notes	Source
Claude Opus 4.6	Anthropic	Feb 2026	14.5h	METR TH1.1 task suite; point estimate with ±30–50% uncertainty	METR Time Horizon 1.1
Claude Opus 4.5	Anthropic	Jul 2025	4h 49m	Direct METR evaluation; TH1.1 suite includes more 8h+ tasks	METR — LessWrong post
GPT-5	OpenAI	Jun 2025	2h 17m	Direct METR evaluation	METR GPT-5 evaluation report
o3 / Claude 3.7 Sonnet	OpenAI / Anthropic	Q1 2025	~75 min	METR Time Horizon 1.0 estimate; extended trend	METR time-horizons page
Claude 3.5 Sonnet	Anthropic	Oct 2024	~40 min	Late-2024 frontier cohort; HCAST+SWAA benchmark	METR arXiv:2503.14499
GPT-4o	OpenAI	May 2024	~8 min	METR HCAST+SWAA evaluation	METR arXiv:2503.14499
GPT-4	OpenAI	Mar 2023	~4 min	HCAST evaluation	METR arXiv:2503.14499
GPT-3 agents	OpenAI	Mid-2020	~9s	Retrospective estimate; early agentic scaffolding	METR arXiv:2503.14499

About this Dataset

In mid-2020, an AI agent built on early GPT-3 infrastructure could autonomously complete tasks requiring roughly 9 seconds of equivalent human expert effort at the 50% success rate threshold. By February 2026, Claude Opus 4.6 reached a 50% task horizon of approximately 14.5 hours on METR’s Time Horizon 1.1 evaluation. That progression — from 9 seconds to 52,200 seconds — spans nearly four orders of magnitude in six years and represents the most systematically measured trajectory of AI agentic capability currently available in published research.

The metric behind this chart is METR’s task time horizon, introduced in the March 2025 paper “Measuring AI Ability to Complete Long Tasks” (Kwa et al., arXiv:2503.14499). The 50% task time horizon is defined as the human expert completion time of the task at which a frontier agent succeeds approximately half the time when working entirely autonomously. “Human expert completion time” is the critical unit: it measures the difficulty of the task as a proxy, not how long the AI takes to execute. A task rated at 40 minutes of human expert time might take an AI agent considerably longer — or, where tasks parallelize, considerably less — but the benchmark uses human-time ratings as a stable, model-agnostic difficulty scale. Researchers derive the 50% threshold by fitting a logistic curve to hundreds of tasks, each pre-rated by annotators, and bootstrapping confidence intervals around the point estimate.

Between early 2024 and early 2026, the 50% task time horizon for frontier models doubled roughly every four to seven months — a rate that, if sustained, would imply day-length autonomous task capability within the current decade. Whether that rate is sustained remains empirically open.

The growth pattern is most clearly visible on the logarithmic scale used in the chart above. On a log axis, each doubling of the task horizon appears as an equal vertical increment. The trajectory from GPT-4 (approximately 4 minutes, March 2023) to GPT-4o (approximately 8 minutes, May 2024) represents one doubling; from GPT-4o to Claude 3.5 Sonnet’s late-2024 cohort (approximately 40 minutes) represents roughly two and a half additional doublings across six months, suggesting acceleration in the 2024 period. The jump from GPT-5 (2h 17m, June 2025) to Claude Opus 4.5 (4h 49m, July 2025) within a single month is consistent with the accelerated doubling rate observed in 2024–2025.

Three scope limitations are essential context for professional use of this data. First, HCAST is predominantly a software engineering benchmark. Most tasks involve Python scripting, shell automation, data pipelines, repository navigation, and related programming-adjacent work in a sandboxed Linux environment. The benchmark does not systematically cover the sustained natural-language reasoning, interpersonal coordination, and domain-specific judgment that characterise many professional workflows. A 14.5-hour task horizon in HCAST does not imply equivalent autonomous capability across general office work. Second, the task suite used for the most recent evaluations — METR’s Time Horizon 1.1, published January 2026 — expanded the set of tasks in the 4–12 hour range to ensure measurement coverage at higher capability levels. This makes direct numerical comparisons with the earlier HCAST estimates from the March 2025 paper partially but not fully methodologically homogeneous. Third, point estimates carry approximately ±30–50% uncertainty as ratios; a stated horizon of 40 minutes plausibly covers a range from roughly 20 to 60 minutes. The chart plots point estimates; for quantitative applications, confidence intervals from the source data should be incorporated.

The mid-2020 data point deserves a separate note: it is a retrospective estimate by METR researchers, not a contemporaneous evaluation. Early GPT-3-era agents were not systematically evaluated on task-horizon methodology at the time; the 9-second estimate is derived by back-fitting the observed trend and applying it to documented early agent performance. It should be treated as an indicative anchor rather than a precision measurement.

For policy researchers and investors, the practical signal in this dataset is primarily about pace of capability development rather than deployment readiness. Benchmark performance in a controlled sandbox environment and autonomous operation in real enterprise workflows involve meaningfully different conditions: live enterprise settings typically include authentication and access management, embedded human approval gates, integration with legacy systems not designed for agent access, and accountability structures where errors carry organisational consequences. The gap between benchmark capability and deployable autonomy is not constant, and measuring it requires empirical evidence from real deployment contexts — which the HCAST benchmark alone does not provide.

Key methodological notes for quantitative use of this dataset:

Task time unit: “Human expert completion time” is a difficulty proxy rated by annotators, not AI execution time. The two can differ substantially.
Logistic curve fitting: The 50% horizon is a model-derived estimate, not a directly observed value. Point estimates carry ±30–50% uncertainty as ratios.
Benchmark scope: HCAST and TH1.1 are software-engineering-heavy. Performance on general professional tasks may differ substantially.
TH1.1 expansion: Post-March-2025 evaluations use an expanded task suite with more long-duration tasks; comparisons with earlier HCAST estimates are partially but not fully homogeneous.
Retrospective baseline: The mid-2020 estimate is a back-fitted retrospective, not a contemporaneous evaluation.
Trend extrapolation: The observed doubling rate is a historical pattern in eight data points. It carries no mechanistic guarantee of continuation, acceleration, or saturation.

Frequently Asked Questions

The 50% task time horizon is the length of task at which a frontier AI agent succeeds approximately half the time when operating autonomously without human guidance. Task length is measured in human expert completion time — the estimated time a skilled human professional would require to perform the same task from scratch — not in wall-clock AI execution time. That distinction matters: a task rated at 40 minutes of human expert time might take an AI agent hours, or might be completed much faster through parallelism; the metric captures task complexity rather than AI speed. METR derives this estimate by fitting a logistic curve to a large set of tasks, each pre-rated by human annotators for complexity, then running AI agents on those tasks in a sandboxed environment. The 50% success rate threshold is chosen because it sits in the most informative region of the logistic curve — where task difficulty and agent capability interact most visibly. The resulting metric compresses the agent's capability distribution to a single interpretable number: the task difficulty, in human-time units, where the agent is about as likely to succeed as to fail.

The 50% task time horizon has grown from approximately 9 seconds in mid-2020 to 52,200 seconds (14.5 hours) by February 2026 — a factor of roughly 5,800 across six years. Plotting this trajectory on a linear axis would make the early data points (GPT-3 agents at 9 seconds, GPT-4 at 240 seconds) visually indistinguishable from zero, while compressing all the interesting variation into a near-vertical spike at the right edge. A logarithmic scale allows every doubling of the horizon to appear as an equal vertical step, which matches the underlying pattern in the data: the horizon has roughly doubled every 7 months over most of the observed period. On a log scale, a straight trend line indicates consistent exponential growth; deviations from linearity reveal acceleration or deceleration. The chart therefore does not exaggerate the trend — it is the scale that makes the empirical regularity visible without distortion.

HCAST (Human-Comparable Agentic Software Tasks) is a benchmark suite developed by METR that consists of hundreds of real-world software engineering and technical tasks sourced from online task platforms, professional workflows, and internal test development. Tasks span Python scripting, shell automation, repository navigation, data analysis pipelines, web scraping, bug reproduction, and similar programming-adjacent work. Agents are evaluated in a sandboxed Linux environment with access to standard tools, internet-connected resources where tasks require them, and no human interaction during the task. The benchmark is specifically designed so that each task has a verifiable, objective completion criterion — allowing binary scoring without subjective judgment. The critical scope limitation is that HCAST is predominantly a software engineering benchmark. The majority of tasks are solvable by someone with programming competence; general professional tasks requiring sustained natural-language reasoning, interpersonal coordination, visual judgment, or domain expertise outside software are largely absent from the current suite. The Time Horizon 1.1 update published in January 2026 expanded the task set with more tasks in the 4–12 hour range, which widened the measurable range but did not substantially broaden the task type distribution. Inferences about agent capability in non-technical workplace contexts should be made cautiously from this data.

The observed doubling time of approximately 7 months (compressing to roughly 4 months in the 2024–2025 period) is an empirical pattern in a small historical dataset of eight data points, not a physical law or a guaranteed trajectory. Several factors could cause the trend to slow: benchmark saturation is possible if current task sets become fully solved before longer tasks are added; the tasks that remain hardest may require qualitatively different capabilities not currently improving at the same rate; and real-world deployment constraints — authentication, approval gates, organisational context — not present in HCAST may limit practical capability gains even as benchmark scores continue to rise. The trend could also accelerate if architectural changes, improved scaffolding, or expanded tool access produce larger-than-typical capability jumps in individual model releases. METR's own published commentary frames the trend descriptively rather than predictively. No published mechanistic model explains why the doubling rate should be stable or should change at any particular threshold. Any forward projection from this chart carries substantial uncertainty and should be treated as illustrative rather than predictive.

The benchmark measures performance in a controlled, sandboxed environment on software-oriented tasks with objective completion criteria. Autonomous task completion in real enterprise workflows typically introduces several layers of complexity absent from HCAST: authentication and access management across multiple systems; human approval gates embedded in procurement, legal, and compliance workflows; integration with legacy software not designed for API or agent access; ambiguous or evolving task specifications that require iterative human clarification; and accountability structures where errors have real organisational consequences. A model capable of completing a 14.5-hour benchmark task autonomously may still require substantial human oversight infrastructure when deployed on tasks of equivalent nominal duration in a live enterprise setting. The benchmark is most directly informative for organisations building agentic pipelines in software development, data engineering, or technical operations — where HCAST's task composition more closely mirrors actual work. For policy researchers and investors, the metric provides a quantitative proxy for the pace of capability development; it does not directly translate to workforce substitution estimates or deployment-readiness conclusions without additional empirical evidence from real deployment contexts.

Reliability varies substantially across the dataset. The mid-2020 data point is a retrospective estimate by METR researchers applied to early GPT-3-era agentic systems; it was not a contemporaneous evaluation and carries higher uncertainty than the later points. Data points from METR's March 2025 paper (arXiv:2503.14499) — covering GPT-4 through the late-2024 frontier cohort — are derived from the most systematic evaluation methodology in the dataset: logistic-curve fitting across hundreds of tasks with bootstrapped confidence intervals. The paper notes that point estimates carry roughly ±30–50% uncertainty expressed as a ratio; a stated horizon of 40 minutes plausibly covers a range from approximately 20 minutes to 60 minutes. The post-March-2025 data points (o3/Claude 3.7 Sonnet, GPT-5, Claude Opus 4.5, Claude Opus 4.6) come from METR's ongoing evaluation programme using the TH1.1 task suite, which added more tasks in the 4–12 hour range specifically to ensure coverage at higher capability levels. The expansion of the task suite means the TH1.1 estimates are not strictly methodologically identical to the earlier HCAST estimates in the 2025 paper — longer-task performance may be measured more precisely in TH1.1, but the calibration anchor differs. Epoch AI's benchmark tracking page documents available confidence intervals alongside point estimates; for quantitative modelling applications, the confidence interval data should be incorporated rather than relying solely on point estimates.

Data

About this Dataset

Frequently Asked Questions

Related Datasets