AI & Research

AI Agents Reach 14.5-Hour Task Horizon as Google Confirms First AI-Assisted Zero-Day Exploit

METR's benchmark shows frontier AI agents can now sustain autonomous work for up to 14.5 hours — a capability gap that Google's first confirmed AI-written zero-day exploit illustrates in practice.

50% Task Horizon — Claude Opus 4.6 (Feb 2026)

14.5 hours

METR Time Horizon 1.1

vs. 9 seconds in 2020

50% Task Horizon — GPT-5 (Jun 2025)

2h 17m

METR direct evaluation

~34× increase since GPT-4 in 2023

50% Task Horizon — Jul 2025 frontier

4h 49m

METR Time Horizon 1.1

+2h 32m vs. Jun 2025

Total Increase in Horizon (2020 → 2026)

~5,800×

9 seconds → 52,200 seconds

~4 orders of magnitude

METR (Model Evaluation & Threat Research) Quarterly (approx.)

Why this matters

Google's threat intelligence team confirmed on 11 May 2026 that a threat actor used AI to develop a zero-day exploit — the first publicly documented case of AI being used to discover and weaponise a previously unknown vulnerability. The finding matters less as an isolated incident than as a proof-of-concept: AI-assisted vulnerability research can now move faster than defender response cycles built around human-speed reconnaissance.

The technical basis for that shift is visible in METR's benchmark data. Currently, frontier AI agents sustain autonomous work on tasks rated at up to 14.5 hours of human expert effort — measured at the 50% success threshold on METR's Time Horizon 1.1 evaluation. That figure, recorded for Claude Opus 4.6 in February 2026, represents a roughly 5,800-fold increase from the 9-second baseline measured for early GPT-3 agents in mid-2020. The progression was not smooth: the horizon stood at approximately 4 minutes for GPT-4 in March 2023, reached around 40 minutes by October 2024, and then jumped to 1 hour 15 minutes by March 2025 and 14.5 hours by February 2026. Sustained vulnerability research — scanning codebases, generating candidate exploits, iterating on failures — typically fits within exactly this range of task complexity. An AI agent capable of autonomous multi-hour software engineering work is, technically, an agent capable of multi-hour autonomous security research.

The attack surface that such capability can target is now substantial. McKinsey's 2025 survey found that 79% of large organisations use generative AI in at least one function, and EU Eurostat data places enterprise AI adoption across firms with ten or more employees at 20% in 2025, up 12 percentage points since 2023. Each AI-integrated workflow likely introduces dependencies on third-party models, APIs, and inference infrastructure — systems whose vulnerability surfaces are still being mapped by defenders and attackers alike. Google's discovery suggests the mapping is now a competitive race, and AI is accelerating both sides. For enterprise security teams, the operational implication is that threat modelling assumptions calibrated to human-speed adversarial research may need revision.

The METR dataset tracked here measures AI agent capability on software-oriented tasks in controlled sandboxed environments, not real-world exploit development. The benchmark does not directly predict attack sophistication, and point estimates carry roughly ±30–50% uncertainty as ratios. Nevertheless, for investment and strategy teams assessing AI-related security risk, the trajectory — from 9 seconds in 2020 to 52,200 seconds today, doubling roughly every four to seven months — provides the most systematic quantitative proxy currently available for the pace at which autonomous AI capability is expanding into territory previously reserved for skilled human specialists.

Linked Statistics

Source Articles

Google Says Hackers Used AI to Develop a Zero-Day Exploit
Hackread 12 May 2026
Google Says It Found Evidence of Hackers Using AI to Discover a Zero-Day Vulnerability
Gizmodo 11 May 2026
Google says hacker used Mythos-like AI for zero-day exploit
Bloomberg Television 11 May 2026
Google announces its first-ever discovery of a zero-day exploit made with AI
Engadget 11 May 2026

Frequently Asked Questions

As of February 2026, METR's benchmark places Claude Opus 4.6's 50% task time horizon at 14.5 hours — meaning the model succeeds roughly half the time on tasks rated at 14.5 hours of human expert effort, working entirely without human intervention.

The 50% task horizon grew from 9 seconds in mid-2020 to 52,200 seconds (14.5 hours) by February 2026 — a factor of roughly 5,800 in six years. From March 2025 (4,500 seconds) to February 2026 (52,200 seconds) alone, the horizon increased more than 11-fold.

In March 2025, METR measured the frontier 50% task horizon at approximately 4,500 seconds — around 1 hour 15 minutes. By July 2025, a subsequent frontier model reached 17,340 seconds (4 hours 49 minutes), and by February 2026 the horizon had reached 52,200 seconds (14.5 hours).