AI Agents Reach 14.5-Hour Task Horizon as Google Confirms First AI-Assisted Zero-Day Exploit
METR's benchmark shows frontier AI agents can now sustain autonomous work for up to 14.5 hours — a capability gap that Google's first confirmed AI-written zero-day exploit illustrates in practice.
METR (Model Evaluation & Threat Research) Quarterly (approx.)
Why this matters
Google's threat intelligence team confirmed on 11 May 2026 that a threat actor used AI to develop a zero-day exploit — the first publicly documented case of AI being used to discover and weaponise a previously unknown vulnerability. The finding matters less as an isolated incident than as a proof-of-concept: AI-assisted vulnerability research can now move faster than defender response cycles built around human-speed reconnaissance.
The technical basis for that shift is visible in METR's benchmark data. Currently, frontier AI agents sustain autonomous work on tasks rated at up to 14.5 hours of human expert effort — measured at the 50% success threshold on METR's Time Horizon 1.1 evaluation. That figure, recorded for Claude Opus 4.6 in February 2026, represents a roughly 5,800-fold increase from the 9-second baseline measured for early GPT-3 agents in mid-2020. The progression was not smooth: the horizon stood at approximately 4 minutes for GPT-4 in March 2023, reached around 40 minutes by October 2024, and then jumped to 1 hour 15 minutes by March 2025 and 14.5 hours by February 2026. Sustained vulnerability research — scanning codebases, generating candidate exploits, iterating on failures — typically fits within exactly this range of task complexity. An AI agent capable of autonomous multi-hour software engineering work is, technically, an agent capable of multi-hour autonomous security research.
The attack surface that such capability can target is now substantial. McKinsey's 2025 survey found that 79% of large organisations use generative AI in at least one function, and EU Eurostat data places enterprise AI adoption across firms with ten or more employees at 20% in 2025, up 12 percentage points since 2023. Each AI-integrated workflow likely introduces dependencies on third-party models, APIs, and inference infrastructure — systems whose vulnerability surfaces are still being mapped by defenders and attackers alike. Google's discovery suggests the mapping is now a competitive race, and AI is accelerating both sides. For enterprise security teams, the operational implication is that threat modelling assumptions calibrated to human-speed adversarial research may need revision.
The METR dataset tracked here measures AI agent capability on software-oriented tasks in controlled sandboxed environments, not real-world exploit development. The benchmark does not directly predict attack sophistication, and point estimates carry roughly ±30–50% uncertainty as ratios. Nevertheless, for investment and strategy teams assessing AI-related security risk, the trajectory — from 9 seconds in 2020 to 52,200 seconds today, doubling roughly every four to seven months — provides the most systematic quantitative proxy currently available for the pace at which autonomous AI capability is expanding into territory previously reserved for skilled human specialists.