Anthropic Study Reveals AI Agents Run 45 Minutes Autonomously as Trust Builds

Felix Pinkston
Feb 18, 2026 20:03

New Anthropic research shows Claude Code autonomy nearly doubled in 3 months, with experienced users granting more independence while maintaining oversight.

AI agents are working independently for significantly longer periods as users develop trust in their capabilities, according to new research from Anthropic published February 18, 2026. The study, which analyzed millions of human-agent interactions, found that the longest-running Claude Code sessions nearly doubled from under 25 minutes to over 45 minutes between October 2025 and January 2026.

The findings arrive as Anthropic rides a wave of investor confidence, having just closed a $30 billion Series G round that valued the company at $380 billion. That valuation reflects growing enterprise appetite for AI agents—and this research offers the first large-scale empirical look at how humans actually work with them.

Trust Builds Gradually, Not Through Capability Jumps

Perhaps the most striking finding: the increase in autonomous operation time was smooth across model releases. If autonomy were purely about capability improvements, you’d expect sharp jumps when new models dropped. Instead, the steady climb suggests users are gradually extending trust as they gain experience.

The data backs this up. Among new Claude Code users, roughly 20% of sessions use full auto-approve mode. By the time users hit 750 sessions, that number exceeds 40%. But here’s the counterintuitive part—experienced users also interrupt Claude more frequently, not less. New users interrupt in about 5% of turns; veterans interrupt in roughly 9%.

What’s happening? Users aren’t abandoning oversight. They’re shifting strategy. Rather than approving every action upfront, experienced users let Claude run and step in when something needs correction. It’s the difference between micromanaging and monitoring.

Claude Knows When to Ask

The research revealed something unexpected about Claude’s own behavior. On complex tasks, the AI stops to ask clarifying questions more than twice as often as humans interrupt it. Claude-initiated pauses actually exceed human-initiated interruptions on the most difficult work.

Common reasons Claude stops itself include presenting users with choices between approaches (35% of pauses), gathering diagnostic information (21%), and clarifying vague requests (13%). Meanwhile, humans typically interrupt to provide missing technical context (32%) or because Claude was running slow or excessive (17%).

This suggests Anthropic’s training for uncertainty recognition is working. Claude appears calibrated to its own limitations—though the researchers caution it may not always stop at the right moments.

Software Dominates, But Riskier Domains Emerge

Software engineering accounts for nearly 50% of all agentic tool calls on Anthropic’s public API. That concentration makes sense—code is testable, reviewable, and relatively low-stakes if something breaks.

But the researchers found emerging usage in healthcare, finance, and cybersecurity. Most actions remain low-risk and reversible—only 0.8% of observed actions appeared irreversible, like sending customer emails. Still, the highest-risk clusters involved sensitive security operations, financial transactions, and medical records.

The team acknowledges limitations: many high-risk actions may actually be red-team evaluations rather than production deployments. They can’t always tell the difference from their vantage point.

What This Means for the Industry

Anthropic’s researchers argue against mandating specific oversight patterns like requiring human approval for every action. Their data suggests such requirements would create friction without safety benefits—experienced users naturally develop more efficient monitoring strategies.

Instead, they’re calling for better post-deployment monitoring infrastructure across the industry. Pre-deployment testing can’t capture how humans actually interact with agents in practice. The patterns they observed—trust building over time, shifting oversight strategies, agents limiting their own autonomy—only emerge in real-world usage.

For enterprises evaluating AI agent deployments, the research offers a concrete benchmark: even power users at the extreme end of the distribution are running Claude autonomously for under an hour at a stretch. The gap between what models can theoretically handle (METR estimates five hours for comparable tasks) and what users actually permit suggests significant headroom remains—and that trust, not capability, may be the binding constraint on adoption.

Image source: Shutterstock

Credit: Source link