NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

Lawrence Jengar
Apr 17, 2026 23:22

NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments.

NVIDIA has released a comprehensive update to its Dynamo inference framework specifically optimized for AI coding agents, addressing a critical bottleneck as enterprise adoption of automated code generation accelerates. The company reports achieving up to 97.2% cache hit rates for multi-agent workflows—a metric that directly translates to reduced compute costs and faster response times.

The timing isn’t accidental. Stripe’s internal agents now generate over 1,300 pull requests weekly. Ramp attributes 30% of its merged PRs to AI agents. Spotify reports 650+ agent-generated PRs monthly. Behind each of these workflows sits an inference stack under intense pressure from repeated context processing.

The Cache Problem Nobody Talks About

Here’s what makes agentic AI different from chatbots: a coding agent like Claude Code or Codex makes hundreds of API calls per session, each carrying the full conversation history. After the first call writes the conversation prefix to KV cache, every subsequent call hits 85-97% cache on the same worker. NVIDIA measured an 11.7x read/write ratio—the system reads from cache nearly 12 times for every token written.

Without cache-aware routing, turn 2 of a conversation has roughly a 1/N chance of landing on the same worker as turn 1. Every miss forces complete prefix recomputation. For a 200K context window, that’s expensive.

Three-Layer Architecture

Dynamo’s update attacks the problem at three levels. The frontend now supports multiple API protocols—v1/responses, v1/messages, and v1/chat/completions—through a common internal representation. This matters because newer APIs use typed content blocks, letting the orchestrator see boundaries between thinking, tool calls, and text to apply different cache policies per block type.

The new “agent hints” extension allows harnesses to attach structured metadata to requests: priority levels, estimated output length, and speculative prefill flags. A harness can signal “warm this cache ahead of time” when it knows a tool call is about to return.

At the routing layer, NVIDIA’s Flash Indexer now handles 170 million operations per second for KV-aware placement decisions. The NeMo Agent Toolkit team built a custom router using these APIs and measured 4x reduction in p50 time-to-first-token and up to 63% latency improvement for priority-tagged requests under memory pressure.

Rethinking Cache Eviction

Standard LRU eviction treats all cached data identically—a fundamental mismatch with how agents actually work. System prompts get reused every turn. Reasoning tokens inside <think> blocks? Typically zero reuse after the loop closes, yet they account for roughly 40% of generated tokens.

The update introduces selective retention with per-region control. Teams can specify that system prompt blocks evict last, conversation context survives 30-second tool call gaps, and decode tokens go first. TensorRT-LLM’s new TokenRangeRetentionConfig enables this granularity within single requests.

NVIDIA is also building toward a four-tier memory hierarchy—GPU, CPU, local NVMe, and remote storage—where blocks flow automatically via write-through. When one worker computes KV for a prefix, any other worker can load those blocks via RDMA instead of recomputing. Four redundant prefill computations become one compute and three loads.

What This Means for Deployment

The company has been running internal Dynamo deployments of GLM-5 and MiniMax2.5 to power Codex and Claude Code harnesses, benchmarking against closed-source inference. They’re targeting parity on cache reuse performance with optimized recipes coming in the next few weeks.

For teams already running open-source models on their own GPUs, the gap with managed API providers just got smaller. The cache_control API mirrors Anthropic’s prompt caching semantics, so migration paths exist for teams familiar with that interface.

The agent hints specification remains v1, and NVIDIA is actively soliciting feedback from teams building agent harnesses on which signals prove most useful. Given that Dynamo 1.0 launched just last month with major cloud provider adoption, expect rapid iteration as enterprise agentic workloads scale.

Image source: Shutterstock

Credit: Source link