Notes from the Tail

KV-Cache Is the New Heap Allocator: Tail Latency at Scale for LLM Inference

Abstract: KV-cache management has become the defining systems challenge of LLM inference at scale. The views here are my own; a synthesis of public research, architectural reasoning, and opinions formed from watching this space evolve. I draw an analogy to heap allocation as the most productive mental model we have.

Introduction

In The Tail at Scale, Dean and Barroso showed that rare slowdowns stop being rare at scale. The tail dominates because variability amplifies through shared resources and queueing.

A decade later, Large Language Model (LLM) inference has arrived at the same inflection point. The dominant source of state is GPU memory; specifically the KV-cache.

KV-cache is the new heap allocator for LLM inference systems.

Managing this heap is now a semantic memory problem, not just a compute problem.

Solving tail latency requires moving beyond local paging to disaggregated, CXL-backed sharing.


Tail Inside the GPU

The original Tail at Scale identified shared resources and background noise as the killers of predictability. In LLM systems, this noise is amplified by the KV-cache: the dynamic memory that stores the context (keys and values) of every token in a conversation.

Unlike model weights (static, read-only), the KV-cache is:

This is not a cache in the traditional sense. It is dynamic, growing, long-lived state; closer to a process's heap than to an LRU lookup table.


KV-Cache as Stateful Memory

Every generated token allocates:

This means:

To make this concrete: for a 70B parameter model with grouped-query attention, each token's KV state across all layers is on the order of a few hundred bytes. A single 128k-context request can occupy several gigabytes of HBM. A fleet serving thousands of concurrent requests is managing a heap measured in hundreds of gigabytes of live, fragmented state.


Placement Is the Real Constraint

Frank Denneman has argued that in modern GPU systems, placement; not raw capacity is the defining problem.

It's not enough that memory exists. It must exist in the right place, at the right time, in the right shape.

KV-cache makes this painfully concrete:

Over time, the GPU memory map becomes fragmented. The allocator can no longer place new requests efficiently, even when free memory appears sufficient.

This is classic external fragmentation, the same problem that has plagued heap allocators for decades. But the GPU version is arguably worse:


Architectural Evolution

Early solutions like PagedAttention (vLLM) solved basic fragmentation by borrowing virtual memory concepts, splitting KV-cache into fixed-size pages with an indirection table. This was a major step forward, analogous to moving from contiguous allocation to paged memory.

But PagedAttention operates within a single serving instance; it doesn't share or coordinate KV memory across replicas or machines. Today's scale requires a tiered, distributed approach.

1. Continuous Batching (Orca)

Before we can manage the heap well, we need to stop wasting it. Orca introduced continuous batching: instead of waiting for an entire batch to finish before admitting new requests, the scheduler inserts and removes requests at the iteration level. This is the equivalent of moving from batch free() to incremental garbage collection; it dramatically reduces the amount of dead KV memory sitting in HBM waiting for a batch to complete.

2. Global Sharing with LMCache

LMCache treats KV-cache as a Distributed Hash Table. When a GPU finishes a request, it doesn't just free the memory. It treats the KV blocks as immutable, content addressable chunks. If another replica receives a similar prompt, it can pull those blocks rather than recomputing them, turning a cold request into a hot one.

This is analogous to a slab allocator with cross-process sharing: instead of each process maintaining its own heap, frequently used objects are promoted to a shared pool.

3. Disaggregated Prefill and Decode

Systems like DistServe and Splitwise separate the prefill phase (processing the input prompt, compute-bound) from the decode phase (generating tokens one at a time, memory-bound) onto different GPU pools. This is a placement optimization: prefill GPUs can run at high utilization without worrying about KV-cache lifetime, while decode GPUs are optimized for managing long-lived KV state.

This directly addresses the fragmentation problem by separating the allocation pattern (bursty, large prefills) from the steady state pattern (incremental decode growth).

4. CXL-Based Memory Expansion

CXL (Compute Express Link) introduces a new tier in the memory hierarchy that changes the economics of KV-cache management.

Systems like Beluga explore using CXL-attached memory as a warm tier for KV-cache. Instead of the binary choice of in HBM or evict and recompute, we get a middle ground: KV blocks that aren't actively needed for the current decode step can spill to CXL memory and be pulled back in microseconds rather than recomputed in milliseconds.

A back of the envelope calculation: For a 70B model, storing 1 million tokens of KV state requires roughly 100-300 GB (depending on precision and GQA configuration). This is well within the range of a CXL-attached memory pool, but far beyond what a single GPU's HBM can hold. CXL makes the million token warm heap physically plausible.

Note: CXL-based KV-cache management is still an emerging direction. Most production deployments today rely on HBM-only or HBM + CPU DRAM swapping via PCIe. The latency characteristics of CXL 2.0/3.0 for this workload are still being characterized. But the architectural direction is clear, and the memory capacity gap is only growing as context windows expand.

5. KV-Cache as a First-Class Distributed Resource (Mooncake)

Mooncake, from Moonshot AI, takes disaggregation to its logical conclusion: the KV-cache itself becomes a transferable, schedulable object managed by a global coordinator. Rather than each GPU managing its own local heap, a central Conductor makes placement decisions across the fleet; deciding not just where to run compute, but where to place and move KV state.

This is the closest existing system to a true distributed heap allocator for KV-cache. The Conductor functions as a global memory manager: it tracks what KV blocks exist, where they live, where they're needed, and orchestrates proactive transfers rather than reactive evictions.


Real-World Failure: The RAG Tail Storm

Imagine a fleet running mixed traffic: 95% short chats, 5% heavy RAG (32k+ tokens).

  1. A RAG request arrives and allocates a massive block of the Heap (KV-cache).
  2. Short chats fill the gaps, fragmenting the remaining HBM.
  3. A new RAG request arrives. There is technically enough free memory, but it's not contiguous (or not in the right tier).
  4. The system triggers eviction. It wipes out the KV-cache of the first RAG request to make room.
  5. The first user sends another message in their conversation. The system must recompute the entire 32k-token context from scratch.
  6. This re-computation occupies GPU cycles and HBM that other requests were depending on. Those requests slow down. Their KV-cache lingers longer. Memory pressure increases further.

This is where the system hits the queueing knee. In any resource constrained serving system, there is a non-linear inflection point; well understood from queueing theory, where utilization and tail latency stop being proportional and become mutually reinforcing. Past this knee:

P99 diverges. The system enters a regime where small increases in load produce enormous increases in tail latency. This is the same phenomenon that makes highway traffic collapse at ~85% capacity utilization; except here, the highway is GPU HBM, and the cars are KV blocks that grow while they're stuck in traffic.

This failure mode is invisible to average latency monitoring. It only shows up in the tail, and it only shows up at scale; exactly the pattern Dean and Barroso warned about.


Tail-Tolerant Techniques

1. KV-Aware Routing (The Stateful Load Balancer)

Traditional load balancers are stateless; they route based on server load, not server state. Modern LLM routers must be Radix-Tree Aware:

  1. The router tokenizes the incoming prefix.
  2. It checks a global index (like SGLang's Radix Cache or NVIDIA Dynamo-style routing).
  3. It routes the request to the GPU that already has that specific KV data in its HBM or CXL tier.

The hard part: This is easier described than implemented. The router must make decisions in microseconds, but tokenizing a prefix and performing a distributed radix lookup has real latency. Prefix matching is fuzzy in multi-turn conversations; the system prompt matches, but conversation histories diverge. And there's a fundamental tension between KV-locality (route to the GPU with the best cache hit) and load-balancing (route to the least loaded GPU). The GPU with the warmest cache might also be the most overloaded.

This remains an active research area. Production systems today use heuristics, sticky sessions, consistent hashing on system prompts, prefix-aware routing with fallback; rather than optimal solutions.

2. Differentiated Service Classes

Not all KV-blocks are equal. A priority-aware eviction policy can dramatically reduce tail latency for high-value requests:

3. Observability: Measuring the Heap

If we only track Average Latency, we will miss this entire class of failure. We need heap health metrics:

Metric What It Tells You
KV Fragmentation % Are we losing usable capacity to holes in the memory map?
Cache-Hit-to-Recompute Ratio How much compute are we burning on redundant prefill?
TTFT Variance by Context Length Are long context requests bullying short ones?
Eviction Rate by Service Class Are we evicting high priority sessions?
CXL Spill Rate How often are we falling out of the fast tier?

These metrics are the equivalent of tracking heap fragmentation, page fault rates, and swap usage in a traditional operating system. Without them, we're flying blind.


The Hardware Horizon

The architectural techniques described in this post: paged allocation, tiered memory, disaggregated prefill/decode, KV-aware routing are the best responses available within the current hardware constraints. Those constraints are defined by three realities:

  1. HBM capacity is growing slowly. NVIDIA's B200 offers 192 GB of HBM3e. The next generation (Rubin, expected ~2026-2027) may push toward 256-384 GB. But context windows are growing faster than HBM ;the gap is widening, not closing.

  2. CXL is maturing but not yet ubiquitous. CXL 3.0 promises memory pooling and sharing across hosts with improved latency characteristics. But widespread production adoption in inference fleets is likely 2-3 years away. Until then, the HBM to host DRAM boundary remains a painful cliff.

  3. Architectural alternatives are emerging but unproven. Linear attention variants (like Mamba-style state space models) replace the KV-cache with fixed size recurrent state, potentially eliminating the heap problem entirely. Hybrid architectures that mix attention and recurrence are being explored. But pure transformer attention; and its KV-cache, will dominate production for the foreseeable future.

This means that for the next 2-4 years, the solution space is primarily software and systems architecture. The engineers and researchers working on this problem are operating in a regime where:

The degrees of freedom are in scheduling, placement, routing, eviction policy, and memory tiering; exactly the domains where systems programming has decades of hard won wisdom. The innovators who will define the next generation of inference infrastructure are not waiting for better hardware. They are building better allocators.


Closing Thoughts

In the era of 100k+ context windows, LLM inference has become a systems programming problem. The GPU is no longer just a math co-processor; it is a complex memory environment with allocation, fragmentation, placement, and eviction dynamics that would be familiar to anyone who has debugged jemalloc or tuned a garbage collector.

If we treat the KV-cache like a static buffer, the tail latency will destroy the user experience. If we treat it like a tiered, disaggregated heap, we can build systems that are robust at scale:

The pioneers of The Tail at Scale in 2013 taught us that you cannot engineer reliability without engineering for the tail. The KV-cache is where that lesson meets the GPU. The heap is back, and it's measured in terabytes.


References

Notes from the Tail