Introduction
The KV (key-value) cache is the most important data structure in LLM inference. Understanding how to optimize it is the difference between a serving system that handles 100 requests/second and one that handles 1,000.
This guide covers what the KV cache is, why it's a bottleneck, and the engineering techniques production systems use to optimize it.
What Is the KV Cache?
During transformer attention, each token attends to all previous tokens. To avoid recomputing attention keys and values for each new token, we cache them:
# Without KV cache: O(n²) compute per token
for step in range(generation_length):
logits = model(full_sequence_so_far) # recomputes everything
# With KV cache: O(n) compute per token
kv_cache = {}
for step in range(generation_length):
new_token_logits, kv_cache = model(new_token, kv_cache)
The cache stores intermediate attention states (K and V tensors) for each layer and each token in the sequence. It trades memory for compute.
Memory Cost
For a model with:
- L layers
- H attention heads
- D head dimension
- S sequence tokens
KV cache size = 2 Ć L Ć H Ć D Ć S Ć dtype_bytes
For LLaMA-3 70B (80 layers, 8 GQA heads, 128 dim, fp16):
- Per token: 2 Ć 80 Ć 8 Ć 128 Ć 2 = 327KB per token
- 4K context: 1.3GB per request
- Batch of 10 requests: 13GB ā just for KV cache
This is why GPU memory fills up fast.
The Fragmentation Problem
Before PagedAttention, most inference engines pre-allocated a contiguous block of GPU memory for each request's maximum context length. This caused two problems:
- Internal fragmentation: A request using 1K of a 4K allocation wastes 75% of its reserved memory
- Reservation uncertainty: You don't know the final sequence length upfront
A batch of 10 requests each reserving 4K context might use only 2K on average, wasting 50% of KV cache memory.
PagedAttention: OS Paging for KV Caches
vLLM introduced PagedAttention, inspired by virtual memory in operating systems.
The Core Idea
Instead of one contiguous block per request, split KV cache into fixed-size pages (typically 16 tokens each) and maintain a page table per request:
Request A: page_table ā [Page 3, Page 7, Page 1, ...]
Request B: page_table ā [Page 2, Page 5, Page 8, ...]
Pages are allocated on demand as generation proceeds. No reservation needed.
Advantages
- Near-zero internal fragmentation: Waste at most (page_size - 1) tokens per request
- Dynamic allocation: Memory grows with actual usage
- Sharing across requests: Multiple requests can share the same physical page (for shared prefixes)
Implementation Complexity
PagedAttention requires a custom attention kernel that:
- Looks up physical page addresses from the page table
- Gathers K/V tensors from non-contiguous memory
- Computes attention across the gathered tensors
CUDA makes this tractable with block-sparse attention patterns.
Prefix Caching
Many production applications send requests with a shared prefix:
- System prompts (same for all users)
- Few-shot examples (same per use case)
- RAG context (shared across a session)
Prefix caching (also called prompt caching) stores the KV cache for these shared portions and reuses them across requests.
How It Works
Request 1: [system_prompt][user_query_1]
Request 2: [system_prompt][user_query_2]
Without caching:
- Compute KV for system_prompt twice
With prefix caching:
- Compute KV for system_prompt once
- Cache by hash of token IDs
- Reuse for request 2 (cache hit)
Cache Key Design
The cache key is typically a hash of the token IDs:
prefix_hash = sha256(token_ids[:prefix_length])
if prefix_hash in kv_cache:
return kv_cache[prefix_hash] # cache hit
With PagedAttention, entire pages can be shared via copy-on-write ā the cached prefix pages are referenced by multiple requests without duplication.
Production Impact
For applications with long system prompts (common in enterprise deployments):
- 60-80% of tokens are shared prefix
- Effective throughput improvement: 2-4x
- Latency improvement: significant reduction in time-to-first-token
Anthropic's API, together with several other providers, now offers explicit prefix caching with per-token pricing discounts.
KV Cache Quantization
KV caches can be quantized independently from model weights:
FP8 KV Cache
Quantizing from FP16 to FP8:
- 2x memory reduction
- ~0.5-1% quality degradation on most tasks
- Now supported natively in NVIDIA H100 tensor cores
# vLLM example
llm = LLM(
model="meta-llama/Llama-3-70b",
kv_cache_dtype="fp8_e5m2", # FP8 KV cache
)
INT4 KV Cache
More aggressive: 4x memory reduction, but requires careful per-channel scaling to avoid accuracy collapse. Used in some mobile and edge deployments.
Key Insight
KV cache quantization is often more effective than weight quantization for throughput, because:
- Weights are read once per forward pass
- KV cache is read O(sequence_length) times per generation step
- Memory bandwidth is the bottleneck in decoding, not compute
Sliding Window Attention
For very long contexts, full KV cache storage becomes impractical. Sliding window attention (used in Mistral models) only attends to the last W tokens:
KV cache size = min(S, W) tokens
With W=4096, memory is bounded regardless of context length. Quality degrades for long-range dependencies but is acceptable for many tasks.
Hybrid approaches: Combine full attention for some layers (for global context) and sliding window for others (for efficiency). This is used in Gemma 2 and several proprietary models.
Offloading to CPU/NVMe
When GPU memory is exhausted, KV cache pages can be offloaded:
GPU HBM (fast) ā CPU DRAM (slower) ā NVMe SSD (slow)
The prefill phase can compute and immediately offload KV caches for low-priority requests. When the request is scheduled for decoding, pages are streamed back.
Bandwidth constraints:
- GPU ā CPU: ~50-100 GB/s (PCIe 5.0)
- CPU ā NVMe: ~10-15 GB/s (NVMe Gen4)
This only works when decoding latency requirements are loose (offline/async workloads).
Architecture Summary: vLLM's KV Cache System
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā GPU Memory Pool ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā KV Cache Pages ā ā
ā ā [Page 0] [Page 1] ... [Page N] ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā Block Manager ā ā
ā ā - Free page list ā ā
ā ā - Per-request page tables ā ā
ā ā - Reference counting (sharing) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā evict/load
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā CPU Memory (swap) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Choosing the Right Optimizations
| Optimization | Memory savings | Quality impact | Complexity |
|---|---|---|---|
| PagedAttention | High (fragmentation) | None | Framework-level |
| Prefix caching | High (shared prompts) | None | Low |
| FP8 KV cache | 2x | Minimal | Low (hardware) |
| INT4 KV cache | 4x | Moderate | Moderate |
| Sliding window | Very high | Task-dependent | Model-level |
| CPU offloading | High | Latency impact | Moderate |
Conclusion
The KV cache is the central resource in LLM serving. Engineers who understand its structure and constraints can unlock dramatically higher throughput and lower costs. PagedAttention eliminated fragmentation, prefix caching eliminates redundant computation, and quantization reduces bandwidth pressure ā each addressing a different dimension of the problem.
As models get longer context and higher QPS demands, KV cache engineering will only become more critical.
Explore more LLM infrastructure patterns in our LLM Inference at Scale course.