tutorial 2025-01-15 15 min read

KV Cache Optimization: The Engineering Core of Efficient LLM Serving

Everything you need to know about key-value caches in transformer serving — PagedAttention, prefix caching, quantized KV caches, and how companies like Anthropic and Together AI optimize memory at scale.

KV cache PagedAttention vLLM memory LLM serving transformer attention

Introduction

The KV (key-value) cache is the most important data structure in LLM inference. Understanding how to optimize it is the difference between a serving system that handles 100 requests/second and one that handles 1,000.

This guide covers what the KV cache is, why it's a bottleneck, and the engineering techniques production systems use to optimize it.

What Is the KV Cache?

During transformer attention, each token attends to all previous tokens. To avoid recomputing attention keys and values for each new token, we cache them:

# Without KV cache: O(n²) compute per token
for step in range(generation_length):
    logits = model(full_sequence_so_far)  # recomputes everything

# With KV cache: O(n) compute per token
kv_cache = {}
for step in range(generation_length):
    new_token_logits, kv_cache = model(new_token, kv_cache)

The cache stores intermediate attention states (K and V tensors) for each layer and each token in the sequence. It trades memory for compute.

Memory Cost

For a model with:

  • L layers
  • H attention heads
  • D head dimension
  • S sequence tokens

KV cache size = 2 Ɨ L Ɨ H Ɨ D Ɨ S Ɨ dtype_bytes

For LLaMA-3 70B (80 layers, 8 GQA heads, 128 dim, fp16):

  • Per token: 2 Ɨ 80 Ɨ 8 Ɨ 128 Ɨ 2 = 327KB per token
  • 4K context: 1.3GB per request
  • Batch of 10 requests: 13GB — just for KV cache

This is why GPU memory fills up fast.

The Fragmentation Problem

Before PagedAttention, most inference engines pre-allocated a contiguous block of GPU memory for each request's maximum context length. This caused two problems:

  1. Internal fragmentation: A request using 1K of a 4K allocation wastes 75% of its reserved memory
  2. Reservation uncertainty: You don't know the final sequence length upfront

A batch of 10 requests each reserving 4K context might use only 2K on average, wasting 50% of KV cache memory.

PagedAttention: OS Paging for KV Caches

vLLM introduced PagedAttention, inspired by virtual memory in operating systems.

The Core Idea

Instead of one contiguous block per request, split KV cache into fixed-size pages (typically 16 tokens each) and maintain a page table per request:

Request A: page_table → [Page 3, Page 7, Page 1, ...]
Request B: page_table → [Page 2, Page 5, Page 8, ...]

Pages are allocated on demand as generation proceeds. No reservation needed.

Advantages

  • Near-zero internal fragmentation: Waste at most (page_size - 1) tokens per request
  • Dynamic allocation: Memory grows with actual usage
  • Sharing across requests: Multiple requests can share the same physical page (for shared prefixes)

Implementation Complexity

PagedAttention requires a custom attention kernel that:

  1. Looks up physical page addresses from the page table
  2. Gathers K/V tensors from non-contiguous memory
  3. Computes attention across the gathered tensors

CUDA makes this tractable with block-sparse attention patterns.

Prefix Caching

Many production applications send requests with a shared prefix:

  • System prompts (same for all users)
  • Few-shot examples (same per use case)
  • RAG context (shared across a session)

Prefix caching (also called prompt caching) stores the KV cache for these shared portions and reuses them across requests.

How It Works

Request 1: [system_prompt][user_query_1]
Request 2: [system_prompt][user_query_2]

Without caching:
  - Compute KV for system_prompt twice

With prefix caching:
  - Compute KV for system_prompt once
  - Cache by hash of token IDs
  - Reuse for request 2 (cache hit)

Cache Key Design

The cache key is typically a hash of the token IDs:

prefix_hash = sha256(token_ids[:prefix_length])
if prefix_hash in kv_cache:
    return kv_cache[prefix_hash]  # cache hit

With PagedAttention, entire pages can be shared via copy-on-write — the cached prefix pages are referenced by multiple requests without duplication.

Production Impact

For applications with long system prompts (common in enterprise deployments):

  • 60-80% of tokens are shared prefix
  • Effective throughput improvement: 2-4x
  • Latency improvement: significant reduction in time-to-first-token

Anthropic's API, together with several other providers, now offers explicit prefix caching with per-token pricing discounts.

KV Cache Quantization

KV caches can be quantized independently from model weights:

FP8 KV Cache

Quantizing from FP16 to FP8:

  • 2x memory reduction
  • ~0.5-1% quality degradation on most tasks
  • Now supported natively in NVIDIA H100 tensor cores
# vLLM example
llm = LLM(
    model="meta-llama/Llama-3-70b",
    kv_cache_dtype="fp8_e5m2",  # FP8 KV cache
)

INT4 KV Cache

More aggressive: 4x memory reduction, but requires careful per-channel scaling to avoid accuracy collapse. Used in some mobile and edge deployments.

Key Insight

KV cache quantization is often more effective than weight quantization for throughput, because:

  • Weights are read once per forward pass
  • KV cache is read O(sequence_length) times per generation step
  • Memory bandwidth is the bottleneck in decoding, not compute

Sliding Window Attention

For very long contexts, full KV cache storage becomes impractical. Sliding window attention (used in Mistral models) only attends to the last W tokens:

KV cache size = min(S, W) tokens

With W=4096, memory is bounded regardless of context length. Quality degrades for long-range dependencies but is acceptable for many tasks.

Hybrid approaches: Combine full attention for some layers (for global context) and sliding window for others (for efficiency). This is used in Gemma 2 and several proprietary models.

Offloading to CPU/NVMe

When GPU memory is exhausted, KV cache pages can be offloaded:

GPU HBM (fast) → CPU DRAM (slower) → NVMe SSD (slow)

The prefill phase can compute and immediately offload KV caches for low-priority requests. When the request is scheduled for decoding, pages are streamed back.

Bandwidth constraints:

  • GPU ↔ CPU: ~50-100 GB/s (PCIe 5.0)
  • CPU ↔ NVMe: ~10-15 GB/s (NVMe Gen4)

This only works when decoding latency requirements are loose (offline/async workloads).

Architecture Summary: vLLM's KV Cache System

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│            GPU Memory Pool              │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”   │
│  │         KV Cache Pages          │   │
│  │  [Page 0] [Page 1] ... [Page N] │   │
│  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜   │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”   │
│  │      Block Manager              │   │
│  │  - Free page list               │   │
│  │  - Per-request page tables      │   │
│  │  - Reference counting (sharing) │   │
│  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜   │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
         ↕ evict/load
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│            CPU Memory (swap)            │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Choosing the Right Optimizations

Optimization Memory savings Quality impact Complexity
PagedAttention High (fragmentation) None Framework-level
Prefix caching High (shared prompts) None Low
FP8 KV cache 2x Minimal Low (hardware)
INT4 KV cache 4x Moderate Moderate
Sliding window Very high Task-dependent Model-level
CPU offloading High Latency impact Moderate

Conclusion

The KV cache is the central resource in LLM serving. Engineers who understand its structure and constraints can unlock dramatically higher throughput and lower costs. PagedAttention eliminated fragmentation, prefix caching eliminates redundant computation, and quantization reduces bandwidth pressure — each addressing a different dimension of the problem.

As models get longer context and higher QPS demands, KV cache engineering will only become more critical.


Explore more LLM infrastructure patterns in our LLM Inference at Scale course.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.