Introduction
In 2023, vLLM was released with a claim that seemed too good to be true: up to 24x higher throughput than HuggingFace Transformers for LLM serving, on the same hardware, without any model changes. The two key innovations behind this were continuous batching and PagedAttention — systems-level insights that fundamentally changed how LLMs are served in production.
Understanding these techniques is essential for any ML engineer responsible for LLM serving infrastructure.
The Problem: GPU Underutilization in Naive LLM Serving
Naive LLM serving has severe efficiency problems:
Problem 1: Static Batching Wastes GPU Cycles
In static batching, a batch of requests is processed together. All requests in the batch must complete before the next batch starts:
Static batching:
Batch: [Request A (10 tokens), Request B (100 tokens), Request C (50 tokens)]
Time:
─────────────────────────────────────────────────────────────────
A: [generates 10 tokens] [IDLE for 90 tokens] ← GPU time wasted
B: [generates 100 tokens]
C: [generates 50 tokens] [IDLE for 50 tokens] ← GPU time wasted
─────────────────────────────────────────────────────────────────
↑ All three must finish before batch N+1 starts
GPU utilization: ~30-50% for variable-length requests
Short requests finish early and leave their GPU allocation idle until the longest request completes.
Problem 2: Memory Fragmentation Wastes KV Cache Space
The KV cache stores key-value tensors for all tokens generated so far. Naive implementations pre-allocate a contiguous memory block for the maximum possible sequence length:
Memory allocation for max_sequence_length=2048:
Request A (eventually 50 tokens): reserves 2048 tokens of KV cache
Request B (eventually 200 tokens): reserves 2048 tokens of KV cache
Request C (eventually 2000 tokens): reserves 2048 tokens of KV cache
Internal fragmentation: (2048-50 + 2048-200 + 2048-2000) / (3×2048) = 63% waste
This dramatically limits the number of concurrent requests and wastes GPU memory.
Solution 1: Continuous Batching
Continuous batching (also called iteration-level batching or in-flight batching) changes when new requests join the batch:
Continuous batching:
The batch is updated at every iteration (every generated token).
As soon as a request finishes, the freed slot is filled by a new request.
Step 1: [Request A, Request B, Request C] ← initial batch
Step 2: [Request A, Request B, Request C] ← all still generating
...
Step 10: Request A finishes
Step 11: [Request D (new!), Request B, Request C] ← D immediately replaces A
...
This is analogous to how OS schedulers work: don't wait for all processes to finish, preemptively schedule new work.
Effect on throughput: Continuous batching maintains near-100% GPU utilization. Benchmarks show 5-10x throughput improvements over static batching for real-world request distributions.
The key challenge continuous batching introduces: batching requests of different lengths at different stages of generation. Standard batched attention computation assumes all sequences in the batch are the same length (for the prefill) or at the same generation step (for decode). vLLM's scheduler handles this with careful padding and step tracking.
Solution 2: PagedAttention
PagedAttention borrows from virtual memory management in operating systems to solve the KV cache fragmentation problem.
The Insight: KV Cache is Like Virtual Memory
OS virtual memory maps virtual page addresses to physical memory pages. This allows:
- Non-contiguous physical allocation for logically contiguous data
- On-demand allocation (don't allocate pages until needed)
- Sharing pages between processes
PagedAttention applies the same idea to KV cache:
PagedAttention:
KV cache is divided into fixed-size "blocks" (pages), e.g., 16 tokens per block.
A request's KV cache is stored in a logical sequence of blocks,
which map to non-contiguous physical blocks in GPU memory.
Logical blocks: [0] → [1] → [2] → [3] → ...
↓ ↓ ↓ ↓
Physical blocks: [B7] [B2] [B15] [B9] ← scattered in GPU memory
Block Manager
vLLM maintains a block manager that tracks:
- Free physical blocks (available pool)
- Block table per request (logical → physical mapping)
# Simplified block manager logic
class BlockManager:
def __init__(self, num_gpu_blocks, block_size):
self.free_blocks = list(range(num_gpu_blocks))
self.block_tables = {} # request_id → [physical_block_ids]
def allocate(self, request_id, num_tokens):
blocks_needed = ceil(num_tokens / self.block_size)
allocated = self.free_blocks[:blocks_needed]
self.free_blocks = self.free_blocks[blocks_needed:]
self.block_tables[request_id] = allocated
return allocated
def free(self, request_id):
freed = self.block_tables.pop(request_id)
self.free_blocks.extend(freed)
Copy-on-Write for Beam Search and Parallel Sampling
A powerful benefit of paged KV cache: blocks can be shared between requests (like copy-on-write pages in OS memory). This is critical for:
Beam search: Multiple beams share the KV cache for their shared prefix. Only when they diverge do they get separate physical blocks.
Parallel sampling (generating N completions per prompt): All N samples share the KV cache for the prompt prefix — only the generated tokens have separate blocks.
Shared prefix caching:
Request: "Summarize this 5000-word document: [...]"
→ 4 parallel samples generated
Prompt KV cache (5000 tokens × 4 layers × 2 × d):
Stored ONCE in shared physical blocks
All 4 samples point to the same blocks for the prompt
Token generation blocks:
4 separate physical block sequences (one per sample)
Memory savings: ~3x for 4 parallel samples with long prompts
Combining Both: The vLLM Architecture
vLLM serving architecture:
Incoming requests → Request scheduler
↓
Batch construction (continuous batching)
↓
Block allocation (PagedAttention block manager)
↓
GPU forward pass
├─ Prefill (new requests): process full prompt in parallel
└─ Decode (existing requests): generate one token per request
↓
Detokenize completed sequences
↓
Free blocks for finished requests → return to pool
Performance Numbers
The impact of vLLM's techniques on Llama-2-13B throughput (requests/second):
| System | Throughput | vs. Baseline |
|---|---|---|
| HuggingFace (static batching) | 0.5 req/s | 1× |
| FasterTransformer | 1.2 req/s | 2.4× |
| vLLM (continuous batching only) | 4.0 req/s | 8× |
| vLLM (continuous + PagedAttention) | 5.5 req/s | 11× |
Approximate numbers from vLLM paper, Llama-2-13B, A100 40GB, mixed request lengths
Prefix Caching: Reusing KV Across Requests
An extension of PagedAttention: if multiple requests share the same prefix (e.g., a long system prompt), their KV cache for that prefix can be computed once and reused across requests.
Without prefix caching:
1000 requests × same 2000-token system prompt = 2M tokens of KV recomputed
With prefix caching:
2000-token system prompt computed once → KV cached in GPU memory
1000 requests → skip prefill for the shared prefix
Savings: 2M token computations → essentially free
vLLM's "automatic prefix caching" (APC) and SGLang's RadixAttention both implement this. For RAG systems and chatbots with long system prompts, prefix caching can reduce time-to-first-token by 50-80%.
Modern Alternatives: SGLang, TensorRT-LLM
vLLM pioneered continuous batching and PagedAttention, but the landscape has expanded:
SGLang: Adds RadixAttention (more efficient prefix caching using a radix tree), compiled CUDA graphs for common patterns, and structured generation with grammar constraints.
TensorRT-LLM (NVIDIA): Production-grade serving with hardware-specific optimizations for NVIDIA GPUs (H100, A100). Fastest option for fixed workloads where the compilation overhead is acceptable.
Llama.cpp: For CPU/edge inference, uses memory mapping and quantization rather than PagedAttention — fundamentally different approach for resource-constrained settings.
Conclusion
Continuous batching and PagedAttention are the systems innovations that made high-throughput LLM serving economically viable. Continuous batching eliminates the GPU idle time caused by static batching; PagedAttention eliminates the memory fragmentation that limits concurrency. Together, they enable a 10× throughput improvement over naive implementations — the difference between serving 100 requests/minute and 1,000 requests/minute on the same hardware. Any team running LLMs in production should understand these fundamentals, whether they're using vLLM directly or a managed service built on top of it.
Want to understand the KV cache in depth? Read our guide on KV Cache Optimization for LLM Serving.