Introduction
When GPT-4 launched with an 8K context window, 128K tokens seemed like science fiction. Two years later, Claude's 200K context, Gemini's 1M context, and open-source models with 128K windows are standard production offerings. The journey from 2K to 1M tokens required solving fundamental problems in positional encoding, attention efficiency, and model training.
This post explains how context windows are extended, what breaks in practice, and when long context is the right tool versus retrieval.
Why Context Windows Are Hard to Extend
The core challenge isn't just compute β it's positional generalization. LLMs are trained with specific positional encodings, and at inference time they encounter positions they've never seen during training.
The RoPE Problem
Most modern LLMs (Llama, Mistral, Qwen, Gemini) use Rotary Position Embeddings (RoPE). RoPE encodes position by rotating query and key vectors:
Q_rotated[i] = Q[i] Β· R(pos_i)
K_rotated[j] = K[j] Β· R(pos_j)
dot_product(Q_rotated[i], K_rotated[j]) β depends on (pos_i - pos_j)
RoPE uses sinusoidal functions at different frequencies:
R(pos) uses frequencies: ΞΈ_d = base^(-2d/D) for d = 0, 1, ..., D/2
Default RoPE base: 10,000
Trained on 4096 tokens: frequencies have seen positions 0 to 4095
At position 8192: high-frequency dimensions have rotated to positions never seen in training
β attention patterns become incoherent
Simply running inference on longer sequences doesn't work β the model's positional encodings produce out-of-distribution rotation angles, causing attention to break down.
Technique 1: Positional Interpolation (PI)
The simplest fix: scale down position indices to fit within the trained range.
If the model was trained on 4096 tokens and you want 32768 tokens:
Scaling factor: s = 32768 / 4096 = 8
New position encoding: pos_interpolated = pos / s
Token at position 8192 β encoded as position 1024
Token at position 32768 β encoded as position 4096 (max trained position)
This works after a small amount of fine-tuning on longer sequences. The model adapts to the interpolated positions. Used in LLaMA 2's long-context fine-tunes.
Limitation: Uniform interpolation dilutes the positional signal for nearby tokens β tokens at distance 1 vs. distance 2 are now essentially indistinguishable at high scales. Short-range attention quality degrades.
Technique 2: YaRN (Yet Another RoPE extensioN)
YaRN (Peng et al., 2023) improves on uniform interpolation by treating different frequency components differently:
RoPE frequencies split into three groups based on wavelength:
High frequencies (short wavelengths):
β These are already well-trained in the base model
β Keep them unchanged (no scaling)
Medium frequencies:
β Interpolate with a linear ramp
Low frequencies (long wavelengths):
β These need to generalize to long distances
β Apply uniform interpolation
The intuition: high-frequency dimensions capture local relationships (adjacent tokens); they work fine as-is. Low-frequency dimensions capture global relationships (long-range dependencies); they need interpolation to work at extended lengths.
YaRN also introduces a temperature scaling factor that prevents attention entropy from collapsing when context is very long (long context β many tokens to attend to β softmax spreads too thin).
# YaRN implementation sketch
def apply_yarn_scaling(freqs, scale_factor, alpha=32, beta=1):
# freqs: the inverse frequencies of RoPE
wavelengths = 2 * math.pi / freqs
# High-frequency: keep unchanged
# Low-frequency: scale uniformly
# Transition: smooth ramp based on wavelength
scaled_freqs = torch.where(
wavelengths < alpha, # high-freq: leave unchanged
freqs,
torch.where(
wavelengths > beta * scale_factor, # low-freq: scale down
freqs / scale_factor,
# Transition zone: linear interpolation
freqs * (1 - (wavelengths - alpha) / (beta * scale_factor - alpha))
)
)
return scaled_freqs
YaRN is used by Mistral, Qwen 2.5, and many other open-source models for long-context extension.
Technique 3: LongRoPE
LongRoPE (2024) extends context to 2M tokens by observing that different attention heads have different optimal scaling factors. Rather than applying a global interpolation, it uses per-dimension, per-head scaling:
Evolutionary search over scaling factors:
For each dimension d in [0, D/2]:
Find scaling factor Ξ»_d that minimizes perplexity
at target context length
Result: non-uniform scaling that's better than any uniform scheme
The search is expensive (requires evaluating many scaling configurations) but only needs to be done once per model. LongRoPE achieved 2M context on Phi-3 with minimal quality degradation.
Technique 4: Adjusting the RoPE Base
An even simpler approach that's often overlooked: change the base of the RoPE frequencies.
Default: base = 10,000, trained on 4096 tokens
Observation: larger base β lower frequencies β slower rotation per position
β model can handle longer positions naturally
Code Llama 34B: trained with base = 1,000,000, 100K context
Llama 3.1: fine-tuned with base = 500,000, 128K context
Simply increasing the RoPE base during continued pre-training on long-context data allows the model to naturally extend. This is the approach Meta used for Llama 3.1's 128K context window.
The Training Data Challenge
Extending the context window doesn't just require positional encoding changes β you also need training data with long-range dependencies. A model trained only on 4K-token documents, even with extended positional encodings, won't learn to use 128K context effectively.
Long-context training data requirements:
- Long documents (books, research papers, codebases)
- Tasks that require referring back to distant context
- Synthetic tasks designed to require long-range reasoning (needle-in-a-haystack, multi-hop QA over long documents)
Meta found that even 0.1% long-context data (sequences > 32K tokens) significantly improved Llama 3.1's long-context performance, with diminishing returns beyond 5%.
"Lost in the Middle": The Practical Challenge
Even with perfect positional encodings, LLMs use long context imperfectly. A well-documented phenomenon: performance degrades for information placed in the middle of long contexts.
Performance on "find the relevant fact" task:
Fact at beginning of context: 90% accuracy
Fact in middle of context: 50% accuracy β significant degradation
Fact at end of context: 85% accuracy
This is a training artifact: most training data naturally has important information at the start (abstracts, headers) or end (conclusions, summaries), so models over-attend to positions near the boundaries.
Practical implication: When structuring prompts for long-context tasks, put the most critical information at the beginning or end of the context window.
Long Context vs. RAG: When to Use Each
Long context and RAG solve overlapping problems. Choosing between them depends on:
| Factor | Long Context | RAG |
|---|---|---|
| Document collection size | Smallβmedium (<200K tokens total) | Unlimited |
| Query type | Requires holistic understanding, cross-doc reasoning | Factual lookup, specific retrieval |
| Update frequency | Static or rarely changing | Frequently updated |
| Latency tolerance | High (prefilling 100K tokens takes seconds) | Lower (retrieval + short context generation) |
| Cost | High (long context β many tokens β expensive) | Lower (only retrieved chunks billed) |
| Reasoning across docs | Excellent | Harder (limited by what's retrieved) |
Use long context when: You need the model to reason across a document (summarize a 100-page report, find inconsistencies across sections, extract all mentions of an entity).
Use RAG when: You're answering factual questions over a large, frequently updated corpus, and answers depend on individual retrieved facts rather than holistic understanding.
Efficient Attention for Long Contexts
At 128K tokens, standard attention requires 128K Γ 128K = 16B entries in the attention matrix β infeasible even with FlashAttention at standard FP16. Production solutions:
FlashAttention (already discussed): Eliminates the O(NΒ²) HBM memory bottleneck, making 128K context possible on single-GPU configurations.
Ring Attention: Distribute attention across multiple GPUs by splitting the sequence. Each GPU holds a slice of Q, K, V and communicates in a ring to compute full attention. Enables millions of tokens.
Sliding Window Attention (Mistral): Each token only attends to the last W tokens. O(NΓW) complexity. Fast but loses access to very distant tokens. Mistral 7B uses W=4096 within its 32K context.
Sparse Attention patterns (BigBird, Longformer): Combine local windows with global tokens (CLS token attends to everything) and random attention. Approximates full attention with O(N) complexity.
Conclusion
Long-context LLMs are the result of careful co-design between positional encoding techniques (YaRN, LongRoPE, base scaling), attention efficiency improvements (FlashAttention, ring attention), and long-context training data. The path from 4K to 1M tokens was not obvious β it required understanding why positional encodings break, how to fix them without destroying short-range performance, and how to train models to actually use the extended context effectively. The "lost in the middle" phenomenon remains an unsolved challenge at scale, which is why RAG and long context are often complementary rather than competitive.
Interested in how transformers handle attention computation efficiently? Read our guide on FlashAttention Explained.