design pattern 2025-03-06 13 min read

Long Context LLMs: RoPE Scaling, Retrieval, and the Path to 1M Tokens

How modern LLMs extend context windows to 128K, 1M, and beyond β€” covering RoPE scaling, positional extrapolation, attention approximations, and when long context beats RAG.

long context context window RoPE RoPE scaling LongRoPE YaRN LLM attention RAG vs long context

Introduction

When GPT-4 launched with an 8K context window, 128K tokens seemed like science fiction. Two years later, Claude's 200K context, Gemini's 1M context, and open-source models with 128K windows are standard production offerings. The journey from 2K to 1M tokens required solving fundamental problems in positional encoding, attention efficiency, and model training.

This post explains how context windows are extended, what breaks in practice, and when long context is the right tool versus retrieval.

Why Context Windows Are Hard to Extend

The core challenge isn't just compute β€” it's positional generalization. LLMs are trained with specific positional encodings, and at inference time they encounter positions they've never seen during training.

The RoPE Problem

Most modern LLMs (Llama, Mistral, Qwen, Gemini) use Rotary Position Embeddings (RoPE). RoPE encodes position by rotating query and key vectors:

Q_rotated[i] = Q[i] Β· R(pos_i)
K_rotated[j] = K[j] Β· R(pos_j)

dot_product(Q_rotated[i], K_rotated[j]) β†’ depends on (pos_i - pos_j)

RoPE uses sinusoidal functions at different frequencies:

R(pos) uses frequencies: ΞΈ_d = base^(-2d/D) for d = 0, 1, ..., D/2

Default RoPE base: 10,000
Trained on 4096 tokens: frequencies have seen positions 0 to 4095

At position 8192: high-frequency dimensions have rotated to positions never seen in training
β†’ attention patterns become incoherent

Simply running inference on longer sequences doesn't work β€” the model's positional encodings produce out-of-distribution rotation angles, causing attention to break down.

Technique 1: Positional Interpolation (PI)

The simplest fix: scale down position indices to fit within the trained range.

If the model was trained on 4096 tokens and you want 32768 tokens:

Scaling factor: s = 32768 / 4096 = 8

New position encoding: pos_interpolated = pos / s

Token at position 8192 β†’ encoded as position 1024
Token at position 32768 β†’ encoded as position 4096 (max trained position)

This works after a small amount of fine-tuning on longer sequences. The model adapts to the interpolated positions. Used in LLaMA 2's long-context fine-tunes.

Limitation: Uniform interpolation dilutes the positional signal for nearby tokens β€” tokens at distance 1 vs. distance 2 are now essentially indistinguishable at high scales. Short-range attention quality degrades.

Technique 2: YaRN (Yet Another RoPE extensioN)

YaRN (Peng et al., 2023) improves on uniform interpolation by treating different frequency components differently:

RoPE frequencies split into three groups based on wavelength:

High frequencies (short wavelengths):
  β†’ These are already well-trained in the base model
  β†’ Keep them unchanged (no scaling)

Medium frequencies:
  β†’ Interpolate with a linear ramp

Low frequencies (long wavelengths):
  β†’ These need to generalize to long distances
  β†’ Apply uniform interpolation

The intuition: high-frequency dimensions capture local relationships (adjacent tokens); they work fine as-is. Low-frequency dimensions capture global relationships (long-range dependencies); they need interpolation to work at extended lengths.

YaRN also introduces a temperature scaling factor that prevents attention entropy from collapsing when context is very long (long context β†’ many tokens to attend to β†’ softmax spreads too thin).

# YaRN implementation sketch
def apply_yarn_scaling(freqs, scale_factor, alpha=32, beta=1):
    # freqs: the inverse frequencies of RoPE
    wavelengths = 2 * math.pi / freqs

    # High-frequency: keep unchanged
    # Low-frequency: scale uniformly
    # Transition: smooth ramp based on wavelength

    scaled_freqs = torch.where(
        wavelengths < alpha,           # high-freq: leave unchanged
        freqs,
        torch.where(
            wavelengths > beta * scale_factor,  # low-freq: scale down
            freqs / scale_factor,
            # Transition zone: linear interpolation
            freqs * (1 - (wavelengths - alpha) / (beta * scale_factor - alpha))
        )
    )
    return scaled_freqs

YaRN is used by Mistral, Qwen 2.5, and many other open-source models for long-context extension.

Technique 3: LongRoPE

LongRoPE (2024) extends context to 2M tokens by observing that different attention heads have different optimal scaling factors. Rather than applying a global interpolation, it uses per-dimension, per-head scaling:

Evolutionary search over scaling factors:
  For each dimension d in [0, D/2]:
    Find scaling factor Ξ»_d that minimizes perplexity
    at target context length

Result: non-uniform scaling that's better than any uniform scheme

The search is expensive (requires evaluating many scaling configurations) but only needs to be done once per model. LongRoPE achieved 2M context on Phi-3 with minimal quality degradation.

Technique 4: Adjusting the RoPE Base

An even simpler approach that's often overlooked: change the base of the RoPE frequencies.

Default: base = 10,000, trained on 4096 tokens

Observation: larger base β†’ lower frequencies β†’ slower rotation per position
             β†’ model can handle longer positions naturally

Code Llama 34B: trained with base = 1,000,000, 100K context
Llama 3.1: fine-tuned with base = 500,000, 128K context

Simply increasing the RoPE base during continued pre-training on long-context data allows the model to naturally extend. This is the approach Meta used for Llama 3.1's 128K context window.

The Training Data Challenge

Extending the context window doesn't just require positional encoding changes β€” you also need training data with long-range dependencies. A model trained only on 4K-token documents, even with extended positional encodings, won't learn to use 128K context effectively.

Long-context training data requirements:

  • Long documents (books, research papers, codebases)
  • Tasks that require referring back to distant context
  • Synthetic tasks designed to require long-range reasoning (needle-in-a-haystack, multi-hop QA over long documents)

Meta found that even 0.1% long-context data (sequences > 32K tokens) significantly improved Llama 3.1's long-context performance, with diminishing returns beyond 5%.

"Lost in the Middle": The Practical Challenge

Even with perfect positional encodings, LLMs use long context imperfectly. A well-documented phenomenon: performance degrades for information placed in the middle of long contexts.

Performance on "find the relevant fact" task:
  Fact at beginning of context: 90% accuracy
  Fact in middle of context:    50% accuracy  ← significant degradation
  Fact at end of context:       85% accuracy

This is a training artifact: most training data naturally has important information at the start (abstracts, headers) or end (conclusions, summaries), so models over-attend to positions near the boundaries.

Practical implication: When structuring prompts for long-context tasks, put the most critical information at the beginning or end of the context window.

Long Context vs. RAG: When to Use Each

Long context and RAG solve overlapping problems. Choosing between them depends on:

Factor Long Context RAG
Document collection size Small–medium (<200K tokens total) Unlimited
Query type Requires holistic understanding, cross-doc reasoning Factual lookup, specific retrieval
Update frequency Static or rarely changing Frequently updated
Latency tolerance High (prefilling 100K tokens takes seconds) Lower (retrieval + short context generation)
Cost High (long context β†’ many tokens β†’ expensive) Lower (only retrieved chunks billed)
Reasoning across docs Excellent Harder (limited by what's retrieved)

Use long context when: You need the model to reason across a document (summarize a 100-page report, find inconsistencies across sections, extract all mentions of an entity).

Use RAG when: You're answering factual questions over a large, frequently updated corpus, and answers depend on individual retrieved facts rather than holistic understanding.

Efficient Attention for Long Contexts

At 128K tokens, standard attention requires 128K Γ— 128K = 16B entries in the attention matrix β€” infeasible even with FlashAttention at standard FP16. Production solutions:

FlashAttention (already discussed): Eliminates the O(NΒ²) HBM memory bottleneck, making 128K context possible on single-GPU configurations.

Ring Attention: Distribute attention across multiple GPUs by splitting the sequence. Each GPU holds a slice of Q, K, V and communicates in a ring to compute full attention. Enables millions of tokens.

Sliding Window Attention (Mistral): Each token only attends to the last W tokens. O(NΓ—W) complexity. Fast but loses access to very distant tokens. Mistral 7B uses W=4096 within its 32K context.

Sparse Attention patterns (BigBird, Longformer): Combine local windows with global tokens (CLS token attends to everything) and random attention. Approximates full attention with O(N) complexity.

Conclusion

Long-context LLMs are the result of careful co-design between positional encoding techniques (YaRN, LongRoPE, base scaling), attention efficiency improvements (FlashAttention, ring attention), and long-context training data. The path from 4K to 1M tokens was not obvious β€” it required understanding why positional encodings break, how to fix them without destroying short-range performance, and how to train models to actually use the extended context effectively. The "lost in the middle" phenomenon remains an unsolved challenge at scale, which is why RAG and long context are often complementary rather than competitive.


Interested in how transformers handle attention computation efficiently? Read our guide on FlashAttention Explained.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.