design pattern 2025-03-10 13 min read

LLM Scaling Laws: Chinchilla, Compute-Optimal Training, and What They Mean in Practice

A deep dive into neural scaling laws — how they predict model performance, what Chinchilla changed about how we train LLMs, and the emerging debate about whether scaling is hitting limits.

scaling laws Chinchilla compute-optimal LLM training model size data scaling GPT-4

Introduction

Neural scaling laws are among the most practically important results in modern AI. They tell you, before spending millions on a training run, what model performance you should expect as a function of compute, data, and model size. The 2022 Chinchilla paper from DeepMind fundamentally changed how leading labs train LLMs — and its insights are why Llama-3-8B outperforms GPT-3 175B on many benchmarks despite being 20x smaller.

This post explains the key scaling laws, what Chinchilla proved, and the state of the debate about whether scaling will continue to drive progress.

The Power Law Foundation

Scaling laws describe how model loss decreases as a power law with scale:

L(N) = A/N^α + irreducible_loss    (loss vs. model parameters)
L(D) = B/D^β + irreducible_loss    (loss vs. training tokens)
L(C) = C_const / C^γ + irred.     (loss vs. total compute FLOPs)

Where:

  • N = number of model parameters
  • D = number of training tokens
  • C = total training FLOPs ≈ 6·N·D (the standard approximation)
  • α, β, γ ≈ 0.07–0.10 (empirically determined)

Key insight: Loss follows predictable power laws across many orders of magnitude. This means you can run small, cheap experiments and extrapolate to large, expensive runs — with surprisingly high accuracy.

The original OpenAI scaling laws paper (Kaplan et al., 2020) established these relationships for language models, validating them from 10⁸ to 10²³ FLOPs.

The Kaplan et al. Recommendation (2020): Scale Models

The original scaling laws paper suggested that, given a fixed compute budget, you should prioritize scaling model size over training longer:

Kaplan 2020 finding:
  Model size scales as: N_optimal ∝ C^0.73
  Dataset size scales as: D_optimal ∝ C^0.27

Implication: As compute grows, grow model size much faster than data.
  10× more compute → 5× bigger model, 2× more data

This recommendation led labs to train large models on relatively small datasets. GPT-3 (175B params) was trained on ~300B tokens — less than 2 tokens per parameter.

Chinchilla (2022): The Dataset Was Too Small

DeepMind's "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022), known as the Chinchilla paper, re-ran the scaling experiment more carefully and reached a very different conclusion.

The Kaplan experiments held either model size or dataset size fixed and varied the other. Chinchilla varied both simultaneously to find the true optimum.

Chinchilla's finding: For a compute-optimal training run, you should train roughly 20 tokens per parameter:

Chinchilla optimal scaling:
  N_optimal ≈ C^0.49 (model parameters)
  D_optimal ≈ C^0.51 (training tokens)

  → Scale model size and data size roughly equally

Implication for a fixed compute budget:
  10× more compute → ~3× bigger model AND ~3× more data

By this rule, GPT-3 (175B params, 300B tokens = 1.7 tokens/param) was massively undertrained. The optimal model for GPT-3's compute budget would have been ~10× smaller but trained on ~20× more data.

The Chinchilla Experiment

DeepMind trained over 400 models, from 70M to 16B parameters, across compute budgets from 6×10¹⁸ to 3×10²³ FLOPs. The pattern was clear and consistent:

For a fixed compute budget C (in FLOPs):
  Optimal N ≈ (C/6)^0.5   (sqrt of compute budget / 6)
  Optimal D ≈ C / (6N)

Example: C = 6×10²³ FLOPs (Chinchilla's budget):
  N_optimal ≈ 70B parameters
  D_optimal ≈ 1.4T tokens (= 20 tokens per parameter)

Chinchilla 70B achieves lower loss than Gopher 280B (4× larger, same compute)

Why Chinchilla Changed LLM Training

Before Chinchilla: GPT-3 scale models (100B–500B params, 300B–500B tokens) After Chinchilla: models are smaller but trained on much more data

The Llama family embodies this shift:
  Llama 1 7B: 1T tokens (143 tokens/param) — follows Chinchilla
  Llama 2 7B: 2T tokens (286 tokens/param) — goes beyond Chinchilla
  Llama 3 8B: 15T tokens (1875 tokens/param!) — dramatically over-trains

Wait — Llama 3 trains on 1875 tokens per parameter? That's 100× the Chinchilla optimum. Why?

"Inference-Optimal" vs. "Training-Optimal"

Chinchilla's optimum is about minimizing loss for a fixed training compute budget. But once a model is trained and deployed, you care about inference cost, not training cost.

A smaller, extensively trained model:

  • Has lower inference cost per request (fewer parameters = less compute per token)
  • Is cheaper to serve at scale
  • Can run on more hardware (fits on fewer GPUs)
  • Can be fine-tuned and deployed by more people

This led to the concept of inference-optimal training: deliberately over-train smaller models beyond the Chinchilla compute-optimal point to get a model that performs well and is cheap to serve.

Decision framework:
  Training budget >> deployment cost → use Chinchilla-optimal (train once, serve few)
  Training budget << deployment cost → over-train smaller model (serve millions of times)

Meta's situation with Llama 3:
  Training cost: ~$50M for 15T tokens
  Expected inference: billions of requests/day from open-source users
  → Over-train aggressively, minimize serving cost

Scaling Laws Beyond Pretraining: The Emerging Picture

Chinchilla's laws apply to pretraining loss on next-token prediction. The emerging question: do they predict performance on downstream tasks?

The relationship breaks in important ways:

Emergent capabilities: Some capabilities (multi-step arithmetic, chain-of-thought reasoning) appear suddenly at certain scales and are not predicted by smooth pretraining loss curves. A model might achieve low loss but near-zero accuracy on a task, then suddenly jump to high accuracy as scale increases.

Post-training matters more: Llama-3-8B-Instruct dramatically outperforms base Llama-3-8B. The alignment and instruction-tuning stage can unlock capabilities far beyond what pretraining loss predicts.

Scaling in compute efficiency: Test-time compute scaling (DeepSeek-R1, o3) shows a different scaling dimension: spending more compute at inference (not training) through chain-of-thought, search, and verification. This curve hasn't plateaued.

Scaling Laws for Data Quality

A crucial extension not in the original Chinchilla paper: data quality changes the effective token count.

Effective token count = actual tokens × data quality multiplier

High-quality filtered data: multiplier ~3-5×
Synthetic textbook data (Phi): multiplier ~10-20×
Randomly sampled web text: multiplier 1×

This is why Phi-1 (1.3B parameters, 7B tokens of curated + synthetic data) outperforms models trained on 100B tokens of raw web data. The Chinchilla rules apply to effective tokens, not raw token counts.

Is Scaling Hitting Limits?

The ML community is actively debating whether the returns to pretraining scale are diminishing:

The pessimistic view: Web-scale text data is nearly exhausted. The next 10× of tokens doesn't exist in the same high-quality form. Labs are already using essentially all high-quality text available on the internet.

The optimistic view: Synthetic data can substitute for natural data at scale (DeepSeek, Phi, Llama 3 all use synthetic data). Multi-modal data (video, code, scientific papers) is largely untapped. Test-time compute scaling offers a new dimension.

The nuanced view: Pretraining scaling is slowing, but the combination of better data curation, synthetic data, post-training (RLHF, RLAIF, DPO), and test-time compute continues to drive rapid capability improvements.

Practical Implications for ML Engineers

When planning a training run:

  • Use 20 tokens/parameter as a baseline for Chinchilla-optimal
  • If you'll serve the model widely, consider over-training (Llama-3 style: 1000+ tokens/param)
  • If training budget dominates serving cost, Chinchilla-optimal is appropriate

When choosing a base model:

  • Llama-3-8B has seen 15T tokens — it's a very well-trained 8B model, more capable than undertrained 70B models in some settings
  • A smaller, heavily trained model often beats a larger, undertrained one at inference time

When evaluating model quality:

  • Perplexity on held-out text is predictable by scaling laws
  • Task performance is less predictable — test your actual use case

Conclusion

Scaling laws transformed LLM development from an art into a science. Chinchilla's finding — that the industry was dramatically over-sizing models and under-training them — changed how every major lab trains models. The subsequent shift to inference-optimal training (over-train smaller models) reflects the economic reality that inference cost dominates training cost at scale. The frontier is now in post-training, synthetic data, and test-time compute — new scaling dimensions that Chinchilla's original framework didn't capture.


Want to understand the training techniques that build on these insights? Read about DeepSeek-R1's RL approach and test-time compute scaling.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.