design pattern 2025-02-25 13 min read

Synthetic Data for LLM Training: Techniques, Trade-offs, and Industry Practice

How leading labs use synthetic data to train better LLMs — from self-instruct to distillation to execution-verified synthetic code. Covers the techniques, quality control, and when synthetic data helps vs. hurts.

synthetic data LLM training self-instruct distillation data generation phi Llama alignment

Introduction

One of the most significant shifts in LLM development over the past two years is the rise of synthetic data — training examples generated by AI models rather than written by humans. Microsoft's Phi series demonstrated that a 1.3B model trained on high-quality synthetic data could match models 10x its size. Meta used synthetic data extensively in Llama 3's training. Google's Gemini models were partially trained on synthetic reasoning traces.

Understanding how synthetic data is generated, filtered, and used is now essential knowledge for anyone training or fine-tuning language models.

Why Synthetic Data?

The canonical human-curated dataset is expensive and bottlenecked by human throughput. Synthetic data addresses several constraints:

Scale: A strong LLM can generate millions of training examples overnight; humans cannot.

Coverage: You can generate examples for rare, underrepresented scenarios that human datasets miss.

Quality control: You can filter synthetic data with automated quality signals (model perplexity, execution verification, reward model scoring) at scale.

Distribution control: You can precisely specify what kinds of examples to generate — specific topics, difficulty levels, formats.

The risk: garbage in, garbage out. Poorly filtered synthetic data introduces model-generated artifacts, errors, and biases that can be worse than no data at all.

Self-Instruct: Bootstrapping from a Base Model

Self-Instruct (Wang et al., 2022) is the foundational technique: use a language model to generate its own instruction-following training data.

Self-Instruct pipeline:
1. Start with a small seed of human-written task examples (~175)
2. Prompt the LLM to generate new task instructions:
   "Generate 5 new instructions similar to but different from these examples: ..."
3. Prompt the LLM to generate responses to the new instructions
4. Filter: remove low-quality, duplicates, or trivially similar instructions
5. Add to training pool → fine-tune the model on the accumulated data
6. Repeat with the improved model

This bootstrapped approach can generate hundreds of thousands of instruction-following examples from a small seed. Stanford's Alpaca used this to fine-tune Llama-7B for under $600 total cost.

Key limitation: The synthetic data inherits all the base model's biases and errors. Self-Instruct improves instruction-following but doesn't improve factual accuracy or add new knowledge.

Distillation: Teacher to Student

A more powerful approach: use a stronger "teacher" model to generate data for training a smaller "student" model.

Teacher (large, expensive):  GPT-4, Claude, Llama-70B
                    ↓
Generate high-quality responses to diverse prompts
                    ↓
Student (small, cheap to serve): 7B, 13B parameter model
                    ↓
Fine-tune student on teacher's outputs

This is how models like WizardLM, Orca, and Zephyr were created. The critical insight from Microsoft's Orca paper: it's not just about generating responses — it's about generating rich reasoning traces.

Orca prompted GPT-4 with "explain your reasoning step by step" and trained the smaller model on those explanations, not just the final answers. The smaller model learned to reason like GPT-4, not just mimic its outputs.

# Orca-style data generation prompt
system_prompt = "You are a helpful assistant. Think step by step and explain your reasoning before answering."

for instruction in seed_instructions:
    response = gpt4.complete(
        system=system_prompt,
        user=instruction
    )
    # response includes explicit reasoning trace
    synthetic_dataset.append({
        "instruction": instruction,
        "response": response  # includes "<reasoning>...</reasoning><answer>...</answer>"
    })

Phi: When Synthetic Data Beats Scale

Microsoft's Phi-1 and Phi-1.5 models (2023) made the most dramatic case for high-quality synthetic data. Phi-1 (1.3B parameters) outperformed GPT-3.5 on Python coding benchmarks despite being 100x smaller.

The key insight: the Phi team filtered the entire internet for "textbook quality" educational content and supplemented it with GPT-4-generated synthetic textbooks and exercises.

Phi data recipe:
  "textbook quality" web text: ~7B tokens
    → filtered from Common Crawl using a quality classifier trained on curated examples

  GPT-4 synthetic textbooks: ~1B tokens
    → prompted GPT-4: "Write a textbook chapter that teaches Python to a beginner, including exercises"
    → diverse topics, concepts, and difficulty levels

  GPT-4 synthetic exercises: ~180M tokens
    → code exercises with solutions, specifically designed to teach programming concepts

Total: ~8B tokens — tiny by modern standards, but extremely high quality

Phi-1's benchmark performance shattered assumptions about the relationship between model size and capability. The lesson: data quality per token matters enormously.

Execution-Verified Synthetic Code

For code generation, you have a powerful quality signal unavailable in other domains: does the code actually run correctly?

This enables a high-quality synthetic data flywheel:

1. Generate diverse coding problems (docstrings, function signatures)
2. Use LLM to generate solution candidates
3. Execute solutions against test cases
4. Keep only solutions that pass all tests  ← execution verification
5. Fine-tune on (problem, verified_solution) pairs

DeepMind's AlphaCode and later code models all use this approach. The verification step dramatically increases data quality — you're training on solutions known to be correct, not just solutions that look plausible.

# Execution-verified data generation
def generate_verified_solution(problem, test_cases, n_samples=10):
    candidates = []
    for _ in range(n_samples):
        solution = llm.generate_code(problem)
        if passes_all_tests(solution, test_cases):
            candidates.append(solution)

    if candidates:
        # Keep the shortest/cleanest solution that passes
        return min(candidates, key=lambda s: len(s))
    return None  # discard if no solution found

Synthetic Reasoning Traces: The DeepSeek Contribution

DeepSeek-R1's distillation pipeline (covered in our separate post) represents the state of the art: use the strong R1 model to generate reasoning traces — full step-by-step chains of thought — and distill these into smaller models.

The distilled R1-7B model outperforms models 10x its size on reasoning benchmarks, trained purely on R1's synthetic reasoning traces.

DeepSeek distillation pipeline:
  R1 (671B, strong reasoner)
    ↓
  Generate 800K reasoning traces on diverse problems:
    <think>
    Let me break this down...
    [multi-step reasoning]
    I can verify this by...
    </think>
    <answer>...</answer>
    ↓
  Filter: keep traces where final answer is correct
    ↓
  SFT on filtered (problem, reasoning_trace) pairs → 7B distilled model

Quality Control: The Critical Step

Synthetic data is only as good as your filtering pipeline. Common quality filters:

1. Perplexity filtering: High perplexity from a small reference model often indicates unusual/low-quality text. Used by the Phi team.

2. Reward model scoring: Score synthetic examples with a reward model; keep high-scoring ones.

3. Deduplication: Aggressive near-deduplication using MinHash or SimHash. Models trained on duplicated synthetic data memorize artifacts.

4. Model-as-judge filtering: Use a strong LLM to rate synthetic examples on quality dimensions.

# Quality filtering pipeline
def filter_synthetic_examples(examples):
    quality_scores = judge_model.batch_score(
        examples,
        criteria=["accuracy", "clarity", "educational_value", "no_harmful_content"]
    )

    # Keep top 50% by quality
    threshold = np.percentile(quality_scores, 50)
    return [ex for ex, score in zip(examples, quality_scores) if score > threshold]

5. Distribution analysis: Check that synthetic data doesn't collapse to a narrow distribution. Measure diversity of topics, styles, and structures.

When Synthetic Data Helps vs. Hurts

Synthetic data helps:

  • Instruction following and format compliance (easy to generate diverse examples)
  • Reasoning traces for tasks with verifiable correct answers
  • Rare domain coverage (generate medical, legal, scientific examples)
  • Data augmentation for imbalanced categories

Synthetic data hurts:

  • When teacher model is wrong: errors propagate and amplify
  • When you need grounding in real-world facts and events (LLMs hallucinate)
  • When there's too much homogeneity: synthetic data from one model family can introduce systematic biases that reduce the student's generalization

The fundamental limitation: A model cannot generate synthetic data that exceeds its own capabilities in open-ended domains. Synthetic data is powerful for teaching smaller models what larger models already know, not for discovering genuinely new knowledge.

Conclusion

Synthetic data has gone from a research curiosity to a cornerstone of LLM development. The Phi series proved that data quality trumps data scale; DeepSeek proved that distillation of reasoning traces unlocks capabilities far beyond what the student model would learn from human demonstrations alone. The essential toolkit: self-instruct for instruction diversity, teacher distillation for quality, execution verification for code, and aggressive filtering for everything.


Curious about the post-training techniques that rely on synthetic data? Read our guide on DeepSeek R1's GRPO training.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.