case study 2025-02-10 14 min read

DeepSeek-R1: How Reinforcement Learning Unlocks Reasoning in LLMs

A deep dive into DeepSeek-R1's training methodology — using pure RL to teach LLMs to reason step-by-step, and what this means for the future of AI at scale.

DeepSeek reinforcement learning reasoning LLM GRPO chain of thought

Introduction

In January 2025, DeepSeek released R1 — a reasoning model that matched or exceeded OpenAI's o1 on a range of benchmarks while being trained at a fraction of the cost. What made R1 extraordinary wasn't just its performance, but how it achieved that performance: through reinforcement learning, without supervised fine-tuning on human-generated chain-of-thought data.

This post breaks down R1's architecture, training methodology, and the engineering lessons it offers for ML practitioners.

The Core Insight: RL Can Teach Reasoning

Before R1, the standard recipe for reasoning models was:

  1. Generate high-quality chain-of-thought (CoT) data
  2. Fine-tune the model on that data
  3. Optionally apply RLHF on top

This is expensive. You need human annotators or a strong teacher model to generate CoT data, and that data can embed biases in how the model "thinks."

DeepSeek-R1's key bet was: what if we skip supervised CoT data entirely and just reward correct answers?

GRPO: Group Relative Policy Optimization

DeepSeek trained R1 using GRPO, a variant of PPO optimized for language models. The key difference: instead of a separate value network (expensive), GRPO estimates baselines from a group of sampled responses.

For each question q:
  1. Sample G responses: {o1, o2, ..., oG} from policy π
  2. Score each response with reward model r
  3. Normalize rewards within the group
  4. Update policy to increase probability of high-reward responses

This means the model gets a signal of "was this response better or worse than my other attempts at this same question?" — a clean relative comparison that's more stable than absolute reward values.

Emergent Reasoning Behaviors

The most striking finding: when trained with RL on correctness alone, the model spontaneously developed reasoning behaviors:

  • Self-reflection: Re-reading and questioning its own intermediate steps
  • Backtracking: Abandoning incorrect approaches mid-solution
  • Verification: Cross-checking final answers using alternative methods

These weren't engineered — they emerged from reward maximization. The model learned that longer, more careful reasoning led to more correct answers.

The "Aha Moment"

DeepSeek's paper documents a specific emergent behavior they call the "aha moment" — where the model, mid-solution, realizes its approach is wrong and pivots. This mirrors human metacognition and appeared without any explicit training signal for it.

The Training Pipeline

Stage 1: Cold Start Fine-tuning

Pure RL from scratch on a base model leads to unstable training (incoherent outputs, language mixing). DeepSeek addressed this with a small "cold start" SFT phase using a few thousand high-quality CoT examples — just enough to stabilize the format, not to teach reasoning itself.

Stage 2: RL Training at Scale

Base model: DeepSeek-V3 (671B MoE)
RL algorithm: GRPO
Reward signals:
  - Accuracy reward: is the final answer correct?
  - Format reward: does the output follow <think>...</think> structure?
  - No process reward model (PRM) needed

The absence of a process reward model is significant. PRMs require expensive annotation of which reasoning steps are correct. DeepSeek showed outcome-based rewards alone suffice.

Stage 3: Rejection Sampling + SFT + RLHF

To improve helpfulness and safety (not just reasoning), DeepSeek applied a final stage:

  1. Generate many responses to diverse prompts
  2. Keep only high-quality ones via rejection sampling
  3. Fine-tune on this curated set
  4. Apply a second round of RLHF

Distillation: R1-7B Punches Above Its Weight

Perhaps the most practically important result: DeepSeek distilled R1's reasoning into smaller models (1.5B to 70B parameters) using the CoT traces from R1 as supervised training data.

The 7B distilled model outperforms models twice its size on reasoning benchmarks. This makes high-quality reasoning accessible to teams without the compute for frontier-scale models.

Engineering Lessons

1. Outcome rewards beat process rewards for reasoning

You don't need to label how the model should think — just whether it got the right answer. This dramatically reduces annotation costs.

2. Model size enables emergent behaviors

The self-reflection and backtracking behaviors required sufficient model capacity. Smaller models showed much weaker emergence under the same RL training.

3. Format matters more than content for RL stability

Enforcing a consistent output format (the <think> block) during early training prevented mode collapse and made reward signals cleaner.

4. Distillation captures reasoning, not just outputs

The key to R1's distilled models working well: training on reasoning traces, not just final answers. This teaches the student model how to think, not just what to output.

Implications for the Industry

R1 demonstrated that:

  • Strong reasoning capabilities can be developed with substantially less compute than previously assumed
  • RL on outcome-based rewards is a viable and scalable path
  • Reasoning can be transferred through distillation, democratizing access

For teams building ML systems at scale, this opens the door to reasoning-capable models at practical serving costs.

Conclusion

DeepSeek-R1 is a landmark result not because it's the biggest model, but because it challenged the assumptions underlying how reasoning models are built. The RL-first approach, emergent CoT behaviors, and efficient distillation pipeline represent a shift in how the field thinks about training capable AI systems.


Interested in LLM inference and serving? Check out our guide on LLM Inference at Scale.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.