Introduction
In January 2025, DeepSeek released R1 — a reasoning model that matched or exceeded OpenAI's o1 on a range of benchmarks while being trained at a fraction of the cost. What made R1 extraordinary wasn't just its performance, but how it achieved that performance: through reinforcement learning, without supervised fine-tuning on human-generated chain-of-thought data.
This post breaks down R1's architecture, training methodology, and the engineering lessons it offers for ML practitioners.
The Core Insight: RL Can Teach Reasoning
Before R1, the standard recipe for reasoning models was:
- Generate high-quality chain-of-thought (CoT) data
- Fine-tune the model on that data
- Optionally apply RLHF on top
This is expensive. You need human annotators or a strong teacher model to generate CoT data, and that data can embed biases in how the model "thinks."
DeepSeek-R1's key bet was: what if we skip supervised CoT data entirely and just reward correct answers?
GRPO: Group Relative Policy Optimization
DeepSeek trained R1 using GRPO, a variant of PPO optimized for language models. The key difference: instead of a separate value network (expensive), GRPO estimates baselines from a group of sampled responses.
For each question q:
1. Sample G responses: {o1, o2, ..., oG} from policy π
2. Score each response with reward model r
3. Normalize rewards within the group
4. Update policy to increase probability of high-reward responses
This means the model gets a signal of "was this response better or worse than my other attempts at this same question?" — a clean relative comparison that's more stable than absolute reward values.
Emergent Reasoning Behaviors
The most striking finding: when trained with RL on correctness alone, the model spontaneously developed reasoning behaviors:
- Self-reflection: Re-reading and questioning its own intermediate steps
- Backtracking: Abandoning incorrect approaches mid-solution
- Verification: Cross-checking final answers using alternative methods
These weren't engineered — they emerged from reward maximization. The model learned that longer, more careful reasoning led to more correct answers.
The "Aha Moment"
DeepSeek's paper documents a specific emergent behavior they call the "aha moment" — where the model, mid-solution, realizes its approach is wrong and pivots. This mirrors human metacognition and appeared without any explicit training signal for it.
The Training Pipeline
Stage 1: Cold Start Fine-tuning
Pure RL from scratch on a base model leads to unstable training (incoherent outputs, language mixing). DeepSeek addressed this with a small "cold start" SFT phase using a few thousand high-quality CoT examples — just enough to stabilize the format, not to teach reasoning itself.
Stage 2: RL Training at Scale
Base model: DeepSeek-V3 (671B MoE)
RL algorithm: GRPO
Reward signals:
- Accuracy reward: is the final answer correct?
- Format reward: does the output follow <think>...</think> structure?
- No process reward model (PRM) needed
The absence of a process reward model is significant. PRMs require expensive annotation of which reasoning steps are correct. DeepSeek showed outcome-based rewards alone suffice.
Stage 3: Rejection Sampling + SFT + RLHF
To improve helpfulness and safety (not just reasoning), DeepSeek applied a final stage:
- Generate many responses to diverse prompts
- Keep only high-quality ones via rejection sampling
- Fine-tune on this curated set
- Apply a second round of RLHF
Distillation: R1-7B Punches Above Its Weight
Perhaps the most practically important result: DeepSeek distilled R1's reasoning into smaller models (1.5B to 70B parameters) using the CoT traces from R1 as supervised training data.
The 7B distilled model outperforms models twice its size on reasoning benchmarks. This makes high-quality reasoning accessible to teams without the compute for frontier-scale models.
Engineering Lessons
1. Outcome rewards beat process rewards for reasoning
You don't need to label how the model should think — just whether it got the right answer. This dramatically reduces annotation costs.
2. Model size enables emergent behaviors
The self-reflection and backtracking behaviors required sufficient model capacity. Smaller models showed much weaker emergence under the same RL training.
3. Format matters more than content for RL stability
Enforcing a consistent output format (the <think> block) during early training prevented mode collapse and made reward signals cleaner.
4. Distillation captures reasoning, not just outputs
The key to R1's distilled models working well: training on reasoning traces, not just final answers. This teaches the student model how to think, not just what to output.
Implications for the Industry
R1 demonstrated that:
- Strong reasoning capabilities can be developed with substantially less compute than previously assumed
- RL on outcome-based rewards is a viable and scalable path
- Reasoning can be transferred through distillation, democratizing access
For teams building ML systems at scale, this opens the door to reasoning-capable models at practical serving costs.
Conclusion
DeepSeek-R1 is a landmark result not because it's the biggest model, but because it challenged the assumptions underlying how reasoning models are built. The RL-first approach, emergent CoT behaviors, and efficient distillation pipeline represent a shift in how the field thinks about training capable AI systems.
Interested in LLM inference and serving? Check out our guide on LLM Inference at Scale.