DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning
Published on: October 06, 2025
Shifting from Supervised Learning to Reinforcement Learning
graph TD subgraph "DeepSeek's Approach (Pure RL)" direction TB DS_HP["Hard Reasoning Problems
(Math, Code)"] --> DS_RLLoop{"Reinforcement Learning
Loop"}; DS_RLLoop -- "Generates Reasoning & Answer" --> DS_Verifier{"Rule-Based Verifier"}; DS_Verifier -- "Receives Reward Signal
(based on final answer only)" --> DS_RLLoop; DS_RLLoop --> DS_LLMR["LLM's Reasoning
(Self-Discovered Pathways)"]; subgraph Benefits direction TB B1["Surpasses human performance"] B2["Autonomous self-improvement"] end DS_LLMR --> Benefits; end subgraph "Traditional Approach (SFT)" direction TB TA_HARS["Human-Annotated
Reasoning Steps"] --> TA_SFT{"Supervised
Fine-Tuning"}; TA_SFT --> TA_LLMR["LLM's Reasoning
(Mimics Human Thought)"]; subgraph Limitations direction TB L1["Capped by human
performance"] L2["Introduces cognitive biases"] end TA_LLMR --> Limitations; end style L1 fill:#f8d7da,stroke:#721c24 style L2 fill:#f8d7da,stroke:#721c24 style B1 fill:#d4edda,stroke:#155724 style B2 fill:#d4edda,stroke:#155724
Emergence of Sophisticated Reasoning Behaviors
graph TD A[Base LLM] --> B["Incentivized by RL on
Hard Problems with Simple Rewards"]; B --> C(Emergent Reasoning Engine); subgraph " " direction LR C --> D["📈 Increased Thinking Time
(Generates Longer Chain-of-Thought)"]; C --> E["🕵️ Self-Verification
(Checks its own
calculations and logic)"]; C --> F["🤔 Self-Reflection
(Identifies mistakes and
re-evaluates)"]; C --> G["💡 'Aha Moment'
(Sudden strategy shifts,
e.g., using 'Wait, let's
reevaluate...')"]; end style C fill:#cce5ff,stroke:#004085,stroke-width:2px style D fill:#fff3cd,stroke:#856404 style E fill:#fff3cd,stroke:#856404 style F fill:#fff3cd,stroke:#856404 style G fill:#fff3cd,stroke:#856404
The Two-Model Development Pipeline
graph TD A[DeepSeek-V3
Base Model] --> B{"Stage 1: Pure
Reinforcement Learning"}; B -- "on reasoning tasks" --> C[DeepSeek-R1-Zero]; C -- "Inherits Core Reasoning" --> D{"Stage 2: Multi-stage
Refinement & Alignment"}; D -- "Includes SFT, Rejection
Sampling & Secondary RL" --> E[DeepSeek-R1]; subgraph Model Properties direction LR P1["R1-Zero: ✅ Powerful
Reasoner | ❌ Poor
Readability"] P2["R1: ✅ Powerful Reasoner
| ✅ Human-Aligned &
Readable"] end C -.-> P1; E -.-> P2; style A fill:#e0e0e0,stroke:#333 style C fill:#cce5ff,stroke:#004085 style E fill:#d4edda,stroke:#155724```
Efficient Training with Group Relative Policy Optimization (GRPO)
graph TD subgraph "PPO (Traditional Actor-Critic)" direction TB PPO_A["Policy Model
(Actor)"] -- "Generates action" --> PPO_B{Environment}; PPO_B -- "Returns state, reward" --> PPO_C["Value Model
(Critic)"]; PPO_C -- "Computes Advantage" --> PPO_A; PPO_D["Requires two complex networks
working in tandem"]; end subgraph "GRPO (Simpler Approach used in the paper)" direction TB GRPO_A[Policy Model] -- "Generates a group of G responses" --> GRPO_B["{Response 1..G}"]; GRPO_B --> GRPO_C{"Reward Model"}; GRPO_C -- "Assigns reward to each response" --> GRPO_D["{Reward 1..G}"]; GRPO_D --> GRPO_E{"Group Computation
(Calculates relative advantage)"}; GRPO_E --> GRPO_A; GRPO_F["More efficient: Eliminates
the need for a separate Value Model"]; end style GRPO_F fill:#d4edda,stroke:#155724 style PPO_D fill:#f8d7da,stroke:#721c24
Source: