PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

GRPO is DeepSeek's simplified PPO variant. Key change: instead of training a separate value/critic network to estimate advantages, GRPO samples a GROUP of responses and uses the group's mean reward as the baseline. No critic network → less memory, simpler training.

PPO: needs a value network (critic)

Extra model to train, doubles memory

Key change

GRPO: replace critic with group statistics

Sample G responses → normalize rewards within group

Advantage = (reward - group_mean) / group_std

No critic network needed

50% less GPU memory

Used in DeepSeek-R1

1. The Problem with PPO's Critic

In standard PPO, you need a value network V_ψ(s) to estimate the expected future reward at each state. This value network:

Is often as large as the policy model itself
Requires its own training loop
Can be hard to train well for language tasks

2. The GRPO Solution

For each prompt x, sample G responses and compute advantages using group statistics:

\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}

G

Group size — number of responses sampled per prompt (e.g. 64)

r_i

Reward for the i-th response in the group (from reward model or verifier)

\hat{A}_i

Normalized advantage: z-score within the group. Positive = better than group average, negative = worse

\mathcal{L}_{\text{GRPO}} = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}(\cdot)\hat{A}_i\right) - \beta\,D_{\text{KL}}[\pi_\theta\|\pi_{\text{ref}}]\right]

Prompt: "What is 7 × 8?" Sample G=4 responses:

y₁: "56" → r₁ = 1.0 ✓

y₂: "54" → r₂ = 0.0 ✗

y₃: "56" → r₃ = 1.0 ✓

y₄: "58" → r₄ = 0.0 ✗

mean = 0.5, std = 0.5

Â₁ = (1.0 - 0.5) / 0.5 = +1.0 (reward correct answers)

Â₂ = (0.0 - 0.5) / 0.5 = -1.0 (penalize wrong answers)

3. Why GRPO Works for Math/Code

GRPO is especially effective for verifiable tasks (math, code) where you can compute exact rewards (correct/incorrect). DeepSeek used it to train DeepSeek-Math (51.7% on MATH) and DeepSeek-R1.

4. Connections

PPO

GRPO's parent — same clipping mechanism but GRPO replaces the critic with group statistics.

DPO

Different approach: DPO removes RL entirely. GRPO keeps RL (online sampling) but simplifies it. GRPO is better for math/code; DPO is easier for general alignment.

5. Additional Resources

DeepSeekMath (arXiv)Introduces GRPO DeepSeek-R1 (arXiv)GRPO at scale for reasoning