GRPO: Group Relative Policy Optimization

Shao et al. (DeepSeek) Β· 2024 Β· arXiv 2402.03300

TL;DR

GRPO is DeepSeek's simplified PPO variant. Key change: instead of training a separate value/critic network to estimate advantages, GRPO samples a GROUP of responses and uses the group's mean reward as the baseline. No critic network β†’ less memory, simpler training.

β—†GRPO vs PPO
PPO: needs a value network (critic)
Extra model to train, doubles memory
Key change
GRPO: replace critic with group statistics
Sample G responses β†’ normalize rewards within group
Advantage = (reward - group_mean) / group_std
No critic network needed
50% less GPU memory
Used in DeepSeek-R1

1. The Problem with PPO's Critic

In standard PPO, you need a value network V_ψ(s) to estimate the expected future reward at each state. This value network:

  • Is often as large as the policy model itself
  • Requires its own training loop
  • Can be hard to train well for language tasks

2. The GRPO Solution

For each prompt x, sample G responses and compute advantages using group statistics:

GRPO advantage estimation
A^i=riβˆ’mean({r1,…,rG})std({r1,…,rG})\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}
GGGroup size β€” number of responses sampled per prompt (e.g. 64)rir_iReward for the i-th response in the group (from reward model or verifier)A^i\hat{A}_iNormalized advantage: z-score within the group. Positive = better than group average, negative = worse
GRPO objective (same structure as PPO but with group advantages)
LGRPO=E[1Gβˆ‘i=1Gmin⁑ ⁣(πθ(yi∣x)Ο€old(yi∣x)A^i,β€…β€Šclip(β‹…)A^i)βˆ’Ξ²β€‰DKL[πθβˆ₯Ο€ref]]\mathcal{L}_{\text{GRPO}} = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}(\cdot)\hat{A}_i\right) - \beta\,D_{\text{KL}}[\pi_\theta\|\pi_{\text{ref}}]\right]

Prompt: "What is 7 Γ— 8?" Sample G=4 responses:

y₁: "56" β†’ r₁ = 1.0 βœ“
yβ‚‚: "54" β†’ rβ‚‚ = 0.0 βœ—
y₃: "56" β†’ r₃ = 1.0 βœ“
yβ‚„: "58" β†’ rβ‚„ = 0.0 βœ—
mean = 0.5, std = 0.5
Â₁ = (1.0 - 0.5) / 0.5 = +1.0 (reward correct answers)
Γ‚β‚‚ = (0.0 - 0.5) / 0.5 = -1.0 (penalize wrong answers)

3. Why GRPO Works for Math/Code

GRPO is especially effective for verifiable tasks (math, code) where you can compute exact rewards (correct/incorrect). DeepSeek used it to train DeepSeek-Math (51.7% on MATH) and DeepSeek-R1.

4. Connections

PPO

GRPO's parent β€” same clipping mechanism but GRPO replaces the critic with group statistics.

DPO

Different approach: DPO removes RL entirely. GRPO keeps RL (online sampling) but simplifies it. GRPO is better for math/code; DPO is easier for general alignment.

5. Additional Resources