GRPO is DeepSeek's simplified PPO variant. Key change: instead of training a separate value/critic network to estimate advantages, GRPO samples a GROUP of responses and uses the group's mean reward as the baseline. No critic network β less memory, simpler training.
βGRPO vs PPO
PPO: needs a value network (critic)
Extra model to train, doubles memory
Key change
GRPO: replace critic with group statistics
Sample G responses β normalize rewards within group
Advantage = (reward - group_mean) / group_std
No critic network needed
50% less GPU memory
Used in DeepSeek-R1
1. The Problem with PPO's Critic
In standard PPO, you need a value network V_Ο(s) to estimate the expected future reward at each state. This value network:
Is often as large as the policy model itself
Requires its own training loop
Can be hard to train well for language tasks
2. The GRPO Solution
For each prompt x, sample G responses and compute advantages using group statistics:
GGroup size β number of responses sampled per prompt (e.g. 64)riβReward for the i-th response in the group (from reward model or verifier)A^iβNormalized advantage: z-score within the group. Positive = better than group average, negative = worse
GRPO objective (same structure as PPO but with group advantages)
GRPO is especially effective for verifiable tasks (math, code) where you can compute exact rewards (correct/incorrect). DeepSeek used it to train DeepSeek-Math (51.7% on MATH) and DeepSeek-R1.
Different approach: DPO removes RL entirely. GRPO keeps RL (online sampling) but simplifies it. GRPO is better for math/code; DPO is easier for general alignment.