TL;DR
PPO is a policy gradient RL algorithm that clips the objective to prevent too-large updates. It's the "workhorse" of RLHF — used in InstructGPT/ChatGPT to optimize LLMs with human feedback. Simple, stable, and effective.
1. The PPO Objective
PPO-Clip objective
Probability ratio: π_θ(a|s) / π_old(a|s) — how much has the policy changed?Estimated advantage: how much better is action a than the average? Positive = good, negative = badClip range (typically 0.2) — limits how much r_t can deviate from 1clip()Clamps r_t to [1-ε, 1+ε] — prevents policy from changing too much in one stepmin()Conservative update: takes the worse of clipped/unclipped — only improves if BOTH agree
2. PPO for LLMs (RLHF)
When applied to LLMs, the setup becomes:
RLHF with PPO
State= prompt xAction= full response yReward= reward model score minus KL penaltyPolicy= the LLM π_θ
3. Connections
DPO
Eliminates PPO entirely by reparameterizing the reward — simpler but less flexible.
GRPO
Simplifies PPO by removing the critic/value network and using group-relative baselines.