PPO: Proximal Policy Optimization

Schulman et al. · 2017 · arXiv 1707.06347

TL;DR

PPO is a policy gradient RL algorithm that clips the objective to prevent too-large updates. It's the "workhorse" of RLHF — used in InstructGPT/ChatGPT to optimize LLMs with human feedback. Simple, stable, and effective.

PPO in RLHF Pipeline
1. SFT Model (reference)
2. Reward Model trained on human preferences
3. PPO optimizes policy against reward model
Sample from policy
Score with reward model
Clip & update policy

1. The PPO Objective

PPO-Clip objective
LCLIP(θ)=Et[min ⁣(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_t\right)\right]
rt(θ)r_t(\theta)Probability ratio: π_θ(a|s) / π_old(a|s) — how much has the policy changed?A^t\hat{A}_tEstimated advantage: how much better is action a than the average? Positive = good, negative = badϵ\epsilonClip range (typically 0.2) — limits how much r_t can deviate from 1clip()Clamps r_t to [1-ε, 1+ε] — prevents policy from changing too much in one stepmin()Conservative update: takes the worse of clipped/unclipped — only improves if BOTH agree

2. PPO for LLMs (RLHF)

When applied to LLMs, the setup becomes:

RLHF with PPO
reward=rϕ(x,y)βDKL[πθπref]\text{reward} = r_\phi(x, y) - \beta \cdot D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]
State= prompt xAction= full response yReward= reward model score minus KL penaltyPolicy= the LLM π_θ

3. Connections

DPO

Eliminates PPO entirely by reparameterizing the reward — simpler but less flexible.

GRPO

Simplifies PPO by removing the critic/value network and using group-relative baselines.

4. Additional Resources