PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

PPO is a policy gradient RL algorithm that clips the objective to prevent too-large updates. It's the "workhorse" of RLHF — used in InstructGPT/ChatGPT to optimize LLMs with human feedback. Simple, stable, and effective.

1. SFT Model (reference)

2. Reward Model trained on human preferences

3. PPO optimizes policy against reward model

Sample from policy

Score with reward model

Clip & update policy

1. The PPO Objective

\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_t\right)\right]

r_t(\theta)

Probability ratio: π_θ(a|s) / π_old(a|s) — how much has the policy changed?

\hat{A}_t

Estimated advantage: how much better is action a than the average? Positive = good, negative = bad

\epsilon

Clip range (typically 0.2) — limits how much r_t can deviate from 1clip()Clamps r_t to [1-ε, 1+ε] — prevents policy from changing too much in one stepmin()Conservative update: takes the worse of clipped/unclipped — only improves if BOTH agree

2. PPO for LLMs (RLHF)

When applied to LLMs, the setup becomes:

\text{reward} = r_\phi(x, y) - \beta \cdot D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]

State= prompt xAction= full response yReward= reward model score minus KL penaltyPolicy= the LLM π_θ

3. Connections

DPO

Eliminates PPO entirely by reparameterizing the reward — simpler but less flexible.

GRPO

Simplifies PPO by removing the critic/value network and using group-relative baselines.

4. Additional Resources

PPO (arXiv)Original paper InstructGPT (Ouyang et al. 2022)PPO applied to LLM alignment