RLHF and Post-Training: From Preference Data to Modern Alignment Pipelines

Stage 1: Supervised fine-tuning (SFT)

SFT trains the model to imitate high-quality demonstrations: instructions paired with ideal responses. This establishes the “assistant” format and improves immediate helpfulness.

Stage 2: Preference modeling

Human raters compare two (or more) model responses to the same prompt and pick the better one. A reward model is trained to predict these preferences, creating a differentiable signal for “what humans like.”

Stage 3: Reinforcement learning (classical RLHF)

In classical RLHF, the policy (the language model) is optimized to maximize reward model scores while staying close to the SFT model. Algorithms like PPO were popular because they constrain updates and reduce catastrophic drift.

Why people now say “post-training”

Modern systems often combine or replace classic RL with alternatives: direct preference optimization variants, rejection sampling, multi-objective training, constitutional or self-critiquing data generation, and targeted safety fine-tunes. The common theme is: optimize behavior under constraints after broad pretraining.

Failure modes and tradeoffs

Reward hacking: optimizing the reward model can exploit its blind spots.

Overrefusal: safety constraints can push models to refuse benign requests.

Helpfulness vs harmlessness: the “right” balance is contextual and product-dependent.

Distribution shift: user prompts differ from training prompts; policies evolve.

Where this connects next

Optimization choices matter here (see Optimization). Scaling changes what’s even possible to align (see Scaling Laws). And deployed behavior is the full stack (see ChatGPT).