Optimizers: SGD vs Adam vs AdamW
SGD with momentum is simple and can generalize well, especially in vision. Adam adapts per-parameter learning rates, often improving early training and stability. AdamW decouples weight decay from the gradient update, which became a common default for Transformers.
Learning rate schedules
Large-scale training often uses warmup (to avoid early instability) followed by decay (cosine, linear, step). The learning rate is frequently the highest-leverage hyperparameter: too high and training diverges; too low and you waste compute.
Batch size and gradient noise
Increasing batch size reduces gradient noise and can improve hardware utilization. But it can also change optimization dynamics; sometimes you need learning-rate adjustments or regularization to keep generalization strong.
Stability: why training diverges
Common sources of instability include outlier batches, numerical precision issues, excessive learning rates, and activation/attention spikes. Practical mitigations include gradient clipping, loss scaling (for mixed precision), better initialization, and careful normalization.
Regularization and generalization
Weight decay, dropout, data augmentation (especially for vision), and early stopping are classic. For LLMs, data composition and post-training often dominate “generalization” in the user-perceived sense.
How this ties to scaling and post-training
Scaling laws (see Scaling Laws) assume you can actually train at scale without instabilities. Post-training (see RLHF and Post-Training) inherits all these issues, plus the additional challenges of preference/reward optimization.