What “scaling law” means in practice
Empirically, for certain regimes, training loss decreases as a smooth function of compute, parameters, and data. This lets teams forecast returns on investment and choose configurations that are near-optimal for a given compute budget.
The compute–data–parameters triangle
Given a compute budget, you can spend it on:
More parameters (bigger model).
More tokens (more data).
More training steps (more compute per parameter/token).
In many regimes there’s an “optimal frontier” where undertraining or overtraining wastes compute.
Capability emergence and thresholds
Some behaviors look like they “suddenly appear” at scale (tool use, multi-step planning, code synthesis). Often the underlying improvement is continuous, but our evaluation is discrete: a task is either passed or failed, so small continuous gains can cross a threshold.
What scaling laws don’t tell you
They don’t guarantee alignment.
They don’t define the best architecture forever.
They don’t replace data quality concerns or evaluation.
They don’t fully capture inference-time constraints like latency and context length.
Connections
Scaling interacts with optimization stability (see Optimization), and it changes the “surface area” of post-training (see RLHF and Post-Training). Scaling also influenced organizational strategy and splits across labs (see OpenAI and New Labs).