Scaling Laws: Why Bigger (Often) Works

What “scaling law” means in practice

Empirically, for certain regimes, training loss decreases as a smooth function of compute, parameters, and data. This lets teams forecast returns on investment and choose configurations that are near-optimal for a given compute budget.

The compute–data–parameters triangle

Given a compute budget, you can spend it on:

More parameters (bigger model).

More tokens (more data).

More training steps (more compute per parameter/token).

In many regimes there’s an “optimal frontier” where undertraining or overtraining wastes compute.

Capability emergence and thresholds

Some behaviors look like they “suddenly appear” at scale (tool use, multi-step planning, code synthesis). Often the underlying improvement is continuous, but our evaluation is discrete: a task is either passed or failed, so small continuous gains can cross a threshold.

What scaling laws don’t tell you

They don’t guarantee alignment.

They don’t define the best architecture forever.

They don’t replace data quality concerns or evaluation.

They don’t fully capture inference-time constraints like latency and context length.

Connections

Scaling interacts with optimization stability (see Optimization), and it changes the “surface area” of post-training (see RLHF and Post-Training). Scaling also influenced organizational strategy and splits across labs (see OpenAI and New Labs).