Scaling Laws: Why Bigger (Often) Works

Scaling laws describe how loss and capability improve predictably as you increase model size, data, and compute—until you hit new bottlenecks. They don’t prove “AGI by brute force,” but they do explain why scaling became a dominant strategy.

What “scaling law” means in practice

Empirically, for certain regimes, training loss decreases as a smooth function of compute, parameters, and data. This lets teams forecast returns on investment and choose configurations that are near-optimal for a given compute budget.

The compute–data–parameters triangle

Given a compute budget, you can spend it on:

More parameters (bigger model).

More tokens (more data).

More training steps (more compute per parameter/token).

In many regimes there’s an “optimal frontier” where undertraining or overtraining wastes compute.

Capability emergence and thresholds

Some behaviors look like they “suddenly appear” at scale (tool use, multi-step planning, code synthesis). Often the underlying improvement is continuous, but our evaluation is discrete: a task is either passed or failed, so small continuous gains can cross a threshold.

What scaling laws don’t tell you

They don’t guarantee alignment.

They don’t define the best architecture forever.

They don’t replace data quality concerns or evaluation.

They don’t fully capture inference-time constraints like latency and context length.

Connections

Scaling interacts with optimization stability (see Optimization), and it changes the “surface area” of post-training (see RLHF and Post-Training). Scaling also influenced organizational strategy and splits across labs (see OpenAI and New Labs).