Transformers: the Architecture That Changed NLP (and Then Everything Else)

The Transformer replaced recurrence with attention, then proved unusually friendly to scaling, parallelism, and transfer. This post builds the “mental model” that helps everything else in this blog click into place.

What problem did Transformers solve?

Pre-Transformer sequence models (RNNs, LSTMs, GRUs) processed tokens sequentially. That made training and inference hard to parallelize, and long-range dependencies were difficult to learn reliably. Attention mechanisms helped by letting the model “look back” at relevant tokens, but the early recipes still wrapped attention around a recurrent core.

The key move: self-attention everywhere

Transformers make self-attention the primary operation. Each token produces three vectors:

Q (query), K (key), V (value). Attention weights are based on similarities between Q and K, and the output is a weighted mix of V. Multi-head attention repeats this in parallel subspaces so the model can represent different relations simultaneously.

Why positional encoding exists

Pure attention is permutation-invariant: without extra information, it can’t tell whether “A then B” differs from “B then A.” Positional encodings inject ordering. Classical Transformers used sinusoidal encodings; modern LLMs often use learned or relative schemes (e.g., RoPE) to improve extrapolation and stability.

Encoder-decoder vs decoder-only

Original Transformers had:

Encoder: reads an input sequence and builds contextual representations.

Decoder: generates output tokens with masked self-attention (can’t see the future) plus cross-attention to the encoder.

LLMs like GPT are typically decoder-only: they predict the next token, using masked self-attention. They can still do translation/summarization by placing a prompt prefix (“instruction + input”) and generating a continuation.

Why Transformers scale well

Several properties made Transformers ideal for industrial-scale training:

Parallelizable training over sequence positions (no recurrence).

Uniform blocks that stack cleanly (depth scaling).

Strong transfer: the same architecture works for text, code, images (with tokenization), audio, and multimodal mixtures.

The two big costs

Attention is typically O(n^2) in sequence length due to the all-pairs interaction matrix. That drives memory and compute cost for long contexts. The industry response has been a mix of engineering and research: KV-caching, FlashAttention, better kernels, chunking, sparse or linear attention variants, and architectural hybrids.

Where this connects next

If you want the historical bridge into Transformers, read Sequence-to-Sequence. If you want the “product behavior” bridge, read ChatGPT. For how models get shaped after pretraining, read RLHF and Post-Training.