What problem did Transformers solve?
Pre-Transformer sequence models (RNNs, LSTMs, GRUs) processed tokens sequentially. That made training and inference hard to parallelize, and long-range dependencies were difficult to learn reliably. Attention mechanisms helped by letting the model “look back” at relevant tokens, but the early recipes still wrapped attention around a recurrent core.
The key move: self-attention everywhere
Transformers make self-attention the primary operation. Each token produces three vectors:
Q (query), K (key), V (value). Attention weights are based on similarities between Q and K, and the output is a weighted mix of V. Multi-head attention repeats this in parallel subspaces so the model can represent different relations simultaneously.
Why positional encoding exists
Pure attention is permutation-invariant: without extra information, it can’t tell whether “A then B” differs from “B then A.” Positional encodings inject ordering. Classical Transformers used sinusoidal encodings; modern LLMs often use learned or relative schemes (e.g., RoPE) to improve extrapolation and stability.
Encoder-decoder vs decoder-only
Original Transformers had:
Encoder: reads an input sequence and builds contextual representations.
Decoder: generates output tokens with masked self-attention (can’t see the future) plus cross-attention to the encoder.
LLMs like GPT are typically decoder-only: they predict the next token, using masked self-attention. They can still do translation/summarization by placing a prompt prefix (“instruction + input”) and generating a continuation.
Why Transformers scale well
Several properties made Transformers ideal for industrial-scale training:
Parallelizable training over sequence positions (no recurrence).
Uniform blocks that stack cleanly (depth scaling).
Strong transfer: the same architecture works for text, code, images (with tokenization), audio, and multimodal mixtures.
The two big costs
Attention is typically O(n^2) in sequence length due to the all-pairs interaction matrix. That drives memory and compute cost for long contexts. The industry response has been a mix of engineering and research: KV-caching, FlashAttention, better kernels, chunking, sparse or linear attention variants, and architectural hybrids.
Where this connects next
If you want the historical bridge into Transformers, read Sequence-to-Sequence. If you want the “product behavior” bridge, read ChatGPT. For how models get shaped after pretraining, read RLHF and Post-Training.