Transformers: the Architecture That Changed NLP (and Then Everything Else)

What problem did Transformers solve?

Pre-Transformer sequence models (RNNs, LSTMs, GRUs) processed tokens sequentially. That made training and inference hard to parallelize, and long-range dependencies were difficult to learn reliably. Attention mechanisms helped by letting the model “look back” at relevant tokens, but the early recipes still wrapped attention around a recurrent core.

The key move: self-attention everywhere

Transformers make self-attention the primary operation. Each token produces three vectors:

Q (query), K (key), V (value). Attention weights are based on similarities between Q and K, and the output is a weighted mix of V. Multi-head attention repeats this in parallel subspaces so the model can represent different relations simultaneously.

Why positional encoding exists

Pure attention is permutation-invariant: without extra information, it can’t tell whether “A then B” differs from “B then A.” Positional encodings inject ordering. Classical Transformers used sinusoidal encodings; modern LLMs often use learned or relative schemes (e.g., RoPE) to improve extrapolation and stability.

Encoder-decoder vs decoder-only

Original Transformers had:

Encoder: reads an input sequence and builds contextual representations.

Decoder: generates output tokens with masked self-attention (can’t see the future) plus cross-attention to the encoder.

LLMs like GPT are typically decoder-only: they predict the next token, using masked self-attention. They can still do translation/summarization by placing a prompt prefix (“instruction + input”) and generating a continuation.

Why Transformers scale well

Several properties made Transformers ideal for industrial-scale training:

Parallelizable training over sequence positions (no recurrence).

Uniform blocks that stack cleanly (depth scaling).

Strong transfer: the same architecture works for text, code, images (with tokenization), audio, and multimodal mixtures.

The two big costs

Attention is typically O(n^2) in sequence length due to the all-pairs interaction matrix. That drives memory and compute cost for long contexts. The industry response has been a mix of engineering and research: KV-caching, FlashAttention, better kernels, chunking, sparse or linear attention variants, and architectural hybrids.

Where this connects next

If you want the historical bridge into Transformers, read Sequence-to-Sequence. If you want the “product behavior” bridge, read ChatGPT. For how models get shaped after pretraining, read RLHF and Post-Training.