Sequence-to-Sequence: From RNNs to Attention to Modern Foundation Models

The original seq2seq recipe

Early seq2seq systems used an encoder RNN to read an input sequence into a fixed-size state, and a decoder RNN to generate an output sequence from that state. This worked, but compressing an entire input into a single vector was a bottleneck.

Attention: the bottleneck breaker

Attention allowed the decoder to reference encoder states directly at each output step. Instead of relying on a single compressed vector, the decoder could dynamically weight different input positions.

Transformers as “seq2seq without recurrence”

The Transformer keeps the encoder/decoder idea but replaces RNNs with self-attention blocks. This makes training dramatically more parallelizable and tends to work better at scale. See Transformers for the architectural core.

Decoder-only models reframe seq2seq

GPT-style models often implement seq2seq via prompting: concatenate instruction + input + delimiter, then generate the output continuation. In this view, the “encoder” is the prefix context inside the same decoder-only network.

Why this paradigm keeps reappearing

Many tasks become “generate the right continuation” once you define a representation for inputs and outputs. That includes:

Text and code generation.

Image generation (as diffusion or as discrete tokens).

Multimodal assistants (interleaving text with other modalities).

Where post-training fits

Pretraining teaches broad next-token prediction. Post-training changes which continuations are preferred given instructions, user goals, and safety constraints. The modern story is covered in RLHF and Post-Training.