The original seq2seq recipe
Early seq2seq systems used an encoder RNN to read an input sequence into a fixed-size state, and a decoder RNN to generate an output sequence from that state. This worked, but compressing an entire input into a single vector was a bottleneck.
Attention: the bottleneck breaker
Attention allowed the decoder to reference encoder states directly at each output step. Instead of relying on a single compressed vector, the decoder could dynamically weight different input positions.
Transformers as “seq2seq without recurrence”
The Transformer keeps the encoder/decoder idea but replaces RNNs with self-attention blocks. This makes training dramatically more parallelizable and tends to work better at scale. See Transformers for the architectural core.
Decoder-only models reframe seq2seq
GPT-style models often implement seq2seq via prompting: concatenate instruction + input + delimiter, then generate the output continuation. In this view, the “encoder” is the prefix context inside the same decoder-only network.
Why this paradigm keeps reappearing
Many tasks become “generate the right continuation” once you define a representation for inputs and outputs. That includes:
Text and code generation.
Image generation (as diffusion or as discrete tokens).
Multimodal assistants (interleaving text with other modalities).
Where post-training fits
Pretraining teaches broad next-token prediction. Post-training changes which continuations are preferred given instructions, user goals, and safety constraints. The modern story is covered in RLHF and Post-Training.