Two big approaches: discrete tokens vs diffusion
Early systems explored generating images as sequences of discrete tokens (an explicit seq2seq framing; see Sequence-to-Sequence). Diffusion models instead learn to reverse a noising process, iteratively refining an image from noise into a sample consistent with the text prompt.
Text conditioning
Regardless of the generator type, the model typically encodes the text and uses it to steer the image generation. Conditioning can happen via cross-attention, concatenation, or other mechanisms that couple text features to visual features.
Why the product impact was immediate
Text-to-image collapsed a complex creative workflow into natural language. It also created new issues: style imitation, copyright questions, watermarking, and safety around generating harmful or deceptive imagery.
Connections
Multimodality connects back to Transformer scaling and deployment (see Transformers and OpenAI) and forward to multimodal assistants as products (see Mira Murati).