Most models from earlier days are discriminative: given an image, decide cat vs. dog; given a sentence, predict the next token. Today flips the question to something fundamentally different — generative: not judging existing data, but conjuring up brand-new samples that never existed yet look real. This is the core behind Stable Diffusion painting images and Sora generating video. We'll walk the 2014–2022 evolution: GAN → VAE → Diffusion → Flow Matching, where each step is a precise response to a specific flaw of the one before. Understand it and you'll see "image generation" isn't magic, but four different engineering answers to "how to transport random noise into real data."
A GAN is a counterfeiter vs. bill-detector arms race. Closer to your backend world: a fuzzer (generator) keeps crafting malformed inputs to fool a set of assertion checks (discriminator); each time the checker gets fooled it upgrades its rules, forcing the fuzzer to fake even more convincingly. The two pressure each other and co-evolve, until the fuzzer's "fakes" are so realistic the checker can only guess — at which point the generator has learned the true data distribution.
The fundamental difficulty of generation: the probability distribution p(x) of real data (say all "human faces") is wildly complex and has no explicit formula. The genius of the GAN (Goodfellow et al. 2014) is to sidestep it — instead of modeling p(x) directly, train two networks against each other:
Key insight: the discriminator is itself a "learnable loss function." A hand-designed loss (like pixel-wise MSE) forces the model to produce blurry average faces; D, by contrast, keeps getting smarter as G improves, providing an always "just-right" training signal. The two play a minimax game:
That minimax objective written out:
minG maxD E[log D(x)] + E[log(1 − D(G(z)))]
Unpacked: term one — D wants D(x) → 1 (real judged real); term two — D wants D(G(z)) → 0 (fake judged fake), while G wants the opposite, → 1 (fool succeeds). Both fight over the max/min of the same expression — that's the mathematical meaning of "adversarial."
import torch, torch.nn as nn G = nn.Sequential(nn.Linear(100, 256), nn.ReLU(), nn.Linear(256, 784), nn.Tanh()) D = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 1), nn.Sigmoid()) bce = nn.BCELoss() # binary cross-entropy: real=1, fake=0 for real in dataloader: # real: a batch of real images (B, 784) z = torch.randn(real.size(0), 100) fake = G(z) # 1) train D: real→real, fake→fake (detach cuts G's gradient) loss_D = bce(D(real), ones) + bce(D(fake.detach()), zeros) opt_D.zero_grad(); loss_D.backward(); opt_D.step() # 2) train G: make D rate fakes as real (label deliberately = 1) loss_G = bce(D(fake), ones) opt_G.zero_grad(); loss_G.backward(); opt_G.step()
A VAE is a lossy codec with built-in uncertainty. A plain autoencoder is like JPEG: compress an image into a fixed latent code, then decompress. But a VAE encodes an image not into a point but into a small probabilistic cloud (a Gaussian) — like storing a database record as an "index with error bars" rather than exact coordinates. This one change makes the whole latent space continuous and hole-free: sample any point in it, decode, and you get a fresh, plausible image.
A plain autoencoder can compress and reconstruct but can't generate: its latent space is "full of holes" — sample randomly between two known points and the decode is often noise garbage — because it was never asked to "fill" the latent space. The VAE (Kingma & Welling 2013) fixes this by forcing the latent variable to follow a standard normal, achieved by two forces in tension:
Their sum is the famous ELBO (Evidence Lower Bound) — maximizing it is equivalent to maximizing data likelihood. But there's an engineering obstacle: "sampling" is not differentiable, so gradients can't pass through the random node back to the encoder. The VAE's killer move is the reparameterization trick:
Intuition: randomness is the source of "generation diversity," but it blocks gradients. Reparameterization peels the random part (ε) off as an external input, leaving μ and σ entirely on the differentiable trunk — keeping randomness yet letting gradients flow. This trick later reappears across reinforcement learning and diffusion models.
import torch, torch.nn.functional as F def vae_step(x, encoder, decoder): mu, logvar = encoder(x).chunk(2, dim=-1) # encoder emits μ and log(σ²) # reparameterize: z = μ + σ·ε, σ = exp(½·logvar) std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) # ε ~ N(0,1), randomness externalized z = mu + std * eps x_hat = decoder(z) # ELBO = reconstruction loss + KL (pull latent toward N(0,1)) recon = F.mse_loss(x_hat, x, reduction="sum") kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return recon + kl # minimizing it = maximizing ELBO # generate new samples: z = torch.randn(...); decoder(z) — no encoder needed
Diffusion breaks the hard problem of "one-shot generation" into hundreds of tiny denoising corrections — the same wisdom as "splitting one big transaction into idempotent small steps that gradually converge to the target state" in distributed systems. Concretely: the forward process is like letting a sharp photo slowly "rot" into noise, adding a little Gaussian noise per step until it's pure static; the reverse process trains a network to play that rotting backwards step by step, gradually "developing" an image out of pure noise.
GANs are unstable and collapse; VAEs are blurry. Diffusion (Ho, Jain & Abbeel 2020, i.e. DDPM) solves both with a strikingly plain idea: break generation into a huge number of extremely simple subtasks. It has two processes:
The loss is simplified to the extreme — just a noise regression:
L = ‖ ε − εθ(x_t, t) ‖²
where ε is the actual noise added during the forward pass (you added it, so it's a known answer) and εθ is the network's prediction. This is plain mean-squared-error supervised learning — no adversary, no KL balancing, rock-solid training. That's the fundamental reason diffusion beat GANs.
import torch, torch.nn.functional as F def diffusion_loss(x0, model, alpha_bars): B = x0.size(0) t = torch.randint(0, T, (B,)) # pick a random timestep t per image ab = alpha_bars[t].view(B, 1, 1, 1) # ᾱ_t: cumulative retain ratio noise = torch.randn_like(x0) # ε: the actual added noise (known answer) # generate x_t in one step (closed form, no need to run t times) x_t = ab.sqrt() * x0 + (1 - ab).sqrt() * noise pred = model(x_t, t) # network predicts the noise return F.mse_loss(pred, noise) # just noise regression: simple & stable # sampling: start from x_T~N(0,1), loop T steps, subtract model's predicted noise each step
If diffusion makes a noise point do a Brownian-walk slowly drifting to data, Flow Matching straightens that winding path into a single straight conveyor belt. It learns a "velocity field" — every point in space is annotated with "which direction and how fast to flow right now," like a flow field in fluid dynamics, or the turn-by-turn arrows in map navigation. Follow that field along a deterministic straight line (an ODE) and you transport a noise point precisely to a data point — straighter path, fewer steps.
Diffusion is stable but slow and mathematically roundabout (involving stochastic differential equations, SDEs, and score functions). Flow Matching (Lipman et al. 2022) radically simplifies it: directly learn a vector field vθ(x, t) that "flows" the noise distribution into the data distribution. The training target is unbelievably simple — given a noise point x₀ and a data point x₁, connect a straight line, and then the position and velocity at any time t are closed-form:
Another plain regression! Intuition: the network learns "standing at any point on the path, which way to walk to reach the data." To generate, start from noise x₀ and integrate along the velocity field with an ordinary ODE solver to t=1 — and because the path is near-straight, just a few steps finish it.
A profound unification: diffusion is actually a special case of Flow Matching (corresponding to one curved Gaussian path). Flow Matching reframes generative modeling as "learning a velocity field that transports noise to data," with diffusion being just one way to walk — this conceptual unification is exactly why it became the 2024–2025 frontier mainstream.
import torch, torch.nn.functional as F def flow_matching_loss(x1, model): x0 = torch.randn_like(x1) # start point: pure noise t = torch.rand(x1.size(0), 1) # pick a random time t∈[0,1] on the path # straight-line interpolation: x_t between noise x0 and data x1 x_t = (1 - t) * x0 + t * x1 target = x1 - x0 # velocity of the line (constant direction) pred = model(x_t, t) # network predicts the velocity field here return F.mse_loss(pred, target) # another plain regression # sampling: x = randn(...); integrate model's velocity field with an ODE solver from t=0 to t=1