Skip to content

From Noise to Images: A Short History of Diffusion Models

About 2616 wordsAbout 9 min

AIDiffusion ModelComputer VisionGenerative AI

2026-05-05

Info

I started writing this because my final year project topic is connected to visual generation. Before reading more specialized papers, I needed a map of how diffusion models developed: where the idea came from, why DDPM made it practical again, and how later work such as DDIM, latent diffusion, and Diffusion Transformers changed the field.

Why diffusion models deserve their own story

When I first saw diffusion models, the idea felt almost backwards. Instead of generating an image in one shot, the model first defines a process that slowly destroys an image with noise, then learns how to reverse that destruction. The final model starts from noise and walks backward toward structure.

That sounds inefficient compared with a GAN generator that maps one latent vector directly into an image. But this strange design is exactly why diffusion became important. The training objective is stable. The generation process is gradual. The model can be guided by class labels, text, masks, layouts, or other conditions. Once the computation problem became manageable, diffusion models turned into one of the most flexible frameworks for visual generation.

This article is not a full mathematical tutorial. It is a historical map for myself: how the field moved from an elegant probabilistic idea to a practical image generation infrastructure.

2015: Diffusion as reversing a destruction process

The modern diffusion story is usually traced back to Sohl-Dickstein et al.'s 2015 paper, Deep Unsupervised Learning using Nonequilibrium Thermodynamics.[1] The core idea was inspired by non-equilibrium statistical physics: take a complex data distribution, slowly add noise until it becomes simple, then learn the reverse process that restores structure.

In the forward process, data is gradually corrupted:

x0x1xT.x_0 \rightarrow x_1 \rightarrow \cdots \rightarrow x_T.

After many small noising steps, xTx_T is close to a simple Gaussian distribution. The generative model then learns the reverse chain:

xTxT1x0.x_T \rightarrow x_{T-1} \rightarrow \cdots \rightarrow x_0.

Conceptually this is beautiful. It turns generation into a sequence of easier denoising problems. But at this stage, diffusion was not yet the public face of image generation. GANs were more dramatic, VAEs were easier to explain as latent-variable models, and autoregressive models had strong likelihoods. Diffusion had a good story, but not yet the kind of image quality and engineering recipe that would make everyone pay attention.

For me, this first stage answers one question:

What if generation is not one big leap from noise to image, but many small steps of recovering structure?

The score-based view: learning where data becomes more likely

There is another line that matters: score-based generative modeling. The "score" is the gradient of the log data density:

xlogp(x).\nabla_x \log p(x).

Intuitively, it points toward regions where the data distribution is more likely. Song and Ermon's 2019 work, Generative Modeling by Estimating Gradients of the Data Distribution, made this view practical by estimating scores at different noise levels and sampling with annealed Langevin dynamics.[2]

This matters because diffusion can be read in two related ways. One view says: learn a reverse Markov chain. Another says: learn the vector field that tells noisy samples how to move back toward data. The second view later becomes especially clean in the SDE formulation from Song et al. 2020, where diffusion and score-based models are placed in a shared continuous-time framework.[3]

I did not understand why people kept connecting DDPM with score matching until I saw this point:

Denoising is not only removing noise. It is also learning the local direction back toward the data manifold.

That intuition is useful for computer vision. Images do not fill the whole pixel space uniformly. Natural images sit on a much smaller, structured manifold. Noise pushes samples away from that structure; denoising learns how to return.

2020: DDPM makes diffusion practical again

The real turning point was Ho, Jain, and Abbeel's 2020 paper, Denoising Diffusion Probabilistic Models.[4] DDPM did not invent the basic diffusion idea, but it made the recipe simple, effective, and memorable.

The forward process adds Gaussian noise according to a variance schedule:

q(xtxt1)=N(1βtxt1,βtI).q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I).

The reverse process is learned by a neural network. In practice, a U-Net takes a noisy image xtx_t and timestep tt, then predicts the noise that was added. The training objective can be simplified into a noise prediction loss:

Ex0,ϵ,t[ϵϵθ(xt,t)2].\mathbb{E}_{x_0,\epsilon,t}\left[\|\epsilon - \epsilon_\theta(x_t,t)\|^2\right].

This simplification is one reason DDPM became so influential. Instead of treating generation as an unstable adversarial game, we can train a denoising network with a straightforward regression objective. The model learns many tiny corrections, and the sample gradually becomes more image-like.

DDPM made diffusion feel less like a theoretical curiosity and more like a usable image synthesis method. It also changed the emotional center of generative modeling. GANs were fast but hard to train; diffusion models were slow but reliable. Once quality improved, reliability became a major advantage.

DDIM: the first pressure point is sampling speed

DDPM still had a serious weakness: sampling was expensive. If generation requires hundreds or thousands of denoising steps, the model may be elegant but inconvenient.

Song, Meng, and Ermon's 2020 Denoising Diffusion Implicit Models addressed this pressure directly.[5] DDIM showed that we can use a non-Markovian generative process with the same training objective as DDPM, but sample much faster. The paper reports 10x to 50x speedups compared with DDPM in wall-clock time.

The conceptual shift is important. The model does not have to replay every tiny forward noising step in reverse. We can follow a different, more deterministic path through the same family of marginals. This also makes interpolation more meaningful, because the sampling trajectory becomes less dominated by fresh randomness at every step.

This is the first major engineering lesson in the history:

Once diffusion models could generate good images, the next question was not only "can they generate?" but "can they generate fast enough?"

Many later samplers, ODE/SDE solvers, and design-space papers continue this theme. They ask how many network evaluations are really needed, which noise schedule works best, and which parameterization makes the reverse process easier to solve.

2021: Quality, guidance, and the moment diffusion catches GANs

The next stage made diffusion models harder to ignore. Nichol and Dhariwal's Improved Denoising Diffusion Probabilistic Models showed that relatively simple modifications, including learned reverse-process variances, could reduce sampling cost while keeping quality high.[6]

Then Dhariwal and Nichol's Diffusion Models Beat GANs on Image Synthesis made the message explicit.[7] With architectural improvements and classifier guidance, diffusion models could produce sample quality competitive with or better than strong GAN baselines.

Guidance is one of the key ideas here. A conditional diffusion model can trade diversity for fidelity. Classifier guidance does this by using gradients from an external classifier to push samples toward a desired class. The cost is that the classifier must be trained separately and must behave well on noisy images.

This stage matters because it changes diffusion's status:

  • It is no longer only stable.
  • It is no longer only mathematically elegant.
  • It can compete on image quality.

That is probably the point where diffusion moved from "interesting alternative" to "mainstream generative model."

Text conditioning and classifier-free guidance

Once diffusion models became strong image generators, the natural next question was control. Images are rarely generated unconditionally in real applications. We want to specify a class, a text prompt, a mask, a layout, or some other condition.

OpenAI's GLIDE explored text-conditional image generation and editing with diffusion models.[8] One important finding was that classifier-free guidance worked very well for text-to-image generation. Ho and Salimans later formalized classifier-free diffusion guidance: train the model sometimes with the condition and sometimes without it, then combine the conditional and unconditional predictions at sampling time.[9]

This idea is simple but powerful. Instead of training a separate classifier, one generative model learns both:

  • what images look like without the condition
  • how the condition changes the denoising direction

The guidance scale then becomes a practical control knob. Increase it, and the image tends to follow the prompt more strongly, often with less diversity. Lower it, and the sample may become more varied but less faithful to the condition.

For text-to-image systems, this became a core part of the user experience. A prompt is not just a label. It becomes a steering signal inside the denoising process.

Latent diffusion: move the expensive part out of pixel space

Even with better samplers and guidance, pixel-space diffusion remains expensive. A 512 by 512 RGB image is a large object. Denoising it many times is costly.

Latent diffusion changed the scale of the problem. Rombach et al.'s High-Resolution Image Synthesis with Latent Diffusion Models proposed running diffusion in the latent space of a pretrained autoencoder instead of directly in pixel space.[10]

The pipeline can be summarized as:

imagelatentdiffusion in latent spacedecoded image.\text{image} \rightarrow \text{latent} \rightarrow \text{diffusion in latent space} \rightarrow \text{decoded image}.

This is a major engineering move. The autoencoder handles perceptual compression. The diffusion model works in a smaller, more efficient representation. Cross-attention layers then allow the model to condition on text, bounding boxes, or other inputs.

This is also the conceptual bridge to Stable Diffusion. The point is not merely that the model is smaller. The point is that diffusion becomes affordable enough to train, run, and modify outside the largest labs.

For visual learning, latent diffusion is one of the most important ideas to internalize:

Modern image generation is not only about a better denoising model. It is also about choosing the space where denoising happens.

The design-space stage: separating the knobs

By 2022, diffusion had many moving parts: noise schedules, parameterizations, samplers, loss weightings, preconditioning, guidance, architectures, and latent spaces. Karras et al.'s Elucidating the Design Space of Diffusion-Based Generative Models is useful because it tries to separate these choices instead of treating them as one tangled recipe.[11]

This kind of work is less flashy than a new demo, but it is important for understanding the field. When a diffusion model improves, the reason may not be "diffusion is better" in a vague sense. It may be:

  • the sampling trajectory is better conditioned
  • the network prediction target is easier
  • the noise schedule spends capacity in the right regions
  • the solver needs fewer function evaluations
  • the architecture scales more cleanly

This matters for reading later papers. Many papers are not replacing diffusion. They are improving one knob inside the diffusion system.

Diffusion Transformers: replacing the U-Net backbone

For a long time, the default mental image of a diffusion model was "U-Net plus noise timestep." The U-Net made sense for images because it preserves spatial structure and combines local detail with global context through downsampling and upsampling paths.

Diffusion Transformers, or DiT, changed that mental image. Peebles and Xie's Scalable Diffusion Models with Transformers replaced the commonly used U-Net backbone with a Transformer operating on latent patches.[12]

This is a natural meeting point between two trends:

  • latent diffusion makes the input smaller and more token-like
  • Transformers scale well when data can be represented as tokens

DiT does not mean U-Nets disappear immediately. It means diffusion has entered the scaling-law era more directly. If image latents can be patchified into tokens, then the architecture can inherit many ideas from large Transformer systems: depth, width, attention, conditioning, and compute scaling.

This is why DiT feels like a historical milestone rather than just another model variant. It suggests that visual generation is moving toward the same broad scaling logic that transformed language modeling.

A compact timeline

StageRepresentative workMain contribution
2015Sohl-Dickstein et al.Diffusion as a learned reverse process from noise to data
2019Song & ErmonScore-based generative modeling with noise-conditioned scores
2020DDPMSimple denoising objective and high-quality image synthesis
2020DDIMFaster sampling through non-Markovian implicit processes
2020/2021Score-based SDEContinuous-time view connecting diffusion and score models
2021Improved DDPM / Guided DiffusionBetter quality, better likelihoods, fewer sampling steps, guidance
2021/2022GLIDE / classifier-free guidancePractical text conditioning and guidance without a separate classifier
2021/2022Latent DiffusionRun diffusion in compressed latent space for high-resolution generation
2022EDMClarify the design space of samplers, schedules, and parameterizations
2022/2023DiTReplace U-Net with Transformer over latent patches

How I currently remember the whole story

The history of diffusion models can be compressed into four transitions.

First, diffusion made generation gradual. Instead of jumping directly from a latent vector to a data sample, it learned to reverse a controlled corruption process.

Second, DDPM made the recipe practical. A U-Net predicting noise at different timesteps turned the theory into a trainable image model.

Third, sampling and guidance made diffusion useful. DDIM and later samplers attacked speed, while classifier guidance and classifier-free guidance made generation controllable.

Fourth, latent diffusion and DiT made diffusion scalable. Latent diffusion moved the work into a cheaper representation; DiT connected diffusion to Transformer scaling.

For my own visual-generation learning path, this history is helpful because it shows where to look when reading a new paper. Some papers change the noising process. Some change the sampler. Some change the conditioning mechanism. Some change the latent representation. Some change the backbone. They all live inside the same broad question:

How do we learn a path from noise back to visual structure, and how do we make that path fast, controllable, and scalable?

That question is the reason diffusion models became more than a generation trick. They became a general framework for modern visual generative AI.

References


  1. Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." arXiv:1503.03585, 2015. https://arxiv.org/abs/1503.03585 ↩︎

  2. Yang Song and Stefano Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." arXiv:1907.05600, 2019. https://arxiv.org/abs/1907.05600 ↩︎

  3. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. "Score-Based Generative Modeling through Stochastic Differential Equations." arXiv:2011.13456, 2020. https://arxiv.org/abs/2011.13456 ↩︎

  4. Jonathan Ho, Ajay Jain, and Pieter Abbeel. "Denoising Diffusion Probabilistic Models." arXiv:2006.11239, 2020. https://arxiv.org/abs/2006.11239 ↩︎

  5. Jiaming Song, Chenlin Meng, and Stefano Ermon. "Denoising Diffusion Implicit Models." arXiv:2010.02502, 2020. https://arxiv.org/abs/2010.02502 ↩︎

  6. Alex Nichol and Prafulla Dhariwal. "Improved Denoising Diffusion Probabilistic Models." arXiv:2102.09672, 2021. https://arxiv.org/abs/2102.09672 ↩︎

  7. Prafulla Dhariwal and Alex Nichol. "Diffusion Models Beat GANs on Image Synthesis." arXiv:2105.05233, 2021. https://arxiv.org/abs/2105.05233 ↩︎

  8. Alex Nichol et al. "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741, 2021. https://arxiv.org/abs/2112.10741 ↩︎

  9. Jonathan Ho and Tim Salimans. "Classifier-Free Diffusion Guidance." arXiv:2207.12598, 2022. https://arxiv.org/abs/2207.12598 ↩︎

  10. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. "High-Resolution Image Synthesis with Latent Diffusion Models." arXiv:2112.10752, 2021. https://arxiv.org/abs/2112.10752 ↩︎

  11. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. "Elucidating the Design Space of Diffusion-Based Generative Models." arXiv:2206.00364, 2022. https://arxiv.org/abs/2206.00364 ↩︎

  12. William Peebles and Saining Xie. "Scalable Diffusion Models with Transformers." arXiv:2212.09748, 2022. https://arxiv.org/abs/2212.09748 ↩︎