The Anatomy of Diffusion LLMs

...explained from scratch!

Apr 12, 2026

This week’s deep dive covers one of the most important architectural shifts happening in language modeling right now: diffusion LLMs.

Read the full Part 1 deep dive here →

Diffusion LLMs Part 1

It builds a complete understanding from first principles:

how autoregressive generation is structurally memory-bandwidth bound)
why Gaussian noise can’t work on discrete tokens
how masked diffusion solves this with an ELBO-derived training objective
the math behind the forward and reverse processes
unmasking strategies
block diffusion for KV cache compatibility
and a detailed engineering comparison between the two paradigms.

Read the full Part 1 deep dive here →

Why care?

Every production LLM today, GPT-4, Claude, Gemini, LLaMA, generates text the same way: one token at a time, left to right.

Each token requires loading the full model weights through GPU memory, performing a tiny computation, and then loading all the weights again for the next token. On an A100, this means roughly 1 FLOP per byte of data moved, while the GPU is designed for 100+ FLOPs per byte.

Diffusion LLMs take a completely different approach. They start with a fully masked sequence and iteratively unmask all tokens in parallel, using bidirectional attention at every step. This shifts inference from memory-bandwidth bound to compute-bound, which is exactly where modern GPUs are efficient.

The results are catching up fast. Block diffusion (BD3-LM) is within 0.5 perplexity points of autoregressive on LM1B. LLaDA at 8B parameters matches LLaMA 3 on MMLU and exceeds it on TruthfulQA and HumanEval. And models like Dream 7B are already being served in production with SGLang.

Understanding how it works at a mathematical level, from the forward masking process to the ELBO objective to block-level KV caching, is going to be increasingly valuable as these models scale.

You can read the Part 1 here →

👉 Over to you: Do you think the future of LLM generation is pure diffusion, pure autoregressive, or some hybrid of the two?

Thanks for reading!

Daily Dose of Data Science

Discussion about this post

Ready for more?