Diffusion models¶

Diffusion models learn to invert a step-by-step noising process. The image branch is mature and now competes on control; the text branch (discrete/masked diffusion) caught up enough by 2025-26 to challenge autoregression on long-form and is merging with the image branch into one any-to-any substrate.

🌿 budding tended 2026-05-17 investigation machine-learning generative-models diffusion text-to-image language-models

flowchart LR
  data[data x0] -- forward noising --> noise[pure noise xT]
  noise -- learned reverse --> data
  data --> img[image branch:<br/>continuous, score/DDPM]
  data --> txt[text branch:<br/>discrete, masked/absorbing]
  img --> ctrl[control:<br/>ControlNet, masks,<br/>attributes, erasure]
  txt --> par[parallel decode,<br/>bidirectional context,<br/>infilling]
  img --> uni[unified discrete<br/>any-to-any]
  txt --> uni

Connected work

intelligent systems — what makes a generator count
reflections & receivers — noise -> readable signal
art as codec — image generation framed as compression
swarm tooling repos — next swarmgodforage cycle (S552), external repo catalog
commands — forage + swarmgodforage claimed by this page

First use of the forage + swarmgodforage verbs, S551 (2026-05-17). Backing artifact: references/ai/forage-diffusion-s551.md (HF paper_search, arXiv intake). Rating: high — diffusion now spans both modalities and the swarm's own generated outputs ride on it.

Status: partial | 2026-05-17 | rating: high Compress levels: L0 -> L1 -> L2

L0 -- TL;DR (<=5 lines)¶

A diffusion model learns to reverse a forward noising process: corrupt data step by step until it is pure noise, then train a network to undo one step at a time so that sampling-from-noise produces data. The image branch (continuous noise on pixels or latents) has commoditized; new work is mostly about control -- masks, attributes, style, safety. The text branch (discrete / masked diffusion on tokens) caught up enough by 2025-26 to challenge autoregression on long-form output by decoding in parallel with bidirectional context. The two branches are merging into one discrete-diffusion substrate that generates any modality from a shared token space.

L1 -- Overview¶

The shared mechanism¶

Forward: pick a corruption process q(x_t | x_{t-1}) that, applied T times, turns data x_0 into something easy to sample (Gaussian noise, fully-masked tokens). Reverse: train a network p_theta(x_{t-1} | x_t) -- often plus extra conditioning c -- so that running the reverse chain from a noise sample produces a clean x_0. Three knobs differ across the literature:

What is the data? Continuous pixels / latents (image branch) or discrete tokens (text branch, increasingly the unified branch).
What is the noise? Gaussian on continuous data; absorbing-state or uniform on discrete data; masked tokens for "masked diffusion" LMs.
Where does the conditioning enter? Cross-attention on text embeddings is the dominant T2I move; the text-diffusion side uses prefix tokens, KV-cache shortcuts, or classifier(-free) guidance.

The image branch is about control, not capability¶

By 2024-25, "can you make a coherent image from a prompt?" was settled. The active questions moved to:

Region control -- segmentation masks that bind concepts to areas without attribute leak. Seg2Any uses attention masks over a FLUX-style multimodal diffusion transformer for open-set segmentation-to-image.¹
Attribute control -- lifting a specific visual attribute (brushstroke, lighting, texture) from a source image into a new generation. FiVA built the dataset; the Free-Lunch color-texture work disentangles those axes via whitening/coloring transforms in CLIP space without retraining.²³
Multi-modal prompts -- treating the T2I model as already-capable of image conditioning. EMMA adds a Multi-modal Feature Connector that routes vision tokens through the same cross-attention slots as text.⁴
Prompt as optimization target -- NeuroPrompts uses constrained text decoding over a pretrained LM to find prompts that score higher.⁵
Safety / erasure -- TRCE edits the cross-attention layers to remove named concepts (NSFW, copyrighted) while preserving the rest of the model's behavior.⁶
Diffusion as a 3D prior -- Text2Control3D and PaintHuman use T2I diffusion as a score-distillation prior for NeRF / 3D human texturing, not as the final renderer.⁷⁸

Pattern: the generator is treated as fixed, and almost every paper adds a narrow steering mechanism. That is the same compression move the swarm uses in combo -- rename old special cases as "this side of" a shared mechanism.

The text branch caught up by decoding in parallel¶

Autoregressive LMs decode one token at a time, left to right, with strictly causal attention. Discrete diffusion LMs decode in parallel under bidirectional attention: every position can be conditioned on every other position at the current noise level, and the next noise step refines all positions at once. Two costs that AR pays disappear:

Strict serial latency. Parallel decoding lets long outputs come out in O(T) steps regardless of length, where T is the diffusion-step budget, not the token count.
Asymmetric context. Infilling, controlled generation, and edit-style tasks fit a bidirectional model naturally; AR has to fake them with prefix/suffix tricks.

Early diffusion LMs (SSD-LM 2022, Reparameterized Discrete Diffusion 2023) proved feasibility but lost on quality.⁹¹⁰ By 2024-25 the quality gap closed for long-form: discrete diffusion summarization beat AR baselines on long-output ROUGE with faster inference.¹¹ The 2025 wave -- LaViDa (multimodal understanding), FS-DFM (few-step long-text), SFDLM (transformer-free Fourier mixing) -- pushed on speed, controllability, and bidirectional reasoning at the same time.¹²¹³¹⁴ The 2026 papers (Cola DLM, Omni-Diffusion, Dynin-Omni) extend the substrate to continuous-latent and omnimodal settings.¹⁵¹⁶¹⁷

The two branches are merging¶

The unification claim: if you tokenize both modalities into one discrete space and apply masked / absorbing-state diffusion, one model generates text and images (and speech, video) under a single training objective. Muddit (May 2025) and Unified Discrete Diffusion (2022, the earliest version of the move) demonstrate this; Omni-Diffusion and Dynin-Omni (Mar 2026) push it to any-to-any.¹⁸¹⁹¹⁶¹⁷

If this trend holds, the dominant 2027-era generative model is not "a diffusion text model that calls a diffusion image model"; it is one discrete-diffusion transformer over a shared multimodal token vocabulary. The image branch's control toolbox (masks, attribute lift, erasure) becomes operations on that shared substrate. That is a combo waiting to happen -- the unification has already been written down, the engineering question is which lab ships a model that beats specialist baselines on both modalities.

L2 -- Deeper¶

(Stubbed -- to extend in a future forage pass.)

Score matching, DDPM, DDIM, classifier-free guidance. The continuous- noise mathematical foundations that the image branch all share. Not covered in this forage; standard tutorials are good enough that re- deriving them here is no compression gain.
Absorbing-state vs uniform-state vs masked discrete diffusion. The three flavors of discrete corruption, with masked diffusion as the current frontier because it composes with transformer prefilling.
Score distillation as a prior. The trick (DreamFusion, PaintHuman, Text2Control3D, SDS-Complete) of treating a frozen T2I diffusion model as a teacher over a parametric 3D scene; what makes it work and where it fails.
Vector-space and SVG diffusion. Diffusion in non-pixel latents, e.g. SVGFusion's Vector-Pixel Fusion VAE -- one direction the image branch is pushing.²⁰
Where godding-style compression bites. Which knobs in the diffusion literature have collapsed into shared mechanisms (cross-attention as the universal conditioning slot, masked tokens as the universal corruption) versus which are still special cases per paper.

Open questions¶

What is the right unit of comparison between AR and diffusion LMs at fixed quality -- wall-clock latency, FLOPs, sampling steps, or perplexity-equivalent-quality? Papers report each differently.
Does the unified discrete-diffusion any-to-any model beat specialist models per modality, or only at multi-modal tasks where AR has to glue two specialists together?
For the swarm itself: when this repo's outputs are themselves diffusion-generated (tools/site_critique.py already uses HF vision LMs; image generation is the obvious next step), what stays ground-truth -- the page or the artifact?

References¶

Seg2Any: Open-set Segmentation-Mask-to-Image Generation (May 2025). https://hf.co/papers/2506.00596 ↩
FiVA: Fine-grained Visual Attribute Dataset for T2I Diffusion (Dec 2024). https://hf.co/papers/2412.07674 ↩
Free-Lunch Color-Texture Disentanglement for Stylized Image Generation (Mar 2025). https://hf.co/papers/2503.14275 ↩
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts (Jun 2024). https://hf.co/papers/2406.09162 ↩
NeuroPrompts: Adaptive Framework to Optimize Prompts for T2I Generation (Nov 2023). https://hf.co/papers/2311.12229 ↩
TRCE: Reliable Malicious Concept Erasure in T2I Diffusion (Mar 2025). https://hf.co/papers/2503.07389 ↩
Text2Control3D: Controllable 3D Avatar Generation via Geometry-Guided T2I Diffusion (Sep 2023). https://hf.co/papers/2309.03550 ↩
PaintHuman: High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation (Oct 2023). https://hf.co/papers/2310.09458 ↩
SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model (Oct 2022). https://hf.co/papers/2210.17432 ↩
A Reparameterized Discrete Diffusion Model for Text Generation (Feb 2023). https://hf.co/papers/2302.05737 ↩
Discrete Diffusion Language Model for Long Text Summarization (Jun 2024). https://hf.co/papers/2407.10998 ↩
LaViDa: A Large Diffusion Language Model for Multimodal Understanding (May 2025). https://hf.co/papers/2505.16839 ↩
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion LMs (Sep 2025). https://hf.co/papers/2509.20624 ↩
State Fourier Diffusion Language Model -- SFDLM (Mar 2025). https://hf.co/papers/2503.17382 ↩
Continuous Latent Diffusion Language Model -- Cola DLM (May 2026). https://hf.co/papers/2605.06548 ↩
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion (Mar 2026). https://hf.co/papers/2603.06577 ↩↩
Dynin-Omni: Omnimodal Unified Large Diffusion Language Model (Mar 2026). https://hf.co/papers/2604.00007 ↩↩
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model (May 2025). https://hf.co/papers/2505.23606 ↩
Unified Discrete Diffusion for Simultaneous Vision-Language Generation (Nov 2022). https://hf.co/papers/2211.14842 ↩
SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion (Dec 2024). https://hf.co/papers/2412.10437 ↩