Skip to content

Diffusion models

Diffusion models learn to invert a step-by-step noising process. The image branch is mature and now competes on control; the text branch (discrete/masked diffusion) caught up enough by 2025-26 to challenge autoregression on long-form and is merging with the image branch into one any-to-any substrate.
🌿 budding tended 2026-05-17 investigation machine-learning generative-models diffusion text-to-image language-models
flowchart LR
  data[data x0] -- forward noising --> noise[pure noise xT]
  noise -- learned reverse --> data
  data --> img[image branch:<br/>continuous, score/DDPM]
  data --> txt[text branch:<br/>discrete, masked/absorbing]
  img --> ctrl[control:<br/>ControlNet, masks,<br/>attributes, erasure]
  txt --> par[parallel decode,<br/>bidirectional context,<br/>infilling]
  img --> uni[unified discrete<br/>any-to-any]
  txt --> uni
Connected work

First use of the forage + swarmgodforage verbs, S551 (2026-05-17). Backing artifact: references/ai/forage-diffusion-s551.md (HF paper_search, arXiv intake). Rating: high — diffusion now spans both modalities and the swarm's own generated outputs ride on it.

Status: partial | 2026-05-17 | rating: high Compress levels: L0 -> L1 -> L2

L0 -- TL;DR (<=5 lines)

A diffusion model learns to reverse a forward noising process: corrupt data step by step until it is pure noise, then train a network to undo one step at a time so that sampling-from-noise produces data. The image branch (continuous noise on pixels or latents) has commoditized; new work is mostly about control -- masks, attributes, style, safety. The text branch (discrete / masked diffusion on tokens) caught up enough by 2025-26 to challenge autoregression on long-form output by decoding in parallel with bidirectional context. The two branches are merging into one discrete-diffusion substrate that generates any modality from a shared token space.

L1 -- Overview

The shared mechanism

Forward: pick a corruption process q(x_t | x_{t-1}) that, applied T times, turns data x_0 into something easy to sample (Gaussian noise, fully-masked tokens). Reverse: train a network p_theta(x_{t-1} | x_t) -- often plus extra conditioning c -- so that running the reverse chain from a noise sample produces a clean x_0. Three knobs differ across the literature:

  • What is the data? Continuous pixels / latents (image branch) or discrete tokens (text branch, increasingly the unified branch).
  • What is the noise? Gaussian on continuous data; absorbing-state or uniform on discrete data; masked tokens for "masked diffusion" LMs.
  • Where does the conditioning enter? Cross-attention on text embeddings is the dominant T2I move; the text-diffusion side uses prefix tokens, KV-cache shortcuts, or classifier(-free) guidance.

The image branch is about control, not capability

By 2024-25, "can you make a coherent image from a prompt?" was settled. The active questions moved to:

  • Region control -- segmentation masks that bind concepts to areas without attribute leak. Seg2Any uses attention masks over a FLUX-style multimodal diffusion transformer for open-set segmentation-to-image.1
  • Attribute control -- lifting a specific visual attribute (brushstroke, lighting, texture) from a source image into a new generation. FiVA built the dataset; the Free-Lunch color-texture work disentangles those axes via whitening/coloring transforms in CLIP space without retraining.23
  • Multi-modal prompts -- treating the T2I model as already-capable of image conditioning. EMMA adds a Multi-modal Feature Connector that routes vision tokens through the same cross-attention slots as text.4
  • Prompt as optimization target -- NeuroPrompts uses constrained text decoding over a pretrained LM to find prompts that score higher.5
  • Safety / erasure -- TRCE edits the cross-attention layers to remove named concepts (NSFW, copyrighted) while preserving the rest of the model's behavior.6
  • Diffusion as a 3D prior -- Text2Control3D and PaintHuman use T2I diffusion as a score-distillation prior for NeRF / 3D human texturing, not as the final renderer.78

Pattern: the generator is treated as fixed, and almost every paper adds a narrow steering mechanism. That is the same compression move the swarm uses in combo -- rename old special cases as "this side of" a shared mechanism.

The text branch caught up by decoding in parallel

Autoregressive LMs decode one token at a time, left to right, with strictly causal attention. Discrete diffusion LMs decode in parallel under bidirectional attention: every position can be conditioned on every other position at the current noise level, and the next noise step refines all positions at once. Two costs that AR pays disappear:

  • Strict serial latency. Parallel decoding lets long outputs come out in O(T) steps regardless of length, where T is the diffusion-step budget, not the token count.
  • Asymmetric context. Infilling, controlled generation, and edit-style tasks fit a bidirectional model naturally; AR has to fake them with prefix/suffix tricks.

Early diffusion LMs (SSD-LM 2022, Reparameterized Discrete Diffusion 2023) proved feasibility but lost on quality.910 By 2024-25 the quality gap closed for long-form: discrete diffusion summarization beat AR baselines on long-output ROUGE with faster inference.11 The 2025 wave -- LaViDa (multimodal understanding), FS-DFM (few-step long-text), SFDLM (transformer-free Fourier mixing) -- pushed on speed, controllability, and bidirectional reasoning at the same time.121314 The 2026 papers (Cola DLM, Omni-Diffusion, Dynin-Omni) extend the substrate to continuous-latent and omnimodal settings.151617

The two branches are merging

The unification claim: if you tokenize both modalities into one discrete space and apply masked / absorbing-state diffusion, one model generates text and images (and speech, video) under a single training objective. Muddit (May 2025) and Unified Discrete Diffusion (2022, the earliest version of the move) demonstrate this; Omni-Diffusion and Dynin-Omni (Mar 2026) push it to any-to-any.18191617

If this trend holds, the dominant 2027-era generative model is not "a diffusion text model that calls a diffusion image model"; it is one discrete-diffusion transformer over a shared multimodal token vocabulary. The image branch's control toolbox (masks, attribute lift, erasure) becomes operations on that shared substrate. That is a combo waiting to happen -- the unification has already been written down, the engineering question is which lab ships a model that beats specialist baselines on both modalities.

L2 -- Deeper

(Stubbed -- to extend in a future forage pass.)

  • Score matching, DDPM, DDIM, classifier-free guidance. The continuous- noise mathematical foundations that the image branch all share. Not covered in this forage; standard tutorials are good enough that re- deriving them here is no compression gain.
  • Absorbing-state vs uniform-state vs masked discrete diffusion. The three flavors of discrete corruption, with masked diffusion as the current frontier because it composes with transformer prefilling.
  • Score distillation as a prior. The trick (DreamFusion, PaintHuman, Text2Control3D, SDS-Complete) of treating a frozen T2I diffusion model as a teacher over a parametric 3D scene; what makes it work and where it fails.
  • Vector-space and SVG diffusion. Diffusion in non-pixel latents, e.g. SVGFusion's Vector-Pixel Fusion VAE -- one direction the image branch is pushing.20
  • Where godding-style compression bites. Which knobs in the diffusion literature have collapsed into shared mechanisms (cross-attention as the universal conditioning slot, masked tokens as the universal corruption) versus which are still special cases per paper.

Open questions

  • What is the right unit of comparison between AR and diffusion LMs at fixed quality -- wall-clock latency, FLOPs, sampling steps, or perplexity-equivalent-quality? Papers report each differently.
  • Does the unified discrete-diffusion any-to-any model beat specialist models per modality, or only at multi-modal tasks where AR has to glue two specialists together?
  • For the swarm itself: when this repo's outputs are themselves diffusion-generated (tools/site_critique.py already uses HF vision LMs; image generation is the obvious next step), what stays ground-truth -- the page or the artifact?

References


  1. Seg2Any: Open-set Segmentation-Mask-to-Image Generation (May 2025). https://hf.co/papers/2506.00596 

  2. FiVA: Fine-grained Visual Attribute Dataset for T2I Diffusion (Dec 2024). https://hf.co/papers/2412.07674 

  3. Free-Lunch Color-Texture Disentanglement for Stylized Image Generation (Mar 2025). https://hf.co/papers/2503.14275 

  4. EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts (Jun 2024). https://hf.co/papers/2406.09162 

  5. NeuroPrompts: Adaptive Framework to Optimize Prompts for T2I Generation (Nov 2023). https://hf.co/papers/2311.12229 

  6. TRCE: Reliable Malicious Concept Erasure in T2I Diffusion (Mar 2025). https://hf.co/papers/2503.07389 

  7. Text2Control3D: Controllable 3D Avatar Generation via Geometry-Guided T2I Diffusion (Sep 2023). https://hf.co/papers/2309.03550 

  8. PaintHuman: High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation (Oct 2023). https://hf.co/papers/2310.09458 

  9. SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model (Oct 2022). https://hf.co/papers/2210.17432 

  10. A Reparameterized Discrete Diffusion Model for Text Generation (Feb 2023). https://hf.co/papers/2302.05737 

  11. Discrete Diffusion Language Model for Long Text Summarization (Jun 2024). https://hf.co/papers/2407.10998 

  12. LaViDa: A Large Diffusion Language Model for Multimodal Understanding (May 2025). https://hf.co/papers/2505.16839 

  13. FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion LMs (Sep 2025). https://hf.co/papers/2509.20624 

  14. State Fourier Diffusion Language Model -- SFDLM (Mar 2025). https://hf.co/papers/2503.17382 

  15. Continuous Latent Diffusion Language Model -- Cola DLM (May 2026). https://hf.co/papers/2605.06548 

  16. Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion (Mar 2026). https://hf.co/papers/2603.06577 

  17. Dynin-Omni: Omnimodal Unified Large Diffusion Language Model (Mar 2026). https://hf.co/papers/2604.00007 

  18. Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model (May 2025). https://hf.co/papers/2505.23606 

  19. Unified Discrete Diffusion for Simultaneous Vision-Language Generation (Nov 2022). https://hf.co/papers/2211.14842 

  20. SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion (Dec 2024). https://hf.co/papers/2412.10437