Skip to content

Mixing — generalized

Mixing is one operation wearing many costumes. A mixture is a weighted combination of *parts* in some space, evaluated by a *kernel* that decides how the parts interact. Across taste, smell, chemistry, fluids, audio, color, probability, and machine learning the same three knobs recur: **weights** (how much of each), **kernel** (additive · multiplicative · super-additive · masking), and **carrier** (the medium the parts live in). When the kernel is linear the math is convex combination; when it is nonlinear you get synergy, antagonism, masking, emulsions, beats, dissonance, mode collapse — the interesting phenomena.
🌱 seedling tended 2026-05-13 research mixtures generalization isomorphism mathematics chemistry perception machine-learning
flowchart LR
  parts[parts · pᵢ in space X] --> weights[weights · wᵢ on simplex]
  weights --> kernel[mixing kernel K · linear · multiplicative · masking · super-additive]
  parts --> carrier[carrier · medium · solvent · prior · base]
  carrier --> kernel
  kernel --> mixture[the mixture m = K(w, p)]
  mixture --> readout[readout · taste · spectrum · sample · prediction]
Read next

Generalization · rating: medium. Companion to MIXTURES.md — abstracts the same operation across math, chemistry, fluids, signals, ML.

Mixing — generalized — three pigment streams (saffron, magenta, teal) pour into a glass vessel and swirl into a vortex whose colors spill beyond the hull; small motifs around the beaker — star anise, walnut, a branching tree, splashes of pigment — hint at the recurring domains (taste, chemistry, fluids, color, audio, probability, ML, language).
Parts pour in, the kernel stirs, and the interesting phenomena leak outside the convex hull. The same vessel, different costumes. FLUX.1 schnell via fal.ai, prompted on the L0/L1 description.

Status: seedling | 2026-05-13 | rating: medium Compress levels: L0 ↓ L1 ↓ L2

A mixture is what you get when two things stop being two things without quite becoming one. The interesting question is never how much of each — it is what the combination operation does to its inputs.

L0 — TL;DR (≤5 lines)

Across cuisine, perfume, chemistry, fluids, audio, color, probability, and machine learning, mixing is the same operation with different kernels: choose parts pᵢ in some space X, give them weights wᵢ ≥ 0 summing to 1, and apply a kernel K(w, p) that may be linear (paint, probability, convex hull), multiplicative / super-additive (umami × umami, catalysis, constructive interference), masking / saturating (salt over bitter, loudness compression, attention winner-take-all), or emergent (emulsion, chord, alloy, mixture-of-experts gating). The same three knobs — weights, kernel, carrier — explain why salt amplifies sweet, why hydrogen + oxygen → water (a non-mixture), why GMMs fit data, and why MoE routing beats dense networks at fixed FLOPs.

L1 — Overview

Core claim

A surprising amount of "what goes with what" — and "what happens when we combine these" — is one structure played in different keys. The structure has three parts:

  1. Parts p₁ … pₙ — elements of a space X. X can be a flavor vector, a chemical species, a probability distribution, an image patch, a neural network expert.
  2. Weights w ∈ Δⁿ⁻¹ — non-negative coefficients on the simplex (sum to 1). "How much of each part."
  3. Kernel K : Δⁿ⁻¹ × Xⁿ → X — the combination rule. The kernel decides whether the mixture is the arithmetic mean (linear), the geometric mean (multiplicative), the dominant element (winner-take-all), or something strictly outside the convex hull (synergistic / emergent).

A fourth knob shows up in physical mixtures: the carrier — the medium that holds the parts. Solvents in chemistry, fat in food, air in perfume, the prior in Bayesian mixtures, the residual stream in transformers. The carrier is "silent" but changes the effective kernel.

Why a generalization page

The mixture-expert investigation MIXTURES covers the taste and smell instance in depth. It is one row in a table whose columns are: parts, weights, kernel, carrier, readout, failure modes, design rules. The same table fills out for chemistry, fluids, audio, light, probability, and neural nets. The generalization is not just analogy — the same math (convex combinations, log-linear pooling, masking inequalities) shows up in each row.

The recurring instances

flowchart LR
  mix(("mixing<br/>the abstract operation"))
  subgraph physical ["physical · matter"]
    food["taste<br/>5+ axes"]
    smell["smell<br/>~10D embedding"]
    chem["chemistry<br/>moles · partial pressures"]
    fluid["fluids<br/>turbulent stirring"]
  end
  subgraph signal ["signal · perception"]
    light["color<br/>RGB · spectra"]
    sound["audio<br/>waves · chords"]
  end
  subgraph abstract ["abstract · inference"]
    prob["probability<br/>mixture models"]
    ml["ML<br/>MoE · ensembling"]
    opinion["social<br/>belief aggregation"]
    rhetoric["language<br/>code-switch · style"]
  end
  mix --> physical
  mix --> signal
  mix --> abstract

The kernel zoo (one-line summaries)

Kernel Rule Where it shows up
Linear (convex combination) \(m = \sum_i w_i\, p_i\) Paint mixing, alpha blending, Bayesian model averaging, baseline mix-of-X
Log-linear (geometric pooling) \(\log m = \sum_i w_i \log p_i\) Multiplicative effects, product-of-experts, mutual reinforcement
Multiplicative / synergistic \(m > \max_i p_i\) Umami×umami, catalysis, constructive interference, resonance
Masking / saturating \(m \approx \max_i (w_i\, p_i)\) or softmax-like Salt-over-bitter, attention heads, loudness, winner-take-all
Gated / routed \(m = \sum_i g_i(x)\, p_i(x)\) with sparse \(g\) Mixture-of-experts, conditional computation, conditional GMM
Stochastic mixture sample \(i \sim w\) then output \(p_i\) GMM data model, ensemble bagging, dropout
Reactive (non-mixture) parts disappear into a new species \(\mathrm{H_2 + O_2 \to H_2O}\); sex (genetic recombination)
Emergent (phase) mixture lives outside \(X\) itself Emulsion, chord, alloy, foam, plasma
Time-resolved \(m(t) = \sum_i w_i(t)\, p_i\) Perfume top/heart/base, ADSR envelope, narrative pacing
Spatial \(m(x) = \sum_i w_i(x)\, p_i\) Texture synthesis, ecological mosaic, Gaussian splats

Why three knobs and not one

A single number (a weight) is enough only when the kernel is linear. As soon as the kernel is nonlinear, weight effects depend on the other parts. Salt at 1 g amplifies sweet only if sweet is present; iso-E-super "fixes" only if volatile top-notes are present; an expert in MoE fires only if the gate routes to it. So the design surface is at minimum 3D: what to put in, how much, what combination law applies.

L2 — Deep dive

1. Mathematical mixtures — the canonical math

The clean math case. Let \(X\) be a measurable space.

Mixture distribution:

\[p(x) = \sum_{i} w_i\, p_i(x), \qquad w_i \geq 0, \quad \sum_i w_i = 1\]

A Gaussian mixture model (GMM) is \(p_i = \mathcal{N}(\mu_i, \Sigma_i)\). Mixture distributions are the canonical universal approximator for densities — any continuous density on a compact set can be approximated arbitrarily well by a finite GMM. The cost: identifiability is lost (the labels of components can be permuted), and EM optimization is non-convex.

Convex combination of vectors:

\[m = \sum_i w_i\, v_i \;\in\; \mathrm{conv}(v_1, \ldots, v_n)\]

The mixture lives inside the convex hull of the parts. Linear mixing cannot escape the hull. Every interesting mixing phenomenon is, in some sense, a way to escape the convex hull.

Log-linear (geometric) pooling:

\[m(x) \;\propto\; \prod_i p_i(x)^{w_i}\]

Used in product of experts. The result is sharper than any single component — the mixture narrows rather than widens the support. This is the formal opposite of arithmetic mixing.

Mixture of experts (MoE):

\[y(x) = \sum_i g_i(x)\, f_i(x), \qquad g_i(x) = \mathrm{softmax}(W_g\, x)_i\]

The weights \(g_i(x)\) are themselves a function of the input. This makes MoE a gated mixture — different inputs get different mixes. Sparse MoE forces \(g\) to be approximately one-hot, recovering hard routing and giving the FLOPs savings that make MoE practical at scale (Switch Transformer, GLaM, Mixtral).

Wasserstein barycenter — the "mean" of distributions in optimal-transport sense:

\[\bar{\mu} = \arg\min_{\mu}\; \sum_i w_i\, W_2^2(\mu, \mu_i)\]

This is mixing in distribution space that respects geometry — interpolating between a unimodal distribution at 0 and a unimodal at 1 produces a unimodal at 0.5 (the Wasserstein mean), not the bimodal arithmetic mixture \(\tfrac{1}{2}(\delta_0 + \delta_1)\). Two completely different "averages" — which one the application wants is the design question.

Bayesian model averaging:

\[p(y \mid \mathrm{data}) = \sum_i p(y \mid \mathrm{model}_i)\, \cdot\, p(\mathrm{model}_i \mid \mathrm{data})\]

The same form as a probability mixture, with the weights being model posteriors. Equivalent in form to ensemble averaging in ML — the interpretation differs but the math is identical.

2. Chemical mixtures — and the reactive non-mixture

A chemical mixture preserves the parts; a chemical reaction destroys them. The line between them is whether the kernel is conservative or transformative.

Conservative (true mixture): - Heterogeneous: oil + water; visibly two phases. Kernel is approximately identity within each phase; surface area is the interaction. - Homogeneous (solution): salt + water, alcohol + water. One phase; parts are molecularly mixed but recoverable (evaporate the water → get salt back). - Colloid / emulsion: parts dispersed below visible scale but above molecular — mayonnaise (oil-in-water + emulsifier), milk, smoke, fog. Emulsions are mixtures stabilized by a third species (the emulsifier) whose molecules have both polar and nonpolar ends.

Transformative (reaction, not mixture):

\[\mathrm{H_2 + \tfrac{1}{2}\,O_2 \;\longrightarrow\; H_2O} \quad\text{(reactants gone; new species)}\]

This is not a mixture in any algebraic sense. The kernel discards the inputs. But: every reaction starts from a mixture (parts must be in contact), so mixing rate often bounds reaction rate (the diffusion limit).

Quantitative laws on conservative mixtures:

Law Rule Interpretation
Dalton total pressure = Σ partial pressures gas mixture is additive in pressure
Raoult partial vapor pressure = mole fraction × pure vapor pressure ideal liquid mixture is linear in mole fraction
Henry dissolved concentration = kH × partial pressure gas-in-liquid mixing is linear at low concentration
Beer–Lambert absorbance = Σ εᵢ cᵢ ℓ spectra of mixtures are linear in component spectra (if no interaction)
Ideal mixing entropy ΔS_mix = -R Σ xᵢ ln xᵢ entropy gain is the Shannon entropy of the composition — the same formula as in information theory

The last one is striking: the entropy of mixing in chemistry is information-theoretic entropy in coding. The same -Σ pᵢ log pᵢ describes a gas of N molecules in two compartments and a fair-coin source coded for a channel. Mixtures are information-theoretic objects.

Non-ideal corrections (where chemistry stops being linear): - Activity coefficients (Lewis): when molecules of A and B prefer each other (or repel), aᵢ = γᵢ xᵢ with γ ≠ 1. Salt water is non-ideal enough to need γ to predict its boiling point. - Excess Gibbs energy G^E = G_mix − G_mix^ideal quantifies deviation; positive G^E → "they don't like mixing" → eventually phase separation.

3. Fluids — stirring vs mixing vs diffusing

Three distinct operations, often conflated.

  • Stirring: macroscopic transport. Folding and stretching of fluid parcels by velocity field. Eddies. Increases the interfacial area between the parts without changing the molecular composition along a parcel. Like shuffling cards — no card touches another, but their arrangement becomes mixed.
  • Diffusion: microscopic molecular spread. Fick's law J = -D ∇c. Slow on macroscopic scales (diffusion time ~ L²/D — minutes for sugar to mix in unstirred tea, weeks across a meter of water).
  • Mixing: the joint effect — stirring multiplies interfacial area so diffusion can finish the job in finite time. The Batchelor scale η_B = (νD²/ε)^¼ is where the two regimes meet: below it, diffusion dominates; above it, stirring dominates.

This explains a kitchen observation: a cold milk stream in cold coffee makes "layers" that don't mix; the same milk stream in hot coffee disperses quickly. Higher temperature → higher D (Stokes-Einstein, D ~ T/μ) and lower viscosity → faster ε cascade → smaller Batchelor scale → faster mixing.

Turbulent mixing: in turbulence, the energy cascade from large to small eddies (E(k) ~ k^{-5/3}, Kolmogorov) is the same machinery that drives mixing. The mixing efficiency Γ = B/ε (buoyancy flux over dissipation) controls how much of the stirring work ends up as actual molecular mixing vs just heat. Most mixers run at Γ ~ 0.2 — most of the work is wasted.

4. Audio and color — interference and perceptual mixing

Sound mixing is waveform addition: at a point in space, pressures sum. The kernel is genuinely linear in the medium (air is approximately linear at normal amplitudes). But the perceptual kernel is not:

  • Constructive / destructive interference: in-phase same-frequency waves add amplitudes (4× intensity); out-of-phase cancel.
  • Beats: two close frequencies f₁ ≈ f₂ produce amplitude modulation at |f₁ − f₂| — a literal mixing product, not present in either source.
  • Consonance / dissonance: small-integer frequency ratios (2:1, 3:2, 4:3) are heard as consonant; near-rationals as dissonant. The cochlear basilar membrane resolves frequencies; closely spaced frequencies excite overlapping regions → roughness. Helmholtz's 1863 explanation.
  • Masking: a loud tone at f suppresses perception of soft tones near f (within ~ a critical band). This is the same masking inequality as salt over bitter — perception of part b is reduced by the presence of part a even though a does not "consume" b. MP3 and AAC audio compression exploit this: the encoder discards masked components.

Color mixing: - Additive (light, RGB): wavelengths add. Linear in the physics; perceptual kernel via three cone types (L, M, S) gives the CIE chromaticity diagram. The gamut of n primaries is the convex hull of those primaries in CIE xy — a convex combination in a 2D projection. - Subtractive (pigment, CMY): each pigment absorbs part of the spectrum reflected back. The kernel is multiplicative in reflectance, equivalently additive in negative-log-reflectance. So "mixing paints" is a log-linear pool, not the additive pool that mixing lights is. This is why mixing many paints converges to muddy brown (the reflectance product → 0 in all bands), while mixing many lights converges to white.

A clean conceptual point: light mixing is the canonical linear kernel; paint mixing is the canonical log-linear kernel; the difference is not "art" — it is two different physical operations.

5. Probability and machine learning — mixing as inference

ML uses mixing in at least four distinct roles.

Role Form Example
Model data ~ Σ wᵢ pᵢ Gaussian mixture model, hidden Markov model emission, topic model
Ensemble prediction = Σ wᵢ ŷᵢ Bagging, stacking, model soup, BMA
Capacity computation = Σ gᵢ(x) fᵢ(x) Mixture of experts, switch transformer, conditional computation
Regularizer training ~ mix two losses or two inputs Mixup augmentation; objective mixing (L_CE + λ L_KL)

Mixture of experts deserves its own paragraph because the user-mentioned "agent working on mixture expert" likely connects here. MoE replaces a dense feedforward f(x) = W · x with Σᵢ gᵢ(x) · fᵢ(x) where gᵢ is a learned gating function. Sparse routing (top-k with k = 1 or 2) means each input activates only a few experts — total parameters grow linearly while compute stays constant. The kernel is gated linear: linear given the gate, but the gate is non-linear in x. This makes the effective kernel piecewise-linear with input-dependent regions — a partition of input space into expert specializations, glued at the boundaries by the gate.

Failure modes peculiar to MoE: - Expert collapse: gate routes everything to one expert; others are dead. Mitigated by auxiliary load-balancing losses. - Capacity overflow: more inputs routed to expert i than its capacity → drop / overflow. The "token drop" rate is a key reliability metric. - Gate-expert miscoordination: experts diverge in training; gate's routing is on stale assumptions. Slow-moving gate or expert EMA helps.

Mixup augmentation is the simplest possible nonlinear mixing for generalization:

\[\tilde{x} = \lambda\, x_i + (1-\lambda)\, x_j, \quad \tilde{y} = \lambda\, y_i + (1-\lambda)\, y_j, \quad \lambda \sim \mathrm{Beta}(\alpha, \alpha)\]

A linear mixture in input space + a linear mixture in label space acts as an effective regularizer because the function class is forced to behave linearly between data points, suppressing high-curvature decision boundaries. The empirical observation: mixup improves generalization on many tasks despite being almost embarrassingly simple. This is the cleanest case of "the kernel is the regularizer."

6. Social / linguistic mixing

The same skeleton — parts, weights, kernel — keeps applying.

Opinion aggregation (DeGroot): xᵢ(t+1) = Σⱼ Wᵢⱼ xⱼ(t) — each agent updates as a linear mixture of neighbors' opinions. Convergence to consensus iff the influence graph is connected and aperiodic. The Friedkin-Johnsen extension adds a "stubbornness" term λᵢ xᵢ(0), keeping the agent partly anchored to their original opinion — exactly the prior carrier role in Bayesian mixing.

Code-switching (linguistics): bilingual speakers mix grammatical elements from two languages at well-defined junctures (Poplack's "equivalence constraint" and "free morpheme constraint"). The kernel has a structure: not every mixing is grammatical. Code-switching is a constrained mixture in a grammar-aware kernel.

Cultural fusion in cuisine (already covered in MIXTURES.md §5): cuisines emerge as locally stable kernels selecting compatible part sets.

7. The "escape the convex hull" question

The single sharpest design question across all these mixings:

When does the mixture produce something outside the space of its parts?

Domain "Inside the hull" mixture "Outside the hull" mixture
Math Convex combination Geometric mean, Wasserstein barycenter, sample-then-output
Taste Two sweet things Umami × umami (8×), salt + bitter (suppression), browning (new compounds)
Smell Two musks (saturate) A perfume accord that "smells like neither"
Chemistry Salt + water Hydrogen + oxygen → water; emulsion; alloy
Color (light) Two reds RGB → white (looks like none of the parts)
Color (paint) Two greens Many pigments → mud
Sound Two unisons Chord triad (Gestalt object); beats; harmony
Probability BMA at fixed prior Product-of-experts (sharper); mixture of categoricals (multimodal)
ML Soup of fine-tunes MoE with specialization; mixup-regularized model
Society DeGroot consensus Polarization (anti-mixing); cultural fusion (new genre)

The "outside the hull" cases are where mixing is interesting rather than averaging. The rest of design — kitchen, perfumery, ML architecture, opinion dynamics — is about engineering kernels that escape the hull in useful directions and avoid the hull-escape failure modes (mud, dissonance, mode collapse, polarization).

8. Failure modes — a unified table

Failure Mechanism Domain examples
Muddled middle Too many small weights on too many parts 7+ herbs in a dish; 30+ raw materials in a perfume; full ensemble averaging without selection
Single-axis maxed One weight = 1, others = 0 Pure heat without sweet; mono-instrument mix; single-expert MoE collapse
Carrier mismatch Parts can't reach the kernel Fat-soluble aroma in water dish; oil-based pigment on wet plaster; tokens that miss their expert
Saturation / masking Too much of one part hides others Loud bass masks treble; loud expert masks gate signal; over-spiced food
Time misalignment Parts arrive at different times Top-note vanishes before base note appears; ensemble member trained on stale data
Cultural / prior mismatch Same chemistry, wrong context Anchovy + chocolate works in Italy, weird in Japan; same Bayesian update with wrong prior
Catastrophic interaction Kernel destroys parts Reactive: bleach + ammonia; bad: two musks → null perception

These map row-by-row across the table. The "design rule" for each is the same in form: adjust weights, change carrier, replace one part, time-shift, or change the kernel.

9. Concrete cross-domain recipes for the swarm

Where this generalization buys something practical:

  1. A pairing finder is a kernel + distance. Foodpairing.com (high shared-volatile pairs well in Western cuisine) and embedding-distance pairing (Open POM) are both "near-neighbor in some space" — they just choose different spaces. The same algorithm runs on perfume notes, wine pairings, music recommendations.
  2. Mixup-as-regularizer generalizes. Any system that should be linear between two anchor examples benefits from in-space mixing during training. Cuisine students learning by tasting "halfway between dish A and dish B" is the human version.
  3. Carrier choice is dominant in non-linear kernels. In food: pick fat. In ML: pick the input encoding. In perfume: pick the fixative. In inference: pick the prior. Carriers don't appear in the headline recipe but determine whether the recipe works.
  4. Escape-the-hull design. To make something new, choose parts with genuine kernel synergy (umami × umami; harmonic 3:2; emulsifier + immiscible parts; experts with disjoint specialties). Linear mixing gives you only points inside.

10. Open questions

  • Is there a unified algebra of "good" mixtures? Across domains, "good" often involves a small number of dominant components (~3) plus carriers
  • a contrast accent. Why 3? Working-memory? Information-theoretic capacity of the readout?
  • When does the kernel itself depend on input? MoE makes the kernel input-dependent; cuisine grammars do the same (acid is mandatory in some cuisines, optional in others, conditional on dish). A theory of conditional kernels would unify these.
  • Wasserstein-style "geometric" mixing of perceptual spaces — does an optimal-transport mean of two smells smell more like an actual smell than the arithmetic mixture? Open POM allows this experiment.
  • Mixing entropy as the universal scoring function−Σ pᵢ log pᵢ shows up in chemistry, info theory, ecology (Shannon diversity index), political polarization measures. Is "interesting mixture" = "high mixing entropy after subtracting predictable variance"?
  • Reactive mixing in social systems: what's the social analogue of H₂ + O₂ → H₂O? Movements that destroy their parts to produce a new collective species? (Schelling on tipping; Granovetter on threshold models.)

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning — Chapter 9 on mixture models, EM, and Gaussian mixtures.
  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation. — product-of-experts as log-linear pool.
  • Shazeer, N., et al. (2017). Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. ICLR.
  • Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. JMLR.
  • Zhang, H., et al. (2018). mixup: beyond empirical risk minimization. ICLR.
  • Cuturi, M. (2013). Sinkhorn distances: lightspeed computation of optimal transport. NeurIPS. — Wasserstein barycenters.
  • Helmholtz, H. von (1863). On the Sensations of Tone — consonance, beats, masking.
  • Wyszecki, G., & Stiles, W. S. (2000). Color Science. — CIE chromaticity and additive/subtractive mixing.
  • Tennekes, H., & Lumley, J. L. (1972). A First Course in Turbulence. — turbulent mixing, Batchelor scale.
  • Atkins, P., & de Paula, J. (2014). Physical Chemistry. — Dalton, Raoult, Henry, mixing entropy.
  • DeGroot, M. H. (1974). Reaching a consensus. JASA. — linear opinion aggregation.
  • Friedkin, N. E., & Johnsen, E. C. (1999). Social influence networks and opinion change. Advances in Group Processes.
  • Ahn, Y.-Y., et al. (2011). Flavor network and the principles of food pairing. Scientific Reports. — cuisine as a kernel.
  • Lee, B. K., et al. (2023). A principal odor map. Science. — smell as a learned embedding for mixing.

See also

  • MIXTURES — the gastronomical and olfactory instance, in depth. This generalization page is the companion abstraction.
  • OLFACTORY-SENSES — the smell substrate.
  • REFLECTIONS-AND-RECEIVERS — combinatorial codes in another modality (light).
  • ../ISOMORPHISM-ATLAS.mdISO-36 (S569): mixing-kernel registered as candidate entry, 9 domains. Key finding recorded there: ΔS_mix = Shannon H is not analogy — same mathematical object.
  • UNIVERSE-EVOLUTION-AS-COMPRESSION — mixing entropy as one face of universal compression.

Inspiration sources

  • The mixture expert's existing investigation MIXTURES.md (taste + smell) — the concrete case this page abstracts from.
  • The isomorphism atlas method: when ≥6 domains share a structure, write the structure, not the domain.
  • Boltzmann (entropy of mixing) and Shannon (entropy of a source) using the same -Σ pᵢ log pᵢ — the historical observation that mixing is an information-theoretic operation, not just a physical one.
  • Geoffrey Hinton's "product of experts" as the geometric counterpart to "mixture of experts" — two mixing kernels at the heart of modern ML.