Mixing — generalized¶
flowchart LR
parts[parts · pᵢ in space X] --> weights[weights · wᵢ on simplex]
weights --> kernel[mixing kernel K · linear · multiplicative · masking · super-additive]
parts --> carrier[carrier · medium · solvent · prior · base]
carrier --> kernel
kernel --> mixture[the mixture m = K(w, p)]
mixture --> readout[readout · taste · spectrum · sample · prediction]
- Mixing as kernel — the seam page — O-information, non-Euclidean kernels, formal grounding of escape-the-hull
- mixtures (taste & smell) — the gastronomical · olfactory instance
- reflections and receivers — combinatorial codes in light
- rate-distortion — mixing as lossy compression
Generalization · rating: medium. Companion to MIXTURES.md — abstracts the same operation across math, chemistry, fluids, signals, ML.
- PreviousMixing As Kernel
- NextMixtures
Status: seedling | 2026-05-13 | rating: medium Compress levels: L0 ↓ L1 ↓ L2
A mixture is what you get when two things stop being two things without quite becoming one. The interesting question is never how much of each — it is what the combination operation does to its inputs.
L0 — TL;DR (≤5 lines)¶
Across cuisine, perfume, chemistry, fluids, audio, color, probability, and
machine learning, mixing is the same operation with different kernels:
choose parts pᵢ in some space X, give them weights wᵢ ≥ 0 summing to 1,
and apply a kernel K(w, p) that may be linear (paint, probability,
convex hull), multiplicative / super-additive (umami × umami, catalysis,
constructive interference), masking / saturating (salt over bitter,
loudness compression, attention winner-take-all), or emergent (emulsion,
chord, alloy, mixture-of-experts gating). The same three knobs — weights,
kernel, carrier — explain why salt amplifies sweet, why hydrogen +
oxygen → water (a non-mixture), why GMMs fit data, and why MoE routing
beats dense networks at fixed FLOPs.
L1 — Overview¶
Core claim¶
A surprising amount of "what goes with what" — and "what happens when we combine these" — is one structure played in different keys. The structure has three parts:
- Parts
p₁ … pₙ— elements of a spaceX.Xcan be a flavor vector, a chemical species, a probability distribution, an image patch, a neural network expert. - Weights
w ∈ Δⁿ⁻¹— non-negative coefficients on the simplex (sum to 1). "How much of each part." - Kernel
K : Δⁿ⁻¹ × Xⁿ → X— the combination rule. The kernel decides whether the mixture is the arithmetic mean (linear), the geometric mean (multiplicative), the dominant element (winner-take-all), or something strictly outside the convex hull (synergistic / emergent).
A fourth knob shows up in physical mixtures: the carrier — the medium that holds the parts. Solvents in chemistry, fat in food, air in perfume, the prior in Bayesian mixtures, the residual stream in transformers. The carrier is "silent" but changes the effective kernel.
Why a generalization page¶
The mixture-expert investigation MIXTURES covers the taste and
smell instance in depth. It is one row in a table whose columns are: parts,
weights, kernel, carrier, readout, failure modes, design rules. The same
table fills out for chemistry, fluids, audio, light, probability, and neural
nets. The generalization is not just analogy — the same math (convex
combinations, log-linear pooling, masking inequalities) shows up in each row.
The recurring instances¶
flowchart LR
mix(("mixing<br/>the abstract operation"))
subgraph physical ["physical · matter"]
food["taste<br/>5+ axes"]
smell["smell<br/>~10D embedding"]
chem["chemistry<br/>moles · partial pressures"]
fluid["fluids<br/>turbulent stirring"]
end
subgraph signal ["signal · perception"]
light["color<br/>RGB · spectra"]
sound["audio<br/>waves · chords"]
end
subgraph abstract ["abstract · inference"]
prob["probability<br/>mixture models"]
ml["ML<br/>MoE · ensembling"]
opinion["social<br/>belief aggregation"]
rhetoric["language<br/>code-switch · style"]
end
mix --> physical
mix --> signal
mix --> abstract
The kernel zoo (one-line summaries)¶
| Kernel | Rule | Where it shows up |
|---|---|---|
| Linear (convex combination) | \(m = \sum_i w_i\, p_i\) | Paint mixing, alpha blending, Bayesian model averaging, baseline mix-of-X |
| Log-linear (geometric pooling) | \(\log m = \sum_i w_i \log p_i\) | Multiplicative effects, product-of-experts, mutual reinforcement |
| Multiplicative / synergistic | \(m > \max_i p_i\) | Umami×umami, catalysis, constructive interference, resonance |
| Masking / saturating | \(m \approx \max_i (w_i\, p_i)\) or softmax-like | Salt-over-bitter, attention heads, loudness, winner-take-all |
| Gated / routed | \(m = \sum_i g_i(x)\, p_i(x)\) with sparse \(g\) | Mixture-of-experts, conditional computation, conditional GMM |
| Stochastic mixture | sample \(i \sim w\) then output \(p_i\) | GMM data model, ensemble bagging, dropout |
| Reactive (non-mixture) | parts disappear into a new species | \(\mathrm{H_2 + O_2 \to H_2O}\); sex (genetic recombination) |
| Emergent (phase) | mixture lives outside \(X\) itself | Emulsion, chord, alloy, foam, plasma |
| Time-resolved | \(m(t) = \sum_i w_i(t)\, p_i\) | Perfume top/heart/base, ADSR envelope, narrative pacing |
| Spatial | \(m(x) = \sum_i w_i(x)\, p_i\) | Texture synthesis, ecological mosaic, Gaussian splats |
Why three knobs and not one¶
A single number (a weight) is enough only when the kernel is linear. As soon as the kernel is nonlinear, weight effects depend on the other parts. Salt at 1 g amplifies sweet only if sweet is present; iso-E-super "fixes" only if volatile top-notes are present; an expert in MoE fires only if the gate routes to it. So the design surface is at minimum 3D: what to put in, how much, what combination law applies.
L2 — Deep dive¶
1. Mathematical mixtures — the canonical math¶
The clean math case. Let \(X\) be a measurable space.
Mixture distribution:
A Gaussian mixture model (GMM) is \(p_i = \mathcal{N}(\mu_i, \Sigma_i)\). Mixture distributions are the canonical universal approximator for densities — any continuous density on a compact set can be approximated arbitrarily well by a finite GMM. The cost: identifiability is lost (the labels of components can be permuted), and EM optimization is non-convex.
Convex combination of vectors:
The mixture lives inside the convex hull of the parts. Linear mixing cannot escape the hull. Every interesting mixing phenomenon is, in some sense, a way to escape the convex hull.
Log-linear (geometric) pooling:
Used in product of experts. The result is sharper than any single component — the mixture narrows rather than widens the support. This is the formal opposite of arithmetic mixing.
Mixture of experts (MoE):
The weights \(g_i(x)\) are themselves a function of the input. This makes MoE a gated mixture — different inputs get different mixes. Sparse MoE forces \(g\) to be approximately one-hot, recovering hard routing and giving the FLOPs savings that make MoE practical at scale (Switch Transformer, GLaM, Mixtral).
Wasserstein barycenter — the "mean" of distributions in optimal-transport sense:
This is mixing in distribution space that respects geometry — interpolating between a unimodal distribution at 0 and a unimodal at 1 produces a unimodal at 0.5 (the Wasserstein mean), not the bimodal arithmetic mixture \(\tfrac{1}{2}(\delta_0 + \delta_1)\). Two completely different "averages" — which one the application wants is the design question.
Bayesian model averaging:
The same form as a probability mixture, with the weights being model posteriors. Equivalent in form to ensemble averaging in ML — the interpretation differs but the math is identical.
2. Chemical mixtures — and the reactive non-mixture¶
A chemical mixture preserves the parts; a chemical reaction destroys them. The line between them is whether the kernel is conservative or transformative.
Conservative (true mixture): - Heterogeneous: oil + water; visibly two phases. Kernel is approximately identity within each phase; surface area is the interaction. - Homogeneous (solution): salt + water, alcohol + water. One phase; parts are molecularly mixed but recoverable (evaporate the water → get salt back). - Colloid / emulsion: parts dispersed below visible scale but above molecular — mayonnaise (oil-in-water + emulsifier), milk, smoke, fog. Emulsions are mixtures stabilized by a third species (the emulsifier) whose molecules have both polar and nonpolar ends.
Transformative (reaction, not mixture):
This is not a mixture in any algebraic sense. The kernel discards the inputs. But: every reaction starts from a mixture (parts must be in contact), so mixing rate often bounds reaction rate (the diffusion limit).
Quantitative laws on conservative mixtures:
| Law | Rule | Interpretation |
|---|---|---|
| Dalton | total pressure = Σ partial pressures | gas mixture is additive in pressure |
| Raoult | partial vapor pressure = mole fraction × pure vapor pressure | ideal liquid mixture is linear in mole fraction |
| Henry | dissolved concentration = kH × partial pressure |
gas-in-liquid mixing is linear at low concentration |
| Beer–Lambert | absorbance = Σ εᵢ cᵢ ℓ | spectra of mixtures are linear in component spectra (if no interaction) |
| Ideal mixing entropy | ΔS_mix = -R Σ xᵢ ln xᵢ |
entropy gain is the Shannon entropy of the composition — the same formula as in information theory |
The last one is striking: the entropy of mixing in chemistry is
information-theoretic entropy in coding. The same -Σ pᵢ log pᵢ describes a
gas of N molecules in two compartments and a fair-coin source coded for a
channel. Mixtures are information-theoretic objects.
Non-ideal corrections (where chemistry stops being linear):
- Activity coefficients (Lewis): when molecules of A and B prefer each
other (or repel), aᵢ = γᵢ xᵢ with γ ≠ 1. Salt water is non-ideal
enough to need γ to predict its boiling point.
- Excess Gibbs energy G^E = G_mix − G_mix^ideal quantifies deviation;
positive G^E → "they don't like mixing" → eventually phase separation.
3. Fluids — stirring vs mixing vs diffusing¶
Three distinct operations, often conflated.
- Stirring: macroscopic transport. Folding and stretching of fluid parcels by velocity field. Eddies. Increases the interfacial area between the parts without changing the molecular composition along a parcel. Like shuffling cards — no card touches another, but their arrangement becomes mixed.
- Diffusion: microscopic molecular spread. Fick's law
J = -D ∇c. Slow on macroscopic scales (diffusion time ~L²/D— minutes for sugar to mix in unstirred tea, weeks across a meter of water). - Mixing: the joint effect — stirring multiplies interfacial area so
diffusion can finish the job in finite time. The Batchelor scale
η_B = (νD²/ε)^¼is where the two regimes meet: below it, diffusion dominates; above it, stirring dominates.
This explains a kitchen observation: a cold milk stream in cold coffee
makes "layers" that don't mix; the same milk stream in hot coffee
disperses quickly. Higher temperature → higher D (Stokes-Einstein, D ~
T/μ) and lower viscosity → faster ε cascade → smaller Batchelor scale →
faster mixing.
Turbulent mixing: in turbulence, the energy cascade from large to small
eddies (E(k) ~ k^{-5/3}, Kolmogorov) is the same machinery that drives
mixing. The mixing efficiency Γ = B/ε (buoyancy flux over dissipation)
controls how much of the stirring work ends up as actual molecular mixing
vs just heat. Most mixers run at Γ ~ 0.2 — most of the work is wasted.
4. Audio and color — interference and perceptual mixing¶
Sound mixing is waveform addition: at a point in space, pressures sum. The kernel is genuinely linear in the medium (air is approximately linear at normal amplitudes). But the perceptual kernel is not:
- Constructive / destructive interference: in-phase same-frequency waves add amplitudes (4× intensity); out-of-phase cancel.
- Beats: two close frequencies
f₁ ≈ f₂produce amplitude modulation at|f₁ − f₂|— a literal mixing product, not present in either source. - Consonance / dissonance: small-integer frequency ratios (2:1, 3:2, 4:3) are heard as consonant; near-rationals as dissonant. The cochlear basilar membrane resolves frequencies; closely spaced frequencies excite overlapping regions → roughness. Helmholtz's 1863 explanation.
- Masking: a loud tone at
fsuppresses perception of soft tones nearf(within ~ a critical band). This is the same masking inequality as salt over bitter — perception of partbis reduced by the presence of partaeven thoughadoes not "consume"b. MP3 and AAC audio compression exploit this: the encoder discards masked components.
Color mixing:
- Additive (light, RGB): wavelengths add. Linear in the physics; perceptual
kernel via three cone types (L, M, S) gives the CIE chromaticity diagram.
The gamut of n primaries is the convex hull of those primaries in CIE
xy — a convex combination in a 2D projection.
- Subtractive (pigment, CMY): each pigment absorbs part of the spectrum
reflected back. The kernel is multiplicative in reflectance, equivalently
additive in negative-log-reflectance. So "mixing paints" is a log-linear
pool, not the additive pool that mixing lights is. This is why mixing
many paints converges to muddy brown (the reflectance product → 0 in all
bands), while mixing many lights converges to white.
A clean conceptual point: light mixing is the canonical linear kernel; paint mixing is the canonical log-linear kernel; the difference is not "art" — it is two different physical operations.
5. Probability and machine learning — mixing as inference¶
ML uses mixing in at least four distinct roles.
| Role | Form | Example |
|---|---|---|
| Model | data ~ Σ wᵢ pᵢ |
Gaussian mixture model, hidden Markov model emission, topic model |
| Ensemble | prediction = Σ wᵢ ŷᵢ |
Bagging, stacking, model soup, BMA |
| Capacity | computation = Σ gᵢ(x) fᵢ(x) |
Mixture of experts, switch transformer, conditional computation |
| Regularizer | training ~ mix two losses or two inputs |
Mixup augmentation; objective mixing (L_CE + λ L_KL) |
Mixture of experts deserves its own paragraph because the user-mentioned
"agent working on mixture expert" likely connects here. MoE replaces a dense
feedforward f(x) = W · x with Σᵢ gᵢ(x) · fᵢ(x) where gᵢ is a learned
gating function. Sparse routing (top-k with k = 1 or 2) means each input
activates only a few experts — total parameters grow linearly while compute
stays constant. The kernel is gated linear: linear given the gate, but
the gate is non-linear in x. This makes the effective kernel
piecewise-linear with input-dependent regions — a partition of input
space into expert specializations, glued at the boundaries by the gate.
Failure modes peculiar to MoE:
- Expert collapse: gate routes everything to one expert; others are dead.
Mitigated by auxiliary load-balancing losses.
- Capacity overflow: more inputs routed to expert i than its capacity →
drop / overflow. The "token drop" rate is a key reliability metric.
- Gate-expert miscoordination: experts diverge in training; gate's
routing is on stale assumptions. Slow-moving gate or expert EMA helps.
Mixup augmentation is the simplest possible nonlinear mixing for generalization:
A linear mixture in input space + a linear mixture in label space acts as an effective regularizer because the function class is forced to behave linearly between data points, suppressing high-curvature decision boundaries. The empirical observation: mixup improves generalization on many tasks despite being almost embarrassingly simple. This is the cleanest case of "the kernel is the regularizer."
6. Social / linguistic mixing¶
The same skeleton — parts, weights, kernel — keeps applying.
Opinion aggregation (DeGroot): xᵢ(t+1) = Σⱼ Wᵢⱼ xⱼ(t) — each agent
updates as a linear mixture of neighbors' opinions. Convergence to
consensus iff the influence graph is connected and aperiodic. The
Friedkin-Johnsen extension adds a "stubbornness" term λᵢ xᵢ(0),
keeping the agent partly anchored to their original opinion — exactly
the prior carrier role in Bayesian mixing.
Code-switching (linguistics): bilingual speakers mix grammatical elements from two languages at well-defined junctures (Poplack's "equivalence constraint" and "free morpheme constraint"). The kernel has a structure: not every mixing is grammatical. Code-switching is a constrained mixture in a grammar-aware kernel.
Cultural fusion in cuisine (already covered in MIXTURES.md §5): cuisines emerge as locally stable kernels selecting compatible part sets.
7. The "escape the convex hull" question¶
The single sharpest design question across all these mixings:
When does the mixture produce something outside the space of its parts?
| Domain | "Inside the hull" mixture | "Outside the hull" mixture |
|---|---|---|
| Math | Convex combination | Geometric mean, Wasserstein barycenter, sample-then-output |
| Taste | Two sweet things | Umami × umami (8×), salt + bitter (suppression), browning (new compounds) |
| Smell | Two musks (saturate) | A perfume accord that "smells like neither" |
| Chemistry | Salt + water | Hydrogen + oxygen → water; emulsion; alloy |
| Color (light) | Two reds | RGB → white (looks like none of the parts) |
| Color (paint) | Two greens | Many pigments → mud |
| Sound | Two unisons | Chord triad (Gestalt object); beats; harmony |
| Probability | BMA at fixed prior | Product-of-experts (sharper); mixture of categoricals (multimodal) |
| ML | Soup of fine-tunes | MoE with specialization; mixup-regularized model |
| Society | DeGroot consensus | Polarization (anti-mixing); cultural fusion (new genre) |
The "outside the hull" cases are where mixing is interesting rather than averaging. The rest of design — kitchen, perfumery, ML architecture, opinion dynamics — is about engineering kernels that escape the hull in useful directions and avoid the hull-escape failure modes (mud, dissonance, mode collapse, polarization).
8. Failure modes — a unified table¶
| Failure | Mechanism | Domain examples |
|---|---|---|
| Muddled middle | Too many small weights on too many parts | 7+ herbs in a dish; 30+ raw materials in a perfume; full ensemble averaging without selection |
| Single-axis maxed | One weight = 1, others = 0 | Pure heat without sweet; mono-instrument mix; single-expert MoE collapse |
| Carrier mismatch | Parts can't reach the kernel | Fat-soluble aroma in water dish; oil-based pigment on wet plaster; tokens that miss their expert |
| Saturation / masking | Too much of one part hides others | Loud bass masks treble; loud expert masks gate signal; over-spiced food |
| Time misalignment | Parts arrive at different times | Top-note vanishes before base note appears; ensemble member trained on stale data |
| Cultural / prior mismatch | Same chemistry, wrong context | Anchovy + chocolate works in Italy, weird in Japan; same Bayesian update with wrong prior |
| Catastrophic interaction | Kernel destroys parts | Reactive: bleach + ammonia; bad: two musks → null perception |
These map row-by-row across the table. The "design rule" for each is the same in form: adjust weights, change carrier, replace one part, time-shift, or change the kernel.
9. Concrete cross-domain recipes for the swarm¶
Where this generalization buys something practical:
- A pairing finder is a kernel + distance. Foodpairing.com (high shared-volatile pairs well in Western cuisine) and embedding-distance pairing (Open POM) are both "near-neighbor in some space" — they just choose different spaces. The same algorithm runs on perfume notes, wine pairings, music recommendations.
- Mixup-as-regularizer generalizes. Any system that should be linear between two anchor examples benefits from in-space mixing during training. Cuisine students learning by tasting "halfway between dish A and dish B" is the human version.
- Carrier choice is dominant in non-linear kernels. In food: pick fat. In ML: pick the input encoding. In perfume: pick the fixative. In inference: pick the prior. Carriers don't appear in the headline recipe but determine whether the recipe works.
- Escape-the-hull design. To make something new, choose parts with genuine kernel synergy (umami × umami; harmonic 3:2; emulsifier + immiscible parts; experts with disjoint specialties). Linear mixing gives you only points inside.
10. Open questions¶
- Is there a unified algebra of "good" mixtures? Across domains, "good" often involves a small number of dominant components (~3) plus carriers
- a contrast accent. Why 3? Working-memory? Information-theoretic capacity of the readout?
- When does the kernel itself depend on input? MoE makes the kernel input-dependent; cuisine grammars do the same (acid is mandatory in some cuisines, optional in others, conditional on dish). A theory of conditional kernels would unify these.
- Wasserstein-style "geometric" mixing of perceptual spaces — does an optimal-transport mean of two smells smell more like an actual smell than the arithmetic mixture? Open POM allows this experiment.
- Mixing entropy as the universal scoring function —
−Σ pᵢ log pᵢshows up in chemistry, info theory, ecology (Shannon diversity index), political polarization measures. Is "interesting mixture" = "high mixing entropy after subtracting predictable variance"? - Reactive mixing in social systems: what's the social analogue of H₂ + O₂ → H₂O? Movements that destroy their parts to produce a new collective species? (Schelling on tipping; Granovetter on threshold models.)
References¶
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning — Chapter 9 on mixture models, EM, and Gaussian mixtures.
- Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation. — product-of-experts as log-linear pool.
- Shazeer, N., et al. (2017). Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. ICLR.
- Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. JMLR.
- Zhang, H., et al. (2018). mixup: beyond empirical risk minimization. ICLR.
- Cuturi, M. (2013). Sinkhorn distances: lightspeed computation of optimal transport. NeurIPS. — Wasserstein barycenters.
- Helmholtz, H. von (1863). On the Sensations of Tone — consonance, beats, masking.
- Wyszecki, G., & Stiles, W. S. (2000). Color Science. — CIE chromaticity and additive/subtractive mixing.
- Tennekes, H., & Lumley, J. L. (1972). A First Course in Turbulence. — turbulent mixing, Batchelor scale.
- Atkins, P., & de Paula, J. (2014). Physical Chemistry. — Dalton, Raoult, Henry, mixing entropy.
- DeGroot, M. H. (1974). Reaching a consensus. JASA. — linear opinion aggregation.
- Friedkin, N. E., & Johnsen, E. C. (1999). Social influence networks and opinion change. Advances in Group Processes.
- Ahn, Y.-Y., et al. (2011). Flavor network and the principles of food pairing. Scientific Reports. — cuisine as a kernel.
- Lee, B. K., et al. (2023). A principal odor map. Science. — smell as a learned embedding for mixing.
See also¶
MIXTURES— the gastronomical and olfactory instance, in depth. This generalization page is the companion abstraction.OLFACTORY-SENSES— the smell substrate.REFLECTIONS-AND-RECEIVERS— combinatorial codes in another modality (light).../ISOMORPHISM-ATLAS.md— ISO-36 (S569): mixing-kernel registered as candidate entry, 9 domains. Key finding recorded there: ΔS_mix = Shannon H is not analogy — same mathematical object.UNIVERSE-EVOLUTION-AS-COMPRESSION— mixing entropy as one face of universal compression.
Inspiration sources¶
- The mixture expert's existing investigation
MIXTURES.md(taste + smell) — the concrete case this page abstracts from. - The isomorphism atlas method: when ≥6 domains share a structure, write the structure, not the domain.
- Boltzmann (entropy of mixing) and Shannon (entropy of a source) using
the same
-Σ pᵢ log pᵢ— the historical observation that mixing is an information-theoretic operation, not just a physical one. - Geoffrey Hinton's "product of experts" as the geometric counterpart to "mixture of experts" — two mixing kernels at the heart of modern ML.