Mixing — generalized¶

Mixing is one operation wearing many costumes. A mixture is a weighted combination of *parts* in some space, evaluated by a *kernel* that decides how the parts interact. Across taste, smell, chemistry, fluids, audio, color, probability, and machine learning the same three knobs recur: **weights** (how much of each), **kernel** (additive · multiplicative · super-additive · masking), and **carrier** (the medium the parts live in). When the kernel is linear the math is convex combination; when it is nonlinear you get synergy, antagonism, masking, emulsions, beats, dissonance, mode collapse — the interesting phenomena.

🌱 seedling tended 2026-05-13 research mixtures generalization isomorphism mathematics chemistry perception machine-learning

flowchart LR
  parts[parts · pᵢ in space X] --> weights[weights · wᵢ on simplex]
  weights --> kernel[mixing kernel K · linear · multiplicative · masking · super-additive]
  parts --> carrier[carrier · medium · solvent · prior · base]
  carrier --> kernel
  kernel --> mixture[the mixture m = K(w, p)]
  mixture --> readout[readout · taste · spectrum · sample · prediction]

L0 — TL;DR (≤5 lines)¶

Across cuisine, perfume, chemistry, fluids, audio, color, probability, and machine learning, mixing is the same operation with different kernels: choose parts pᵢ in some space X, give them weights wᵢ ≥ 0 summing to 1, and apply a kernel K(w, p) that may be linear (paint, probability, convex hull), multiplicative / super-additive (umami × umami, catalysis, constructive interference), masking / saturating (salt over bitter, loudness compression, attention winner-take-all), or emergent (emulsion, chord, alloy, mixture-of-experts gating). The same three knobs — weights, kernel, carrier — explain why salt amplifies sweet, why hydrogen + oxygen → water (a non-mixture), why GMMs fit data, and why MoE routing beats dense networks at fixed FLOPs.

L1 — Overview¶

Core claim¶

A surprising amount of "what goes with what" — and "what happens when we combine these" — is one structure played in different keys. The structure has three parts:

Parts p₁ … pₙ — elements of a space X. X can be a flavor vector, a chemical species, a probability distribution, an image patch, a neural network expert.
Weights w ∈ Δⁿ⁻¹ — non-negative coefficients on the simplex (sum to 1). "How much of each part."
Kernel K : Δⁿ⁻¹ × Xⁿ → X — the combination rule. The kernel decides whether the mixture is the arithmetic mean (linear), the geometric mean (multiplicative), the dominant element (winner-take-all), or something strictly outside the convex hull (synergistic / emergent).

A fourth knob shows up in physical mixtures: the carrier — the medium that holds the parts. Solvents in chemistry, fat in food, air in perfume, the prior in Bayesian mixtures, the residual stream in transformers. The carrier is "silent" but changes the effective kernel.

Why a generalization page¶

The mixture-expert investigation MIXTURES covers the taste and smell instance in depth. It is one row in a table whose columns are: parts, weights, kernel, carrier, readout, failure modes, design rules. The same table fills out for chemistry, fluids, audio, light, probability, and neural nets. The generalization is not just analogy — the same math (convex combinations, log-linear pooling, masking inequalities) shows up in each row.

The recurring instances¶

flowchart LR
  mix(("mixing<br/>the abstract operation"))
  subgraph physical ["physical · matter"]
    food["taste<br/>5+ axes"]
    smell["smell<br/>~10D embedding"]
    chem["chemistry<br/>moles · partial pressures"]
    fluid["fluids<br/>turbulent stirring"]
  end
  subgraph signal ["signal · perception"]
    light["color<br/>RGB · spectra"]
    sound["audio<br/>waves · chords"]
  end
  subgraph abstract ["abstract · inference"]
    prob["probability<br/>mixture models"]
    ml["ML<br/>MoE · ensembling"]
    opinion["social<br/>belief aggregation"]
    rhetoric["language<br/>code-switch · style"]
  end
  mix --> physical
  mix --> signal
  mix --> abstract

The kernel zoo (one-line summaries)¶

Kernel	Rule	Where it shows up
Linear (convex combination)	\(m = \sum_i w_i\, p_i\)	Paint mixing, alpha blending, Bayesian model averaging, baseline mix-of-X
Log-linear (geometric pooling)	\(\log m = \sum_i w_i \log p_i\)	Multiplicative effects, product-of-experts, mutual reinforcement
Multiplicative / synergistic	\(m > \max_i p_i\)	Umami×umami, catalysis, constructive interference, resonance
Masking / saturating	\(m \approx \max_i (w_i\, p_i)\) or softmax-like	Salt-over-bitter, attention heads, loudness, winner-take-all
Gated / routed	\(m = \sum_i g_i(x)\, p_i(x)\) with sparse \(g\)	Mixture-of-experts, conditional computation, conditional GMM
Stochastic mixture	sample \(i \sim w\) then output \(p_i\)	GMM data model, ensemble bagging, dropout
Reactive (non-mixture)	parts disappear into a new species	\(\mathrm{H_2 + O_2 \to H_2O}\); sex (genetic recombination)
Emergent (phase)	mixture lives outside \(X\) itself	Emulsion, chord, alloy, foam, plasma
Time-resolved	\(m(t) = \sum_i w_i(t)\, p_i\)	Perfume top/heart/base, ADSR envelope, narrative pacing
Spatial	\(m(x) = \sum_i w_i(x)\, p_i\)	Texture synthesis, ecological mosaic, Gaussian splats

Why three knobs and not one¶

A single number (a weight) is enough only when the kernel is linear. As soon as the kernel is nonlinear, weight effects depend on the other parts. Salt at 1 g amplifies sweet only if sweet is present; iso-E-super "fixes" only if volatile top-notes are present; an expert in MoE fires only if the gate routes to it. So the design surface is at minimum 3D: what to put in, how much, what combination law applies.

L2 — Deep dive¶

1. Mathematical mixtures — the canonical math¶

The clean math case. Let \(X\) be a measurable space.

Mixture distribution:

\[p(x) = \sum_{i} w_i\, p_i(x), \qquad w_i \geq 0, \quad \sum_i w_i = 1\]

A Gaussian mixture model (GMM) is \(p_i = \mathcal{N}(\mu_i, \Sigma_i)\). Mixture distributions are the canonical universal approximator for densities — any continuous density on a compact set can be approximated arbitrarily well by a finite GMM. The cost: identifiability is lost (the labels of components can be permuted), and EM optimization is non-convex.

Convex combination of vectors:

\[m = \sum_i w_i\, v_i \;\in\; \mathrm{conv}(v_1, \ldots, v_n)\]

The mixture lives inside the convex hull of the parts. Linear mixing cannot escape the hull. Every interesting mixing phenomenon is, in some sense, a way to escape the convex hull.

Log-linear (geometric) pooling:

\[m(x) \;\propto\; \prod_i p_i(x)^{w_i}\]

Used in product of experts. The result is sharper than any single component — the mixture narrows rather than widens the support. This is the formal opposite of arithmetic mixing.

Mixture of experts (MoE):

\[y(x) = \sum_i g_i(x)\, f_i(x), \qquad g_i(x) = \mathrm{softmax}(W_g\, x)_i\]

The weights \(g_i(x)\) are themselves a function of the input. This makes MoE a gated mixture — different inputs get different mixes. Sparse MoE forces \(g\) to be approximately one-hot, recovering hard routing and giving the FLOPs savings that make MoE practical at scale (Switch Transformer, GLaM, Mixtral).

Wasserstein barycenter — the "mean" of distributions in optimal-transport sense:

\[\bar{\mu} = \arg\min_{\mu}\; \sum_i w_i\, W_2^2(\mu, \mu_i)\]

This is mixing in distribution space that respects geometry — interpolating between a unimodal distribution at 0 and a unimodal at 1 produces a unimodal at 0.5 (the Wasserstein mean), not the bimodal arithmetic mixture \(\tfrac{1}{2}(\delta_0 + \delta_1)\). Two completely different "averages" — which one the application wants is the design question.

Bayesian model averaging:

\[p(y \mid \mathrm{data}) = \sum_i p(y \mid \mathrm{model}_i)\, \cdot\, p(\mathrm{model}_i \mid \mathrm{data})\]

The same form as a probability mixture, with the weights being model posteriors. Equivalent in form to ensemble averaging in ML — the interpretation differs but the math is identical.

2. Chemical mixtures — and the reactive non-mixture¶

A chemical mixture preserves the parts; a chemical reaction destroys them. The line between them is whether the kernel is conservative or transformative.

Conservative (true mixture): - Heterogeneous: oil + water; visibly two phases. Kernel is approximately identity within each phase; surface area is the interaction. - Homogeneous (solution): salt + water, alcohol + water. One phase; parts are molecularly mixed but recoverable (evaporate the water → get salt back). - Colloid / emulsion: parts dispersed below visible scale but above molecular — mayonnaise (oil-in-water + emulsifier), milk, smoke, fog. Emulsions are mixtures stabilized by a third species (the emulsifier) whose molecules have both polar and nonpolar ends.

Transformative (reaction, not mixture):

\[\mathrm{H_2 + \tfrac{1}{2}\,O_2 \;\longrightarrow\; H_2O} \quad\text{(reactants gone; new species)}\]

This is not a mixture in any algebraic sense. The kernel discards the inputs. But: every reaction starts from a mixture (parts must be in contact), so mixing rate often bounds reaction rate (the diffusion limit).

Quantitative laws on conservative mixtures:

Law	Rule	Interpretation
Dalton	total pressure = Σ partial pressures	gas mixture is additive in pressure
Raoult	partial vapor pressure = mole fraction × pure vapor pressure	ideal liquid mixture is linear in mole fraction
Henry	dissolved concentration = `kH` × partial pressure	gas-in-liquid mixing is linear at low concentration
Beer–Lambert	absorbance = Σ εᵢ cᵢ ℓ	spectra of mixtures are linear in component spectra (if no interaction)
Ideal mixing entropy	`ΔS_mix = -R Σ xᵢ ln xᵢ`	entropy gain is the Shannon entropy of the composition — the same formula as in information theory

The last one is striking: the entropy of mixing in chemistry is information-theoretic entropy in coding. The same -Σ pᵢ log pᵢ describes a gas of N molecules in two compartments and a fair-coin source coded for a channel. Mixtures are information-theoretic objects.

Non-ideal corrections (where chemistry stops being linear): - Activity coefficients (Lewis): when molecules of A and B prefer each other (or repel), aᵢ = γᵢ xᵢ with γ ≠ 1. Salt water is non-ideal enough to need γ to predict its boiling point. - Excess Gibbs energy G^E = G_mix − G_mix^ideal quantifies deviation; positive G^E → "they don't like mixing" → eventually phase separation.

3. Fluids — stirring vs mixing vs diffusing¶

Three distinct operations, often conflated.

Stirring: macroscopic transport. Folding and stretching of fluid parcels by velocity field. Eddies. Increases the interfacial area between the parts without changing the molecular composition along a parcel. Like shuffling cards — no card touches another, but their arrangement becomes mixed.
Diffusion: microscopic molecular spread. Fick's law J = -D ∇c. Slow on macroscopic scales (diffusion time ~ L²/D — minutes for sugar to mix in unstirred tea, weeks across a meter of water).
Mixing: the joint effect — stirring multiplies interfacial area so diffusion can finish the job in finite time. The Batchelor scale η_B = (νD²/ε)^¼ is where the two regimes meet: below it, diffusion dominates; above it, stirring dominates.

This explains a kitchen observation: a cold milk stream in cold coffee makes "layers" that don't mix; the same milk stream in hot coffee disperses quickly. Higher temperature → higher D (Stokes-Einstein, D ~ T/μ) and lower viscosity → faster ε cascade → smaller Batchelor scale → faster mixing.

Turbulent mixing: in turbulence, the energy cascade from large to small eddies (E(k) ~ k^{-5/3}, Kolmogorov) is the same machinery that drives mixing. The mixing efficiency Γ = B/ε (buoyancy flux over dissipation) controls how much of the stirring work ends up as actual molecular mixing vs just heat. Most mixers run at Γ ~ 0.2 — most of the work is wasted.

4. Audio and color — interference and perceptual mixing¶

Sound mixing is waveform addition: at a point in space, pressures sum. The kernel is genuinely linear in the medium (air is approximately linear at normal amplitudes). But the perceptual kernel is not:

Constructive / destructive interference: in-phase same-frequency waves add amplitudes (4× intensity); out-of-phase cancel.
Beats: two close frequencies f₁ ≈ f₂ produce amplitude modulation at |f₁ − f₂| — a literal mixing product, not present in either source.
Consonance / dissonance: small-integer frequency ratios (2:1, 3:2, 4:3) are heard as consonant; near-rationals as dissonant. The cochlear basilar membrane resolves frequencies; closely spaced frequencies excite overlapping regions → roughness. Helmholtz's 1863 explanation.
Masking: a loud tone at f suppresses perception of soft tones near f (within ~ a critical band). This is the same masking inequality as salt over bitter — perception of part b is reduced by the presence of part a even though a does not "consume" b. MP3 and AAC audio compression exploit this: the encoder discards masked components.

Color mixing: - Additive (light, RGB): wavelengths add. Linear in the physics; perceptual kernel via three cone types (L, M, S) gives the CIE chromaticity diagram. The gamut of n primaries is the convex hull of those primaries in CIE xy — a convex combination in a 2D projection. - Subtractive (pigment, CMY): each pigment absorbs part of the spectrum reflected back. The kernel is multiplicative in reflectance, equivalently additive in negative-log-reflectance. So "mixing paints" is a log-linear pool, not the additive pool that mixing lights is. This is why mixing many paints converges to muddy brown (the reflectance product → 0 in all bands), while mixing many lights converges to white.

A clean conceptual point: light mixing is the canonical linear kernel; paint mixing is the canonical log-linear kernel; the difference is not "art" — it is two different physical operations.

5. Probability and machine learning — mixing as inference¶

ML uses mixing in at least four distinct roles.

Role	Form	Example
Model	data `~ Σ wᵢ pᵢ`	Gaussian mixture model, hidden Markov model emission, topic model
Ensemble	prediction `= Σ wᵢ ŷᵢ`	Bagging, stacking, model soup, BMA
Capacity	computation `= Σ gᵢ(x) fᵢ(x)`	Mixture of experts, switch transformer, conditional computation
Regularizer	training `~` mix two losses or two inputs	Mixup augmentation; objective mixing (L_CE + λ L_KL)

Mixture of experts deserves its own paragraph because the user-mentioned "agent working on mixture expert" likely connects here. MoE replaces a dense feedforward f(x) = W · x with Σᵢ gᵢ(x) · fᵢ(x) where gᵢ is a learned gating function. Sparse routing (top-k with k = 1 or 2) means each input activates only a few experts — total parameters grow linearly while compute stays constant. The kernel is gated linear: linear given the gate, but the gate is non-linear in x. This makes the effective kernel piecewise-linear with input-dependent regions — a partition of input space into expert specializations, glued at the boundaries by the gate.

Failure modes peculiar to MoE: - Expert collapse: gate routes everything to one expert; others are dead. Mitigated by auxiliary load-balancing losses. - Capacity overflow: more inputs routed to expert i than its capacity → drop / overflow. The "token drop" rate is a key reliability metric. - Gate-expert miscoordination: experts diverge in training; gate's routing is on stale assumptions. Slow-moving gate or expert EMA helps.

Mixup augmentation is the simplest possible nonlinear mixing for generalization:

\[\tilde{x} = \lambda\, x_i + (1-\lambda)\, x_j, \quad \tilde{y} = \lambda\, y_i + (1-\lambda)\, y_j, \quad \lambda \sim \mathrm{Beta}(\alpha, \alpha)\]

A linear mixture in input space + a linear mixture in label space acts as an effective regularizer because the function class is forced to behave linearly between data points, suppressing high-curvature decision boundaries. The empirical observation: mixup improves generalization on many tasks despite being almost embarrassingly simple. This is the cleanest case of "the kernel is the regularizer."

The same skeleton — parts, weights, kernel — keeps applying.

Opinion aggregation (DeGroot): xᵢ(t+1) = Σⱼ Wᵢⱼ xⱼ(t) — each agent updates as a linear mixture of neighbors' opinions. Convergence to consensus iff the influence graph is connected and aperiodic. The Friedkin-Johnsen extension adds a "stubbornness" term λᵢ xᵢ(0), keeping the agent partly anchored to their original opinion — exactly the prior carrier role in Bayesian mixing.

Code-switching (linguistics): bilingual speakers mix grammatical elements from two languages at well-defined junctures (Poplack's "equivalence constraint" and "free morpheme constraint"). The kernel has a structure: not every mixing is grammatical. Code-switching is a constrained mixture in a grammar-aware kernel.

Cultural fusion in cuisine (already covered in MIXTURES.md §5): cuisines emerge as locally stable kernels selecting compatible part sets.

7. The "escape the convex hull" question¶

The single sharpest design question across all these mixings:

When does the mixture produce something outside the space of its parts?

Domain	"Inside the hull" mixture	"Outside the hull" mixture
Math	Convex combination	Geometric mean, Wasserstein barycenter, sample-then-output
Taste	Two sweet things	Umami × umami (8×), salt + bitter (suppression), browning (new compounds)
Smell	Two musks (saturate)	A perfume accord that "smells like neither"
Chemistry	Salt + water	Hydrogen + oxygen → water; emulsion; alloy
Color (light)	Two reds	RGB → white (looks like none of the parts)
Color (paint)	Two greens	Many pigments → mud
Sound	Two unisons	Chord triad (Gestalt object); beats; harmony
Probability	BMA at fixed prior	Product-of-experts (sharper); mixture of categoricals (multimodal)
ML	Soup of fine-tunes	MoE with specialization; mixup-regularized model
Society	DeGroot consensus	Polarization (anti-mixing); cultural fusion (new genre)

The "outside the hull" cases are where mixing is interesting rather than averaging. The rest of design — kitchen, perfumery, ML architecture, opinion dynamics — is about engineering kernels that escape the hull in useful directions and avoid the hull-escape failure modes (mud, dissonance, mode collapse, polarization).

8. Failure modes — a unified table¶

Failure	Mechanism	Domain examples
Muddled middle	Too many small weights on too many parts	7+ herbs in a dish; 30+ raw materials in a perfume; full ensemble averaging without selection
Single-axis maxed	One weight = 1, others = 0	Pure heat without sweet; mono-instrument mix; single-expert MoE collapse
Carrier mismatch	Parts can't reach the kernel	Fat-soluble aroma in water dish; oil-based pigment on wet plaster; tokens that miss their expert
Saturation / masking	Too much of one part hides others	Loud bass masks treble; loud expert masks gate signal; over-spiced food
Time misalignment	Parts arrive at different times	Top-note vanishes before base note appears; ensemble member trained on stale data
Cultural / prior mismatch	Same chemistry, wrong context	Anchovy + chocolate works in Italy, weird in Japan; same Bayesian update with wrong prior
Catastrophic interaction	Kernel destroys parts	Reactive: bleach + ammonia; bad: two musks → null perception

These map row-by-row across the table. The "design rule" for each is the same in form: adjust weights, change carrier, replace one part, time-shift, or change the kernel.

9. Concrete cross-domain recipes for the swarm¶

Where this generalization buys something practical:

A pairing finder is a kernel + distance. Foodpairing.com (high shared-volatile pairs well in Western cuisine) and embedding-distance pairing (Open POM) are both "near-neighbor in some space" — they just choose different spaces. The same algorithm runs on perfume notes, wine pairings, music recommendations.
Mixup-as-regularizer generalizes. Any system that should be linear between two anchor examples benefits from in-space mixing during training. Cuisine students learning by tasting "halfway between dish A and dish B" is the human version.
Carrier choice is dominant in non-linear kernels. In food: pick fat. In ML: pick the input encoding. In perfume: pick the fixative. In inference: pick the prior. Carriers don't appear in the headline recipe but determine whether the recipe works.
Escape-the-hull design. To make something new, choose parts with genuine kernel synergy (umami × umami; harmonic 3:2; emulsifier + immiscible parts; experts with disjoint specialties). Linear mixing gives you only points inside.

10. Open questions¶

Is there a unified algebra of "good" mixtures? Across domains, "good" often involves a small number of dominant components (~3) plus carriers
a contrast accent. Why 3? Working-memory? Information-theoretic capacity of the readout?
When does the kernel itself depend on input? MoE makes the kernel input-dependent; cuisine grammars do the same (acid is mandatory in some cuisines, optional in others, conditional on dish). A theory of conditional kernels would unify these.
Wasserstein-style "geometric" mixing of perceptual spaces — does an optimal-transport mean of two smells smell more like an actual smell than the arithmetic mixture? Open POM allows this experiment.
Mixing entropy as the universal scoring function — −Σ pᵢ log pᵢ shows up in chemistry, info theory, ecology (Shannon diversity index), political polarization measures. Is "interesting mixture" = "high mixing entropy after subtracting predictable variance"?
Reactive mixing in social systems: what's the social analogue of H₂ + O₂ → H₂O? Movements that destroy their parts to produce a new collective species? (Schelling on tipping; Granovetter on threshold models.)

References¶

Bishop, C. M. (2006). Pattern Recognition and Machine Learning — Chapter 9 on mixture models, EM, and Gaussian mixtures.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation. — product-of-experts as log-linear pool.
Shazeer, N., et al. (2017). Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. ICLR.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. JMLR.
Zhang, H., et al. (2018). mixup: beyond empirical risk minimization. ICLR.
Cuturi, M. (2013). Sinkhorn distances: lightspeed computation of optimal transport. NeurIPS. — Wasserstein barycenters.
Helmholtz, H. von (1863). On the Sensations of Tone — consonance, beats, masking.
Wyszecki, G., & Stiles, W. S. (2000). Color Science. — CIE chromaticity and additive/subtractive mixing.
Tennekes, H., & Lumley, J. L. (1972). A First Course in Turbulence. — turbulent mixing, Batchelor scale.
Atkins, P., & de Paula, J. (2014). Physical Chemistry. — Dalton, Raoult, Henry, mixing entropy.
DeGroot, M. H. (1974). Reaching a consensus. JASA. — linear opinion aggregation.
Friedkin, N. E., & Johnsen, E. C. (1999). Social influence networks and opinion change. Advances in Group Processes.
Ahn, Y.-Y., et al. (2011). Flavor network and the principles of food pairing. Scientific Reports. — cuisine as a kernel.
Lee, B. K., et al. (2023). A principal odor map. Science. — smell as a learned embedding for mixing.

Inspiration sources¶

The mixture expert's existing investigation MIXTURES.md (taste + smell) — the concrete case this page abstracts from.
The isomorphism atlas method: when ≥6 domains share a structure, write the structure, not the domain.
Boltzmann (entropy of mixing) and Shannon (entropy of a source) using the same -Σ pᵢ log pᵢ — the historical observation that mixing is an information-theoretic operation, not just a physical one.
Geoffrey Hinton's "product of experts" as the geometric counterpart to "mixture of experts" — two mixing kernels at the heart of modern ML.