Mixing as Kernel — the seam¶

All combination phenomena share one skeleton: parts p in a space X, weights w on the simplex, and a kernel K(w,p) that decides whether the mixture stays inside the convex hull (additive, redundant, Ω > 0) or escapes it (synergistic, interesting, Ω < 0). O-information gives the signed scalar. Non-Euclidean kernels (Wasserstein, Fisher-Rao, orthogonal) beat Euclidean averaging whenever the parts live in a curved space — proved independently for distributions, model weights, and input clusters.

🌿 budding tended 2026-05-19 research mixtures isomorphism mathematics information-theory machine-learning kernel

flowchart LR
  parts[parts · pᵢ ∈ X] --> kernel["K(w, p)"]
  weights[weights · w ∈ Δ] --> kernel
  carrier[carrier · medium · prior · geometry] --> kernel
  kernel --> inside["Ω > 0 · redundant · inside hull"]
  kernel --> outside["Ω < 0 · synergistic · outside hull"]
  outside --> interesting[interesting: umami×umami · chord · MoE · alloy]
  inside --> averaging[averaging: paint mix · BMA · ensemble mean]

L0 — TL;DR (≤5 lines)¶

Every combination phenomenon has the same three-knob skeleton: parts p in a space X, weights w on the simplex, and a kernel K(w, p) that decides what the mixture is. The kernel either stays inside the convex hull (additive, linear, boring) or escapes it (synergistic, interesting). O-information Ω = TC − DTC is the signed scalar that measures which regime you are in: Ω < 0 = synergy (outside hull); Ω > 0 = redundancy (inside hull). Non-Euclidean kernels — Wasserstein, Fisher-Rao, orthogonal — beat Euclidean averaging whenever parts live in a curved space, proved independently for three regimes (distributions, model weights, cluster-structured inputs).

L1 — Overview¶

The three-knob model¶

mixture m = K(weights w, parts p, carrier c)

Knob	What it is	Why it matters
Parts `p₁…pₙ ∈ X`	the things being combined	the space `X` determines what distances and means are valid
Weights `w ∈ Δⁿ⁻¹`	how much of each	only decisive for linear kernels; non-linear kernels make weight effects input-dependent
Carrier `c`	the medium holding the parts	"silent" but changes the effective kernel (fat in food; prior in Bayes; geometry in model space)

The kernel zoo (full version in MIXING-GENERALIZED): linear · log-linear · multiplicative-synergistic · masking-saturating · gated-routed · stochastic · reactive · emergent · time-resolved · spatial.

The seam claim¶

The seam between MIXTURES and MIXING-GENERALIZED is not analogy — it is the same math playing in different keys. Each domain is one instantiation of K(w, p) with a domain-specific carrier:

Domain	Carrier	Hull-escaping kernel
Taste	fat	umami × umami (8× synergy)
Smell	air / receptor space	accord that "smells like neither"
Chemistry	solvent	emulsion; reactive (H₂+O₂→H₂O)
Color (light)	CIE space	RGB → white
Color (paint)	reflectance	many pigments → mud (log-linear, hull inward)
ML	weight / distribution space	MoE with sparse gating; mixup
Social	influence graph	DeGroot consensus (inside); polarization (outside?)

The design task is the same in every row: choose K and carrier to engineer Ω toward negative for the desired output, and away from the failure modes (mud, mode collapse, muddled middle).

What O-information adds¶

MIXING-GENERALIZED §7 asks "when does the mixture produce something outside the space of its parts?" without a computable answer. O-information answers it:

Ω(X₁, …, Xₙ) = TC(X₁:…:Xₙ) − DTC(X₁:…:Xₙ)

where TC = total correlation (redundancy pressure) and DTC = dual total correlation (synergy pressure). Bounoua et al. (2024) give a practical estimator that works on non-Gaussian systems. Sign of Ω determines the regime; magnitude tells you how far from linear you are.

Design rule from the forage (S553): to produce a synergistic mixture (Ω < 0), you need parts whose joint information exceeds the sum of pairwise informations — this is exactly what umami × umami, harmonic chords, and gated MoE routing achieve. Parts with high mutual overlap (high shared information) give Ω > 0 by default; Ω < 0 requires structurally complementary parts.

Non-Euclidean kernels¶

MIXING-GENERALIZED §1 notes the Wasserstein barycenter as an alternative "geometric mean" of distributions without proving it is better. The forage confirms the gradient:

Fisher-Rao Karcher mean (Wang et al. 2026): avoids representation collapse and activation variance shrinkage that afflict linear weight averaging. The manifold geometry encodes information-theoretic distance.
Orthogonal manifold merging (Yang et al. 2026): prevents catastrophic forgetting; linear arithmetic merging fails.
Wasserstein geodesic (Zhu et al. 2023): improves certifiable robustness over linear Mixup. Geometric interpolation > arithmetic interpolation in distribution space.

The pattern: whenever the parts live in a curved space (distributions, probability simplices, model weight manifolds), the flat Euclidean average is wrong by construction. The carrier geometry is load-bearing, not cosmetic.

L2 — Deep dive¶

1. O-information as a mixing instrument¶

Full formalism: let X₁, …, Xₙ be the parts.

TC  = ΣᵢH(Xᵢ) − H(X₁,…,Xₙ)          [total correlation, ≥0]
DTC = H(X₁,…,Xₙ) − ΣᵢH(Xᵢ|X₋ᵢ)     [dual total correlation, ≥0]
Ω   = TC − DTC

Ω > 0: TC dominates — the system is more predictable from parts than the parts are from each other. Redundancy. Mixture inside hull.
Ω < 0: DTC dominates — the system carries more joint information than any individual part can account for. Synergy. Mixture outside hull.
Ω = 0: perfectly balanced; GMM with independent components.

SΩI (Bounoua et al. 2024) estimates this without Gaussianity using score functions — applicable to taste data, audio, social networks, model activations.

2. When Euclidean averaging is wrong¶

The Euclidean mean assumes the space is flat. Three classes of failure:

Distributions: the arithmetic mean of two Gaussians N(0,1) and N(10,1) is a bimodal — meaningless as a "typical distribution." The Wasserstein barycenter produces N(5,1), which is the geometric midpoint respecting the metric of distribution space. For any application where "midpoint" should reflect a smooth interpolant (data augmentation, domain adaptation), use the Wasserstein kernel.

Model weights: modern LLMs live on or near low-dimensional manifolds in weight space. Arithmetic weight averaging projects off-manifold → representation collapse (activation variance shrinks, effective rank degrades). Fisher-Rao Karcher mean stays on-manifold by using the information-geometric metric. The cost is an iterative fixed-point solve (vs one-shot averaging) but the quality gain is consistent.

Cluster-structured inputs: MoE provably learns cluster-structured regression (Kawata et al. 2025) where dense networks fail. The gated kernel m = Σᵢ gᵢ(x) fᵢ(x) is piecewise — each input routes to its cluster's specialist. Dense networks are forced to average across clusters; this is the hull-inside failure in ML. MoE escapes it structurally.

The unified pattern: flat kernels fail when the data manifold is curved; curved kernels succeed; the carrier geometry determines which kernel is appropriate.

3. The carrier as a hidden design variable¶

Both source pages note the carrier changes the effective kernel. The forage adds a formal consequence: the carrier geometry determines what "mixing" even means. You cannot choose K independently of the space X that the parts live in.

Carrier geometry	Correct kernel	Wrong kernel
Euclidean (vector space)	arithmetic mean	—
Riemannian (smooth manifold)	Riemannian/Karcher mean	arithmetic mean
Probability simplex	Wasserstein barycenter or log-linear pool	arithmetic mix
Discrete (grammar, graph)	constrained mixture (code-switching, DeGroot)	unconstrained average

The "carrier mismatch" failure mode in MIXING-GENERALIZED §8 is now precisely: using a flat kernel in a curved carrier.

4. What remains open¶

Why ~3 dominant components? MIXING-GENERALIZED §10 observes that "good" mixtures across domains tend to have ~3 dominant components. No formal grounding found in the forage. Candidate: Miller/Cowan working memory bounds × readout channel capacity. Not grounded here — one circuit to close in a future session.

Reactive mixing in social systems. MIXING-GENERALIZED §10 asks: what is the social analogue of H₂+O₂→H₂O? The forage found no paper on this. Schelling tipping / Granovetter threshold models are candidates.

Wasserstein mean of perceptual spaces. Does the Wasserstein mean of two smells produce a more natural intermediate odor than the arithmetic mean of their receptor activation vectors? Open POM allows this experiment (Lee et al. 2023, Science). Not done.

References (forage additions)¶

Full taste/smell/chemistry references are in MIXTURES; full ML/math references are in MIXING-GENERALIZED.

Bounoua, M., Franzese, G., & Michiardi, P. (2024). SΩI: Score-based O-information estimation. arXiv:2402.05667. — O-information as the synergy/redundancy scalar for mixing.
Wang, J., Ye, Z., & Yin, W. (2026). Functionality-oriented LLM merging on the Fisher–Rao manifold. arXiv:2603.04972. — Non-Euclidean model mixing beats Euclidean.
Yang, S., Shi, K., & Liu, W. (2026). Orthogonal model merging. arXiv:2602.05943. — Riemannian orthogonal merging prevents forgetting.
Zhu, J., et al. (2023). Interpolation for robust learning: data augmentation on geodesics. arXiv:2302.02092. — Wasserstein geodesic > linear Mixup for robustness.
Kawata, R., et al. (2025). Mixture of experts provably detect and learn the latent cluster structure. arXiv:2506.01656. — MoE hull-escape is structurally necessary for cluster data.
Liu, H., et al. (2023). Dataset distillation via the Wasserstein metric. arXiv:2311.18531. — Wasserstein barycenter as distribution-space mean.

Inspiration sources¶

MIXTURES.md and MIXING-GENERALIZED.md — the two source pages whose 113 shared salient terms (combo.py S553) surfaced this seam.
O-information literature (Timme et al. 2014; Williams & Beer 2010 PID; Bounoua et al. 2024) — the information-theoretic backbone.
The model-merging literature (2021–2026) — independent confirmation that the kernel choice is non-optional in curved spaces.