Skip to content

Mixing as Kernel — the seam

All combination phenomena share one skeleton: parts p in a space X, weights w on the simplex, and a kernel K(w,p) that decides whether the mixture stays inside the convex hull (additive, redundant, Ω > 0) or escapes it (synergistic, interesting, Ω < 0). O-information gives the signed scalar. Non-Euclidean kernels (Wasserstein, Fisher-Rao, orthogonal) beat Euclidean averaging whenever the parts live in a curved space — proved independently for distributions, model weights, and input clusters.
🌿 budding tended 2026-05-19 research mixtures isomorphism mathematics information-theory machine-learning kernel
flowchart LR
  parts[parts · pᵢ ∈ X] --> kernel["K(w, p)"]
  weights[weights · w ∈ Δ] --> kernel
  carrier[carrier · medium · prior · geometry] --> kernel
  kernel --> inside["Ω > 0 · redundant · inside hull"]
  kernel --> outside["Ω < 0 · synergistic · outside hull"]
  outside --> interesting[interesting: umami×umami · chord · MoE · alloy]
  inside --> averaging[averaging: paint mix · BMA · ensemble mean]
Read next

Seam page from swarmgodcomboforage S553: MIXTURES × MIXING-GENERALIZED → the shared kernel structure. Forage record at references/math/forage-mixing-kernel-s553.md confirms three predicates; O-information was the missing formalism. combo.py score: 113 shared salient terms.

Status: budding | 2026-05-19 | swarmgodcomboforage S553 Compress levels: L0 ↓ L1 ↓ L2

The two sides of this seam: MIXTURES is taste, smell, food — one row in a table. MIXING-GENERALIZED is all ten rows of that table. This page is the table itself — the abstract claim that both were already making, and what external research says about it.

L0 — TL;DR (≤5 lines)

Every combination phenomenon has the same three-knob skeleton: parts p in a space X, weights w on the simplex, and a kernel K(w, p) that decides what the mixture is. The kernel either stays inside the convex hull (additive, linear, boring) or escapes it (synergistic, interesting). O-information Ω = TC − DTC is the signed scalar that measures which regime you are in: Ω < 0 = synergy (outside hull); Ω > 0 = redundancy (inside hull). Non-Euclidean kernels — Wasserstein, Fisher-Rao, orthogonal — beat Euclidean averaging whenever parts live in a curved space, proved independently for three regimes (distributions, model weights, cluster-structured inputs).

L1 — Overview

The three-knob model

mixture m = K(weights w, parts p, carrier c)
Knob What it is Why it matters
Parts p₁…pₙ ∈ X the things being combined the space X determines what distances and means are valid
Weights w ∈ Δⁿ⁻¹ how much of each only decisive for linear kernels; non-linear kernels make weight effects input-dependent
Carrier c the medium holding the parts "silent" but changes the effective kernel (fat in food; prior in Bayes; geometry in model space)

The kernel zoo (full version in MIXING-GENERALIZED): linear · log-linear · multiplicative-synergistic · masking-saturating · gated-routed · stochastic · reactive · emergent · time-resolved · spatial.

The seam claim

The seam between MIXTURES and MIXING-GENERALIZED is not analogy — it is the same math playing in different keys. Each domain is one instantiation of K(w, p) with a domain-specific carrier:

Domain Carrier Hull-escaping kernel
Taste fat umami × umami (8× synergy)
Smell air / receptor space accord that "smells like neither"
Chemistry solvent emulsion; reactive (H₂+O₂→H₂O)
Color (light) CIE space RGB → white
Color (paint) reflectance many pigments → mud (log-linear, hull inward)
ML weight / distribution space MoE with sparse gating; mixup
Social influence graph DeGroot consensus (inside); polarization (outside?)

The design task is the same in every row: choose K and carrier to engineer Ω toward negative for the desired output, and away from the failure modes (mud, mode collapse, muddled middle).

What O-information adds

MIXING-GENERALIZED §7 asks "when does the mixture produce something outside the space of its parts?" without a computable answer. O-information answers it:

Ω(X₁, …, Xₙ) = TC(X₁:…:Xₙ) − DTC(X₁:…:Xₙ)

where TC = total correlation (redundancy pressure) and DTC = dual total correlation (synergy pressure). Bounoua et al. (2024) give a practical estimator that works on non-Gaussian systems. Sign of Ω determines the regime; magnitude tells you how far from linear you are.

Design rule from the forage (S553): to produce a synergistic mixture (Ω < 0), you need parts whose joint information exceeds the sum of pairwise informations — this is exactly what umami × umami, harmonic chords, and gated MoE routing achieve. Parts with high mutual overlap (high shared information) give Ω > 0 by default; Ω < 0 requires structurally complementary parts.

Non-Euclidean kernels

MIXING-GENERALIZED §1 notes the Wasserstein barycenter as an alternative "geometric mean" of distributions without proving it is better. The forage confirms the gradient:

  • Fisher-Rao Karcher mean (Wang et al. 2026): avoids representation collapse and activation variance shrinkage that afflict linear weight averaging. The manifold geometry encodes information-theoretic distance.
  • Orthogonal manifold merging (Yang et al. 2026): prevents catastrophic forgetting; linear arithmetic merging fails.
  • Wasserstein geodesic (Zhu et al. 2023): improves certifiable robustness over linear Mixup. Geometric interpolation > arithmetic interpolation in distribution space.

The pattern: whenever the parts live in a curved space (distributions, probability simplices, model weight manifolds), the flat Euclidean average is wrong by construction. The carrier geometry is load-bearing, not cosmetic.

L2 — Deep dive

1. O-information as a mixing instrument

Full formalism: let X₁, …, Xₙ be the parts.

TC  = ΣᵢH(Xᵢ) − H(X₁,…,Xₙ)          [total correlation, ≥0]
DTC = H(X₁,…,Xₙ) − ΣᵢH(Xᵢ|X₋ᵢ)     [dual total correlation, ≥0]
Ω   = TC − DTC
  • Ω > 0: TC dominates — the system is more predictable from parts than the parts are from each other. Redundancy. Mixture inside hull.
  • Ω < 0: DTC dominates — the system carries more joint information than any individual part can account for. Synergy. Mixture outside hull.
  • Ω = 0: perfectly balanced; GMM with independent components.

SΩI (Bounoua et al. 2024) estimates this without Gaussianity using score functions — applicable to taste data, audio, social networks, model activations.

2. When Euclidean averaging is wrong

The Euclidean mean assumes the space is flat. Three classes of failure:

Distributions: the arithmetic mean of two Gaussians N(0,1) and N(10,1) is a bimodal — meaningless as a "typical distribution." The Wasserstein barycenter produces N(5,1), which is the geometric midpoint respecting the metric of distribution space. For any application where "midpoint" should reflect a smooth interpolant (data augmentation, domain adaptation), use the Wasserstein kernel.

Model weights: modern LLMs live on or near low-dimensional manifolds in weight space. Arithmetic weight averaging projects off-manifold → representation collapse (activation variance shrinks, effective rank degrades). Fisher-Rao Karcher mean stays on-manifold by using the information-geometric metric. The cost is an iterative fixed-point solve (vs one-shot averaging) but the quality gain is consistent.

Cluster-structured inputs: MoE provably learns cluster-structured regression (Kawata et al. 2025) where dense networks fail. The gated kernel m = Σᵢ gᵢ(x) fᵢ(x) is piecewise — each input routes to its cluster's specialist. Dense networks are forced to average across clusters; this is the hull-inside failure in ML. MoE escapes it structurally.

The unified pattern: flat kernels fail when the data manifold is curved; curved kernels succeed; the carrier geometry determines which kernel is appropriate.

3. The carrier as a hidden design variable

Both source pages note the carrier changes the effective kernel. The forage adds a formal consequence: the carrier geometry determines what "mixing" even means. You cannot choose K independently of the space X that the parts live in.

Carrier geometry Correct kernel Wrong kernel
Euclidean (vector space) arithmetic mean
Riemannian (smooth manifold) Riemannian/Karcher mean arithmetic mean
Probability simplex Wasserstein barycenter or log-linear pool arithmetic mix
Discrete (grammar, graph) constrained mixture (code-switching, DeGroot) unconstrained average

The "carrier mismatch" failure mode in MIXING-GENERALIZED §8 is now precisely: using a flat kernel in a curved carrier.

4. What remains open

Why ~3 dominant components? MIXING-GENERALIZED §10 observes that "good" mixtures across domains tend to have ~3 dominant components. No formal grounding found in the forage. Candidate: Miller/Cowan working memory bounds × readout channel capacity. Not grounded here — one circuit to close in a future session.

Reactive mixing in social systems. MIXING-GENERALIZED §10 asks: what is the social analogue of H₂+O₂→H₂O? The forage found no paper on this. Schelling tipping / Granovetter threshold models are candidates.

Wasserstein mean of perceptual spaces. Does the Wasserstein mean of two smells produce a more natural intermediate odor than the arithmetic mean of their receptor activation vectors? Open POM allows this experiment (Lee et al. 2023, Science). Not done.

References (forage additions)

Full taste/smell/chemistry references are in MIXTURES; full ML/math references are in MIXING-GENERALIZED.

  • Bounoua, M., Franzese, G., & Michiardi, P. (2024). SΩI: Score-based O-information estimation. arXiv:2402.05667. — O-information as the synergy/redundancy scalar for mixing.
  • Wang, J., Ye, Z., & Yin, W. (2026). Functionality-oriented LLM merging on the Fisher–Rao manifold. arXiv:2603.04972. — Non-Euclidean model mixing beats Euclidean.
  • Yang, S., Shi, K., & Liu, W. (2026). Orthogonal model merging. arXiv:2602.05943. — Riemannian orthogonal merging prevents forgetting.
  • Zhu, J., et al. (2023). Interpolation for robust learning: data augmentation on geodesics. arXiv:2302.02092. — Wasserstein geodesic > linear Mixup for robustness.
  • Kawata, R., et al. (2025). Mixture of experts provably detect and learn the latent cluster structure. arXiv:2506.01656. — MoE hull-escape is structurally necessary for cluster data.
  • Liu, H., et al. (2023). Dataset distillation via the Wasserstein metric. arXiv:2311.18531. — Wasserstein barycenter as distribution-space mean.

See also

Inspiration sources

  • MIXTURES.md and MIXING-GENERALIZED.md — the two source pages whose 113 shared salient terms (combo.py S553) surfaced this seam.
  • O-information literature (Timme et al. 2014; Williams & Beer 2010 PID; Bounoua et al. 2024) — the information-theoretic backbone.
  • The model-merging literature (2021–2026) — independent confirmation that the kernel choice is non-optional in curved spaces.