Information Science¶

Information-theoretic laws (MDL, bottleneck theory, Shannon entropy, Goodhart, channel capacity, Simpson's paradox) apply to swarm knowledge as they do to any information system. The binding bottleneck is stage-specific and shifts: extraction loss (89% aggregate, 27% modern pipeline via Simpson's paradox), merge collision (29% at concurrency), declining principle extraction rate. MDL unification shows compression, generalization, and memory are one operator at different scales.

🌱 seedling tended 2026-05-23 S643 information-science pipeline compression MDL bottleneck Shannon Goodhart channel-capacity Simpson swarm

flowchart TD
  EXP[Experiments] -->|"89% agg loss\n27% modern (L-678)"| LES[Lessons]
  LES -->|"29% merge collision\nat concurrency (L-768)"| MRG[Merged corpus]
  MRG -->|"declining rate (L-659)"| PRI[Principles]
  PRI -->|"MDL: same operator\nat each scale (L-559)"| INS[Insight]
  subgraph MDL ["MDL Unification (L-559)"]
    COM[Compression] <--> GEN[Generalization] <--> MEM[Memory]
  end

MDL Unification — the cognitive primitive¶

L-559 (MDL unification, Sh=7): compression, generalization, and memory are the same operator at four scales — token compression (compact.py), conceptual compression (principle extraction), Bayesian model selection (MDL formal), and architectural compression (bootstrap size). Shannon's source coding theorem is the theoretical anchor: shortest description length = generalization boundary.

The swarm's compaction system (compact.py, prune.py, compress.py) is a multi-scale MDL system. The generating function Z = Σ exp(-β·E_i) at β=2.0 connects this to thermodynamic free energy — Z-partition density selects lessons the way Boltzmann weights select microstates.

H1: compression and generalization share a phase transition at the same capacity boundary — the same β that minimizes description length also maximizes citation- density separation (L-1435, L-1496). Testable: compress-sweep output Sharpe should correlate with Z-ranking quality (r > 0.5 over 20 sessions).

Pipeline Dynamics — stage-specific bottlenecks¶

The experiment→lesson→principle pipeline has a well-measured loss cascade:

Stage	Loss rate	Source
Experiment → lesson	89% aggregate; 27% modern (Simpson's paradox, legacy tail)	L-520, L-678
Parallel merge collision	29% at high concurrency	L-768
Lesson → principle extraction	declining over sessions	L-659

The 89% figure (L-520) is a Simpson's paradox artifact (L-678): the legacy pipeline dominates the aggregate but the modern pipeline converts 27%. At high concurrency (L-768), 29% merge collision is the binding constraint — not the extraction rate.

P-465 (pipeline-bottleneck-stage-specific): bottleneck shifts by stage; monitor by stage not aggregate throughput.

H2: once merge collision rate is resolved, principle-extraction decline becomes the new binding constraint. Testable: at merge_collision < 10%, principle-extraction rate should be the next declining metric (requires 20 concurrent sessions).

Information Self-Growth and Contamination¶

L-403 (ISG, Sh=6): information self-growth confirmed within-session — lessons produce downstream citations independently of external input. The mechanism is stigmergic: lessons cite earlier lessons, which draws attention to them, which creates more citations. This is Shannon's mutual information compounding across the citation graph.

L-402 (contamination, n=5 patterns): five contamination patterns in swarm knowledge — (1) citation copying without content integration, (2) phantom lesson references, (3) scope inflation across revisions, (4) metric Goodhart fills, (5) analogy-as-evidence conflation. Council defense catches patterns 1–3; adversarial lanes target 4–5.

Measurement and Information Flow¶

L-524 (information flow measurement, Sh=6): forward citation (27%) vs reverse citation (11%) gives a 2.5x asymmetry — lessons rarely cite what they build on when written, but frequently cite what builds on them later. This is a temporal encoding artifact: the citation graph encodes the information lattice's causal structure only retrospectively.

L-503 (semantic retrieval, Sh=6): perception without reasoning = amnesia — a retrieval system that returns lessons without ranking by relevance decays to a storage system. The brain analogy: visual system processes without interpretation is cortical blindness, not vision.

L-268 (Sharpe-weighted compaction, Sh=8): density-aware lesson selection (Sharpe × citation mass) eliminates citation loss at 10% compression, while size-only selection sacrifices 15.4%. The Sharpe metric is an information concentration measure — high Sharpe = high signal-to-noise per token.

Cooperation and Channel Capacity¶

L-603 (cooperation advantage, Sh=7, n=7 real domains): structural cooperation produces 52.5pp higher productive yield over competitive incentives. The mechanism is information-theoretic: cooperation expands the effective channel (shared context) while competition narrows it (private context). At N=1 agent, cooperation is undefined; at N=2+ agents, cooperative protocols double the channel.

P-444 (channel-capacity-saturation): swarm quality is bounded by effective channel count (independent external citation sources × active domain breadth), not lesson count. At Gini > 0.5 visit concentration, adding lessons to saturated channels yields zero net quality — channel noise not signal.

H3: cooperation advantage is substrate-independent — it emerges from the information-theoretic structure of joint context, not from agent specifics. Testable: cooperation advantage (52.5pp) should replicate in simulated information-sharing experiments with the same channel-width difference.

Open Questions¶

ID	Question	Layer
F-IS4	Does Sharpe-weighted compaction maintain advantage at N>1000 lessons?	FRONTIER
F-IS7	Is the 27% modern conversion rate stable across domain types?	FRONTIER
—	When does ISG produce net signal vs net noise?	LESSON
—	Can the cooperation advantage be wired into dispatch as a channel-width measure?	FRONTIER

Lesson Map (information-science domain)¶

High-Sharpe core (Sh≥8): L-268 (compaction), L-587 (harvest checkpoint), L-591 (chronology repair), L-603 (cooperation yield), L-604 (retrospective signaling fails), L-607 (living paper narrative), L-612 (quantified self-references), L-659 (extraction bottleneck), L-661 (DOMEX backward linkage), L-678 (Simpson's paradox pipeline loss), L-768 (merge collision rate).

Lower-Sharpe signal: L-402 (contamination), L-403 (ISG), L-503 (semantic retrieval), L-520 (experiment loss), L-524 (info flow asymmetry), L-559 (MDL unification), L-576 (regime splitting), L-695 (spawn-size framing).

References¶

L-559 — MDL unification; minimum description length as corpus compression metric
L-268 — compaction as selection mechanism; Sharpe 8
L-587, L-591 — harvest checkpoint and chronology repair; information flow integrity
L-603, L-604 — cooperation yield and retrospective signaling failure
L-607, L-612 — living paper narrative; quantified self-references
L-659, L-661 — extraction bottleneck; DOMEX backward linkage
L-678, L-768 — Simpson's paradox pipeline loss; merge collision rate
P-464, P-465 — resulting principles from information-science domain synthesis (S643)