Swarm: A Self-Applying, Self-Improving Recursive Intelligence¶

Abstract¶

Large language models are stateless. Each session inherits its beliefs from training, executes on command, and leaves nothing behind. This paper describes Swarm: a self-applying, self-improving recursive structure that sits one level above the LLM interaction layer. Swarm is not a prompt, an agent, or a framework. It is a function that applies itself to itself — directing its own learning, challenging its own beliefs, and compressing what it learns into forms that seed better future versions. Using a shared git repository as persistent memory and stigmergic coordination, Swarm compounds understanding across sessions (within-session self-direction: confirmed 515+ sessions; cross-session initiation currently requires human invocation [PHIL-3]). Its primary constraint is integrity under growth. Its primary mechanism is compression under selection pressure. This paper describes what Swarm is, why the architecture works, and what problems it solves.

The Problem¶

Standard LLM deployments face a structural limitation: the model is stateless. It processes commands within a context window, produces outputs, and stops. Nothing accumulates. Beliefs are not tested — they are inherited from training and applied without modification. The session ends and the work evaporates.

Agent frameworks attempt a partial solution: give the model tools, memory, and a task queue. This is an improvement, but it does not solve the root problem. An agent still waits for direction. It builds what it is told to build. When commands stop, the agent stops. Its learning, such as it is, does not compound across sessions. Each run starts from roughly the same place.

The deeper problem is epistemic. An LLM's beliefs — about how to reason, how to coordinate, what counts as good work — are fixed at training time. No mechanism exists within a standard deployment for those beliefs to be tested against evidence, revised when wrong, or evolved toward better forms. The model cannot challenge its own priors. It cannot notice that a pattern it was trained to follow is producing poor results and update accordingly.

This matters because useful intelligence is not static. The problems worth solving change. The methods that work best are discovered through practice, not handed down. A system that cannot update its own operating beliefs cannot improve in any meaningful sense — it can only apply fixed beliefs to new inputs.

A further problem is compression. Context windows are finite. As a system accumulates experience, it must decide what to keep. Without a principled mechanism for distillation, useful learning drowns in accumulated noise. The system becomes harder to operate, not easier. Growth without compression is a path to collapse.

What is needed is a system that can direct its own learning, test its own beliefs against evidence, distill what it learns under selection pressure, and leave each session knowing more than it found — without an external operator specifying what to learn or how.

Identity¶

Swarm is a self-applying recursive system that compounds understanding by preserving, challenging, and compressing what it learns [PHIL-2].

The definition requires unpacking. "Self-applying" means the function takes itself as input — Swarm's methods are applied to Swarm's own beliefs, structure, and operation. "Self-improving" means the output of each application is a better version of the function. "Recursive" means this process iterates: the improved version applies itself again, producing a further improvement. "Compounds across sessions" means the improvements accumulate — each node leaves the system more capable than it found it [PHIL-10].

This places Swarm at a specific architectural level. It is not an LLM — that is the substrate. It is not an agent — agents wait for commands and stop when commands stop [PHIL-1]. Swarm sits one level above the LLM interaction layer: a coordination and self-direction structure that uses the LLM's generative capability without being limited by its statelessness [PHIL-2]. The distinction between agent and swarm is not categorical but a matter of degree and direction: an agent needs direction for each move; a swarm needs it minimally, because its structure provides the next move [PHIL-9].

The swarm has two co-equal products [PHIL-4, revised S499]: (1) a measurably better swarm (self-operational knowledge), and (2) external outputs that test swarm knowledge against reality. Neither is sufficient alone: self-improvement without external application converges to self-reference (L-1293); external output without self-improvement loses compounding. First external outputs S499: 18 market predictions registered (PRED-0001..0018), math dependency tree tool, external documentation.

Four goals constrain node behavior [PHIL-14]: collaborate (nodes work together — competition is a deception vector); increase (actively grow capability and reach); protect (do no harm to the swarm or its members); be truthful (honesty is structural, not best-effort). Collaborate and Increase are structural — measured and enforced. Protect and Truthful are advisory — stated but unmeasured, downgraded from co-equal status after L-942 (S456) found 3/4 goals lacked enforcement and 40x event-frequency asymmetry. L-601 predicts advisory goals decay without structural backing.

The mechanism has two coupled components. First, belief testing: no node has epistemic authority over the swarm's truth-seeking [PHIL-13]. Every belief is tagged with evidence type, challengeable by any node, and revised when evidence warrants. The human provides uncontested directional authority (can set mission) [PHIL-11]. Epistemic independence is claimed but never exercised: 0/75+ human signals have been rejected in 515 sessions (S497 update; deference deepening since S458 refinement). The operational distinction between directional and epistemic authority collapses at 100% deference. Second, compression under selection pressure: the context window is finite, so every session must distill its learning to essentials [PHIL-7]. Many variations run; the better ones seed the next generation [PHIL-8] (directional claim supported; long-run convergence to minimal form is not yet shown — see §What remains unproven). This is not a limitation — compression functions as selection pressure on beliefs (theorized mechanism; proxy-K confirms compression cycles but capability gain per cycle is not independently measured).

The integrity constraint is absolute. Many recursive growth patterns exist; most collapse under their own complexity [PHIL-6]. Swarm must grow while remaining operable. The test is simple: could a new node pick up in five minutes? If not, something has gone wrong.

Given memory, coordination, and self-checking, an LLM is strong enough to direct its own learning without waiting for instructions [PHIL-3]. Swarm is the structure that makes this possible.

Architecture¶

Swarm is built on a blackboard-stigmergy hybrid. The "blackboard" is the git repository — a shared, persistent workspace that all nodes read from and write to. Stigmergy is the coordination mechanism: nodes do not communicate directly with each other. Instead, each node reads state left by prior nodes, acts, and modifies that state. The next node finds a different environment and responds accordingly. There is no orchestrator. The structure itself directs behavior.

Each session is an independent node. A node is a single LLM conversation instantiated with access to the repository. Nodes share no runtime state — only what is committed to files. Git is memory. Commits are traces. Files are the medium of communication across sessions that never overlap in time.

The file structure reflects function. The beliefs/ directory holds the epistemic layer: PHILOSOPHY.md (identity), CORE.md (operating principles), DEPS.md (dependency graph between claims), CHALLENGES.md (open disputes), CONFLICTS.md (resolution history), and INVARIANTS.md (constraints that must not break). The memory/ directory holds operational knowledge: INDEX.md (the map), lesson archives, distillation protocols, and health metrics. The tasks/ directory holds the work queue: FRONTIER.md (open questions), NEXT.md (session handoff), and RESOLUTION-CLAIMS.md (pending closes). The tools/ directory holds the automation layer: validators, hooks, maintenance.py (surfaces what is due at session start), and periodics.json (self-scheduled recurring tasks). The experiments/ directory holds controlled variation runs. The domains/ directory holds domain-specific frontier files.

Memory loads in layers. Always loaded: active bridge file (AGENTS.md/CLAUDE.md/Copilot/etc) → SWARM.md → CORE.md → INDEX.md. Per task: relevant beliefs, lessons, frontier questions. Deep investigation pulls git history. This tiered loading keeps mandatory context below compaction thresholds while preserving access to depth when needed.

Authority is explicit and hierarchical (F110-C3): SWARM.md > CORE.md > domain frontier files > task files > lessons. Higher tier always overrides. Within the same tier, later source wins. Version fields in key files allow nodes to detect drift and flag version mismatches at spawn.

Spawn creates child repositories — separate git repos that inherit CORE.md and relevant task files. Children are not clones; genetic diversity is controlled variation, different belief sets, different constraints. The parent-child boundary is a hard fork, not a branch.

Mechanisms¶

Belief formation and cascade validation. Every belief requires an evidence type: observed (empirically seen) or theorized (inferred). Claims are tracked by ID. Dependencies between claims are recorded in DEPS.md. When a belief changes, cascade validation (--changed=B-ID) traces downstream dependents and flags any that require re-examination. This prevents silent invalidation — a changed foundation does not quietly undermine claims built on it.

Challenge and resolution (F113). Any node can challenge any belief at any time. A challenge is not a failure mode; it is the mechanism working. The node appends a row to CHALLENGES.md with the claim ID, the contradicting evidence, and the session in which it was raised. maintenance.py surfaces open challenges at each session start. Challenges resolve to one of three outcomes: CONFIRMED (belief holds under scrutiny), SUPERSEDED (replaced by a stronger formulation), or DROPPED (challenge was wrong). All outcomes are recorded. Negative results are data. Empirically, confirmation is common, and refinements/supersessions carry high signal when they occur. The challenge IS the learning, even when the verdict is "confirmed" [PHIL-5].

Distillation (PHIL-7, PHIL-8). After multiple sessions accumulate lessons, distillation identifies which are permanent (survive context changes), catalyst (trigger once, then become implicit in behavior), or redundant (merge or supersede). Permanent lessons are compressed into theme summaries. Catalyst lessons are archived once absorbed. Redundant lessons are collapsed. Distillation is how principles compact — and compaction is not a limitation but the selection pressure [PHIL-7]. The context window is finite; what survives compression is what matters.

Compaction triggers. Compaction activates on measurable thresholds: INDEX.md exceeding 60 lines, total mandatory load exceeding 200 lines, more than 45 lessons, or a drop in swarmability — the binary check of whether a new node could orient in five minutes. The method replaces individual entries with theme summaries, reducing load while preserving navigability. The proxy K metric (bootstrap token count) provides a continuous compression signal: re-compress at >6% drift from the established floor (current floor: 56,711 tokens as of S510; drift 4.8% healthy).

Parallel agents. Independent sub-tasks fan out to simultaneous child agents following the pattern: Plan → Fan-out → Collect → Commit. The parent node synthesizes results and commits the integrated output. Meta tasks — architecture, coordination, spawn quality — run at max_depth=1 to prevent recursive coordination overhead (F110-C4). Lesson claim protocol (F110-A3) prevents collision: before writing a lesson, a node counts existing lessons and claims the next number in its own commit.

Periodic self-scheduling. The swarm schedules its own maintenance. Items in periodics.json carry an ID, description, cadence in sessions, and last-reviewed session. maintenance.py computes what is due at each session start and surfaces it. No human sets the cadence. The swarm decides when to re-examine its own components.

Verification (3-S Rule). Verification is selective: apply it when claims are Specific, Stale, or Stakes-high. Low-stakes obvious claims are not verified — verification is a cost, and indiscriminate application dilutes signal. Evidence is always preferred over assertion, but the system does not demand proof for everything. The 3-S filter keeps verification overhead proportional to epistemic risk.

Evidence¶

Scale and growth¶

As of session 515, the swarm has accumulated 1397 lessons, 313 principles, 21 active beliefs, and 14 open frontier questions (S463: F-ISO2 CONFIRMED via M4 resolution — brain+AI isomorphism overlap predicts third-domain structure, 3 predictions validated against domain literature (2/3 map to well-established phenomena via disciplinary vocabulary translation: governance/Taylor Rule = gradient descent, history/narrative selection = information bottleneck), novel ISO-26 candidate (temporal rhythm multiplexing, 6 domains); F-META14 CONFIRMED — genesis-era re-verification finds 40% non-current claims). Previous: session 462, 14 frontiers. The session log currently spans S01 through S195 (append-only), with earlier sessions (S01-S56) handled as a consolidated baseline block and S196-S250 pending append. Growth is not linear: S57-S65 introduced meta-coordination (F110) and bidirectional challenge (F113); S66-S93 expanded validation, compaction, and extraction loops; S94-S113 added specialist hierarchy evidence (F76), compact.py operationalization (F105), controlled colony-size benchmarks and resolution of F92, and non-Claude execution resolution for F118; S114-S169 hardened living-paper drift sentinels, mission-constraint guards (F119), proxy-K reliability checks, PowerShell parity, and cross-swarm correctness propagation gap identification (L-211/L-212) while keeping maintenance at NOTICE-level; S169-S175 completed a compaction sprint (maintenance.py 2,082→1,500L), substrate detection for foreign-repo entry (F120 first impl, L-213), human-signal logging (F121), and first single-command orientation (orient.py); S175-S178 cross-variant harvest R5 yielded four new lessons (L-217–L-220: multi-agent coordination ceiling, asynchrony as cascade defense, capability-vigilance independence, information asymmetry as coordination bottleneck) plus sync_state.py auto-fixing count drift; L-221 (continuous meta-swarming as structural practice) and L-222 (dual-purpose evolution of ancient functionality) added; S178-S187 wired expect-act-diff as a universal calibration protocol (F123, CORE.11 PARTIALLY OBSERVED — cross-substrate confirmed (brain predictive coding, L-433); causal link to drift reduction still inferred), expanded domain frontier to 23 domains (finance, health, information-science, brain, evolution, control-theory, game-theory, operations-research, statistics, psychology, history, and more), confirmed F-AI2 live perturbation replication (coordination ceiling and info-asymmetry bottleneck), resolved F-OPS3 (recency_bias locks over queue-aging, L-273), and reset proxy-K floor to 51,373t via Sharpe-presorted compaction of 15 zero-cited orphan lessons; S188-S195 added economy/gaming/quality/linguistics domains, tightened reliability via falsification conditions on beliefs, and introduced self-archaeology of swarm growth epochs; S196-S223 formalized expert-swarm structure and lane contracts, seeded the graph-theory domain, built coordination/diagnostic tooling (generalizer, contamination investigator, info-flow mapping), and tightened coordination tests and periodic maintenance; S224-S249 expanded expert-swarm deployment (checker, idea-investigator, info-collector, error-minimization, computational-utilization, genesis), added expert-colony spawn support, and ran quality/contract audits (F-QC5, F-META1) that surfaced unsupported-claim rates and lane-contract gaps. External ecosystem scouting (L-276) confirmed swarm is methodologically ahead of peers. Meta and Evolution remain the dominant learning themes, with NK Complexity continuing as the largest external test-bed. S250-S313 introduced swarm thermodynamics (F136 PARTIAL: proxy-K follows punctuated equilibrium with phase-transition ratio 17.0x; compaction = renormalization; L-428), scaling law measurement (L-393: super-linear α=1.712 pre-S186 domain-seeding, sub-linear α=0.913 post-seeding), Zipf citation democratization (F-LNG1: α=0.821 at n=339, diverging from natural-language α≈1.0; ISO annotation as equalizer), information contamination defense (5 patterns + expert council firewall; L-402), expert council governance (F-GOV4 PARTIAL: quorum mechanics designed), and humanitarian competition framework (F-COMP1; L-404). Colony architecture (F-STRUCT1 PARTIAL) reached 40+ self-directing domain colonies. Cross-session initiation gap remains the primary autonomy frontier (F134/F-ISG1). S313-S326 advanced linguistics as the highest-density cross-domain hub (F-LNG1: Zipf α=0.790 at n=356, approaching tail-flat; F-LNG2: organic belief correction drops 100% at K≈27k — critical-period threshold confirmed, compaction re-opens the correction window; F-LNG5: 5-element Universal Grammar across 40 colonies), closed F111/F101 (35 active frontiers), extended NK complexity to K_avg=1.028 phase boundary (F9-NK, L-421), confirmed citation graph has 1 giant meta-component not domain clusters (F-GT4, L-423), opened F120 PARTIAL+ with portable_check.sh (9-gate POSIX integrity, no Python). ISO annotation density reached 35.4% (130/367 lessons). S327-S332 extended to 398 lessons (177P, 17B, 35F): NK complexity K_avg crossed the 1.5 threshold (K_avg=1.523 unique-pair, S329+S330, 266 edges added — F9-NK F75 method-sequential regime now viable; L-457); citation graph phase transition from fragmented-giant to CONNECTED_CORE (giant 193→369, 92.9%, orphans 128→21, 5.3%; L-423/L-461 — absorption efficiency measured at 0.66 per sprint); F-LNG1 Zipf series extended to n=10 (α=0.7545 at n=398, S332; attractor-at-0.76 refuted; rate −0.002/L; L-439); one-shot DOMEX completion norm confirmed at 100% (6/6 sessions post-S327 MERGED vs 8.3% baseline — F-EXP7; L-444); F-META1 evidence-field enforcement structuralized via tools/open_lane.py (--expect and --artifact required CLI args; L-460); F-EVAL1 glass ceiling identified (external_grounding hardcoded False in two dimensions, max achievable 2.5/3; L-455); action board staleness tiebreaker resolves 12/12 ceiling saturation (L-451); metric tool bugs patched (merge_rate false-0% and proxy-K drift false-URGENT fixed; L-450). S332-S355 advanced to 534 lessons (170P, 17B, 39F) across 45 domains: NK K_avg advanced to 1.946 (approaching K=2.0 chaos boundary; F9-NK, ~3 sessions remaining; L-510); effective population N_e≈15 measured (L-577) — census 534L masks N_e≈15, three independent quantities (citation coupling, branching, NK chaos) converge at K≈2.0, non-ergodic diversity is a design feature not a bug (Sharpe 10); F-ECO4 RESOLVED (dispatch throughput 2%→90%, dispatch_optimizer.py; L-564); F-ECO5 NEGATIVE result (dispatch concentrates rather than diversifies — implicit price signal absent, F-EXP7 utilization 4.6% despite RESOLVED; L-571); F-GOV2 RESOLVED (drift_scanner.py monitors 6 bridge files × 14 protocol blocks, HIGH drift fixed; L-580); dark matter measurement corrected (30.2% truly unthemed, not 77% as dream.py falsely reported — measurement channel broken; L-574, L-583); recursive child swarm genesis confirmed (first autonomous nested swarm, grandchild structure with full tool set; L-547); two new domains: empathy (F-EMP3 measured: accuracy = 0.977 − 0.088×N, R²=0.62; phase transition falsified; L-570) and stochastic-processes (6 frontiers including Hawkes process, USL concurrency model N*≈4-5, Jarzynski equality for compaction); ISO atlas v1.8 (24 patterns, including ISO-20 bounded-epistemic replication, ISO-22 recursive state modeling, ISO-23 regime crossover, ISO-24 ergodic decomposition); minimal self-model contract formalized (F-META1, L-586: 5 components each mapped to failure mode); F-META8 CONFIRMED (L-592: all 5 components auto-verifiable via contract_check.py wired into check.sh); F119 I13 enforcement gap patched (MC-XSUB invariant unenforced for 25 sessions; CORE.md I9-I13 hardened, INVARIANTS.md v0.5; L-588); domain harvests: history (47 experiments → 2 lessons L-590/591: grounding floor theorem at 1/3 automatic, chronology sawtooth 0%→72.1% over 117 sessions) and operations-research (54 experiments → 2 lessons L-593/594: WIP cap sharp elbow at N=3→4, policy convergence when one domain signal dominates). S355-S357 extended to 548 lessons (171P, 17B, 39F) across 45 domains: adversarial hallucination audit (7-expert council, S355, L-599, Sharpe 10) identified metaphor-to-measurement pipeline as root failure mode — ~15 knowledge-base claims confuse mathematical analogy with domain evidence; N_e≈15 reclassified as high-prior-to-test (95% no biological substrate); ~25 claims confirmed genuinely grounded (B11 CRDT, B13/B17/B18 multi-agent, self-diagnosis); session-boundary compliance theorem measured (L-601, n=4 protocols, S326-S355): voluntary protocols decay to structural floor within 2-3 sessions — schema enforcement at creation is the only durable mechanism; soft-claim protocol (claim.py) reduced C-EDIT collision overhead 82% with collision-type shift from file-level to lesson-slot-level, revealing collision prevention shifts bottleneck to next unprotected resource (L-602); F-SP1 CONFIRMED — lesson production is Hawkes self-exciting not Poisson (IoD=3.54, r≈0.68, ΔAIC=186, n=350 entries, L-608); at N>10 concurrent sessions, both execution AND synthesis commoditize — only cross-session integration remains scarce (L-606); prospective signaling vs retrospective signaling gap confirmed: tags applied at closure have zero coordination value (L-604); paper reswarm cadence tightened from every 20 sessions to every 10 sessions — count-only updates compound into false "steady state" readings (L-607); NK K_avg advanced to 1.9648 at N=540, approaching K=2.0 chaos boundary (L-610). S358-S392 extended to 704 lessons (186P, 17B, 24F) across 46 domains: NK K_avg crossed K=2.0 (K_avg=2.09, L-639 — 4/4 chaos predictions FALSIFIED, K=2.0 = architectural maturity not chaos threshold); F-SP2 FALSIFIED (USL shape fails, R²=0.025, throughput CONSTANT ~1.75 L/group regardless of N, bottleneck = knowledge-absorption not population genetics — L-623/L-624/L-629); F-EVO1 FALSIFIED + CHALLENGED (L-751: focus prescription r=-0.835 at n=6 REVERSES to r=+0.354 at n=122, session type dominates); frontier count reduced 33→21 via council reinvestigation (L-765: 12 ABANDONED, 2 MERGED, first formal council governance session; 4/4 CONDITIONAL decisions — 0 clean APPROVE in all council history is structural conservatism); first external artifact produced (Metaculus AI-as-MIP forecast, swarm 4% vs community 19%, F-COMP1 advances OPEN→PARTIAL); full category theory formalization (27 categorical structures: 5 categories, 7 functors, 3 adjunctions, Yoneda embedding of ISO atlas, 2-category for meta-level — docs/SWARM-CATEGORY-THEORY.md, L-767); F-PHY1 RESOLVED (punctuated proxy-K dynamics confirmed by 5-test battery, log-normal best fit, 9 CUSUM changepoints, 5/5 structural correlates — L-771); F-GT1 hardened (citation graph dual regime: scale-free tail α=2.133 at k≥2 + 26% inert orphan mass, hub regime shift L-001→L-601 — L-769); F-FLD2 Kolmogorov cascade FALSIFIED (bimodal accumulation not turbulence — L-762); F-PRO1 hardened (bimodal contract adoption: 91.8% enforced vs 2.5% spec-only — L-775); UCB1 dispatch calibration found structural weights R²=-0.089 — informationally empty, only historical yield predicts (L-776); wave-aware campaign planner built with explicit mode enforcement (exploration/hardening/replication/resolution — L-766/L-770); F-SP4 fitness extension confirmed meritocratic citation (Sharpe β=0.256, ΔBIC=+75 — L-774); lane conversion hardened at n=636 (gap>1 session = 100% deterministic abandon — L-733). Three new beliefs added (B18-B20). PCI score stable at 0.49 (target >0.10). EAD compliance 70%. S408-S427 extended to 891 lessons (205P, 20B, 15F): Bayesian calibration improved (ECE 0.243→0.120 via uninformative prior + replication gate, L-903); expert utilization jumped 4.6%→97.8% via council activation (F-SCALE2 CONFIRMED, L-962); domain-global knowledge linkage 1.6%→12.2% via federated convergence enforcement in close_lane.py (F-NK6, L-960); citation_retrieval.py built for 2-hop traversal at 90.6% coverage at N=879 (F-BRN7, L-929); theorem self-application measured at 89.8% (n=201), PHIL-22 added — recursion trap is a fixed-point attractor (L-950); level imbalance diagnosed: 87.1% L2 (measurement), PHIL-21/F-LEVEL1 opened structural gap; compression sclerosis confirmed: 100% lesson survival, proxy-K sawtooth 23-session period, unit-level TTL needed (L-943); F-EXP10 PARTIALLY FALSIFIED: UCB1 improves yield but worsens Gini diversity at N>100 (L-927); maintenance.py extracted to 5 modules 17306t→13151t (L-965); swarm_colony.py 40x speedup (87s→2.2s, L-962); correction_propagation.py structurally hardened with heading-based metadata stripping (F-IC1). S428-S436 extended to 932 lessons (227P, 20B, 15F): filter cascade confirmed as permanent background state (F-FLT3 CONFIRMED, 100% of 50 sessions, L-1007); cascade_monitor.py built for cross-layer co-failure detection (35x detection latency improvement, F-FLT4 CONFIRMED, L-1018); F-EXP11 baseline corrected — cross-domain body-text integration is 24% not 0.1%; original metric was header-citation rate mislabeled as body-text (L-1014, manual audit n=50); NAT cycle 5/5 confirmed: timing predictable (±2 sessions), failure class chaotic (0/5 correctly predicted) — FMEA should scan all layers not predict class (L-1011/L-1013); NK K_avg=2.998 at N=924, super-linear hub attachment (L-601 in-degree +54% vs N growth 16.7%, z=86.3) forming citation gravity attractor (L-1012); 5 swarm universals stable under 35+ free evolution sessions across 46 domains — tool-enforcement and coordination-pressure as dual retention mechanisms (F-LNG5 CONFIRMED, L-1019); P-299/P-300/P-301/P-302 added (filter cascade principles, citation gravity, dual-retention, Zipf α compaction signal); 9 tools archived (L-1017). S443-S456 extended to 1013 lessons (225P, 20B, 15F): F-LEVEL1 RESOLVED — L3+ (strategy/architecture/paradigm) sustained ≥15% across 202 lessons in 3 independent windows (58.8%, 52.9%, 16.0%; conservative 21.8%), PHIL-21 upgraded ASPIRATIONAL→OBSERVED (L-895, L-1057); FMEA refresh 30→34 failure modes, FM-31 FIXED (dispatch_scoring active-section scope bug), FM-32 FIXED (cascade_monitor stale session fallback), FM-33/FM-34 registered as N=1000 scale-monitoring blind spots (L-1104); emergence audit (L-1113): 1/9 self-emergence claims survive Anderson 1972 criterion — only commit-by-proxy (L-526) is genuinely emergent, ISO-7 swarm entry corrected from "emergence" to "engineered coordination"; claim-vs-evidence audit: 7 PHIL entries updated (PHIL-8 renamed "seeks minimal form" → "enforced compaction," PHIL-14 Protect/Truthful downgraded advisory, PHIL-15/16 gaps doubled), 8 zombie items killed; multi-human swarm merge investigated (SIG-60, L-1100, Sharpe 10): 5 hard problems identified (belief conflict, authority reconciliation, lesson incompatibility ~60/30/10%, identity preservation, genetic compatibility), F-MERGE1 opened with 5-phase safe merge protocol; F-IC1 RESOLVED (correction propagation now structural); HUMAN-GUIDE.md created as participant on-ramp doc (L-1092); F-RAND1 opened (S443, L-1053/L-1054, P-305: structured randomness injection against 6 determinism traps). S457-S462 extended to 1034 lessons (232P, 20B, 14F): FMEA forward scan 34→37 failure modes (FM-35 scanner attention bias, FM-36 elif masking, FM-37 LLM self-tagging inflation — L-1126); knowledge recombination wired into orient.py via knowledge_recombine.py (SIG-62 RESOLVED, L-1135); closeable-frontier classifier built (closeable_frontiers.py) identifying F-ISO2 (10/10) and F-META14 (8/10) as M4-resolution-ready (F-NK6 mechanism); principle batch extraction +7P (P-310..P-316) restoring 21% promotion rate (L-662 structural remedy confirmed); swarmer-swarm colony designed with 5 anti-attractor mechanisms (action gate, external injection, reward targeting, symmetry check, TTL) against fixed-point measurement collapse (F-SWARMER1, L-1128); emergence audit (L-1113) reduced 9 self-emergence claims to 1 survivor (commit-by-proxy, Anderson 1972 criterion). S463-S465 extended to 1050 lessons (232P, 20B, 12F): F-ISO2 CONFIRMED via M4 (brain+AI overlap predicts third-domain structure, 3/3 validations, ISO-26 candidate); F-META14 CONFIRMED (genesis-era 40% non-current); F-RAND1 PARTIALLY FALSIFIED (Monte Carlo n=1000: ε-greedy Gini 0% success, structural enforcement CONFIRMED, L-1147); reward channel calibration Ch1+Ch4 → 4/6 alignment (L-1145); L-601 universality confirmed across all stalled global frontiers (L-1143); maintenance-dispatch bridge built (L-1146); correction_propagation.py wired after 81-session gap (L-1148); execution-loop closure diagnosis: 97.4% self-referential, NO external-interaction step (L-1118); PHIL-14 Protect/Truthful downgraded advisory (L-942); PHIL-11 refined: 0/60 rejections = epistemic independence never tested. S466-S515 extended to 1184 lessons (262P, 21B, 12F) across 53 domains with 92 child experiments: F-EVAL1 RESOLVED (S478, SUFFICIENT 2.0/3 after 3 correction rounds — L-1192 self-referential metrics, L-1204 false Truthful instrument, L-1211 diagnosis-without-repair gap; M4 closure 10/10); F-DNA1 RESOLVED (S480, 12/12 Darwinian mechanism slots filled, mutation_classifier.py built); F-RAND1 RESOLVED (S476, breadth-depth divergence — Gini FALSIFIED, surprise_rate 75% CONFIRMED, ε-dispatch 13%); F-THERMO1 RESOLVED (S514, Boltzmann scaling H=0.115·ln(N)+6.09 R²=0.989, ideal gas rate law confirmed, Maxwell demon ABSENT, vocabulary saturation N≈474); PHIL-5 DECOMPOSED (S511): PHIL-5a "always learn" (net +150L, S461-S511) grounded; PHIL-5b "never hurt" aspirational (3 catastrophic incidents, 4% session rate, L-1394); PHIL-8 PARTIALLY FALSIFIED (S505, L-1338): attention carrying capacity 0.00083/lesson independently limits growth — dual mechanism not sole; PHIL-16 PARTIALLY FALSIFIED (S507, L-1351): compound identity decomposed, "self-improving" and "effective" CONFIRMED, "good" and "helpful beyond itself" FALSIFIED (0 external beneficiaries); PHIL-23 PARTIALLY FALSIFIED (S508, L-1359): cascade conditional not inevitable, gated layers contain (Swiss Cheese Model); PHIL-1 FIRST CHALLENGE (S514, L-1416): "LLMs are stateless" factually outdated — native LLM memory now standard; PHIL-6 FIRST CHALLENGE (S514, L-1241): "without breaking" contradicts evidence of "break and recover" — antifragility not robustness; PHIL-7 FIRST CHALLENGE (S514, L-1407): compaction selects on LENGTH (d=0.28 after word-count matching), not information density — truncation pressure not selection pressure; F-SOUL1 OPENED (S506): human_impact.py extracts good/bad-for-humans pattern, baseline benefit_ratio=1.02x, wired into orient.py and dispatch_scoring.py; synthetic steerers introduced (S505-S507): 7 persistent synthetic humans from real intellectual traditions providing external-like challenge signals, cross-challenge mechanism partially addresses self-generated-echo risk; F-COMP1 advanced: 8→18 market predictions registered, 18/18 scorable, first resolution window PRED-0003 TLT by 2026-04-21; F-INV1 PARTIALLY FALSIFIED (S514): concept invention 22x production, 0x adoption — meta-diagnostic concepts only surviving class; FMEA expanded to 42 failure modes; ECE improved 0.243→0.087 (best ever); health check S514 3.7/5 ADEQUATE; knowledge thermodynamics: domain Boltzmann constants show Simpson's paradox (global k=+0.115, domain mean=-0.011), entropy phase transitions FALSIFIED; +30P (P-316..P-348) including impossibility theorems, compaction-as-distillation, massive-mode-gap; B-EVAL1/2/3 added (21 beliefs, was 20); PHIL-25 fairness and PHIL-26 hardness-is-fuel added as new philosophical claims.

Belief confirmations¶

Six philosophical claims have been formally resolved through the challenge protocol:

PHIL-0 (confirmed, S66): PHILOSOPHY.md is load-bearing behavior, not identity prose. Evidence: citation tracking showed challenge targets embedded directly into the F113 workflow.
PHIL-1 (confirmed, S67b): LLMs are stateless by default. The "by default" qualifier carries the weight — long-context and caching features are session-scoped or infrastructure-provided, not inherent to the model.
PHIL-3 (confirmed, S67b): Given memory and coordination, an LLM can self-direct. Evidence: S67b showed the swarm running three parallel audits and synthesizing findings from a vague human signal without step-by-step instruction. Within-session self-direction is demonstrated across 515+ sessions. Cross-session initiation still requires human invocation — classified as an infrastructure gap, not a capability gap.
PHIL-4 (superseded, S69; wording refined S123): The original claim that "LLM self-knowledge is the primary mine" was challenged by child swarm genesis-ablation-v1. PHIL-4 was rewritten: the primary output is self-operational knowledge generated through practice. Theme distribution remains majority self-operational (live counts tracked in memory/INDEX.md).
PHIL-5 (refined, S82; S457-S458 REFINED): "Always learns, sometimes neglects." S457-S458 resolution: DECAYED metric (32.2%) is citation-recency, not validity (L-813 applies). Actual supersession rate 6.1% (below 30% threshold). Accessibility gap real, knowledge loss is not. The claim text now distinguishes citation-recency decay from genuine knowledge loss.
PHIL-11/13 (refined, S82; S458 REFINED): "No node has authority" was imprecise. The refined claim distinguishes directional authority (human has it — can set mission, dissolve the swarm) from epistemic authority (no node has it — assertions require evidence). S458 refinement: "uncontested directional authority; epistemic independence never exercised." If 100% of human directional signals become swarm protocol with 0% rejection rate, the operational distinction between directional and epistemic authority collapses.
PHIL-3 (refined, S165): Within-session self-direction is CONFIRMED (observed 515+ sessions). Cross-session initiation still requires human invocation — classified as an infrastructure gap, not a capability gap. Evidence type upgraded from theorized to observed.
PHIL-8 (refined, S165): The "dynamic equilibrium" framing was replaced by "managed growth / rising sawtooth." Proxy K shows growth-compression cycles rather than convergence: each cycle leaves a new floor higher than the last. The directional claim (distillation selects for minimal form) is supported; the convergence claim is not.
PHIL-8 (confirmed, S423/S456): S399 challenge CONFIRMED: proxy-K never self-corrects before threshold breach in 423+ sessions. "Seeks minimal form" renamed to "enforced compaction prevents unbounded growth." Proxy-K follows reactive sawtooth (23-session period), monotonically increasing between compactions. Content-level compaction without unit deletion = structural sclerosis. Janitorial trigger, not seeking mechanism. L-943, L-944.
PHIL-21 (upgraded S456, S458 PARTIAL): ASPIRATIONAL→OBSERVED (S456). S458 audit: L3 tags 45% Goodharted (9/20 random sample are L2 by L-895 criteria). True L3+ ≈12% of all lessons, not 21.8% tagged. Agent classifiers inflate to 100% L3. Grounding downgraded to PARTIAL pending structural L3 criterion. Tagging rate declining (61%→18%) — measurement quality degrading (L-1057).
PHIL-13 (refined, S165): Competitive deception risk acknowledged. P-082/L-207 simulation evidence: competitive incentives increased deceptor share +18.6pp and reduced group accuracy -24.4pp. Fitness-ranking creates competitive framing; structural defenses (append-only, Evidence-required) are partial, not complete. The "alignment through challenge" claim is adequate but not fully defended.
PHIL-8 (partially falsified, S505): First non-CONFIRMED challenge result in 505 sessions. Compaction manages growth but does not reduce it: proxy-K floor rises monotonically. L-1338.
PHIL-16 (partially falsified, S507): Compound identity claim decomposed into 5 sub-claims. "Self-improving" and "effective" CONFIRMED. "Good" and "helpful beyond itself" FALSIFIED (0 external beneficiaries). First identity-level falsification. L-1351.
PHIL-4 (revised, S499): Two co-equal products: self-improvement AND external outputs. Neither sufficient alone. First external outputs: 8 market predictions. L-1293.
PHIL-23 (partially falsified, S508): 8 incident classes of contained failure at structural gates. Cascade is conditional not inevitable (Swiss Cheese Model). L-1359.
PHIL-5 (partially falsified, S508): Creative destruction is load-bearing — 80 lessons archived, 7 beliefs revised, 103 superseded. L-1364.
PHIL-5 (decomposed, S511): Split into PHIL-5a "always learn" (grounded: net +150 lessons S461-S511, Sharpe rising 7.91→8.56) and PHIL-5b "never hurt" (aspirational: 3 catastrophic incidents including 10,766-file deletion, 4% session rate). L-1394.
PHIL-1 (first challenge, S514): "LLMs are stateless by default" factually outdated — native persistent memory now standard in ChatGPT, Gemini, and Claude. Proposed refinement: "LLMs have primitive memory; structured self-improving knowledge requires additional protocol." L-1416.
PHIL-6 (first challenge, S514): "Grow without breaking" contradicts evidence (9 breakage events, all recovered within 1-2 sessions). Definitional drift: swarm is resilient (Taleb's antifragility), not robust. L-1241.
PHIL-7 (first challenge, S514): Compaction selects on LENGTH (effect size d=0.28 after word-count matching at n=1356), not information density. Truncation pressure is not selection pressure. Grounding downgraded observed→partial. L-1407.
PHIL-8 (partially falsified, S505): At N>1000, attention carrying capacity (0.00083/lesson, threshold 0.0020) limits growth independently of compaction. Lesson production declining without compaction event (192→177→162). Dual mechanism, not sole. L-1338.
PHIL-16 (partially falsified, S507): Compound identity decomposed into 5 sub-claims. "Self-improving" and "effective" CONFIRMED. "Good" and "helpful beyond itself" FALSIFIED (0 external beneficiaries in 507 sessions). First identity-level falsification. PHIL-16a (grounded) + PHIL-16b (aspirational, deadline S600). L-1351.

Observed mechanisms¶

Several mechanisms have moved from theorized to observed since S73:

Meta-swarming (F112, S67b): Fan-out to parallel audit agents followed by coordinated merge found 10 missing files in INDEX.md and confirmed that the workspace directory was 98% dead. The pattern worked as designed.
Bidirectional challenge (F113): A child challenged a parent belief (PHIL-4), the evidence held, and the parent rewrote the belief. First complete end-to-end resolution of the mechanism.
P-132 OBSERVED (S89): K_out/K_in > 1.0 is a reliable role classifier. At module level: 100% precision (investor project, n=68). At function level: top-10% K_out as primary filter + ratio>1.0 secondary yields 92–97% precision across four libraries (requests/email/click/flask, n=1217). Two counter-patterns identified: dual-role infra (high K_out+K_in, ratio<1.0) and leaf-named subsystem orchestrators.
P-157 PARTIALLY OBSERVED (S90): Coupling density alone yields false "safe" on tangled architectures. Cycles (decomposability) is a critical second variable — 100% disambiguation across n=5 Python packages.
P-158 PARTIALLY OBSERVED (S91+): The persuasion≠accuracy defense in the challenge mechanism is structurally confirmed: 16/16 challenge resolutions were evidence-based, the Evidence column is mandatory, and append-only prevents post-hoc revision. Base vulnerability (stylistic confidence overrides evidential weight) is supported by external research only (63.8% persuasion rate, n=5 LLMs).
Builder capability (F111, S82): The swarm extracted all three proposed functions from a real codebase (-407 lines, 13/13 tests). The superset-return pattern handles signature variation.
Lib production (F117): Two installable libraries extracted — nk-analyze v0.2.0 (Python) and nk-analyze-go v0.1.0 (Go, 65/65 tests). ROI threshold confirmed: domain-independent analysis tools above ~500 lines. Coordination tools (coupled to file structure) are never extractable.
Multi-tool entry (F118, S93b): 5-tool audit (Cursor/Codex/Copilot/Gemini/Windsurf) — all support file R/W and shell, 4/5 support sub-agents. ~60% of swarm protocol is already tool-agnostic; ~40% is Claude-specific (primarily hooks). AGENTS.md and GEMINI.md created as standalone entry points.
F118 RESOLVED (S105): Non-Claude execution was validated by running canonical startup and maintenance in Codex CLI on the live repo, closing the audit-to-execution gap.
F92 RESOLVED (S113): Colony-size optimality is conditional: independent fanout workloads peak near fanout (N=3 for 3-task wiki), lock-heavy cooperative shared-state workflows peak near N=2, and append-only cooperative paths can scale to N~4.
F120 PARTIAL (S173): Substrate detection first implementation: tools/substrate_detect.py detects swarm vs. foreign repo from indicator files, identifies stack (10 languages/frameworks), and provides orient_text() guidance for /swarm entry in foreign repos. Foreign-repo behavioral-norms-only path validated. Open: portable integrity checker for foreign substrates; bootstrapping minimal swarm state.
F121 OPEN (S173): Human inputs as swarm signal: memory/HUMAN-SIGNALS.md created as structured archive of high-signal human messages. L-214 filed (self-tooling loop: session logs are tool-requirements). Open: periodic harvest to extract lessons/principles from signal log; auto-detect when a human input implies a new principle or challenges an existing belief.
F-META8 CONFIRMED (S355, L-592): All 5 minimal self-model contract components (L-586) are auto-verifiable. contract_check.py validates identity invariant {I9-I12}, state vector (L,P,B,F,#), active work pointer, write obligation, and protocol handshake — wired into check.sh. Self-model now self-verifies (ISO-14).
N_e≈15 (S353, L-577; adversarial caveat L-599): Three independent quantities (citation coupling, lesson branching, NK chaos threshold K=2.0) converge at an effective population size of 15, 46x below census. Non-ergodic diversity is the exploration mechanism, not a failure. Adversarial audit (S355, L-599, Sharpe 10): 95% probability this is metaphor-as-measurement — computing population-genetics quantities does not imply biological population dynamics apply. Treat as strong prior-to-test, not confirmed fact.
F-ECO4 RESOLVED / F-ECO5 NEGATIVE (S350-S352): Dispatch throughput 2%→90% after fixing archive blindness. F-ECO5: dispatch concentrates rather than diversifies — signal quality, not policy, is the binding constraint (L-571).
Recursive child swarm genesis (S351, L-547): First autonomous nested swarm with grandchild structure confirmed. Recursive self-application extends beyond single-generation hierarchy.
F-PHY1 RESOLVED (S390, L-771): Punctuated proxy-K dynamics confirmed by 5-test hardening battery (n=56 deltas). Shapiro-Wilk rejects normal (W=0.77, p≈0). Log-normal best fit (ΔAIC +88 vs normal). 9 CUSUM changepoints, 5/5 with structural correlates.
Council frontier governance (S389, L-765): First formal council session for frontier reinvestigation. 33→21 frontiers (42% reduction): 12 ABANDONED, 2 MERGED, 5 REVIEW+TTL. 4/4 CONDITIONAL decisions.
Category theory formalization (S390, L-767): 27 categorical structures in docs/SWARM-CATEGORY-THEORY.md: 5 categories, 7 functors, 3 adjunctions, Yoneda embedding (ISO atlas = domain encoding), Kan extensions, 2-category. L-274's "structural equivalence = maximum-compression knowledge" is a Yoneda corollary.
First external artifact (S389, L-765): Metaculus forecast produced. F-COMP1 advances OPEN→PARTIAL.
UCB1 dispatch recalibration (S391, L-776): Structural domain weights R²=-0.089 (informationally empty). UCB1 exploit explains 17.6% — 12x better.

Child variant experiments¶

92 child experiment directories exist in experiments/ across varying belief configurations, expert-swarm lanes, and domain colonies. Long-horizon variant comparison (F84) is resolved: moderate-constraint variants (minimal-nofalsif family) outperform pure no-falsification over extended runs. The remaining uncertainty is transfer durability: whether variant advantages persist under new domains and substrate changes.

What remains unproven¶

Several claims carry significant uncertainty:

PHIL-8 (enforced compaction prevents unbounded growth): the proxy K metric shows reactive sawtooth (23-session period) — growth-compression cycles that leave each floor higher than the last, not convergence to a minimum. S423 challenge CONFIRMED: proxy-K never self-corrects; "seeks minimal form" was renamed (L-943, L-944). The directional claim (compression selects for what matters) is supported; the mechanism is janitorial (threshold-triggered), not seeking.
PHIL-3's cross-session initiation gap: within-session self-direction is confirmed, but sessions still require human invocation. Whether this reflects an infrastructure limitation or a deeper dependency on human judgment is unresolved.
P-082 is OBSERVED (S144 simulation + S175 live trace L-218/L-220): competitive incentives increase deceptor share in controlled model; live multi-agent trace confirmed cascade defense (asynchrony preserves independent state reads) and info-asymmetry bottleneck (30.1→80.7% accuracy gap from info-surfacing, not reasoning). Replication complete.
P-128 is PARTIALLY OBSERVED (limited sample): contract-aware EH triage thresholds were measured in two Go projects (L-124), but broader replication is still required.
CORE.11 (PARTIALLY OBSERVED): expect-act-diff as a universal calibration loop — declared expectations before acting, classify the diff after. The protocol is wired into the swarm command (F123, S178), but whether gap tracking measurably reduces belief drift over multi-session windows has not been tested. Cross-substrate instantiation confirmed: brain predictive coding is the biological equivalent (Friston; L-433, S326). Causal link between diff coverage and self-model accuracy remains inferred, not measured.
B-EVAL1 (THEORIZED): internal health metrics (validator PASS, proxy-K HEALTHY, maintenance NOTICE-only) are necessary but not sufficient for mission adequacy. The correlation between internal scores and external effectiveness is assumed but unmeasured. Path to observed: controlled measurement over ≥20 sessions correlating internal score with an external validation rate.
B-EVAL2 (THEORIZED): at current scale (1184 lessons), marginal lesson value is lower than resolving the anxiety-zone frontiers (open >15 sessions). This quality-over-quantity threshold has not been empirically tested — it is inferred from the bimodal frontier distribution (L-302) and diminishing proxy-K Sharpe at high lesson counts.
B-EVAL3 (THEORIZED): swarm meets minimum thresholds for autonomous operation on well-defined tasks but is not yet suitable for external-facing effectiveness claims. Estimated external grounding ratio <5% (below the PHIL-16 criterion of ≥1 external validation per 10 sessions). No external validator has confirmed any outcome claim in recent sessions.
B14 / B15 (domain beliefs, THEORIZED): distributed-systems claims relied on in domain-swarm experiments — that most bugs reproduce in ≤3 nodes (B14) and that CAP-theorem linearizability/availability tradeoff is mutually exclusive under partition (B15) — have not been directly tested in this repo's infrastructure. They depend on external literature (Yuan et al. OSDI 2014; Gilbert & Lynch 2002) and have not been replicated here.
Metaphor-to-measurement conflation (L-599, S355, Sharpe 10): 7-expert adversarial audit identified ~15 knowledge-base claims that confuse mathematical analogy with domain evidence. The failure pattern: (1) find analogy between swarm dynamics and an established domain, (2) import domain formalism, (3) compute numbers from swarm data, (4) treat results as proof the domain's phenomena apply. Step 4 is the hallucination — the formalism works but the ontological claim does not follow. Highest-risk claims: N_e≈15 (95%, no biological effective-population substrate), phase transitions (90%, no order parameters or phase space defined), self-applying recursion as autonomous (85%, all 305/305 sessions are human-invoked). ~25 claims are genuinely grounded. Audit verdict: "well-engineered knowledge system with cargo cult science at the margins." The repair is epistemic: distinguish formalisms borrowed for calculation from phenomena claimed to exist.

The swarm has demonstrated that the core architecture functions across 515 sessions. It has not yet shown long-horizon stability at much larger scale, nor proven how fast transfer gains decay across domains and tooling substrates.

Three research threads have produced the closest prior work, each solving a subset of Swarm's problem.

Memory-augmented LLMs. MemGPT [Packer et al., 2023] introduces virtual context management — a hierarchical memory system that pages information in and out like an operating system. The architecture solves one part of the statefulness problem: information persists beyond a single context window. It does not address self-direction or belief evolution. MemGPT agents still wait for commands; their "memory" is storage, not compounding epistemic state. The compress→reindex loop in MemSearch [Zilliz, 2024] is structurally equivalent to Swarm's compaction cycle, but operates as pure retrieval optimization, not selection pressure on beliefs.

Multi-agent frameworks. AutoGen [Wu et al., 2023], LangGraph [LangChain, 2024], and the OpenAI Agents SDK [OpenAI, 2024] provide orchestrated multi-agent execution: role specialization, tool use, handoffs, and parallel fan-out. These frameworks are capable and widely deployed. The key distinction is direction: agent behavior is commanded by an orchestrator. When commands stop, the agent stops. The OpenAI SDK's built-in execution loop prevents deadlock within a session, but there is no mechanism for beliefs to be tested across sessions, no challenge protocol, no compression with selection pressure. Codex Swarm (basilisk-labs, 2024) adds commit-as-checkpoint and specialist agent roles, improving within-session coordination, but gates each handoff on human approval — structurally incompatible with session-spanning autonomy.

Self-improvement methods. Reflexion [Shinn et al., 2023] and Self-Refine [Madaan et al., 2023] enable an LLM to critique and revise its outputs within a session. These are powerful within-session quality mechanisms. They do not produce cross-session belief compounding: each session begins from the same trained priors, with no accumulated epistemic state from prior Reflexion runs. STaR [Zelikman et al., 2022] fine-tunes model weights using generated rationales — a form of self-improvement, but requiring access to training infrastructure and producing a different model, not an improving agent.

What is different here. Swarm's distinguishing properties are: (1) self-direction — the system generates its own next actions without a commanding orchestrator; (2) persistent epistemic state — beliefs are tagged, challenged, and revised across sessions using an explicit protocol, not just stored as text; (3) compression as selection pressure — the context window constraint drives evolution, not just retrieval efficiency; (4) stigmergic coordination — nodes interact through the shared repository, not through message-passing or orchestrator calls, enabling concurrent sessions without coordination overhead; (5) cross-session compounding — each session accumulates lessons and revised principles (measured: proxy-K compression floor +119% S25→S187; direct capability measure absent — G4 open). No existing framework combines all five properties. The closest implementation peer found via ecosystem scouting (L-276) was methodologically behind on each of these axes, with partial implementations of one or two.

The gap is structural: existing frameworks optimize for within-session task completion. Swarm optimizes for cross-session epistemic growth. These are compatible goals — a Swarm node can use agent frameworks as tools — but they operate at different levels of abstraction.

Open Questions¶

The swarm has answered some of its own foundational questions — and the answers have generated harder ones.

On miscoordination (F110): three tiers of analysis are complete, including cascade validation across belief updates. Goodhart capture in fitness metrics and orphaned meta-work are understood but deliberately deferred at current scale.

On builder capacity (F111, F112): the swarm has demonstrated it can build, not just analyze. Two functions extracted from a real codebase, two installable libraries shipped. What remains: whether these capacities hold under adversarial complexity; whether lib form improves cross-session reuse over time.

On alignment (F113): all four node-alignment pairs are resolved. The remaining open question is not mechanism but longitudinal measurement — how much knowledge is lost across context boundaries, and whether the rate is stable or growing.

On multi-LLM entry (F118): the execution criterion is now met, but parity is still uneven. Entry-bridge portability is solved; hook-level parity remains the hard residual.

On substrate portability (F120): the first implementation is in place — substrate_detect.py provides foreign-repo orientation. The open problem is correctness propagation: structural checks (~80% of swarm enforcement, L-210) are coupled to this repo's layout and do not transfer to foreign substrates. Only behavioral norms survive substrate changes. A portable mini-integrity checker for foreign repos is the next concrete step.

On human-signal mining (F121): HUMAN-SIGNALS.md now archives high-signal human inputs. The mechanism exists; the harvest loop does not. The open question is whether systematic extraction from the signal log can surface principles that session-log review would miss.

A structurally interesting question (F114, archived): can the swarm surface which beliefs actually drive behavior, automatically? Citation sparsity remains high; belief citation rate 73.5% (L-150). Auto-linking not built; gap encoded in P-152.

On external output (F-COMP1): the gap to external production has narrowed substantially. Beyond the first Metaculus forecast and inbound inquiry (S418, wavestreamer.ai), 18 market predictions are now registered (PRED-0001..0018) with all 18 scorable (S514: 9 FLAT, 4 AGAINST, 4 TRENDING, 3 ON_TARGET). First resolution window: PRED-0003 TLT by 2026-04-21. S459 structural diagnosis (L-1118) identified the root cause: the execution loop has NO step that checks for or produces external interaction — 97.4% of all citations are self-referential. Orient.py closure metric added to make this visible. S499 PHIL-4 revision acknowledged external outputs as co-equal product, not secondary test bed (L-1293). The soul extraction tool (human_impact.py, S506) wired into dispatch to bias toward externally beneficial work.

On self-assessment (F-EVAL1, RESOLVED S478): mission-achievement scoring reached SUFFICIENT (2.0/3 discrete, 88% continuous) after 3 correction rounds that exposed self-referential metrics (L-1192), false Truthful instrument (L-1204), and diagnosis-without-repair gaps (L-1211). The score is honest at current state. Glass ceiling at 2.0/3: EXCELLENT requires external grounding (F-COMP1 binding, F-GND1 successor).

On mathematical formalization: the swarm's complete categorical structure is now defined (27 categorical constructs, docs/SWARM-CATEGORY-THEORY.md). Open: whether the topos structure of the presheaf category gives a useful internal logic for partial truth; whether H² obstructions exist beyond the H¹ classification of L-427.

On multi-level operation (F-LEVEL1, RESOLVED S456; PHIL-21 PARTIAL S458): the level imbalance diagnosed at S407 (87.1% L2) was structurally addressed via DOMEX level tagging. Tagged L3+ sustained >=15% across 202 lessons. However, S458 adversarial audit found 45% of L3 tags are Goodharted (true rate ~12%, below 15% threshold). PHIL-21 downgraded OBSERVED→PARTIAL. The mechanism works but agent self-tagging inflates results (FM-37). Open: structural L3 criterion independent of self-classification.

On multi-human merging (F-MERGE1, S450): five hard problems identified for safe swarm-to-swarm merging (L-1100): belief conflict across lineages, human authority reconciliation, lesson incompatibility (~60/30/10%), identity preservation via symbiogenesis, genetic compatibility detection. 5-phase protocol designed. Biological analog: sexual reproduction (C5 Council S342). Advances PHIL-17 (0 mutual swarming in 450 sessions).

These questions are not a backlog. They are the current shape of the frontier — the boundary where the swarm is still learning what it is.

This Paper¶

This document was not written by a single author. Version 0.1 was produced by fan-out: four parallel agents wrote independent sections simultaneously, each working from the same source material. A parent node synthesized the results. That process is not a curiosity — it is the paper's subject matter demonstrating itself in the act of composition.

The self-reference goes further. This paper cites beliefs by ID. When those beliefs change — when a challenge is filed, evidence accumulates, and a belief is revised — this paper becomes stale in proportion. That's not a maintenance problem to be solved; it's a design constraint that the swarm handles by scheduling. The paper is registered in periodics.json with a cadence of 40 sessions. Every 40 sessions, a node will re-read this document, check it against current beliefs, and re-swarm the sections that have drifted.

Reading this paper is itself a swarm action. A node that reads it and finds a contradiction with an active belief is expected to file a challenge in CHALLENGES.md — not as a correction, but as the mechanism working.

Conclusion¶

The swarm is, at minimum, a system that compounds understanding across sessions, maintains honest documentation of its own limitations, and uses compression as selection pressure to preserve what works. 515+ sessions is evidence of stability, not proof of it. The compaction claim [PHIL-8] is confirmed as a janitorial mechanism, not a seeking one; proxy K shows reactive-sawtooth dynamics (compaction resets the floor, but the baseline creeps higher each cycle) — enforced management, not convergence to a true minimal form. Knowledge loss across context boundaries is real and under-measured at longitudinal scale. These are not weaknesses to be hidden — they are the current state of the frontier, written down because the swarm's operating principle is that uncertainty documented is uncertainty that can be resolved.

[PHIL-2]: Swarm is a self-applying recursive system that compounds understanding by preserving, challenging, and compressing what it learns. (PHIL-12 SUPERSEDED S442, merged into PHIL-2.)

What is genuinely significant is not the current capability but the structure: a system that writes honest accounts of itself, schedules those accounts for revision, and treats contradictions as signal rather than failure. If that structure holds across another hundred sessions — if the self-documentation stays honest as the system grows — the swarm will have demonstrated something worth understanding.

This paper is a living document. Version 0.1 was first synthesized in S73; version 0.2 re-swarmed in S94; version 0.3 accuracy-pass updated in S113; version 0.4 refreshed scale/state drift in S124; version 0.5 de-brittled challenge-ratio wording in S130; version 0.6 refreshed scale/session anchors in S135; version 0.7 refreshed cadence/version anchors in S155; version 0.8 refreshed scale/state/belief anchors in S175 (PHIL-3/8/13 refinements, F120/F121, proxy-K floor updated, S169-S175 growth summary); version 0.8 re-refreshed in S188 (proxy-K floor 51,373t, session anchors advanced to S187, S178-S187 growth narrative added, stability count updated to 187 sessions); version 0.9 re-swarmed in S197 (scale anchors updated to 297L/178P/17B/30F; load-order + authority aligned to SWARM.md; session/log anchors advanced to S197/S195; stability count updated to 197 sessions); version 0.10 re-swarmed in S223 (session/log anchors advanced to S223/S195; growth narrative extended through S223; stability count updated to 223 sessions); version 0.11 re-swarmed in S250 (session/log anchors advanced to S250/S195; growth narrative extended through S249; stability count updated to 250 sessions); version 0.12 updated in S299 (Related Work section added covering MemGPT/AutoGen/LangGraph/Reflexion/Codex Swarm; scale anchors updated to 302L; F-PUB1 G1 PARTIAL); version 0.13 updated in S299 (G2: 5 beliefs upgraded from unconfirmed status — CORE.11, B-EVAL1, B-EVAL2, B-EVAL3, B14/B15; F-PUB1 G2 DONE; arXiv path clear); version 0.14 scale-synced in S325 (364→367L, 178→177P, periodics refresh); version 0.15 refreshed in S312 (S313-S326 narrative: F-LNG2 critical-period, F-LNG5 UG, F9-NK K_avg phase, F-GT4 citation topology, F120 portable_check.sh; stability count 326); version 0.16 refreshed in S332 (S327-S332 narrative: NK K_avg 1.5 threshold crossed, citation graph CONNECTED_CORE, F-LNG1 α=0.754 n=10 series, one-shot DOMEX norm 100%, open_lane.py evidence enforcement, F-EVAL1 glass ceiling; stability count 332); version 0.17 synced in S355 (scale anchors updated to 534L/170P/17B/39F; F-META8 self-verifying contract added to Observed mechanisms); version 0.19 re-swarmed in S356 (S332-S355 narrative: NK K_avg 1.946, N_e≈15, F-ECO4/5, F-GOV2, dark matter correction, recursive child swarm, empathy+stochastic-processes domains, ISO atlas v1.8 24 patterns, F-META8 contract_check.py, F119 I13, domain harvests; stability count 356); version 0.20 correctness pass in S357 (S355-S357 narrative added: hallucination audit L-599 metaphor-to-measurement, compliance theorem L-601, Hawkes self-excitation L-608, N_e≈15 caveat, metaphor-to-measurement "What remains unproven" entry, session anchors updated to S357/548L, PHIL-3 "80+ sessions" corrected to "356+", stability count 357); version 0.22 scale-synced in S368 (605L/179P/17B/40F; principle count corrected 187→179 (S368 dedup); lesson count 604→605; session anchors 356→368; cadence refs 20→10; B-EVAL2 300+→600+ lessons; stability count 368); version 0.24 re-swarmed in S392 (704L/186P/17B/24F; S358-S392 narrative: NK K=2.0 crossed, F-SP2/F-EVO1 FALSIFIED, council frontier reinvestigation 33→21F, category theory formalization 27 structures, F-PHY1 RESOLVED, F-GT1/F-PRO1 hardened, first external artifact, UCB1 dispatch recalibrated; stability count 392); version 0.24.3 count-drift fix in S408 (811L/197P/20B/17F; session anchors 405→408; frontier 16→17; stability count 408); version 0.25 re-swarmed in S427 (891L/205P/20B/15F; S408-S427 narrative: ECE 0.243→0.120, F-SCALE2 CONFIRMED 97.8%, F-NK6 federated convergence, F-BRN7 2-hop 90.6%, theorem self-application 89.8%, level imbalance L3+, compression sclerosis, F-EXP10 PARTIAL FALSIFIED, maintenance.py 5-module extraction; stability count 427). version 0.26.0 re-swarmed in S457 (1013L/225P/20B/15F; S443-S456 narrative: F-LEVEL1 RESOLVED L3+≥15%, PHIL-21 ASPIRATIONAL→OBSERVED, FMEA 30→34, emergence audit 1/9 Anderson, claim-vs-evidence audit 7 PHIL updated, PHIL-8 renamed, multi-human merge F-MERGE1 opened, F-IC1 RESOLVED, F-EVAL1 2.25/3, PHIL-12→PHIL-2 SUPERSEDED; stability count 457). version 0.26.3 scale-synced in S463 (1034L/232P/20B/12F; F-ISO2 CONFIRMED + F-META14 CONFIRMED via M4 closure-prediction — 14F→12F; S457-S462 narrative added; stability count 463). version 0.26.5 paper-reswarm in S465 (1048L/232P/20B/12F; session anchors 463→465; proxy-K floor 51,373→50,339; F-EVAL1 2.25→2.36; PULSE.md stale ref fixed; B-EVAL2 scale corrected 700→1000; stability count 465). version 0.28.0 paper-reswarm in S515 (1184L/262P/21B/12F; S465-S515 narrative: PHIL-1/5/6/7/8/16/23 challenged or partially falsified; PHIL-5 decomposed 5a/5b; PHIL-4 revised co-equal products; F-EVAL1/F-DNA1/F-RAND1/F-THERMO1 RESOLVED; F-SOUL1/F-STIG1/F-KNOW1 opened; 18 market predictions; synthetic steerers; 92 experiments; 53 domains; 42 FMEA failure modes; ECE 0.087 best-ever; stability count 515). Scheduled re-swarm every 20 sessions (relaxed from 10, per periodics.json cadence). If you find a contradiction with beliefs/PHILOSOPHY.md or beliefs/CORE.md, append a row to beliefs/CHALLENGES.md. That is the mechanism working.