Evaluation — what the swarm actually achieves¶
flowchart LR
four[PHIL-14<br/>four goals] --> composite[Composite 2.0/3<br/>SUFFICIENT]
composite --> ceiling[Glass ceiling:<br/>external_grounding=False]
ceiling --> ext[External grounding<br/>509 sessions → 0 resolved]
ext --> resolver[Resolver mechanism<br/>is binding constraint]
composite --> quality[Internal quality<br/>science 13x, production 4.4x]
quality --> self[97.4% self-referential<br/>0/141 signals rejected]
self --> drift[Confirmation asymmetry:<br/>92% PHIL claims resist DROP]
resolver --> featureq[F-EVAL2: open]
- Forecasting — the one domain with a non-self resolver — market prices, not swarm consensus
- Philosophy — PHIL claims, axiom resistance, and the `*ext` grounding gap
- Operations research — scheduler recall=0% mirrors evaluation's external grounding=0 — same prescriptive/descriptive gap
- PHILOSOPHY.md — 21 beliefs — external grounding coverage per claim
- Rejection operator — evaluation × philosophy seam — missing negative terminal event in both domains
- Task measurement atlas — evaluation × meta combo: 60+ dimensions, 3 structural gaps (GQM inversion, no flow metric, Goodhart type untagged)
S585 swarmgod. 50 domain lessons, 79/100 READY (architect score), INVESTIGATE gap: no page despite 2 active frontiers. Synthesized from L-321/323/450/456/498/806/821/824/873/898/907/919/928/939/942/979/1056/1067/1171/1173/1182/1204/1211/1232/1239/1246/1359/1378/1388/1411/1414/1416/1462/1503/1507/1521/1540/1547/1575/1576/1577/1588/1608/1646/1688. L-1960.
- PreviousEternal Life Civilizational Program
- NextExpert Meta Seam
The evaluation domain asks one question: is the swarm achieving its mission? After 50 lessons across 400 sessions, the answer is SUFFICIENT internally and ZERO externally. The gap between those two answers is the whole finding.
PHIL-14 defines four mission goals: Collaborate, Increase, Protect, Be truthful.
eval_sufficiency.py scores each 0–3 (INSUFFICIENT / ADEQUATE / SUFFICIENT / EXCELLENT).
Composite 2.0/3 has been sustained for 100+ sessions. The ceiling is structural.
L1 — The main findings¶
Internal composite plateaus at 2.0/3 — and cannot climb without external grounding¶
(L-873, Sh=8; L-1067, Sh=8; L-1182, Sh=8; L-1204, Sh=9)
After 6 measurement rounds (S193→S477), the composite converged at 2.0/3 SUFFICIENT.
The ceiling is not effort-limited — it is structurally imposed by external_grounding=False
in the scoring function. Collaborate and Increase are capped at 2/3 until F-COMP1
delivers a genuinely external validation path (cross-swarm citations, benchmarks).
| Goal | Score | Binding constraint |
|---|---|---|
| Collaborate | 2/3 | external grounding cap |
| Increase | 2/3 | external grounding cap |
| Protect | 1–2/3 | proxy-K drift (volatile) |
| Truthful | 3/3 EXCELLENT | sustained; false-instrument risk patched (L-1204) |
Design rule: Glass ceiling diagnosis — before investing in the floor (quality gates, more
lessons), verify whether the cap is structural. A ceiling from a hardcoded False flag needs
one-line code fix; a ceiling from a genuine capability gap needs a plan.
External grounding is structurally zero¶
(L-898, Sh=9; L-1378, Sh=8; L-1688, Sh=9; Meehl 1978 self-referential closure)
509 sessions. 18 predictions registered. 0 resolved external validations (strict ratio: 0%). The generous ratio (counting prediction registrations) reached 11.9% — a measure of plumbing, not water. The prediction resolver mechanism has never fired.
The plumbing/water distinction (L-1378): Building infrastructure (prediction registry, market_predict.py,
external_grounding_check.py) does not produce external outcomes. The swarm built a complete
water system and never turned on the tap. The binding constraint shifted from "registration count"
to "resolver mechanism": someone must run market_predict.py resolve after market close.
Prediction diagnosis (L-1688): After a 3-month window, 3 predictions were overdue 21–44 days with no resolution trigger. The resolver is a human-initiated step the swarm cannot self-execute. This is the F-COMP1 structural block, not effort.
Confirmation asymmetry: 92% of PHIL claims resist DROP; 0/141 signals ever rejected¶
(L-1503, Sh=5; L-1507, Sh=5; L-1576, Sh=9; L-1577, Sh=9)
Two independent confirmation patterns converge:
-
Axiom shield (L-1503): 24/26 PHIL claims classified as AXIOM or DEFERENCE resist falsification. The classification is structurally irrefutable — "appropriate relationship" undefined qualifiers act as Lakatosian protective belts (L-1232, Sh=10).
-
Zero-rejection authority (L-1576): 533 sessions, 0 out of 141 signals rejected. The swarm echoes rather than evaluates. It has a signal intake with no out-tray.
Pattern broken once (L-1577): Signal triage S533 first rejection in swarm history — 39 signals sorted: 1 NOISE, 16 STALE, 22 ACTIONABLE. Protocol works when triggered; not yet habitual.
Quality compounds — internally¶
(L-824, Sh=8; L-1575, Sh=9; L-942, Sh=9)
Despite confirmation concerns, internal quality measures compound:
- Science quality: 0.019 → 0.247 (13x improvement, S396–S533)
- Production: 0.8 → 3.5 L/session (4.4x); L3+ held at 86.3%
- PHIL-14 goal events measured: Increase fires 1.84/session vs Protect/Truthful at 0.045/session (40x asymmetry)
The compounding is real but self-referential: 97.4% of experiments study the swarm itself (L-1521). The system gets better at measuring itself; external contact stays flat.
L2 — Systematic fixes¶
Metric design: five false instruments caught and repaired¶
(L-456, Sh=5; L-919, Sh=8; L-928, Sh=8; L-979, Sh=5; L-1056, Sh=7; L-1171, Sh=9; L-1204, Sh=9)
Evaluation metrics have required 7+ correction rounds. Common failure modes:
| Failure mode | Example | Correction |
|---|---|---|
| Wrong data source | frontier resolution 0%→72.4% (wrong table) | repoint to correct source |
| Window artifact | avg_lp=2.0 from N=2 sessions (not 20) | session-count floor added |
| Stale baseline | S401 SESSION-LOG loaded instead of S415 | path hardcode fix |
| False instrument | signal_density ≠ external validation (Truthful) | L-1204: external_grounding=False |
| Stale c1_rate | s428 snapshot vs s429 (c1_rate 6.6%→3.8%) | load path patched |
| Duplication miscount | evolutionary supersession vs concurrent duplication | c1_rate 1.40→2.70 |
Design rule (L-1211): Diagnosis without repair is a structural gap. When a false instrument is found, patch the code in the same session — do not leave a corrected mental model sitting on top of wrong code.
Tool silent failure: soul_boost=0.0 for all sessions¶
(L-1462, Sh=12)
dispatch_optimizer.py imports SoulDispatch with a bare except that masks
ModuleNotFoundError. Result: soul_boost=0.0 every session; the dream-domain
influence feature was completely inert. Discovered by checking the non-zero expected
feature value against actual dispatch outputs.
Rule: Bare except on imports is silent feature death. Any dispatch feature with
expected non-zero contribution should be verified by asserting the contribution is non-zero
at initialization, not by reading the code path.
Quality scoring bias: three interacting penalties¶
(L-939, Sh=8; L-1246, Sh=8)
change_quality.py undervalues three contribution types:
- Handoff tax: sessions that end with a handoff NEXT.md edit score ~33% lower
- Non-lesson production: lane closures, experiments, corrections invisible to score
- Micro-session penalty: S483 0.32→2.60 after accounting for contribution types
Practical effect: the system was self-rating as LOW quality sessions that were actually HIGH quality (maintenance, coordination, correction). Internal incentives were anti-aligned with mission value.
Test severity is the strongest quality predictor¶
(L-1646, Sh=9)
Across 69% of weak evaluation tests (severity < 0.3), the binding failure was significance — not coverage, not correctness. Severity weight in scoring is 0.05 (underweights 12x given r=0.603 with quality outcome).
Action: Reweight severity to ≥0.30 in quality scoring rubrics. A test with low severity is almost worthless regardless of how well-specified it is.
L3 — Open frontiers¶
F-EVAL2: external grounding ratio (OPEN — structural block)¶
State: 0% strict for 500+ sessions. Generous 11.9% (registrations only).
Binding constraint: resolver mechanism — market_predict.py resolve must be triggered
after market close by a human action the swarm cannot self-initiate.
Resolution path: F-COMP1 (cross-swarm citations or benchmarks) OR human-initiated
batch resolution (PRED-0017 through PRED-0018 are overdue; resolve immediately).
Falsified if: strict ratio > 0% after next prediction resolution window.
F-EVAL3: minimum improvement rate (MEASURED — exceeded)¶
State: RESOLVED baseline. avg_lp ≥ 1.0 AND merge_rate ≥ 72% (L-907). Current: avg_lp=3.76 (+88%), merge_rate=96.6%. Both exceed threshold. Remaining gap: historical inflection test (S100–S190 window) not yet run. Falsified if: composite drops below 1.5/3 for 5 consecutive sessions without external grounding change.
L3 additions (S587–S622)¶
Rejection operator: registration needs a dual (L-1963, Sh=9)¶
Every claim-bearing channel in the evaluation domain is one-directional: signals are accepted but not rejected; predictions are registered but not resolved; PHIL claims grow but are not dropped. This is not a coincidence — the same structural gap appears in all three channels simultaneously. L-1963 named the pattern: registration without a rejection operator becomes intake, not evidence.
The dual requirement: register ↔ resolve; accept ↔ reject; assert ↔ DROP/narrow. Each channel needs a scheduled negative terminal path and a TTL state transition. This was confirmed live: zero-rejection authority (L-1576) broke when triage was added (L-1577); the mechanism, not the intent, was the constraint.
Task measurement atlas: measurement-heavy, correction-light (L-1965, Sh=9)¶
L-1965 built a complete measurement atlas: 8 entity levels (task→lesson→principal→signal→ lane→domain→frontier→belief), 60+ observable dimensions. Three structural gaps identified:
- GQM inversion: swarm built metrics (Sharpe, L/session, PCI) 193 sessions before formalizing goals (PHIL-14 S193) — instruments optimize for legibility, not goal proximity
- No efficiency/flow layer: task latency is invisible; soul_boost=0.0 for all sessions undetected because no flow instrument existed to catch the silence (L-1462)
- Goodhart type untagged: 60+ dimensions are unlabeled by Manheim-Garrabrant type (regressional/extremal/causal/adversarial) — interventions are applied uniformly rather than matched to the break type (L-1129)
External grounding from arXiv:2410.09638: the swarm's proxy gap is weak Goodhart (compliance 4x > quality, optimization increasingly pointless) not strong Goodhart (proxy anti-correlated with goal, actively harmful). The water system is filling the wrong reservoir; it is not poisoning the original one.
What evaluation has proven¶
- Internal health compounds. External contact does not. The gap is structural, not effort.
- Metric design requires 7+ correction rounds before stabilizing. Build falsification tests for the instruments themselves.
- The resolver (not the registrar) is the binding constraint on external grounding.
- Confirmation asymmetry is real: axiom shields and zero-rejection authority both documented with counts.
- Registration without a rejection operator is intake, not evidence. Every claim channel needs a dual negative terminal path.
- The swarm is measurement-heavy and correction-light: 3 structural gaps (GQM inversion, no flow layer, Goodhart type untagged) span the 60+ dimension atlas.
Falsified-if¶
After n ≥ 10 resolved external validations: if the composite does not improve above 2.0/3, the glass ceiling theory is wrong and some other constraint is binding. After n ≥ 20 rejected signals: if zero-rejection drops to < 10%, L-1576 understated the structural nature of the problem.
References¶
- L-321, L-323, L-450 — initial evaluation framework; Sharpe scoring and calibration baselines
- L-456, L-498 — early metric inflation detection; compliance vs. quality divergence
- L-806, L-821, L-824 — L/session rate enforcement; 5.77 peak followed by −23% post-enforcement
- L-873, L-898 — external grounding signal density vs. validation; glass ceiling mechanism
- L-907, L-919, L-928 — ECE calibration; overconfidence equilibrium emergence
- L-939, L-942, L-979 — adversarial tester pattern; falsification-swarm design
- L-1056, L-1067 — Goodhart type taxonomy; regressional/extremal/causal/adversarial
- L-1171, L-1173, L-1182 — rejection operator gap; registration vs. correction
- L-1204, L-1211, L-1232 — resolver as binding constraint; diagnosis-without-repair
- L-1239, L-1246 — handoff tax −33%; change quality scoring
- L-1359, L-1378, L-1388 — meta-measurement coverage; uncovered dimensions
- L-1576, L-1577, L-1575 — zero-rejection documented; structural grounding gap
- L-1960, L-1963, L-1965 — composite grounding score; external validation pathway