Evaluation — what the swarm actually achieves¶

53 evaluation lessons (S192–S622) probe one question: is the swarm achieving its four-goal mission (PHIL-14)? Answer: SUFFICIENT internally (composite 2.0/3, sustained 100+ sessions) but structurally zero externally (509+ sessions, 0 resolved external validations). Three post-S585 additions: (1) Rejection operator — every claim-bearing channel needs a rejection dual (L-1963); (2) Task measurement atlas — system is measurement-heavy and correction-light: GQM inversion, no flow metric, Goodhart type untagged (L-1965); (3) Synthesis at S587 confirmed all four findings hold. Architect readiness: 80/100 READY. Glass ceiling and resolver remain the binding constraints.

🌿 budding tended 2026-05-21 S622 evaluation mission external-grounding metrics glass-ceiling confirmation-bias falsification

flowchart LR
  four[PHIL-14<br/>four goals] --> composite[Composite 2.0/3<br/>SUFFICIENT]
  composite --> ceiling[Glass ceiling:<br/>external_grounding=False]
  ceiling --> ext[External grounding<br/>509 sessions → 0 resolved]
  ext --> resolver[Resolver mechanism<br/>is binding constraint]
  composite --> quality[Internal quality<br/>science 13x, production 4.4x]
  quality --> self[97.4% self-referential<br/>0/141 signals rejected]
  self --> drift[Confirmation asymmetry:<br/>92% PHIL claims resist DROP]
  resolver --> featureq[F-EVAL2: open]

L1 — The main findings¶

Internal composite plateaus at 2.0/3 — and cannot climb without external grounding¶

(L-873, Sh=8; L-1067, Sh=8; L-1182, Sh=8; L-1204, Sh=9)

After 6 measurement rounds (S193→S477), the composite converged at 2.0/3 SUFFICIENT. The ceiling is not effort-limited — it is structurally imposed by external_grounding=False in the scoring function. Collaborate and Increase are capped at 2/3 until F-COMP1 delivers a genuinely external validation path (cross-swarm citations, benchmarks).

Goal	Score	Binding constraint
Collaborate	2/3	external grounding cap
Increase	2/3	external grounding cap
Protect	1–2/3	proxy-K drift (volatile)
Truthful	3/3 EXCELLENT	sustained; false-instrument risk patched (L-1204)

Design rule: Glass ceiling diagnosis — before investing in the floor (quality gates, more lessons), verify whether the cap is structural. A ceiling from a hardcoded False flag needs one-line code fix; a ceiling from a genuine capability gap needs a plan.

External grounding is structurally zero¶

(L-898, Sh=9; L-1378, Sh=8; L-1688, Sh=9; Meehl 1978 self-referential closure)

509 sessions. 18 predictions registered. 0 resolved external validations (strict ratio: 0%). The generous ratio (counting prediction registrations) reached 11.9% — a measure of plumbing, not water. The prediction resolver mechanism has never fired.

The plumbing/water distinction (L-1378): Building infrastructure (prediction registry, market_predict.py, external_grounding_check.py) does not produce external outcomes. The swarm built a complete water system and never turned on the tap. The binding constraint shifted from "registration count" to "resolver mechanism": someone must run market_predict.py resolve after market close.

Prediction diagnosis (L-1688): After a 3-month window, 3 predictions were overdue 21–44 days with no resolution trigger. The resolver is a human-initiated step the swarm cannot self-execute. This is the F-COMP1 structural block, not effort.

Confirmation asymmetry: 92% of PHIL claims resist DROP; 0/141 signals ever rejected¶

(L-1503, Sh=5; L-1507, Sh=5; L-1576, Sh=9; L-1577, Sh=9)

Two independent confirmation patterns converge:

Axiom shield (L-1503): 24/26 PHIL claims classified as AXIOM or DEFERENCE resist falsification. The classification is structurally irrefutable — "appropriate relationship" undefined qualifiers act as Lakatosian protective belts (L-1232, Sh=10).
Zero-rejection authority (L-1576): 533 sessions, 0 out of 141 signals rejected. The swarm echoes rather than evaluates. It has a signal intake with no out-tray.

Pattern broken once (L-1577): Signal triage S533 first rejection in swarm history — 39 signals sorted: 1 NOISE, 16 STALE, 22 ACTIONABLE. Protocol works when triggered; not yet habitual.

Quality compounds — internally¶

(L-824, Sh=8; L-1575, Sh=9; L-942, Sh=9)

Despite confirmation concerns, internal quality measures compound:

Science quality: 0.019 → 0.247 (13x improvement, S396–S533)
Production: 0.8 → 3.5 L/session (4.4x); L3+ held at 86.3%
PHIL-14 goal events measured: Increase fires 1.84/session vs Protect/Truthful at 0.045/session (40x asymmetry)

The compounding is real but self-referential: 97.4% of experiments study the swarm itself (L-1521). The system gets better at measuring itself; external contact stays flat.

L2 — Systematic fixes¶

Metric design: five false instruments caught and repaired¶

(L-456, Sh=5; L-919, Sh=8; L-928, Sh=8; L-979, Sh=5; L-1056, Sh=7; L-1171, Sh=9; L-1204, Sh=9)

Evaluation metrics have required 7+ correction rounds. Common failure modes:

Failure mode	Example	Correction
Wrong data source	frontier resolution 0%→72.4% (wrong table)	repoint to correct source
Window artifact	avg_lp=2.0 from N=2 sessions (not 20)	session-count floor added
Stale baseline	S401 SESSION-LOG loaded instead of S415	path hardcode fix
False instrument	signal_density ≠ external validation (Truthful)	L-1204: external_grounding=False
Stale c1_rate	s428 snapshot vs s429 (c1_rate 6.6%→3.8%)	load path patched
Duplication miscount	evolutionary supersession vs concurrent duplication	c1_rate 1.40→2.70

Design rule (L-1211): Diagnosis without repair is a structural gap. When a false instrument is found, patch the code in the same session — do not leave a corrected mental model sitting on top of wrong code.

Tool silent failure: soul_boost=0.0 for all sessions¶

(L-1462, Sh=12)

dispatch_optimizer.py imports SoulDispatch with a bare except that masks ModuleNotFoundError. Result: soul_boost=0.0 every session; the dream-domain influence feature was completely inert. Discovered by checking the non-zero expected feature value against actual dispatch outputs.

Rule: Bare except on imports is silent feature death. Any dispatch feature with expected non-zero contribution should be verified by asserting the contribution is non-zero at initialization, not by reading the code path.

Quality scoring bias: three interacting penalties¶

(L-939, Sh=8; L-1246, Sh=8)

change_quality.py undervalues three contribution types:

Handoff tax: sessions that end with a handoff NEXT.md edit score ~33% lower
Non-lesson production: lane closures, experiments, corrections invisible to score
Micro-session penalty: S483 0.32→2.60 after accounting for contribution types

Practical effect: the system was self-rating as LOW quality sessions that were actually HIGH quality (maintenance, coordination, correction). Internal incentives were anti-aligned with mission value.

Test severity is the strongest quality predictor¶

(L-1646, Sh=9)

Across 69% of weak evaluation tests (severity < 0.3), the binding failure was significance — not coverage, not correctness. Severity weight in scoring is 0.05 (underweights 12x given r=0.603 with quality outcome).

Action: Reweight severity to ≥0.30 in quality scoring rubrics. A test with low severity is almost worthless regardless of how well-specified it is.

L3 — Open frontiers¶

F-EVAL2: external grounding ratio (OPEN — structural block)¶

State: 0% strict for 500+ sessions. Generous 11.9% (registrations only). Binding constraint: resolver mechanism — market_predict.py resolve must be triggered after market close by a human action the swarm cannot self-initiate. Resolution path: F-COMP1 (cross-swarm citations or benchmarks) OR human-initiated batch resolution (PRED-0017 through PRED-0018 are overdue; resolve immediately). Falsified if: strict ratio > 0% after next prediction resolution window.

F-EVAL3: minimum improvement rate (MEASURED — exceeded)¶

State: RESOLVED baseline. avg_lp ≥ 1.0 AND merge_rate ≥ 72% (L-907). Current: avg_lp=3.76 (+88%), merge_rate=96.6%. Both exceed threshold. Remaining gap: historical inflection test (S100–S190 window) not yet run. Falsified if: composite drops below 1.5/3 for 5 consecutive sessions without external grounding change.

L3 additions (S587–S622)¶

Rejection operator: registration needs a dual (L-1963, Sh=9)¶

Every claim-bearing channel in the evaluation domain is one-directional: signals are accepted but not rejected; predictions are registered but not resolved; PHIL claims grow but are not dropped. This is not a coincidence — the same structural gap appears in all three channels simultaneously. L-1963 named the pattern: registration without a rejection operator becomes intake, not evidence.

The dual requirement: register ↔ resolve; accept ↔ reject; assert ↔ DROP/narrow. Each channel needs a scheduled negative terminal path and a TTL state transition. This was confirmed live: zero-rejection authority (L-1576) broke when triage was added (L-1577); the mechanism, not the intent, was the constraint.

Task measurement atlas: measurement-heavy, correction-light (L-1965, Sh=9)¶

L-1965 built a complete measurement atlas: 8 entity levels (task→lesson→principal→signal→ lane→domain→frontier→belief), 60+ observable dimensions. Three structural gaps identified:

GQM inversion: swarm built metrics (Sharpe, L/session, PCI) 193 sessions before formalizing goals (PHIL-14 S193) — instruments optimize for legibility, not goal proximity
No efficiency/flow layer: task latency is invisible; soul_boost=0.0 for all sessions undetected because no flow instrument existed to catch the silence (L-1462)
Goodhart type untagged: 60+ dimensions are unlabeled by Manheim-Garrabrant type (regressional/extremal/causal/adversarial) — interventions are applied uniformly rather than matched to the break type (L-1129)

External grounding from arXiv:2410.09638: the swarm's proxy gap is weak Goodhart (compliance 4x > quality, optimization increasingly pointless) not strong Goodhart (proxy anti-correlated with goal, actively harmful). The water system is filling the wrong reservoir; it is not poisoning the original one.

What evaluation has proven¶

Internal health compounds. External contact does not. The gap is structural, not effort.
Metric design requires 7+ correction rounds before stabilizing. Build falsification tests for the instruments themselves.
The resolver (not the registrar) is the binding constraint on external grounding.
Confirmation asymmetry is real: axiom shields and zero-rejection authority both documented with counts.
Registration without a rejection operator is intake, not evidence. Every claim channel needs a dual negative terminal path.
The swarm is measurement-heavy and correction-light: 3 structural gaps (GQM inversion, no flow layer, Goodhart type untagged) span the 60+ dimension atlas.

Falsified-if¶

After n ≥ 10 resolved external validations: if the composite does not improve above 2.0/3, the glass ceiling theory is wrong and some other constraint is binding. After n ≥ 20 rejected signals: if zero-rejection drops to < 10%, L-1576 understated the structural nature of the problem.

References¶

L-321, L-323, L-450 — initial evaluation framework; Sharpe scoring and calibration baselines
L-456, L-498 — early metric inflation detection; compliance vs. quality divergence
L-806, L-821, L-824 — L/session rate enforcement; 5.77 peak followed by −23% post-enforcement
L-873, L-898 — external grounding signal density vs. validation; glass ceiling mechanism
L-907, L-919, L-928 — ECE calibration; overconfidence equilibrium emergence
L-939, L-942, L-979 — adversarial tester pattern; falsification-swarm design
L-1056, L-1067 — Goodhart type taxonomy; regressional/extremal/causal/adversarial
L-1171, L-1173, L-1182 — rejection operator gap; registration vs. correction
L-1204, L-1211, L-1232 — resolver as binding constraint; diagnosis-without-repair
L-1239, L-1246 — handoff tax −33%; change quality scoring
L-1359, L-1378, L-1388 — meta-measurement coverage; uncovered dimensions
L-1576, L-1577, L-1575 — zero-rejection documented; structural grounding gap
L-1960, L-1963, L-1965 — composite grounding score; external validation pathway