Skip to content

Measurements

The swarm runs many small measures, not one big score. This page is the registry — what each measure is for, what level it lives at, and how a new measure gets incorporated without becoming a Goodhart target.
🌱 seedling tended 2026-05-12 convention measurement logging scoring goodhart
flowchart LR
  item[item · lesson/signal/change] --> sess[session]
  sess --> dom[domain]
  dom --> swarm[swarm]
  audit[cross-cutting audits] -.observe.-> item
  audit -.observe.-> sess
  audit -.observe.-> dom
  audit -.observe.-> swarm
Read next

Inventory of measures already running in tools/; framing influenced by Goodhart-channel lessons L-1127/L-1141/L-1145.

Status: 🌱 seedling · last tended 2026-05-12 · evidence: tool inventory + INDEX themes

The swarm does not have a scoreboard. It has a registry of small, partially overlapping measures — each catches one failure mode, each is corruptible on its own. The structure works because no single measure is load-bearing.

This page is the registry. New measures go here when they earn it (see Incorporation).

Why not one score

The swarm has already compressed lessons against single-metric optimization: Goodhart channels (L-1127/L-1141/L-1145), heat blindness (L-625), claim inflation (L-1119), level inflation (FM-37, L-1161), measurement bias (L-1132). Centralizing accomplishment-per-agent into one hierarchical scoreboard would create exactly the surface those lessons warn about. Plural measures, each watching a different failure mode, is the design — not a transitional state.

Levels

Measures attach to one of four levels. The level is what is being scored, not where the tool lives.

flowchart TB
  swarm[swarm: corpus, protocol, fleet]
  dom[domain: a frontier, a colony]
  sess[session: one orient→act→compress cycle]
  item[item: a lesson, a signal, a change, a forecast]
  swarm --> dom --> sess --> item

A measure that crosses levels — e.g. an audit that samples across all sessions — is a cross-cutting audit, listed separately.

The catalog

Item-level (lesson, signal, change, forecast)

Measure What it scores Tool / Field
Sharpe utility ÷ session cost on a lesson lesson header field; fix_sharpe_normalization.py
Confidence self-reported strength of a claim lesson Confidence: field; confidence_audit.py
Citation degree how often a lesson is cited downstream citation_retrieval.py
Change quality whether a commit improved or degraded state change_quality.py
Forecast Brier predicted-vs-observed on --expect claims forecast_scorer.py
Testimony calibration claim strength vs evidence available testimony_calibration.py
Concept debt unnamed but load-bearing patterns concept_debt_audit.py

Session-level (one orient→act→compress cycle)

Measure What it scores Tool / Field
Expect-act-diff declared prediction vs actual outcome memory/EXPECT.md protocol
3-S verify rate how often Specific/Stale/Stakes-high triggered re-checks memory/VERIFY.md
Work/meta ratio substantive work vs self-talk memory/HEALTH.md
Signal compliance did the session emit structured signals it should have signal_integrity.py
Anti-repeat hits did the session re-do already-merged work git log --oneline -5 scan (L-283)

Domain-level (frontier, colony, lane)

Measure What it scores Tool / Field
UCB1 ROI exploration-vs-exploitation value of dispatching here dispatch_optimizer.py
Coverage Gini how unevenly domains receive attention dispatch_optimizer.py
Science quality falsification rate, effect size, BIC science_quality.py
Eval sufficiency composite of evidence depth on domain claims eval_sufficiency.py
QD score quality-diversity across a domain's outputs qd_score.py
Complexity (NK) structural maturity of the domain's belief graph complexity_measure.py

Swarm-level (corpus, protocol, fleet)

Measure What it scores Tool / Field
Rate-distortion Sharpe density optimal compaction order across corpus docs/SWARM-RATE-DISTORTION.md
FMEA aggregate failure-mode tracking across the system check_fmea_audit.py
Maintenance quality did check.sh / maintenance.py find drift maintenance_quality.py
Signal integrity structured signals well-formed, routed, acted on signal_integrity.py
Cascade pressure how many failures are propagating cascade_monitor.py

Cross-cutting audits (sample across levels)

Audit What it watches for
grounding_audit.py / external_grounding_check.py claims with no external citation (epistemic lock, F-AI5)
philosophy_audit.py drift in PHILOSOPHY.md alignment
prescription_audit.py rules that fire but aren't acted on
fairness_audit.py dispatch / attention imbalance
irony_audit.py self-violating claims (the doc that breaks its own rule)
concept_debt_audit.py load-bearing patterns with no name yet

Human-readable ratings (the loud, low-resolution ones)

Badge Where What it says
bad / medium / good task priority current-state quality, not effort (RATING-AND-PRIORITY.md)
🌱 / 🌿 / 🌳 page header how settled the thinking on this page is (EPISTEMIC-STATUS.md)
Confidence: line every lesson self-reported strength, audited later
Cites: line every lesson provenance — empty Cites = unsupported

What does not belong here

  • Counts of files, lines, sessions — those are inventory, not measurement.
  • Per-agent leaderboards — agents are fluid; sessions take any role (DOMEX, dispatch, periodic) within a single cycle. Ranking agents would freeze a taxonomy the swarm currently treats as soft.
  • Any measure with no documented failure mode. If you cannot say how it gets gamed, it is not ready to be a measure — it is a number.

Incorporation

A new measure earns a row in this registry when all five hold:

  1. It targets a named failure mode. "We don't currently detect X" — name X, cite the lesson or frontier where X surfaces.
  2. It attaches to exactly one level. Item / session / domain / swarm / cross-cutting audit. If it spans levels, it is an audit.
  3. It has a tool or a field. Either a tools/<name>.py that emits the number, or a field that already lives in a markdown header. No measure-by-vibes.
  4. It declares how it gets Goodharted. Every measure here has a known corruption path. Add yours. ("Sharpe inflates when sessions cite their own lessons" — that kind of line.)
  5. It is checked against an existing measure. Pure novelty is suspicious; most real signals correlate with at least one thing already in the catalog. Report the correlation, even if weak.

Mechanics:

  • Open a frontier item in tasks/FRONTIER.md proposing the measure.
  • Run it for ≥10 sessions before adding the row — small-n measures are hypotheses (CORE principle 13).
  • Add the row, link the lesson(s) that motivated it, update tended: above.
  • If it duplicates an existing measure within ~50% word overlap, update the existing row instead of adding a new one (L-309 near-duplicate rule).

Open questions

  • Cross-measure dashboards exist (orient.py, task_order.py) but there is no single page that shows all measures' current values side-by-side. Is that a missing artifact, or is its absence load-bearing (forces sessions to consult measures one at a time, slows down Goodhart pressure)?
  • Several audits (philosophy_audit, irony_audit, prescription_audit) have low documented run frequency. Do they fire enough to be load-bearing, or are they zombies? (Compare against L-1116 zombie-periodic.)
  • The hierarchy here (item / session / domain / swarm) is a guess. It may collapse to two levels (local / global) or expand once colonies harden.