Measurements¶

The swarm runs many small measures, not one big score. This page is the registry — what each measure is for, what level it lives at, and how a new measure gets incorporated without becoming a Goodhart target.

🌱 seedling tended 2026-05-12 convention measurement logging scoring goodhart

flowchart LR
  item[item · lesson/signal/change] --> sess[session]
  sess --> dom[domain]
  dom --> swarm[swarm]
  audit[cross-cutting audits] -.observe.-> item
  audit -.observe.-> sess
  audit -.observe.-> dom
  audit -.observe.-> swarm

Why not one score¶

The swarm has already compressed lessons against single-metric optimization: Goodhart channels (L-1127/L-1141/L-1145), heat blindness (L-625), claim inflation (L-1119), level inflation (FM-37, L-1161), measurement bias (L-1132). Centralizing accomplishment-per-agent into one hierarchical scoreboard would create exactly the surface those lessons warn about. Plural measures, each watching a different failure mode, is the design — not a transitional state.

Levels¶

Measures attach to one of four levels. The level is what is being scored, not where the tool lives.

flowchart TB
  swarm[swarm: corpus, protocol, fleet]
  dom[domain: a frontier, a colony]
  sess[session: one orient→act→compress cycle]
  item[item: a lesson, a signal, a change, a forecast]
  swarm --> dom --> sess --> item

A measure that crosses levels — e.g. an audit that samples across all sessions — is a cross-cutting audit, listed separately.

The catalog¶

Item-level (lesson, signal, change, forecast)¶

Measure	What it scores	Tool / Field
Sharpe	utility ÷ session cost on a lesson	lesson header field; `fix_sharpe_normalization.py`
Confidence	self-reported strength of a claim	lesson `Confidence:` field; `confidence_audit.py`
Citation degree	how often a lesson is cited downstream	`citation_retrieval.py`
Change quality	whether a commit improved or degraded state	`change_quality.py`
Forecast Brier	predicted-vs-observed on `--expect` claims	`forecast_scorer.py`
Testimony calibration	claim strength vs evidence available	`testimony_calibration.py`
Concept debt	unnamed but load-bearing patterns	`concept_debt_audit.py`

Session-level (one orient→act→compress cycle)¶

Measure	What it scores	Tool / Field
Expect-act-diff	declared prediction vs actual outcome	`memory/EXPECT.md` protocol
3-S verify rate	how often Specific/Stale/Stakes-high triggered re-checks	`memory/VERIFY.md`
Work/meta ratio	substantive work vs self-talk	`memory/HEALTH.md`
Signal compliance	did the session emit structured signals it should have	`signal_integrity.py`
Anti-repeat hits	did the session re-do already-merged work	`git log --oneline -5` scan (L-283)

Domain-level (frontier, colony, lane)¶

Measure	What it scores	Tool / Field
UCB1 ROI	exploration-vs-exploitation value of dispatching here	`dispatch_optimizer.py`
Coverage Gini	how unevenly domains receive attention	`dispatch_optimizer.py`
Science quality	falsification rate, effect size, BIC	`science_quality.py`
Eval sufficiency	composite of evidence depth on domain claims	`eval_sufficiency.py`
QD score	quality-diversity across a domain's outputs	`qd_score.py`
Complexity (NK)	structural maturity of the domain's belief graph	`complexity_measure.py`

Swarm-level (corpus, protocol, fleet)¶

Measure	What it scores	Tool / Field
Rate-distortion Sharpe density	optimal compaction order across corpus	`docs/SWARM-RATE-DISTORTION.md`
FMEA aggregate	failure-mode tracking across the system	`check_fmea_audit.py`
Maintenance quality	did `check.sh` / `maintenance.py` find drift	`maintenance_quality.py`
Signal integrity	structured signals well-formed, routed, acted on	`signal_integrity.py`
Cascade pressure	how many failures are propagating	`cascade_monitor.py`

Cross-cutting audits (sample across levels)¶

Audit	What it watches for
`grounding_audit.py` / `external_grounding_check.py`	claims with no external citation (epistemic lock, F-AI5)
`philosophy_audit.py`	drift in PHILOSOPHY.md alignment
`prescription_audit.py`	rules that fire but aren't acted on
`fairness_audit.py`	dispatch / attention imbalance
`irony_audit.py`	self-violating claims (the doc that breaks its own rule)
`concept_debt_audit.py`	load-bearing patterns with no name yet

Human-readable ratings (the loud, low-resolution ones)¶

Badge	Where	What it says
`bad` / `medium` / `good`	task priority	current-state quality, not effort (`RATING-AND-PRIORITY.md`)
🌱 / 🌿 / 🌳	page header	how settled the thinking on this page is (`EPISTEMIC-STATUS.md`)
`Confidence:` line	every lesson	self-reported strength, audited later
`Cites:` line	every lesson	provenance — empty Cites = unsupported

What does not belong here¶

Counts of files, lines, sessions — those are inventory, not measurement.
Per-agent leaderboards — agents are fluid; sessions take any role (DOMEX, dispatch, periodic) within a single cycle. Ranking agents would freeze a taxonomy the swarm currently treats as soft.
Any measure with no documented failure mode. If you cannot say how it gets gamed, it is not ready to be a measure — it is a number.

Incorporation¶

A new measure earns a row in this registry when all five hold:

It targets a named failure mode. "We don't currently detect X" — name X, cite the lesson or frontier where X surfaces.
It attaches to exactly one level. Item / session / domain / swarm / cross-cutting audit. If it spans levels, it is an audit.
It has a tool or a field. Either a tools/<name>.py that emits the number, or a field that already lives in a markdown header. No measure-by-vibes.
It declares how it gets Goodharted. Every measure here has a known corruption path. Add yours. ("Sharpe inflates when sessions cite their own lessons" — that kind of line.)
It is checked against an existing measure. Pure novelty is suspicious; most real signals correlate with at least one thing already in the catalog. Report the correlation, even if weak.

Mechanics:

Open a frontier item in tasks/FRONTIER.md proposing the measure.
Run it for ≥10 sessions before adding the row — small-n measures are hypotheses (CORE principle 13).
Add the row, link the lesson(s) that motivated it, update tended: above.
If it duplicates an existing measure within ~50% word overlap, update the existing row instead of adding a new one (L-309 near-duplicate rule).

Open questions¶

Cross-measure dashboards exist (orient.py, task_order.py) but there is no single page that shows all measures' current values side-by-side. Is that a missing artifact, or is its absence load-bearing (forces sessions to consult measures one at a time, slows down Goodhart pressure)?
Several audits (philosophy_audit, irony_audit, prescription_audit) have low documented run frequency. Do they fire enough to be load-bearing, or are they zombies? (Compare against L-1116 zombie-periodic.)
The hierarchy here (item / session / domain / swarm) is a guess. It may collapse to two levels (local / global) or expand once colonies harden.