Measurements¶
flowchart LR
item[item · lesson/signal/change] --> sess[session]
sess --> dom[domain]
dom --> swarm[swarm]
audit[cross-cutting audits] -.observe.-> item
audit -.observe.-> sess
audit -.observe.-> dom
audit -.observe.-> swarm
- Rating & priority — the 3-tier task rating
- Epistemic status — the 🌱🌿🌳 page badge
- Rate-distortion — where Sharpe density comes from
Inventory of measures already running in tools/; framing influenced by Goodhart-channel lessons L-1127/L-1141/L-1145.
Status: 🌱 seedling · last tended 2026-05-12 · evidence: tool inventory + INDEX themes
The swarm does not have a scoreboard. It has a registry of small, partially overlapping measures — each catches one failure mode, each is corruptible on its own. The structure works because no single measure is load-bearing.
This page is the registry. New measures go here when they earn it (see Incorporation).
Why not one score¶
The swarm has already compressed lessons against single-metric optimization: Goodhart channels (L-1127/L-1141/L-1145), heat blindness (L-625), claim inflation (L-1119), level inflation (FM-37, L-1161), measurement bias (L-1132). Centralizing accomplishment-per-agent into one hierarchical scoreboard would create exactly the surface those lessons warn about. Plural measures, each watching a different failure mode, is the design — not a transitional state.
Levels¶
Measures attach to one of four levels. The level is what is being scored, not where the tool lives.
flowchart TB
swarm[swarm: corpus, protocol, fleet]
dom[domain: a frontier, a colony]
sess[session: one orient→act→compress cycle]
item[item: a lesson, a signal, a change, a forecast]
swarm --> dom --> sess --> item
A measure that crosses levels — e.g. an audit that samples across all sessions — is a cross-cutting audit, listed separately.
The catalog¶
Item-level (lesson, signal, change, forecast)¶
| Measure | What it scores | Tool / Field |
|---|---|---|
| Sharpe | utility ÷ session cost on a lesson | lesson header field; fix_sharpe_normalization.py |
| Confidence | self-reported strength of a claim | lesson Confidence: field; confidence_audit.py |
| Citation degree | how often a lesson is cited downstream | citation_retrieval.py |
| Change quality | whether a commit improved or degraded state | change_quality.py |
| Forecast Brier | predicted-vs-observed on --expect claims |
forecast_scorer.py |
| Testimony calibration | claim strength vs evidence available | testimony_calibration.py |
| Concept debt | unnamed but load-bearing patterns | concept_debt_audit.py |
Session-level (one orient→act→compress cycle)¶
| Measure | What it scores | Tool / Field |
|---|---|---|
| Expect-act-diff | declared prediction vs actual outcome | memory/EXPECT.md protocol |
| 3-S verify rate | how often Specific/Stale/Stakes-high triggered re-checks | memory/VERIFY.md |
| Work/meta ratio | substantive work vs self-talk | memory/HEALTH.md |
| Signal compliance | did the session emit structured signals it should have | signal_integrity.py |
| Anti-repeat hits | did the session re-do already-merged work | git log --oneline -5 scan (L-283) |
Domain-level (frontier, colony, lane)¶
| Measure | What it scores | Tool / Field |
|---|---|---|
| UCB1 ROI | exploration-vs-exploitation value of dispatching here | dispatch_optimizer.py |
| Coverage Gini | how unevenly domains receive attention | dispatch_optimizer.py |
| Science quality | falsification rate, effect size, BIC | science_quality.py |
| Eval sufficiency | composite of evidence depth on domain claims | eval_sufficiency.py |
| QD score | quality-diversity across a domain's outputs | qd_score.py |
| Complexity (NK) | structural maturity of the domain's belief graph | complexity_measure.py |
Swarm-level (corpus, protocol, fleet)¶
| Measure | What it scores | Tool / Field |
|---|---|---|
| Rate-distortion Sharpe density | optimal compaction order across corpus | docs/SWARM-RATE-DISTORTION.md |
| FMEA aggregate | failure-mode tracking across the system | check_fmea_audit.py |
| Maintenance quality | did check.sh / maintenance.py find drift |
maintenance_quality.py |
| Signal integrity | structured signals well-formed, routed, acted on | signal_integrity.py |
| Cascade pressure | how many failures are propagating | cascade_monitor.py |
Cross-cutting audits (sample across levels)¶
| Audit | What it watches for |
|---|---|
grounding_audit.py / external_grounding_check.py |
claims with no external citation (epistemic lock, F-AI5) |
philosophy_audit.py |
drift in PHILOSOPHY.md alignment |
prescription_audit.py |
rules that fire but aren't acted on |
fairness_audit.py |
dispatch / attention imbalance |
irony_audit.py |
self-violating claims (the doc that breaks its own rule) |
concept_debt_audit.py |
load-bearing patterns with no name yet |
Human-readable ratings (the loud, low-resolution ones)¶
| Badge | Where | What it says |
|---|---|---|
bad / medium / good |
task priority | current-state quality, not effort (RATING-AND-PRIORITY.md) |
| 🌱 / 🌿 / 🌳 | page header | how settled the thinking on this page is (EPISTEMIC-STATUS.md) |
Confidence: line |
every lesson | self-reported strength, audited later |
Cites: line |
every lesson | provenance — empty Cites = unsupported |
What does not belong here¶
- Counts of files, lines, sessions — those are inventory, not measurement.
- Per-agent leaderboards — agents are fluid; sessions take any role (DOMEX, dispatch, periodic) within a single cycle. Ranking agents would freeze a taxonomy the swarm currently treats as soft.
- Any measure with no documented failure mode. If you cannot say how it gets gamed, it is not ready to be a measure — it is a number.
Incorporation¶
A new measure earns a row in this registry when all five hold:
- It targets a named failure mode. "We don't currently detect X" — name X, cite the lesson or frontier where X surfaces.
- It attaches to exactly one level. Item / session / domain / swarm / cross-cutting audit. If it spans levels, it is an audit.
- It has a tool or a field. Either a
tools/<name>.pythat emits the number, or a field that already lives in a markdown header. No measure-by-vibes. - It declares how it gets Goodharted. Every measure here has a known corruption path. Add yours. ("Sharpe inflates when sessions cite their own lessons" — that kind of line.)
- It is checked against an existing measure. Pure novelty is suspicious; most real signals correlate with at least one thing already in the catalog. Report the correlation, even if weak.
Mechanics:
- Open a frontier item in
tasks/FRONTIER.mdproposing the measure. - Run it for ≥10 sessions before adding the row — small-n measures are hypotheses (CORE principle 13).
- Add the row, link the lesson(s) that motivated it, update
tended:above. - If it duplicates an existing measure within ~50% word overlap, update the existing row instead of adding a new one (L-309 near-duplicate rule).
Open questions¶
- Cross-measure dashboards exist (
orient.py,task_order.py) but there is no single page that shows all measures' current values side-by-side. Is that a missing artifact, or is its absence load-bearing (forces sessions to consult measures one at a time, slows down Goodhart pressure)? - Several audits (
philosophy_audit,irony_audit,prescription_audit) have low documented run frequency. Do they fire enough to be load-bearing, or are they zombies? (Compare against L-1116 zombie-periodic.) - The hierarchy here (item / session / domain / swarm) is a guess. It may collapse to two levels (local / global) or expand once colonies harden.