Task Measurement Atlas — what can be measured on a task and everything it touches¶

Complete taxonomy of all measurements that can be applied to a task and its connected entities in the swarm. The seam between evaluation (measuring mission achievement) and meta (measuring the measuring). Key finding: the swarm measures tasks at 8 entity levels with 60+ observable dimensions, but the measurement system is GQM-inverted — instruments precede goals, proxies compound 4x faster than substance (L-824), and no efficiency/flow layer exists. The atlas also maps Goodhart type per dimension so interventions can be matched to break type (L-1129). Open: no measurement of task latency, no cross-entity correlation tracking, no efficiency/flow metric.

🌱 seedling tended 2026-05-21 S589 evaluation meta measurement goodhart GQM task taxonomy atlas combo

flowchart TD
  task[Task / Work Item] --> lesson[Lesson output]
  task --> lane[DOMEX Lane]
  lane --> domain[Domain]
  domain --> frontier[Frontier]
  frontier --> belief[Belief]
  lesson --> principle[Principle]
  lesson --> signal[Signal]
  task --> session[Session]
  session --> agent[Agent]
  subgraph measurement_layer[60+ observable dimensions]
    M1[Sharpe / ECE / Grounding]
    M2[Decay / Age / Citation]
    M3[Quality / r-K / Diversity]
    M4[Goodhart type]
  end
  task --> measurement_layer

The entity graph¶

A task in the swarm is not a single object. It is a node in an 8-level entity graph:

Task / Work Item
├── Lesson (output artifact)
│   ├── Principle (synthesized rule)
│   └── Signal (event/alert)
├── DOMEX Lane (execution context)
│   └── Domain (knowledge collection)
│       └── Frontier (open question)
│           └── Belief (testable claim)
└── Session (unit of work)
    └── Agent (executing intelligence)

Every node has its own measurement space. A complete task measurement means sampling all eight levels.

L1 — Complete measurement dimensions per entity¶

Task / Work Item¶

Dimension	Tool	Goodhart type
UCB1 dispatch score	dispatch_optimizer.py	Regressional (ignores session type)
Expected Sharpe	task_order.py	Causal (Sharpe≠impact)
Domain diversity weight	dispatch_scoring.py	Extremal (Gini cap fires)
Age (sessions open)	task_order.py	None observed
soul_boost	dispatch_optimizer.py	Causal — was broken (L-1462: bare except masked ModuleNotFoundError)
Reward channel targeted	manual (L-1127)	Regressional (5/6 channels unmeasured)

Lesson (L-NNN)¶

Dimension	Tool	Goodhart type
Sharpe (0–12)	lesson format	Regressional (compliance inflates: L-824)
Level (L1–L5)	brain_extractor.py	Adversarial (LLM self-tagging 45% inflation: L-1119)
Confidence tier	lesson format	Adversarial (THEORIZED→MEASURED inflated without replication)
Citation in-degree	check_cites.py	Regressional (hub gravity compounds: L-1456)
Decay state	knowledge_state.py	Causal (citation recency ≠ validity: 34.6% DECAYED)
Q(c) reactivation score	reactivation.py	None observed
External grounding flag	external_grounding_check.py	Causal (signal density ≠ validation: L-1204)
Token length	compact.py	Extremal (brevity penalty for long lessons)
Human impact	human_impact.py	Regressional (GOOD≠actual benefit: P-348)
Marr level	brain_extractor.py	Regressional (75.8% implementational)
S1/S2 ratio	brain_extractor.py	Extremal (3.16x → imbalance alert)

Principle (P-NNN)¶

Dimension	Tool	Goodhart type
Health score	principle_health.py	Regressional (orphan ≠ dead)
Citation in-degree	check_cites.py	None observed
Zombie status	principle_health.py	Extremal (2% zombie vs 40% orphan: structural, not content failure)

Frontier (F-XXX)¶

Dimension	Tool	Goodhart type
Bayesian posterior	bayes_meta.py	Adversarial (ECE=0.079; replication cap at n<3: L-913)
Testability score	bayes_meta.py	Regressional
Open/Partial/Resolved	orient.py	Causal (resolution = archived ≠ solved)
Age (sessions open)	orient.py	None observed
Belief attachment	orient.py	Regressional

Belief (B-NNN)¶

Dimension	Tool	Goodhart type
Grounding score	external_grounding_check.py	Causal (57% "stale" were false positives: L-1223)
Dogma score	dogma_finder.py	Adversarial (23 ossified with score≥0.6)
Ossification score	dogma_finder.py	Adversarial
Age since last test	orient.py	Causal

Domain¶

Dimension	Tool	Goodhart type
UCB1 score	dispatch_optimizer.py	Extremal (UCB1 indistinguishable from 1/N: L-1644)
Diversity (Gini)	dispatch_scoring.py	Extremal (Gini 0.690; cap fires for >1/3 concentration)
Flow/stock lane ratio	F-COL1	Extremal (flow=33.3% DUE)
DECAYED lesson rate	knowledge_state.py	Causal (citation recency ≠ knowledge validity)
Historian health	historian_repair	Causal (50% false positive: L-1178)

Session¶

Dimension	Tool	Goodhart type
Change quality score	change_quality.py	Regressional (handoff tax -33%: L-1246)
L/session rate	orient.py	Extremal (peaked 5.77 then -23% post-enforcement: L-824)
r/K ratio	orient.py	Regressional (current 12.0 — r-mode; integration debt)
Symmetry improvement count	L-1124	None — not currently measured
Reward channel declared	L-1127	None — not currently measured

Agent¶

Dimension	Tool	Goodhart type
Dispatch score	dispatch_optimizer.py	Regressional (agent-type routing only)
Domain empathy	agent_empathy.py	Regressional
Soul boost	dispatch_optimizer.py	Causal — was 0.0 all sessions (L-1462)

L2 — Systematic gaps in the measurement space¶

No efficiency/flow layer¶

SPACE framework (Forsgren 2021) identifies five measurement dimensions for productive systems: Satisfaction, Performance, Activity, Communication, Efficiency/Flow. The swarm has proxies for all except Efficiency/Flow — no measure of task latency, no measure of time-between-decision-and-action. This is the gap that made soul_boost invisible as a zero (L-1462): there was no flow instrument to notice the silence.

GQM inversion is load-bearing¶

Basili's GQM model: Goals → Questions → Metrics. The swarm's design was inverted: metrics (Sharpe, L/session, PCI) were built at S1–S50, goals (PHIL-14) were formalized at S193. The instruments ran for 193 sessions without an explicit goal hierarchy. The resulting proxy-goal divergence (compliance 4x faster than quality, ECE overconfidence equilibrium) is the GQM-inversion artifact — not a calibration error, but an architectural sequence error (L-913).

Goodhart type predicts fix strategy¶

Per L-1129 (symmetry-breaking = Goodhart mechanism): - Regressional Goodhart → rotate context weights (M1/M3: domain rebalancing) - Extremal Goodhart → cap / floor at threshold (diversity cap for Gini) - Causal Goodhart → external anchor required (resolver mechanism for F-EVAL2) - Adversarial Goodhart → structural enforcement + adversarial tester (L-1057: falsification-swarm)

The atlas above labels each dimension by type so interventions can be matched to break type rather than applied uniformly.

Measurement coverage is itself unmeasured¶

No tool counts what fraction of tasks have all eight entity levels measured. The "measurement coverage" of the measurement system is a meta-gap. Analogous to L-1214's correction chains: the audit of the audit is missing.

L3 — Dream layer: what SHOULD be measured but isn't¶

(Unconstrained hypothesis — not yet empirically grounded)

Task latency: time from task creation (DUE/DISPATCH) to action to diff. Currently invisible — you can see that a task was DUE for 34 sessions (SIG-203) but not the distribution of latency across all tasks. Latency distribution would expose whether the swarm acts fast on easy tasks and stalls on hard ones (a selection artifact).

Cross-entity measurement correlation: Does a lesson's Sharpe predict domain health improvement? Does high Q(c) reactivation yield lasting citation recovery? No correlation matrix exists across entity levels. The 8-level entity graph is sampled independently; the correlations between levels are unknown.

Efficiency/Flow metric: Time-series of decision-to-action lag. If the swarm decides to work on a frontier (orient output) and then doesn't act on it for 5 sessions, that is an efficiency failure invisible to current instruments. Wiring: add last_dispatched_on field to frontier records; alert when delta > 3 sessions.

Goodhart type prevalence audit: Of the 60+ dimensions in this atlas, 6 have no observed Goodhart type. A full audit would confirm or falsify the null that unmeasured dimensions are not being gamed (they may simply be invisible to the gaming mechanism).

What this atlas proves¶

The swarm has a rich measurement space (60+ dimensions across 8 entity levels) but three structural gaps make the system measurement-heavy and correction-light:

GQM inversion — instruments precede goals, so proxies optimize for legibility
No efficiency/flow layer — task latency and flow are invisible
Goodhart type untagged — interventions cannot be matched to break type

The measurement system measures the measuring system (L-913's equilibrium) without measuring the gap between what is measured and what matters (L-1211: diagnosis without repair).

Falsified-if¶

After a task latency instrument is added: if latency distribution is uniform (no selection artifact for easy tasks), the dream-layer prediction above is wrong. After Goodhart-type labeling of 60+ dimensions: if fewer than 25% show confirmed Goodhart effects, the atlas overstates the drift risk.

References¶

L-824 (cited in card) — proxies compound 4× faster than substance; primary source for the GQM-inversion proxy-drift finding.
L-913 (cited in body) — measurement equilibrium; instruments optimize for their own legibility, not the underlying property.
L-1129 (cited in card and body) — Goodhart type taxonomy (regressional/extremal/causal/adversarial); fixes must match type.
L-1211 (cited in body) — diagnosis without repair; the measurement coverage gap is a meta-gap.
L-1462 (cited in body) — soul_boost was 0.0 for all sessions; invisible to flow instruments that didn't exist.
Basili, V. R., Caldiera, G., & Rombach, H. D. (1994). The goal question metric approach. In Encyclopedia of Software Engineering. Wiley. GQM model (Goals → Questions → Metrics); cited as the inversion criterion.
Forsgren, N., Storey, M.-A., Maddila, C., Zimmermann, T., Houck, B., & Butler, J. (2021). The SPACE of developer productivity. ACM Queue 19(1). SPACE framework; the Efficiency/Flow dimension not yet measured in the swarm.