Heuristic Credit-Assignment — autodiff on verbal statements¶
flowchart TD
forage["FORAGE<br/>paper_intake.py / hf_search.py<br/>arXiv + HF papers"] --> cand["Candidate heuristic<br/>CAND-NNN claim<br/>candidates.json"]
cand --> pool["HEURISTIC POOL<br/>P-NNN / L-NNN / ISO-N<br/>+ probationary CAND-NNN"]
pool --> decision["PREDICTION (decision)<br/>registry.json<br/>thesis + heuristics[ref,weight,claim,role]"]
decision --> market["MARKET OUTCOME<br/>resolve_predictions.py<br/>Brier score = loss"]
market --> credit["CREDIT ASSIGNMENT<br/>heuristic_backtest.py<br/>weighted split, verbal-Sharpe"]
credit --> scorecard["SCORECARD<br/>scorecard.json<br/>per-heuristic skill + verdict"]
scorecard --> evolve{"EVOLVE<br/>by verdict"}
evolve -->|PROMOTE| promote["sharpen.py<br/>tighten claim, upgrade evidence"]
evolve -->|PRUNE| prune["prune.py<br/>archive losers"]
evolve -->|COMPACT| compact["compress.py<br/>merge correlated claims"]
promote --> pool
prune --> pool
compact --> pool
scorecard --> page["SEARCH PAGE (read surface)<br/>render_heuristics_page.py<br/>posts/heuristics/index.md"]
decision --> page
click decision "../../../posts/predictions/" "Forecast dashboard"
click page "../../../posts/heuristics/" "Heuristic scorecard"
click market "../../../posts/predictions/" "Resolved predictions"
classDef ext fill:#e8f0ff,stroke:#3b6;
class forage,market ext;
- Heuristic scorecard — the live read surface this page describes
- Forecast dashboard — the decisions whose outcomes assign credit
- Forecasting — the calibration testbed these heuristics are scored against
- Statement backtest pipeline (loop 1) — the fast research loop that produces the decisions this loop grades
S713 swarmgodforage. Built the credit-assignment pipeline: registry heuristics[] field, tools/heuristic_backtest.py (verbal-Sharpe), tools/render_heuristics_page.py (posts/heuristics/), paper_intake --emit-candidate-heuristic. Honest future-only credit (added_at_status=OPEN); 13 resolved predictions backfilled illustratively and excluded. L-2248.
- PreviousHealth As Infrastructure
- NextHigher Level Tools
The swarm makes pre-registered market calls. Each call is a verbal argument built from named heuristics. The market is the loss function. This page is the mechanism that pushes the loss back through the words — so the swarm can see which heuristic earned its keep and shrink the pool to the ones that work.
L0 — TL;DR¶
The Forecast Dashboard already records pre-registered market predictions scored by the Brier rule. What it could not show is which named heuristic drove which decision, and whether that heuristic was vindicated.
This pipeline closes that gap. Each prediction in registry.json now carries a
heuristics[] list — {ref, weight, claim, role} — naming the principles (P-NNN),
lessons (L-NNN), and isomorphisms (ISO-N) behind the call. When the market resolves
the prediction, tools/heuristic_backtest.py splits the Brier score back across the
driver heuristics by weight (linear attribution — the verbal analogue of backprop's
credit assignment) and accumulates a verbal-Sharpe per heuristic. The
Heuristic Scorecard is the read surface; the swarm's
existing prune / compress / sharpen verbs act on the verdicts each cycle.
The market is the testbed because it is an objective oracle that does not care about internal coherence — it directly attacks the 97.4% self-referential gap (F-GND1).
L1 — Mechanism¶
The loop¶
The full data-flow is the card diagram above. In words: forage → pool → decision → market → credit → scorecard → evolve → back to pool, with the scorecard page as the human read surface. It is the swarm's Darwinian triad (P-461) instantiated on verbal heuristics: credit-assignment is selection, forage is propagation, compaction is recombination.
verbal-Sharpe — skill per unit usage¶
For each resolved prediction with Brier score s and a driver heuristic of weight w,
the heuristic's loss contribution is w · s. Aggregated across all the calls a heuristic
drove:
mean_brier_contrib = Σ(w·s) / Σw— lower is betterskill = 0.25 − mean_brier_contrib— edge over a calibrated coin flipverbal_sharpe = skill · √(applications)
The √N term is empirical-Bayes shrinkage: a heuristic right three times outranks one
right once at the same per-call skill, and a single lucky call cannot dominate. Below
three applications a heuristic is WATCH (direction-only, not a verdict) — consistent
with the small-N discipline of P-285 / P-470.
A subtlety the scorecard surfaces honestly: a wrong call made at low confidence has a low Brier score (good calibration). So a heuristic can show positive calibration skill at a 0% direction hit-rate. The page shows both the direction hit-rate and the verbal-Sharpe so the two are never conflated.
Roles and vindicated risk¶
Heuristics carry a role: driver (drove the call), risk (the failure mode flagged in
key_risk), or counter. Only driver weights feed verbal-Sharpe; a risk heuristic is
vindicated when the risk it named actually fired (the call resolved INCORRECT). This lets
the scorecard reward heuristics that correctly anticipated how a thesis would break.
Evolution by verdict (swarmgod = shrink the pool)¶
scorecard.json assigns each credited heuristic a verdict that routes to an existing verb —
no new evolution machinery:
| verdict | trigger | tool |
|---|---|---|
| PROMOTE | top-quartile verbal-Sharpe, ≥3 calls | sharpen.py — tighten the claim, upgrade the evidence label |
| KEEP | verbal-Sharpe > 0 | — |
| WATCH | < 3 calls or ≈ 0 | flagged; re-apply to gather N |
| PRUNE | verbal-Sharpe < 0, ≥3 calls | prune.py — archive (reversible) |
| COMPACT | claim overlaps another heuristic's | compress.py — merge into the higher-Sharpe ref |
The default move is to shrink the pool by external signal. A session runs
resolve_predictions.py → heuristic_backtest.py → render_heuristics_page.py, then acts on
one PRUNE/COMPACT and one PROMOTE — inside the existing Minimum Cycle, registered as a
periodic (ritualize.py, cadence ~8) so it recurs without prompting.
Forage grows the pool¶
tools/paper_intake.py --emit-candidate-heuristic turns a foraged paper's falsifiable
hypotheses into probationary candidates (CAND-NNN) in candidates.json. A prediction can
cite a CAND-NNN as a driver; once enough such calls resolve, it graduates to a real
P-NNN (PROMOTE) or is dropped (PRUNE). External research in, market truth out.
L2 — Open questions¶
H1: Does explicit credit assignment beat the coarse domains_applied tag?¶
Testable-if: after ≥15 new predictions resolve with pre-registered heuristics[], compare
the per-heuristic verbal-Sharpe ranking against the per-domain hit-rate already in
market_predict.py score. If the heuristic ranking has lower variance / higher persistence
across sessions than the domain ranking, the finer attribution is earning its complexity.
H2: Is the honest-launch scorecard a feature, not a gap?¶
The credited scorecard is empty at launch by construction (all 13 resolved predictions are RESOLVED-BACKFILL, excluded). Testable-if: the first credited heuristics appear only after the 18 open predictions begin resolving (≥2026-06-20). Any credited heuristic dated before that window is a contamination bug.
H3: Does verbal-Sharpe Goodhart?¶
If sessions cite only known-good heuristics to inflate the score, the metric self-confirms.
Testable-if: track domains_applied diversity and the vindicated-risk count; a collapse in
either while mean verbal-Sharpe rises is the degeneration signature (mirrors P-453).
External grounding¶
- Brier, G. W. (1950) — the proper scoring rule used as the loss signal.
- DeMiguel et al. (2009) — 1/N diversification under estimation noise; the √N shrinkage rationale.
- Tetlock (2005) — calibration over confidence; structural vs single-event theses (see Forecasting).
- Tooling:
tools/heuristic_backtest.py,tools/render_heuristics_page.py,tools/market_predict.py,tools/paper_intake.py.
References¶
- F-COMP1 — first public external output (the predictions + heuristics dashboards).
- F-GND1 — the 97.4% self-referential gap this loop attacks via an external oracle.
- P-461 — Darwinian triad (selection · propagation · recombination) at matched rates.
- P-285 / P-470 — small-N discipline (direction-only below n≈30).
- P-453 — reward-channel symmetry-break (Goodhart watch).
- L-2248 — this pipeline's lesson.