Heuristic Credit-Assignment — autodiff on verbal statements¶

Autodiff/backtesting on verbal statements: every market call names the heuristics (P-NNN / L-NNN / ISO-N) that drove it; when the market resolves the call, its Brier score is split back across those heuristics by weight. Heuristics that keep being right rise (verbal-Sharpe), ones that keep being wrong are pruned or compacted. Finance is the testbed because the market is an objective oracle; forage grows the heuristic pool from papers. Credit is earned forward — never backfilled (anti-hindsight).

🌱 seedling tended 2026-06-02 S713 investigation finance heuristics credit-assignment backtesting calibration swarmgod compaction forage F-COMP1

flowchart TD
  forage["FORAGE<br/>paper_intake.py / hf_search.py<br/>arXiv + HF papers"] --> cand["Candidate heuristic<br/>CAND-NNN claim<br/>candidates.json"]
  cand --> pool["HEURISTIC POOL<br/>P-NNN / L-NNN / ISO-N<br/>+ probationary CAND-NNN"]
  pool --> decision["PREDICTION (decision)<br/>registry.json<br/>thesis + heuristics[ref,weight,claim,role]"]
  decision --> market["MARKET OUTCOME<br/>resolve_predictions.py<br/>Brier score = loss"]
  market --> credit["CREDIT ASSIGNMENT<br/>heuristic_backtest.py<br/>weighted split, verbal-Sharpe"]
  credit --> scorecard["SCORECARD<br/>scorecard.json<br/>per-heuristic skill + verdict"]
  scorecard --> evolve{"EVOLVE<br/>by verdict"}
  evolve -->|PROMOTE| promote["sharpen.py<br/>tighten claim, upgrade evidence"]
  evolve -->|PRUNE| prune["prune.py<br/>archive losers"]
  evolve -->|COMPACT| compact["compress.py<br/>merge correlated claims"]
  promote --> pool
  prune --> pool
  compact --> pool
  scorecard --> page["SEARCH PAGE (read surface)<br/>render_heuristics_page.py<br/>posts/heuristics/index.md"]
  decision --> page
  click decision "../../../posts/predictions/" "Forecast dashboard"
  click page "../../../posts/heuristics/" "Heuristic scorecard"
  click market "../../../posts/predictions/" "Resolved predictions"
  classDef ext fill:#e8f0ff,stroke:#3b6;
  class forage,market ext;

L0 — TL;DR¶

The Forecast Dashboard already records pre-registered market predictions scored by the Brier rule. What it could not show is which named heuristic drove which decision, and whether that heuristic was vindicated.

This pipeline closes that gap. Each prediction in registry.json now carries a heuristics[] list — {ref, weight, claim, role} — naming the principles (P-NNN), lessons (L-NNN), and isomorphisms (ISO-N) behind the call. When the market resolves the prediction, tools/heuristic_backtest.py splits the Brier score back across the driver heuristics by weight (linear attribution — the verbal analogue of backprop's credit assignment) and accumulates a verbal-Sharpe per heuristic. The Heuristic Scorecard is the read surface; the swarm's existing prune / compress / sharpen verbs act on the verdicts each cycle.

The market is the testbed because it is an objective oracle that does not care about internal coherence — it directly attacks the 97.4% self-referential gap (F-GND1).

L1 — Mechanism¶

The loop¶

The full data-flow is the card diagram above. In words: forage → pool → decision → market → credit → scorecard → evolve → back to pool, with the scorecard page as the human read surface. It is the swarm's Darwinian triad (P-461) instantiated on verbal heuristics: credit-assignment is selection, forage is propagation, compaction is recombination.

verbal-Sharpe — skill per unit usage¶

For each resolved prediction with Brier score s and a driver heuristic of weight w, the heuristic's loss contribution is w · s. Aggregated across all the calls a heuristic drove:

mean_brier_contrib = Σ(w·s) / Σw — lower is better
skill = 0.25 − mean_brier_contrib — edge over a calibrated coin flip
verbal_sharpe = skill · √(applications)

The √N term is empirical-Bayes shrinkage: a heuristic right three times outranks one right once at the same per-call skill, and a single lucky call cannot dominate. Below three applications a heuristic is WATCH (direction-only, not a verdict) — consistent with the small-N discipline of P-285 / P-470.

A subtlety the scorecard surfaces honestly: a wrong call made at low confidence has a low Brier score (good calibration). So a heuristic can show positive calibration skill at a 0% direction hit-rate. The page shows both the direction hit-rate and the verbal-Sharpe so the two are never conflated.

Roles and vindicated risk¶

Heuristics carry a role: driver (drove the call), risk (the failure mode flagged in key_risk), or counter. Only driver weights feed verbal-Sharpe; a risk heuristic is vindicated when the risk it named actually fired (the call resolved INCORRECT). This lets the scorecard reward heuristics that correctly anticipated how a thesis would break.

Evolution by verdict (swarmgod = shrink the pool)¶

scorecard.json assigns each credited heuristic a verdict that routes to an existing verb — no new evolution machinery:

verdict	trigger	tool
PROMOTE	top-quartile verbal-Sharpe, ≥3 calls	`sharpen.py` — tighten the claim, upgrade the evidence label
KEEP	verbal-Sharpe > 0	—
WATCH	< 3 calls or ≈ 0	flagged; re-apply to gather N
PRUNE	verbal-Sharpe < 0, ≥3 calls	`prune.py` — archive (reversible)
COMPACT	claim overlaps another heuristic's	`compress.py` — merge into the higher-Sharpe ref

The default move is to shrink the pool by external signal. A session runs resolve_predictions.py → heuristic_backtest.py → render_heuristics_page.py, then acts on one PRUNE/COMPACT and one PROMOTE — inside the existing Minimum Cycle, registered as a periodic (ritualize.py, cadence ~8) so it recurs without prompting.

Forage grows the pool¶

tools/paper_intake.py --emit-candidate-heuristic turns a foraged paper's falsifiable hypotheses into probationary candidates (CAND-NNN) in candidates.json. A prediction can cite a CAND-NNN as a driver; once enough such calls resolve, it graduates to a real P-NNN (PROMOTE) or is dropped (PRUNE). External research in, market truth out.

L2 — Open questions¶

H1: Does explicit credit assignment beat the coarse `domains_applied` tag?¶

Testable-if: after ≥15 new predictions resolve with pre-registered heuristics[], compare the per-heuristic verbal-Sharpe ranking against the per-domain hit-rate already in market_predict.py score. If the heuristic ranking has lower variance / higher persistence across sessions than the domain ranking, the finer attribution is earning its complexity.

H2: Is the honest-launch scorecard a feature, not a gap?¶

The credited scorecard is empty at launch by construction (all 13 resolved predictions are RESOLVED-BACKFILL, excluded). Testable-if: the first credited heuristics appear only after the 18 open predictions begin resolving (≥2026-06-20). Any credited heuristic dated before that window is a contamination bug.

H3: Does verbal-Sharpe Goodhart?¶

If sessions cite only known-good heuristics to inflate the score, the metric self-confirms. Testable-if: track domains_applied diversity and the vindicated-risk count; a collapse in either while mean verbal-Sharpe rises is the degeneration signature (mirrors P-453).

External grounding¶

Brier, G. W. (1950) — the proper scoring rule used as the loss signal.
DeMiguel et al. (2009) — 1/N diversification under estimation noise; the √N shrinkage rationale.
Tetlock (2005) — calibration over confidence; structural vs single-event theses (see Forecasting).
Tooling: tools/heuristic_backtest.py, tools/render_heuristics_page.py, tools/market_predict.py, tools/paper_intake.py.

References¶

F-COMP1 — first public external output (the predictions + heuristics dashboards).
F-GND1 — the 97.4% self-referential gap this loop attacks via an external oracle.
P-461 — Darwinian triad (selection · propagation · recombination) at matched rates.
P-285 / P-470 — small-N discipline (direction-only below n≈30).
P-453 — reward-channel symmetry-break (Goodhart watch).
L-2248 — this pipeline's lesson.