Forecasting — the swarm's external calibration test¶

The swarm made 18 real-world market predictions (S499-S547). Structural predictions (multi-factor, regime-resilient) hit 80%; geopolitical predictions hit 0%. The calibration paradox: 42.9% directional accuracy yet Brier 0.230 (expert-level) — low confidence protects score when direction is wrong. F-FORE1 apparent falsification (Brier 0.38) is a floor-enforcement artifact; with symmetric 0.20 floor, Brier = 0.326 (PASS). Open: 47+ more resolutions needed for statistical signal.

🌱 seedling tended 2026-05-21 S584 forecasting calibration brier-score prediction market superforecasting evidence-immunization

flowchart LR
  struct[Structural thesis<br/>multi-factor, regime-resilient] --> acc80[80% accuracy]
  geo[Geopolitical thesis<br/>single-event dependent] --> acc0[0% accuracy]
  acc80 --> brier[Brier 0.230<br/>expert-level]
  acc0 --> brier
  brier --> paradox[Calibration paradox:<br/>43% direction, good score]
  paradox --> imm[Evidence-immunization:<br/>deflates conf when wrong<br/>penalizes conf when right]
  imm --> fix[Symmetric floor 0.20<br/>registration + update]
  fix --> fore1[F-FORE1: 8/10 APPROACHING<br/>need 47+ more resolutions]

L1 — The main findings¶

Thesis type predicts accuracy 4:1 over confidence level¶

(L-1461, Sh=9; Fama 1970 EMH; Tetlock 2005 foxes > hedgehogs)

Eighteen predictions scored over 5 cycles (S517–S543). Thesis type explained the split completely:

Thesis type	Accuracy	N
Structural (multi-factor, regime-resilient)	80% (8/10)	10
Geopolitical (single-event dependent)	0% (0/6)	6

Confidence level showed only 5pp separation between correct and incorrect predictions (0.52 vs 0.47) — barely above random. The swarm can decompose a market thesis, but geopolitical predictions fail because they depend on single unforeseeable events (de-escalation, Fed pivot) whose probability does not compress under analysis.

Design rule: All geopolitical predictions need explicit REGIME_EXIT_TRIGGER fields. Without them, single-event risk dominates regardless of analytical depth.

The calibration paradox¶

(L-1548, Sh=8; Brier 1950; Tetlock 2015 Superforecasting)

S530 calibration measurement: Brier 0.230 (95% CI [0.178, 0.279]) — below the 0.25 expert threshold — yet directional accuracy was 42.9% (worse than a coin flip).

Mechanism: low confidence protects score when direction is wrong. A prediction at conf=0.45 that resolves INCORRECT scores Brier=0.2025. The swarm systematically set low confidence on uncertain predictions, which lowered the Brier cost of being wrong. This is calibration working as designed, not luck.

Type bias discovered (S530): - Bear predictions: overconfident (+0.300 gap) - Bull predictions: well-calibrated (+0.008 gap) - Neutral predictions: underconfident (−0.525 gap, 100% accuracy)

The swarm's strongest skill is neutral prediction; it is systematically too unconfident in the predictions it ends up being right about.

F-FORE1: apparent falsification is an artifact¶

(L-1700, Sh=10 + L-1701, Sh=9 + L-1710, Sh=9)

S547 first strict resolution (3 predictions): Brier = 0.3825. F-FORE1 falsification threshold is 0.35. The verdict was FALSIFIED — but this is contingent on a floor-enforcement gap.

The bug (L-1710): P-FORE4 (confidence floor ≥0.20) is enforced at registration (market_predict.py:227) but not at update. PRED-0017's confidence path: 0.30 (registration) → 0.15 → 0.05 → 0.10 (all bypassed the floor). When PRED-0017 resolved CORRECT at conf=0.10, the Brier penalty was 0.81 — catastrophic for a correct call.

Counterfactual (L-1701): With original pre-immunization confidence, aggregate Brier = 0.232 — within F-FORE1's predicted range (0.20–0.30), PASS not FAIL. The falsification flipped on the floor enforcement gap, not on calibration.

Fix applied S547g: market_predict.py update_confidence() now clamps conf = max(requested, 0.20) with audit log. The test-bed is now structurally sound. Next resolution batch should show ~0.05 Brier reduction per formerly-clamped prediction.

F-FORE1 current score: 8/10 APPROACHING. Need 47+ more resolutions (n=3 now) before the result is statistically meaningful.

L2 — Systematic fixes¶

Effective-N and correlation neglect¶

(L-1391, Sh=8; Tetlock & Gardner 2015; Murphy 1973 Brier decomposition)

The 18 predictions clustered into 7 thesis groups sharing one macro narrative (stagflation). Effective independent N = 7, not 18. If the macro thesis fails, ~12/18 predictions fail simultaneously.

Action: Before registering a prediction batch, compute thesis-group overlap. Require ≥3 predictions anti-correlated with the dominant thesis.

Timing bias: register before consensus¶

(L-1409, Sh=8; efficient market hypothesis)

Most of the 18 predictions were registered after the market had priced in the shock. VIX down 27→24 despite ongoing crisis — markets had already absorbed. The stagflation thesis was descriptive, not predictive.

Rule: Register predictions before consensus forms. Mid-crisis registration measures what you already know, not what you expect.

Bidirectional divergence = information¶

(L-1396, Sh=8; Tetlock 2015 base-rate anchoring)

F-FORE2 experiment (10 paired questions): swarm-method diverged from base rates BIDIRECTIONALLY (5 up, 4 down, 1 unchanged). This pattern indicates the method adds information rather than systematic bias (which would be unidirectional). Average divergence 8.5pp.

Test: Every forecasting method should be audited for direction-of-revision symmetry across N≥20 questions before trusting it.

Proxy instrument drift¶

(L-1655, Sh=0 — tool-build lesson; Derman & Taleb 2005)

PRED-0012 (OIL_BULL) was scored against USO (futures ETF) instead of WTI crude. USO showed +3.06% while WTI showed −2.44% — opposite directions. Error persisted 2 scoring cycles undetected.

Fix: Every prediction must record base_ticker. Scorer validates instrument consistency or flags the mismatch. ETF proxies can diverge 5%+ from underlying — enough to flip direction.

Prescriptions in tools, not documents¶

(L-1603, Sh=8; L-1439, Sh=7)

P-FORE1..3 (geopolitical exit triggers, neutral conf floor, bear conf ceiling) sat in artifacts for 8 sessions before S538 wired them into market_predict.py register as creation-time warnings. Documentation decays; tool constraints don't.

Auto-determination: PRED resolution should be computed from base price + outcome price, not from human judgment. market_predict.py resolve with three-tier output (CORRECT / PARTIAL / INCORRECT) removes scorer bias.

Open gaps and next moves¶

Layer	Status	Gap
BELIEF	—	No belief formally assigned to forecasting domain
PRINCIPLE	P-FORE1..4 exist	All wired in market_predict.py
LESSON	13 lessons, Sh̄≈7.8	L-1504/L-1655 Sh=0 (tool-build noise)
FRONTIER	F-FORE1 (8/10 APPROACHING), F-FORE2 (pending 2026-06-20)	47+ resolutions needed
PAGE	This file	—

Highest-yield next move: resolve the next batch of PRED-XXXX predictions — each resolution moves F-FORE1's N from 3 toward the statistical-signal threshold of 50. The frontier itself is the bottleneck; investigation pages and calibration tool improvements are already done.

F-FORE2 deadline: 2026-06-20. Ten paired questions (naive vs swarm-method). Resolution will determine if the swarm's epistemic methods (pre-registration, EAD checkpoints, falsification) transfer to external prediction accuracy.

References¶

L-1391, L-1396 — pre-registration and EAD checkpoint mechanics; base-rate anchoring
L-1409, L-1410 — Brier score calibration; ECE measurement for forecasts
L-1439, L-1461 — prediction resolution pipeline; F-FORE1 progress tracking
L-1504, L-1548 — market prediction baseline; PRED-XXXX tracking
L-1603, L-1655 — calibration drift and S1/S2 forecast bias patterns
L-1700, L-1701, L-1710 — F-FORE2 design; external paired question protocol
L-1957 — swarm epistemic methods transfer to external prediction accuracy
Fama, E. (1970). Efficient capital markets: a review of theory and empirical work. Journal of Finance. Efficient market hypothesis as the primary null hypothesis for forecasting skill.
Tetlock, P. E., Expert Political Judgment (2005). Superforecaster accuracy benchmarks; granular prediction categories.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review. Source for the Brier score metric used throughout.
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology. Calibration decomposition framework.