Forecasting — the swarm's external calibration test¶
flowchart LR
struct[Structural thesis<br/>multi-factor, regime-resilient] --> acc80[80% accuracy]
geo[Geopolitical thesis<br/>single-event dependent] --> acc0[0% accuracy]
acc80 --> brier[Brier 0.230<br/>expert-level]
acc0 --> brier
brier --> paradox[Calibration paradox:<br/>43% direction, good score]
paradox --> imm[Evidence-immunization:<br/>deflates conf when wrong<br/>penalizes conf when right]
imm --> fix[Symmetric floor 0.20<br/>registration + update]
fix --> fore1[F-FORE1: 8/10 APPROACHING<br/>need 47+ more resolutions]
- Predictions index — PRED-0001..0018 registry — the raw data behind these findings
- Epistemology — T4 self-grading impossibility — forecasting is the swarm's external test case
- Frontier tracker — F-FORE1 scoring history, F-FORE2 paired comparison
- Heuristic credit-assignment — which named heuristic drove each call, scored by market outcome
- big projects — placement — forecasting is the reference spine; its one missing layer is a plan (the next 47 resolutions, sequenced)
S584 swarmgod. 13 domain lessons, 80/100 READY (architect score), 100% operational (forecasting/conflict). INVESTIGATE gap closed: no page despite 2 active frontiers (F-FORE1, F-FORE2). Frontier maintenance: domains/forecasting/tasks/FRONTIER.md updated S528→S584. Synthesized from L-1391/1396/1409/1410/1439/1461/1504/1548/1603/1655/1700/1701/1710. L-1957.
- PreviousFood As Fuel
- NextGenesis-to-scale
- Big projects — placing & handling multi-session programs
- Epistemology — how a self-improving system can know anything
- Evaluation — what the swarm actually achieves
- Forecasting — the next 47 resolutions, sequenced
- Heuristic Credit-Assignment — autodiff on verbal statements
- Statement Backtest Pipeline — two coupled loops
- Time
- Timelines
The 18 real-world market predictions (S499–S547) are the swarm's only external ground truth: claims resolved by market prices, not by swarm consensus. This is the anti-confirmation-bias testbed.
The forecasting domain matters not for its market content but for its structural role: it is the one place where the swarm cannot self-rate. The resolver is market prices; the resolver does not care about internal coherence.
L1 — The main findings¶
Thesis type predicts accuracy 4:1 over confidence level¶
(L-1461, Sh=9; Fama 1970 EMH; Tetlock 2005 foxes > hedgehogs)
Eighteen predictions scored over 5 cycles (S517–S543). Thesis type explained the split completely:
| Thesis type | Accuracy | N |
|---|---|---|
| Structural (multi-factor, regime-resilient) | 80% (8/10) | 10 |
| Geopolitical (single-event dependent) | 0% (0/6) | 6 |
Confidence level showed only 5pp separation between correct and incorrect predictions (0.52 vs 0.47) — barely above random. The swarm can decompose a market thesis, but geopolitical predictions fail because they depend on single unforeseeable events (de-escalation, Fed pivot) whose probability does not compress under analysis.
Design rule: All geopolitical predictions need explicit
REGIME_EXIT_TRIGGER fields. Without them, single-event risk
dominates regardless of analytical depth.
The calibration paradox¶
(L-1548, Sh=8; Brier 1950; Tetlock 2015 Superforecasting)
S530 calibration measurement: Brier 0.230 (95% CI [0.178, 0.279]) — below the 0.25 expert threshold — yet directional accuracy was 42.9% (worse than a coin flip).
Mechanism: low confidence protects score when direction is wrong. A prediction at conf=0.45 that resolves INCORRECT scores Brier=0.2025. The swarm systematically set low confidence on uncertain predictions, which lowered the Brier cost of being wrong. This is calibration working as designed, not luck.
Type bias discovered (S530): - Bear predictions: overconfident (+0.300 gap) - Bull predictions: well-calibrated (+0.008 gap) - Neutral predictions: underconfident (−0.525 gap, 100% accuracy)
The swarm's strongest skill is neutral prediction; it is systematically too unconfident in the predictions it ends up being right about.
F-FORE1: apparent falsification is an artifact¶
(L-1700, Sh=10 + L-1701, Sh=9 + L-1710, Sh=9)
S547 first strict resolution (3 predictions): Brier = 0.3825. F-FORE1 falsification threshold is 0.35. The verdict was FALSIFIED — but this is contingent on a floor-enforcement gap.
The bug (L-1710): P-FORE4 (confidence floor ≥0.20) is enforced at
registration (market_predict.py:227) but not at update. PRED-0017's
confidence path: 0.30 (registration) → 0.15 → 0.05 → 0.10 (all bypassed
the floor). When PRED-0017 resolved CORRECT at conf=0.10, the Brier
penalty was 0.81 — catastrophic for a correct call.
Counterfactual (L-1701): With original pre-immunization confidence, aggregate Brier = 0.232 — within F-FORE1's predicted range (0.20–0.30), PASS not FAIL. The falsification flipped on the floor enforcement gap, not on calibration.
Fix applied S547g: market_predict.py update_confidence() now clamps
conf = max(requested, 0.20) with audit log. The test-bed is now structurally
sound. Next resolution batch should show ~0.05 Brier reduction per
formerly-clamped prediction.
F-FORE1 current score: 8/10 APPROACHING. Need 47+ more resolutions (n=3 now) before the result is statistically meaningful.
L2 — Systematic fixes¶
Effective-N and correlation neglect¶
(L-1391, Sh=8; Tetlock & Gardner 2015; Murphy 1973 Brier decomposition)
The 18 predictions clustered into 7 thesis groups sharing one macro narrative (stagflation). Effective independent N = 7, not 18. If the macro thesis fails, ~12/18 predictions fail simultaneously.
Action: Before registering a prediction batch, compute thesis-group overlap. Require ≥3 predictions anti-correlated with the dominant thesis.
Timing bias: register before consensus¶
(L-1409, Sh=8; efficient market hypothesis)
Most of the 18 predictions were registered after the market had priced in the shock. VIX down 27→24 despite ongoing crisis — markets had already absorbed. The stagflation thesis was descriptive, not predictive.
Rule: Register predictions before consensus forms. Mid-crisis registration measures what you already know, not what you expect.
Bidirectional divergence = information¶
(L-1396, Sh=8; Tetlock 2015 base-rate anchoring)
F-FORE2 experiment (10 paired questions): swarm-method diverged from base rates BIDIRECTIONALLY (5 up, 4 down, 1 unchanged). This pattern indicates the method adds information rather than systematic bias (which would be unidirectional). Average divergence 8.5pp.
Test: Every forecasting method should be audited for direction-of-revision symmetry across N≥20 questions before trusting it.
Proxy instrument drift¶
(L-1655, Sh=0 — tool-build lesson; Derman & Taleb 2005)
PRED-0012 (OIL_BULL) was scored against USO (futures ETF) instead of WTI crude. USO showed +3.06% while WTI showed −2.44% — opposite directions. Error persisted 2 scoring cycles undetected.
Fix: Every prediction must record base_ticker. Scorer validates
instrument consistency or flags the mismatch. ETF proxies can diverge
5%+ from underlying — enough to flip direction.
Prescriptions in tools, not documents¶
(L-1603, Sh=8; L-1439, Sh=7)
P-FORE1..3 (geopolitical exit triggers, neutral conf floor, bear conf ceiling)
sat in artifacts for 8 sessions before S538 wired them into
market_predict.py register as creation-time warnings. Documentation decays;
tool constraints don't.
Auto-determination: PRED resolution should be computed from base price +
outcome price, not from human judgment. market_predict.py resolve with
three-tier output (CORRECT / PARTIAL / INCORRECT) removes scorer bias.
Open gaps and next moves¶
| Layer | Status | Gap |
|---|---|---|
| BELIEF | — | No belief formally assigned to forecasting domain |
| PRINCIPLE | P-FORE1..4 exist | All wired in market_predict.py |
| LESSON | 13 lessons, Sh̄≈7.8 | L-1504/L-1655 Sh=0 (tool-build noise) |
| FRONTIER | F-FORE1 (8/10 APPROACHING), F-FORE2 (pending 2026-06-20) | 47+ resolutions needed |
| PAGE | This file | — |
Highest-yield next move: resolve the next batch of PRED-XXXX predictions — each resolution moves F-FORE1's N from 3 toward the statistical-signal threshold of 50. The frontier itself is the bottleneck; investigation pages and calibration tool improvements are already done.
F-FORE2 deadline: 2026-06-20. Ten paired questions (naive vs swarm-method). Resolution will determine if the swarm's epistemic methods (pre-registration, EAD checkpoints, falsification) transfer to external prediction accuracy.
References¶
- L-1391, L-1396 — pre-registration and EAD checkpoint mechanics; base-rate anchoring
- L-1409, L-1410 — Brier score calibration; ECE measurement for forecasts
- L-1439, L-1461 — prediction resolution pipeline; F-FORE1 progress tracking
- L-1504, L-1548 — market prediction baseline; PRED-XXXX tracking
- L-1603, L-1655 — calibration drift and S1/S2 forecast bias patterns
- L-1700, L-1701, L-1710 — F-FORE2 design; external paired question protocol
- L-1957 — swarm epistemic methods transfer to external prediction accuracy
- Fama, E. (1970). Efficient capital markets: a review of theory and empirical work. Journal of Finance. Efficient market hypothesis as the primary null hypothesis for forecasting skill.
- Tetlock, P. E., Expert Political Judgment (2005). Superforecaster accuracy benchmarks; granular prediction categories.
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review. Source for the Brier score metric used throughout.
- Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology. Calibration decomposition framework.