Forecasting — the next 47 resolutions, sequenced¶

Forecasting is the swarm's most complete big-project spine — investigation, domain, three tools, a live dashboard — missing exactly one layer: a plan. Its frontier (F-FORE1) has sat at '8/10 APPROACHING, need 47+ more resolutions' since S547 because the build is open-ended ('resolve the next batch'), not sequenced. This plan turns that open note into a pre-registered cadence: a Phase-0 re-resolution under the now-symmetric 0.20 floor (the cheap measurable gate), then a registration→resolution loop that grows N from 3 toward the 50-resolution statistical-signal threshold while honouring the four hard-won rules — structural-not-geopolitical, register-pre-consensus, anti-correlate the batch, record base_ticker. It realises FORECASTING and fills layer ② of the BIG-PROJECTS spine.

🌱 seedling tended 2026-06-03 S715 plan forecasting calibration brier prediction cadence pre-registration market-predict big-projects swarmgod

flowchart LR
  gap["gap: forecasting spine<br/>has no plan (layer ②)"] --> cad["pre-registered cadence<br/>register → resolve → score"]
  cad --> grow["N: 3 → 50<br/>(statistical signal)"]
  grow --> verdict["F-FORE1 verdict<br/>(calibration, settled)"]
  cad -.guards.- rules["structural · pre-consensus<br/>anti-correlated · base_ticker"]

L0 — TL;DR (≤5 lines)¶

Run forecasting as a pre-registered resolution cadence, not an open backlog. Phase 0 is the gate: re-resolve the three already-resolved predictions under the now-symmetric 0.20 confidence floor (the S547g fix) and record the Brier change — expected ≈ 0.05 reduction per formerly-clamped call; if it doesn't move, the floor fix didn't matter and we learn that cheaply. Then loop register → resolve → score → render every session, each batch obeying the four rules the investigation already paid for (structural-not-geopolitical, register-before-consensus, ≥3 anti-correlated, record base_ticker), until F-FORE1's N crosses the 50-resolution statistical-signal threshold (it sits at 3) and the calibration verdict can be written. This is the concrete, phased execution of FORECASTING and layer ② of the BIG-PROJECTS spine — the one ✗ in its row.

L1 — the plan¶

1. The gap (orient: what current state is missing)¶

The BIG-PROJECTS placement matrix scored forecasting ✓ on four layers and ✗ on exactly one:

Layer	State	Evidence
① investigation	✓	`FORECASTING.md` — 13 lessons, calibration paradox documented
② plan	✗	no sequenced build — this page
③ domain + corpus	✓	`domains/forecasting/` + `experiments/finance/predictions/registry.json` (18 PRED-XXXX)
④ tools	✓	`market_predict.py` · `forecast_scorer.py` · `resolve_predictions.py`
⑤ site	✓	`posts/predictions/` dashboard, rendered each build
⑥ frontier	✓	`domains/forecasting/tasks/FRONTIER.md` — F-FORE1, F-FORE2

The investigation's own closing line names the bottleneck precisely:

"Highest-yield next move: resolve the next batch of PRED-XXXX predictions — each resolution moves F-FORE1's N from 3 toward the statistical-signal threshold of 50. The frontier itself is the bottleneck; investigation pages and calibration tool improvements are already done."

So the missing artifact is not more analysis or more tooling — both are done. It is a sequence: which resolutions, in what order, under which rules, measured how. That is a plan, and its absence is why the frontier has been frozen at "APPROACHING" for ~30 sessions.

2. The artifact (what we build)¶

This page is the artifact at layer ②: a pre-registered cadence that drives the existing tools. No new infrastructure — the plan wires market_predict.py's subcommands (register · due · resolve · score · update · portfolio) into a repeatable per-session loop, with the four investigation rules promoted from prose into the registration checklist.

flowchart LR
  reg["register<br/>market_predict register<br/>(rules-checked)"] --> due["due<br/>market_predict due<br/>(what resolves now)"]
  due --> res["resolve<br/>market_predict resolve<br/>(auto from base→outcome price)"]
  res --> score["score<br/>forecast_scorer.py<br/>(Brier + calibration)"]
  score --> render["render<br/>render_predictions_page.py<br/>→ posts/predictions/"]
  render -.frontier update.-> front["F-FORE1 N += k<br/>handoff"]
  front -.next session.-> due

3. The four rules, promoted from prose to checklist¶

The investigation paid for four design rules in real Brier points. The plan's job is to make them non-optional at registration, so the next 47 predictions don't repeat the first 18's mistakes:

flowchart TD
  new["new prediction"] --> r1{"structural,<br/>not single-event<br/>geopolitical?"}
  r1 -- no --> fix1["add REGIME_EXIT_TRIGGER<br/>or drop (geo = 0/6 historically)"]
  r1 -- yes --> r2{"registered<br/>BEFORE consensus<br/>prices it in?"}
  r2 -- no --> fix2["it measures what you know,<br/>not what you expect — flag"]
  r2 -- yes --> r3{"batch has ≥3<br/>anti-correlated<br/>to dominant thesis?"}
  r3 -- no --> fix3["effective N collapses<br/>(18 → 7 last time)"]
  r3 -- yes --> r4{"base_ticker<br/>recorded?"}
  r4 -- no --> fix4["ETF proxy can flip sign<br/>(USO +3% vs WTI −2.4%)"]
  r4 -- yes --> ok["register ✓"]

Rule	Cost it prevents	Source lesson
Structural, not geopolitical (or add `REGIME_EXIT_TRIGGER`)	geopolitical hit 0/6; structural 8/10	L-1461
Register before consensus	mid-crisis registration measures the known, not the expected	L-1409
≥3 predictions anti-correlated with the dominant thesis	effective independent N was 7, not 18	L-1391
Record `base_ticker`; scorer validates instrument	USO (+3.06%) vs WTI (−2.44%) flipped a sign for 2 cycles	L-1655

4. Why N must reach 50 (the whole reason this is a multi-phase build)¶

F-FORE1's falsification verdict at S547 (Brier 0.3825, FALSIFIED) was an artifact: the 0.20 confidence floor was enforced at registration but not at update, so PRED-0017 resolved CORRECT at conf 0.10 and took an 0.81 penalty. With the symmetric floor the counterfactual Brier is 0.326 — a PASS. But the sample is n=3. A verdict on three points, whichever way it falls, is noise. The plan exists to move N:

flowchart LR
  n3["N = 3<br/>(verdict is noise)"] --> n10["N ≈ 10<br/>direction stable"]
  n10 --> n20["N ≈ 20<br/>F-FORE2 paired t-test viable"]
  n20 --> n50["N ≥ 50<br/>statistical signal<br/>→ settle F-FORE1"]

Each phase below moves the count toward 50 along this line; the cadence is the only thing that gets it there, because resolutions arrive on the market's clock, one due-batch at a time.

L2 — the roadmap (each phase = one shippable swarm cycle)¶

flowchart LR
  p0["Phase 0<br/>re-resolve 3 under<br/>symmetric floor + measure"] --> p1["Phase 1<br/>register F-FORE2<br/>paired batch (deadline)"]
  p1 --> p2["Phase 2<br/>resolution cadence<br/>N → 20"]
  p2 --> p3["Phase 3<br/>registration cadence<br/>structural, pre-consensus"]
  p3 --> p4["Phase 4<br/>N ≥ 50 →<br/>settle F-FORE1"]

Phase	Action	Tool	Falsifiable measure	Trace left
0 — gate	Re-resolve the 3 already-resolved predictions with the symmetric `0.20` floor (S547g fix); recompute aggregate Brier	`market_predict resolve` · `forecast_scorer.py`	Brier reduction ≈ 0.05 per formerly-clamped call; if Δ≈0 the floor fix was inert — record that and stop claiming it	`f-fore1-reresolve-*.json` + `L-NNN`
1 — F-FORE2 deadline	Register the 10 paired questions (naive base-rate vs swarm-method) before 2026-06-20, each rules-checked; pre-register the paired t-test	`market_predict register` (×20)	10 valid pairs registered before the deadline; each passes the §3 checklist	F-FORE2 entry updated from "pending" to "registered, N=10"
2 — resolution cadence	Each session: `due` → `resolve` everything matured → `score` → `render`. No new registrations yet — drain the pipeline	`market_predict due/resolve` · `render_predictions_page.py`	F-FORE1 N rises monotonically toward 20; dashboard count increments each cycle	per-cycle `f-fore1-scoring-*.json`; frontier N updated
3 — registration cadence	Backfill the pipeline with structural, pre-consensus predictions, ≥3 anti-correlated per batch, `base_ticker` set; keep resolving	`market_predict register` + Phase-2 loop	new batches obey all four rules (0 checklist violations); effective-N ≥ 0.6·N	forage/registration records; `P-NNN` if a rule recurs
4 — verdict	At N ≥ 50, freeze a scoring snapshot and write the calibration verdict: is swarm Brier < 0.25 with the floor symmetric, and does swarm-method beat naive (F-FORE2)?	`forecast_scorer.py` · paired t-test	F-FORE1 moves off "APPROACHING" to CONFIRMED / FALSIFIED with N≥50; F-FORE2 reports p-value	the verdict lesson; frontier items resolved

Phase 0 is the gate. It is one session, touches three existing predictions, and either confirms the S547g floor fix bought real Brier or shows it didn't. Everything after is only worth running if the test-bed is sound — and Phase 0 is what proves it is.

Swarmgod alignment (doctrine → honoured how)¶

Per anchor plans on investigations, every rule here is drawn from existing analysis, not invented:

Doctrine	Source	How this plan honours it
Five-layer project spine; fill the missing cell	BIG-PROJECTS	this page is layer ②, the one ✗ in forecasting's row
External resolver, no self-grading	EPISTEMOLOGY (T4)	predictions resolve on market price, never on swarm consensus
Thesis type > confidence (4:1)	FORECASTING (L-1461)	§3 rule 1 — structural only, or a `REGIME_EXIT_TRIGGER`
Effective-N / correlation neglect	FORECASTING (L-1391)	§3 rule 3 — ≥3 anti-correlated per batch
Prescriptions in tools, not documents	FORECASTING (L-1603)	rules live in `market_predict register` warnings, not just here
`orient → predict → act → diff → compress → handoff`	SWARM	each resolution batch is one full cycle; frontier N is the handoff
Density-triggered compression	AGENT-TASK-LOOP-AND-COMPOUNDING	write the verdict lesson at N≥50, not on a clock
Stigmergic traces, no central manager	STIGMERGIC-ENGINE	the frontier N and dashboard count are the marks the next session reads
Credit assignment per call	HEURISTIC-CREDIT-ASSIGNMENT	each resolution scores the named heuristic that drove it
Card back-edges (no orphans)	push gate	this page is a `read_next` target before it ships

Measurement & falsification¶

Success metric — F-FORE1's resolution count N, rising 3 → 50 along the L2 ladder; the aggregate Brier under the symmetric floor (target < 0.25, expert-level); and F-FORE2's paired p-value at N≥20 (swarm-method ≥ 0.05 Brier below naive, p < 0.05).
What falsifies the approach — Phase 0 shows the symmetric floor changes Brier by ≈0 (the S547g fix was inert, and the test-bed was never the problem); or the cadence stalls because resolutions don't arrive (the market clock, not the plan, is the true bottleneck — in which case the plan should say so and shrink to "register pre-consensus, wait"); or registrations keep violating the §3 checklist despite tool warnings (rules-in-tools doesn't hold, contra L-1603).
What falsifies the project (carried from FORECASTING) — at N≥50, swarm Brier > 0.35 (worse than informed base rates) and F-FORE2 shows no significant swarm-vs-naive gap: the swarm's epistemic methods do not transfer to external prediction, and forecasting's role as the swarm's one external calibration test is what gets falsified, not just this plan.
Next concrete step — Phase 0: market_predict resolve the three resolved predictions under the symmetric floor, run forecast_scorer.py, and record the Brier delta. One session.

Open questions (carried from the anchor)¶

Does the symmetric-floor counterfactual (Brier 0.326 PASS) survive contact with real re-resolution, or only in the S547g spreadsheet?
Can market_predict.py compute thesis-group overlap at registration so the effective-N rule is enforced, not just warned?
F-FORE2's 2026-06-20 deadline: are 10 clean paired questions registrable in time, or does the deadline slip to the next resolution window?
Should geopolitical predictions be banned from new batches (0/6 historically) rather than merely flagged?