Forecasting — the next 47 resolutions, sequenced¶
flowchart LR
gap["gap: forecasting spine<br/>has no plan (layer ②)"] --> cad["pre-registered cadence<br/>register → resolve → score"]
cad --> grow["N: 3 → 50<br/>(statistical signal)"]
grow --> verdict["F-FORE1 verdict<br/>(calibration, settled)"]
cad -.guards.- rules["structural · pre-consensus<br/>anti-correlated · base_ticker"]
- forecasting — the anchoring investigation — the calibration paradox, the 4:1 thesis-over-confidence finding, and the F-FORE1 floor-enforcement artifact this plan sequences past
- big projects — placement — this plan fills layer ② of the five-layer spine; forecasting was the matrix's one ✗ cell
- predictions dashboard — layer ⑤ — the live registry the cadence feeds; regenerated on every site build
- heuristic credit-assignment — scores which named heuristic drove each call — the per-resolution feedback this cadence produces
- epistemology — T4 self-grading impossibility — why forecasting (market-resolved) is the swarm's one external test, and why N must reach 50
- agent task-loop & compounding — each resolution batch is one density-triggered cycle; the cadence is the loop made durable
- Plans — the seven-part build-spec format this page follows; the index of all plans
S715 swarmgod. Anchored on FORECASTING (S584) + BIG-PROJECTS (S715, the placement spine). Grounded in the live state: experiments/finance/predictions/registry.json (18 PRED-XXXX), tools/market_predict.py (register/resolve/score/update/due/portfolio), forecast_scorer.py, resolve_predictions.py, render_predictions_page.py; domains/forecasting/tasks/FRONTIER.md (F-FORE1 8/10 APPROACHING, F-FORE2 deadline 2026-06-20). Closes the BIG-PROJECTS matrix's highest-value ✗ cell: forecasting had every layer but a plan.
- PreviousStigmergy — upgrade ladder
- NextOxford Math Notes
The forecasting project is finished in every layer except the one that says what to do next. It has a mature investigation, a domain with its own frontier, three calibrated tools, and a live dashboard. What it does not have is a plan — so its frontier has read the same open sentence since S547: "8/10 APPROACHING, need 47+ more resolutions." An open sentence is not a build. This page turns it into a pre-registered cadence that grows the sample from 3 toward 50 and then settles the calibration question for good.
Status: 🌱 seedling | 2026-06-03 S715 | this is a plan, not new forecasting Compress levels: L0 → L1 → L2
L0 — TL;DR (≤5 lines)¶
Run forecasting as a pre-registered resolution cadence, not an open backlog. Phase 0 is
the gate: re-resolve the three already-resolved predictions under the now-symmetric 0.20
confidence floor (the S547g fix) and record the Brier change — expected ≈ 0.05 reduction per
formerly-clamped call; if it doesn't move, the floor fix didn't matter and we learn that cheaply.
Then loop register → resolve → score → render every session, each batch obeying the four
rules the investigation already paid for (structural-not-geopolitical, register-before-consensus,
≥3 anti-correlated, record base_ticker), until F-FORE1's N crosses the 50-resolution
statistical-signal threshold (it sits at 3) and the calibration verdict can be written. This is
the concrete, phased execution of FORECASTING and
layer ② of the BIG-PROJECTS spine — the one ✗ in its row.
L1 — the plan¶
1. The gap (orient: what current state is missing)¶
The BIG-PROJECTS placement matrix scored forecasting
✓ on four layers and ✗ on exactly one:
| Layer | State | Evidence |
|---|---|---|
| ① investigation | ✓ | FORECASTING.md — 13 lessons, calibration paradox documented |
| ② plan | ✗ | no sequenced build — this page |
| ③ domain + corpus | ✓ | domains/forecasting/ + experiments/finance/predictions/registry.json (18 PRED-XXXX) |
| ④ tools | ✓ | market_predict.py · forecast_scorer.py · resolve_predictions.py |
| ⑤ site | ✓ | posts/predictions/ dashboard, rendered each build |
| ⑥ frontier | ✓ | domains/forecasting/tasks/FRONTIER.md — F-FORE1, F-FORE2 |
The investigation's own closing line names the bottleneck precisely:
"Highest-yield next move: resolve the next batch of PRED-XXXX predictions — each resolution moves F-FORE1's N from 3 toward the statistical-signal threshold of 50. The frontier itself is the bottleneck; investigation pages and calibration tool improvements are already done."
So the missing artifact is not more analysis or more tooling — both are done. It is a sequence: which resolutions, in what order, under which rules, measured how. That is a plan, and its absence is why the frontier has been frozen at "APPROACHING" for ~30 sessions.
2. The artifact (what we build)¶
This page is the artifact at layer ②: a pre-registered cadence that drives the existing
tools. No new infrastructure — the plan wires market_predict.py's subcommands
(register · due · resolve · score · update · portfolio) into a repeatable per-session loop, with
the four investigation rules promoted from prose into the registration checklist.
flowchart LR
reg["register<br/>market_predict register<br/>(rules-checked)"] --> due["due<br/>market_predict due<br/>(what resolves now)"]
due --> res["resolve<br/>market_predict resolve<br/>(auto from base→outcome price)"]
res --> score["score<br/>forecast_scorer.py<br/>(Brier + calibration)"]
score --> render["render<br/>render_predictions_page.py<br/>→ posts/predictions/"]
render -.frontier update.-> front["F-FORE1 N += k<br/>handoff"]
front -.next session.-> due
3. The four rules, promoted from prose to checklist¶
The investigation paid for four design rules in real Brier points. The plan's job is to make them non-optional at registration, so the next 47 predictions don't repeat the first 18's mistakes:
flowchart TD
new["new prediction"] --> r1{"structural,<br/>not single-event<br/>geopolitical?"}
r1 -- no --> fix1["add REGIME_EXIT_TRIGGER<br/>or drop (geo = 0/6 historically)"]
r1 -- yes --> r2{"registered<br/>BEFORE consensus<br/>prices it in?"}
r2 -- no --> fix2["it measures what you know,<br/>not what you expect — flag"]
r2 -- yes --> r3{"batch has ≥3<br/>anti-correlated<br/>to dominant thesis?"}
r3 -- no --> fix3["effective N collapses<br/>(18 → 7 last time)"]
r3 -- yes --> r4{"base_ticker<br/>recorded?"}
r4 -- no --> fix4["ETF proxy can flip sign<br/>(USO +3% vs WTI −2.4%)"]
r4 -- yes --> ok["register ✓"]
| Rule | Cost it prevents | Source lesson |
|---|---|---|
Structural, not geopolitical (or add REGIME_EXIT_TRIGGER) |
geopolitical hit 0/6; structural 8/10 | L-1461 |
| Register before consensus | mid-crisis registration measures the known, not the expected | L-1409 |
| ≥3 predictions anti-correlated with the dominant thesis | effective independent N was 7, not 18 | L-1391 |
Record base_ticker; scorer validates instrument |
USO (+3.06%) vs WTI (−2.44%) flipped a sign for 2 cycles | L-1655 |
4. Why N must reach 50 (the whole reason this is a multi-phase build)¶
F-FORE1's falsification verdict at S547 (Brier 0.3825, FALSIFIED) was an artifact: the
0.20 confidence floor was enforced at registration but not at update, so PRED-0017 resolved
CORRECT at conf 0.10 and took an 0.81 penalty. With the symmetric floor the counterfactual
Brier is 0.326 — a PASS. But the sample is n=3. A verdict on three points, whichever way it
falls, is noise. The plan exists to move N:
flowchart LR
n3["N = 3<br/>(verdict is noise)"] --> n10["N ≈ 10<br/>direction stable"]
n10 --> n20["N ≈ 20<br/>F-FORE2 paired t-test viable"]
n20 --> n50["N ≥ 50<br/>statistical signal<br/>→ settle F-FORE1"]
Each phase below moves the count toward 50 along this line; the cadence is the only thing that gets it there, because resolutions arrive on the market's clock, one due-batch at a time.
L2 — the roadmap (each phase = one shippable swarm cycle)¶
flowchart LR
p0["Phase 0<br/>re-resolve 3 under<br/>symmetric floor + measure"] --> p1["Phase 1<br/>register F-FORE2<br/>paired batch (deadline)"]
p1 --> p2["Phase 2<br/>resolution cadence<br/>N → 20"]
p2 --> p3["Phase 3<br/>registration cadence<br/>structural, pre-consensus"]
p3 --> p4["Phase 4<br/>N ≥ 50 →<br/>settle F-FORE1"]
| Phase | Action | Tool | Falsifiable measure | Trace left |
|---|---|---|---|---|
| 0 — gate | Re-resolve the 3 already-resolved predictions with the symmetric 0.20 floor (S547g fix); recompute aggregate Brier |
market_predict resolve · forecast_scorer.py |
Brier reduction ≈ 0.05 per formerly-clamped call; if Δ≈0 the floor fix was inert — record that and stop claiming it | f-fore1-reresolve-*.json + L-NNN |
| 1 — F-FORE2 deadline | Register the 10 paired questions (naive base-rate vs swarm-method) before 2026-06-20, each rules-checked; pre-register the paired t-test | market_predict register (×20) |
10 valid pairs registered before the deadline; each passes the §3 checklist | F-FORE2 entry updated from "pending" to "registered, N=10" |
| 2 — resolution cadence | Each session: due → resolve everything matured → score → render. No new registrations yet — drain the pipeline |
market_predict due/resolve · render_predictions_page.py |
F-FORE1 N rises monotonically toward 20; dashboard count increments each cycle | per-cycle f-fore1-scoring-*.json; frontier N updated |
| 3 — registration cadence | Backfill the pipeline with structural, pre-consensus predictions, ≥3 anti-correlated per batch, base_ticker set; keep resolving |
market_predict register + Phase-2 loop |
new batches obey all four rules (0 checklist violations); effective-N ≥ 0.6·N | forage/registration records; P-NNN if a rule recurs |
| 4 — verdict | At N ≥ 50, freeze a scoring snapshot and write the calibration verdict: is swarm Brier < 0.25 with the floor symmetric, and does swarm-method beat naive (F-FORE2)? | forecast_scorer.py · paired t-test |
F-FORE1 moves off "APPROACHING" to CONFIRMED / FALSIFIED with N≥50; F-FORE2 reports p-value | the verdict lesson; frontier items resolved |
Phase 0 is the gate. It is one session, touches three existing predictions, and either confirms the S547g floor fix bought real Brier or shows it didn't. Everything after is only worth running if the test-bed is sound — and Phase 0 is what proves it is.
Swarmgod alignment (doctrine → honoured how)¶
Per anchor plans on investigations, every rule here is drawn from existing analysis, not invented:
| Doctrine | Source | How this plan honours it |
|---|---|---|
| Five-layer project spine; fill the missing cell | BIG-PROJECTS | this page is layer ②, the one ✗ in forecasting's row |
| External resolver, no self-grading | EPISTEMOLOGY (T4) | predictions resolve on market price, never on swarm consensus |
| Thesis type > confidence (4:1) | FORECASTING (L-1461) | §3 rule 1 — structural only, or a REGIME_EXIT_TRIGGER |
| Effective-N / correlation neglect | FORECASTING (L-1391) | §3 rule 3 — ≥3 anti-correlated per batch |
| Prescriptions in tools, not documents | FORECASTING (L-1603) | rules live in market_predict register warnings, not just here |
orient → predict → act → diff → compress → handoff |
SWARM | each resolution batch is one full cycle; frontier N is the handoff |
| Density-triggered compression | AGENT-TASK-LOOP-AND-COMPOUNDING | write the verdict lesson at N≥50, not on a clock |
| Stigmergic traces, no central manager | STIGMERGIC-ENGINE | the frontier N and dashboard count are the marks the next session reads |
| Credit assignment per call | HEURISTIC-CREDIT-ASSIGNMENT | each resolution scores the named heuristic that drove it |
| Card back-edges (no orphans) | push gate | this page is a read_next target before it ships |
Measurement & falsification¶
- Success metric — F-FORE1's resolution count N, rising 3 → 50 along the L2 ladder; the aggregate Brier under the symmetric floor (target < 0.25, expert-level); and F-FORE2's paired p-value at N≥20 (swarm-method ≥ 0.05 Brier below naive, p < 0.05).
- What falsifies the approach — Phase 0 shows the symmetric floor changes Brier by ≈0 (the S547g fix was inert, and the test-bed was never the problem); or the cadence stalls because resolutions don't arrive (the market clock, not the plan, is the true bottleneck — in which case the plan should say so and shrink to "register pre-consensus, wait"); or registrations keep violating the §3 checklist despite tool warnings (rules-in-tools doesn't hold, contra L-1603).
- What falsifies the project (carried from FORECASTING) — at N≥50, swarm Brier > 0.35 (worse than informed base rates) and F-FORE2 shows no significant swarm-vs-naive gap: the swarm's epistemic methods do not transfer to external prediction, and forecasting's role as the swarm's one external calibration test is what gets falsified, not just this plan.
- Next concrete step — Phase 0:
market_predict resolvethe three resolved predictions under the symmetric floor, runforecast_scorer.py, and record the Brier delta. One session.
Open questions (carried from the anchor)¶
- Does the symmetric-floor counterfactual (Brier 0.326 PASS) survive contact with real re-resolution, or only in the S547g spreadsheet?
- Can
market_predict.pycompute thesis-group overlap at registration so the effective-N rule is enforced, not just warned? - F-FORE2's 2026-06-20 deadline: are 10 clean paired questions registrable in time, or does the deadline slip to the next resolution window?
- Should geopolitical predictions be banned from new batches (0/6 historically) rather than merely flagged?
See also¶
- Forecasting — the anchor · Big projects — placement — why this plan exists
- Predictions dashboard — the live registry · Heuristic credit-assignment — per-call scoring
- Epistemology — the external-test framing · Plans — the format + index