Statement Backtest Pipeline — two coupled loops¶
flowchart TD
subgraph L1["LOOP 1 — research (fast: seconds on history)"]
lit["LITERATURE statements<br/>statements.json + signals.py<br/>momentum, trend, mean-rev, breakout"] --> expand["EXPAND (generative pressure)<br/>expand_statements.py<br/>param sweeps"]
expand --> backtest["BACKTEST walk-forward<br/>backtest_statement.py<br/>OOS Sharpe, no look-ahead"]
lit --> backtest
backtest --> composite["COMPREHENSIVE heuristic<br/>composite_heuristic.py<br/>Sharpe-weighted ensemble"]
composite --> decision["DECISION<br/>direction + confidence<br/>registry.json"]
end
subgraph L2["LOOP 2 — live (slow: weeks to resolve)"]
decision --> market["MARKET OUTCOME<br/>resolve_predictions.py<br/>Brier = loss"]
market --> credit["CREDIT<br/>heuristic_backtest.py<br/>verbal-Sharpe"]
end
credit --> compare{"backtest-Sharpe<br/>vs<br/>live verbal-Sharpe"}
compare -->|edge persists| keep["KEEP / PROMOTE"]
compare -->|edge decays| revise["PRUNE / re-backtest<br/>regime shift"]
keep --> lit
revise --> lit
click backtest "../../../posts/heuristics/" "Statement backtests"
click composite "../../../posts/heuristics/" "Composite decisions"
click decision "../../../posts/predictions/" "Forecast dashboard"
classDef fast fill:#e8f7ec,stroke:#3b6;
classDef slow fill:#eef2ff,stroke:#46c;
class lit,expand,backtest,composite,decision fast;
class market,credit slow;
- Heuristic scorecard — the live surface: statement backtests + composite decisions + verbal-Sharpe
- Credit assignment (loop 2) — how live outcomes grade the heuristics this loop produces
- Forecast dashboard — the decisions this pipeline files
- Forecasting — the calibration testbed
S713. Loop 1 build: signals.py (6 price-derivable signals), statements.json (5 literature statements w/ citations), backtest_statement.py (walk-forward OOS Sharpe, Yahoo history + cache), composite_heuristic.py (Sharpe-weighted ensemble + --register bridge to loop 2), expand_statements.py (generative pressure). Real result: momentum OOS Sharpe 0.70, breakout 0.50, trend 0.39; mean-reversion ~flat. L-2251.
- PreviousSql Abstraction Convergence
- NextStatement Composition
Can's model: "clear statements from the literature → expand → run basic tests / model a good pick → comprehensive heuristic → decision." That is a fast research loop. The credit- assignment work (loop 2) was only the slow live half. This page is the fast half and how the two connect.
L0 — TL;DR¶
Decisions are made by a comprehensive heuristic: a Sharpe-weighted ensemble of clear,
literature-grounded statements that have each been backtested on ~10 years of price history.
A statement (e.g. 12-1 momentum, Jegadeesh & Titman 1993) is encoded as a price-derivable signal
(tools/signals.py), backtested walk-forward with an out-of-sample split (tools/backtest_statement.py),
and combined per asset weighted by its OOS Sharpe (tools/composite_heuristic.py). The result is
a transparent call — you see which statement pushed it and how hard — filed as a prediction that the
live market then grades (loop 2). The two loops differ in speed and
in what they trust: history (fast, in-sample-prone) vs the live market (slow, the true out-of-sample).
Real numbers (2015→2026, see the scorecard): momentum OOS Sharpe ≈ 0.70, 52-week-high breakout ≈ 0.50, trend ≈ 0.39 — strong on SPY/QQQ/GLD, absent on IWM/WTI; short-horizon mean-reversion ≈ flat. Where statements have no edge, the composite stays silent (IWM, WTI → NEUTRAL).
L1 — The two loops¶
The card diagram shows both. Loop 1 (research, fast): literature → expand → backtest → composite → decision. Loop 2 (live, slow): decision → market Brier → verbal-Sharpe. They close on each other: the live result is compared against the backtest, and that gap is the product.
Why two loops instead of one¶
The live loop alone is honest but slow — a statement needs weeks of real resolutions before its verbal-Sharpe means anything. The backtest loop gives a verdict in seconds, but on history, which is in-sample and non-stationary. Neither is sufficient: a backtest edge that vanishes live is the classic failure (DeMiguel et al. 2009: optimized portfolios lose to naive 1/N out of sample; KellyBench 2026: agents over-fit betting rules). So the architecture keeps both and surfaces the discrepancy rather than trusting either number alone.
The comprehensive heuristic (SWARMGOD-WEIGHTED-ARCHITECTURE)¶
For an asset today, each applicable statement emits a current signal in {−1, 0, +1}; its weight is its
OOS Sharpe on that asset, floored at 0 (no edge → no vote) and capped (no single statement
dominates). score = Σ wᵢ·signalᵢ / Σ wᵢ maps to BULL/BEAR/NEUTRAL + a confidence. This is the
weighted-architecture pattern — aggregate by track record — with
the council's rolling Sharpe replaced by the statement's backtest Sharpe.
Generative pressure (ACTION-VOCABULARY-CEILING)¶
tools/expand_statements.py sweeps each statement's parameters to mint variants (origin:"expanded"),
the vocabulary-growth complement to compaction. Variants are not
trusted on creation: sweeping many params inflates the best in-sample Sharpe by chance, so an expanded
statement must clear both the OOS backtest and the live verbal-Sharpe gate.
Honesty guards¶
Signals read only past bars (look-ahead asserted in tools/test_signals.py); OOS Sharpe (held-out 30%)
is the headline; no transaction costs are modelled (stated on the page); missing bars are never
fabricated; N<20 is direction-only (P-285/P-470). Price series are cached to
experiments/finance/data/ so backtests are repeatable offline.
L2 — Open questions¶
H1: Does the backtest edge survive live? (the central question)¶
Testable-if: as the composite-filed predictions resolve, compare each statement's live verbal-Sharpe against its OOS backtest Sharpe. Persistent positive correlation = the historical edge is real; systematic decay = overfitting / regime change. This is the experiment the whole two-loop design exists to run.
H2: Does the ensemble beat its best single statement?¶
Testable-if: track the composite's live Brier vs the best individual statement's. If the weighted ensemble doesn't beat its strongest member out of sample, the weighting is not earning its complexity (the DeMiguel 1/N warning, applied to statements).
H3: How fast does an edge decay?¶
Testable-if: re-backtest on a rolling window; if a statement's trailing-3y Sharpe trends down while older windows were strong, non-stationarity is eating the edge — a prune signal independent of the live loop.
External grounding¶
- Jegadeesh & Titman (1993); Asness, Moskowitz & Pedersen (2013) — momentum.
- Moskowitz, Ooi & Pedersen (2012) — time-series momentum / trend.
- George & Hwang (2004) — 52-week-high. Jegadeesh (1990) — short-horizon reversal. Ang et al. (2006) — volatility.
- DeMiguel, Garlappi & Uppal (2009) — 1/N beats optimized out of sample (the overfit warning).
- KellyBench (arXiv:2604.27865, 2026); Velay et al. (2023) — agents overfit / fail to generalize out of backtest.
- Tooling:
tools/signals.py,tools/backtest_statement.py,tools/composite_heuristic.py,tools/expand_statements.py.
References¶
- HEURISTIC-CREDIT-ASSIGNMENT — loop 2 (live grading).
- SWARMGOD-WEIGHTED-ARCHITECTURE — the weighted-ensemble pattern.
- ACTION-VOCABULARY-CEILING — generative pressure.
- L-2251 — this pipeline's lesson.