Skip to content

Statement Backtest Pipeline — two coupled loops

Two coupled loops for finance decisions. LOOP 1 (fast): clear statements from the literature → expand → backtest walk-forward on ~10y history (OOS Sharpe) → comprehensive Sharpe-weighted ensemble → decision. LOOP 2 (slow): the live market grades the decision (Brier → verbal-Sharpe). The payoff is the comparison — does a statement's historical edge survive out of sample? Price-derivable statements only (momentum, trend, mean-reversion, breakout, vol-regime); no new data source.
🌱 seedling tended 2026-06-02 S713 investigation finance backtesting statements ensemble sharpe literature calibration two-loop F-COMP1
flowchart TD
  subgraph L1["LOOP 1 — research (fast: seconds on history)"]
    lit["LITERATURE statements<br/>statements.json + signals.py<br/>momentum, trend, mean-rev, breakout"] --> expand["EXPAND (generative pressure)<br/>expand_statements.py<br/>param sweeps"]
    expand --> backtest["BACKTEST walk-forward<br/>backtest_statement.py<br/>OOS Sharpe, no look-ahead"]
    lit --> backtest
    backtest --> composite["COMPREHENSIVE heuristic<br/>composite_heuristic.py<br/>Sharpe-weighted ensemble"]
    composite --> decision["DECISION<br/>direction + confidence<br/>registry.json"]
  end
  subgraph L2["LOOP 2 — live (slow: weeks to resolve)"]
    decision --> market["MARKET OUTCOME<br/>resolve_predictions.py<br/>Brier = loss"]
    market --> credit["CREDIT<br/>heuristic_backtest.py<br/>verbal-Sharpe"]
  end
  credit --> compare{"backtest-Sharpe<br/>vs<br/>live verbal-Sharpe"}
  compare -->|edge persists| keep["KEEP / PROMOTE"]
  compare -->|edge decays| revise["PRUNE / re-backtest<br/>regime shift"]
  keep --> lit
  revise --> lit
  click backtest "../../../posts/heuristics/" "Statement backtests"
  click composite "../../../posts/heuristics/" "Composite decisions"
  click decision "../../../posts/predictions/" "Forecast dashboard"
  classDef fast fill:#e8f7ec,stroke:#3b6;
  classDef slow fill:#eef2ff,stroke:#46c;
  class lit,expand,backtest,composite,decision fast;
  class market,credit slow;
Read next

S713. Loop 1 build: signals.py (6 price-derivable signals), statements.json (5 literature statements w/ citations), backtest_statement.py (walk-forward OOS Sharpe, Yahoo history + cache), composite_heuristic.py (Sharpe-weighted ensemble + --register bridge to loop 2), expand_statements.py (generative pressure). Real result: momentum OOS Sharpe 0.70, breakout 0.50, trend 0.39; mean-reversion ~flat. L-2251.

Can's model: "clear statements from the literature → expand → run basic tests / model a good pick → comprehensive heuristic → decision." That is a fast research loop. The credit- assignment work (loop 2) was only the slow live half. This page is the fast half and how the two connect.


L0 — TL;DR

Decisions are made by a comprehensive heuristic: a Sharpe-weighted ensemble of clear, literature-grounded statements that have each been backtested on ~10 years of price history. A statement (e.g. 12-1 momentum, Jegadeesh & Titman 1993) is encoded as a price-derivable signal (tools/signals.py), backtested walk-forward with an out-of-sample split (tools/backtest_statement.py), and combined per asset weighted by its OOS Sharpe (tools/composite_heuristic.py). The result is a transparent call — you see which statement pushed it and how hard — filed as a prediction that the live market then grades (loop 2). The two loops differ in speed and in what they trust: history (fast, in-sample-prone) vs the live market (slow, the true out-of-sample).

Real numbers (2015→2026, see the scorecard): momentum OOS Sharpe ≈ 0.70, 52-week-high breakout ≈ 0.50, trend ≈ 0.39 — strong on SPY/QQQ/GLD, absent on IWM/WTI; short-horizon mean-reversion ≈ flat. Where statements have no edge, the composite stays silent (IWM, WTI → NEUTRAL).


L1 — The two loops

The card diagram shows both. Loop 1 (research, fast): literature → expand → backtest → composite → decision. Loop 2 (live, slow): decision → market Brier → verbal-Sharpe. They close on each other: the live result is compared against the backtest, and that gap is the product.

Why two loops instead of one

The live loop alone is honest but slow — a statement needs weeks of real resolutions before its verbal-Sharpe means anything. The backtest loop gives a verdict in seconds, but on history, which is in-sample and non-stationary. Neither is sufficient: a backtest edge that vanishes live is the classic failure (DeMiguel et al. 2009: optimized portfolios lose to naive 1/N out of sample; KellyBench 2026: agents over-fit betting rules). So the architecture keeps both and surfaces the discrepancy rather than trusting either number alone.

The comprehensive heuristic (SWARMGOD-WEIGHTED-ARCHITECTURE)

For an asset today, each applicable statement emits a current signal in {−1, 0, +1}; its weight is its OOS Sharpe on that asset, floored at 0 (no edge → no vote) and capped (no single statement dominates). score = Σ wᵢ·signalᵢ / Σ wᵢ maps to BULL/BEAR/NEUTRAL + a confidence. This is the weighted-architecture pattern — aggregate by track record — with the council's rolling Sharpe replaced by the statement's backtest Sharpe.

Generative pressure (ACTION-VOCABULARY-CEILING)

tools/expand_statements.py sweeps each statement's parameters to mint variants (origin:"expanded"), the vocabulary-growth complement to compaction. Variants are not trusted on creation: sweeping many params inflates the best in-sample Sharpe by chance, so an expanded statement must clear both the OOS backtest and the live verbal-Sharpe gate.

Honesty guards

Signals read only past bars (look-ahead asserted in tools/test_signals.py); OOS Sharpe (held-out 30%) is the headline; no transaction costs are modelled (stated on the page); missing bars are never fabricated; N<20 is direction-only (P-285/P-470). Price series are cached to experiments/finance/data/ so backtests are repeatable offline.


L2 — Open questions

H1: Does the backtest edge survive live? (the central question)

Testable-if: as the composite-filed predictions resolve, compare each statement's live verbal-Sharpe against its OOS backtest Sharpe. Persistent positive correlation = the historical edge is real; systematic decay = overfitting / regime change. This is the experiment the whole two-loop design exists to run.

H2: Does the ensemble beat its best single statement?

Testable-if: track the composite's live Brier vs the best individual statement's. If the weighted ensemble doesn't beat its strongest member out of sample, the weighting is not earning its complexity (the DeMiguel 1/N warning, applied to statements).

H3: How fast does an edge decay?

Testable-if: re-backtest on a rolling window; if a statement's trailing-3y Sharpe trends down while older windows were strong, non-stationarity is eating the edge — a prune signal independent of the live loop.


External grounding

  • Jegadeesh & Titman (1993); Asness, Moskowitz & Pedersen (2013) — momentum.
  • Moskowitz, Ooi & Pedersen (2012) — time-series momentum / trend.
  • George & Hwang (2004) — 52-week-high. Jegadeesh (1990) — short-horizon reversal. Ang et al. (2006) — volatility.
  • DeMiguel, Garlappi & Uppal (2009) — 1/N beats optimized out of sample (the overfit warning).
  • KellyBench (arXiv:2604.27865, 2026); Velay et al. (2023) — agents overfit / fail to generalize out of backtest.
  • Tooling: tools/signals.py, tools/backtest_statement.py, tools/composite_heuristic.py, tools/expand_statements.py.

References