5,039 trading days. 4-layer system. ROC-AUC vs forward drawdown 0.754 (non-circular). Phase 9 anti-redundancy statistically confirmed. Honest about what passed — and what didn't.
This report (v3, 2026-05-01) validates the current 4-layer system (Macro/Politics/Markets/Energy + three additive overlays + Phase 9 Credit/Funding) against 5,039 trading days of market history 2007-2026. Fully reproducible with code and data in the public repository. We publish the results as they are — including the layers that were not historically testable.
All tests run on a single, in-Git committed dataset under `backtest_data/v3/`:
Layer B (politics) was migrated from v1 to v2 with five simultaneous parameter changes. We validated the change against five historical episodes.
Power-limited at n=4 — direction is the strongest available evidence.
v2 eliminates v1's permanent background stress with overwhelming evidence.
All baseline, crisis-response, and out-of-sample checks.
In-sample ends 2025-02-28. First OOS event: Liberation Day, 2 April 2025. v2 classifies it correctly as CRITICAL (score 52.6 vs. VIX 52.3) — independent confirmation.
We checked every component for look-ahead bias — using future information that wouldn't be available at decision time.
Fed balance sheet and Treasury account publish weekly on Thursdays with up to 14-day revision windows. The fetcher takes the latest DB entry without a 7-day buffer.
fred.py and stooq.py do not assert that the latest date is ≤ today. A DB error could allow future data through.
_load_market_data() has no upper date limit. Defensive guardrail recommended.
All three are defensive guardrails for live operation. The backtest itself uses frozen CSV snapshots, so it is unaffected. The warnings are published transparently.
Full backtest of the current 4-layer system (35/25/25/15 + Phase 9 + Phase 9.1) against 5,039 trading days. With renormalized aggregation: when Layer B or Layer D is structurally not testable, their weights are removed from the aggregation rather than filled with constants — the remaining layers are proportionally rescaled. Two benchmarks: circular (VIX > 25) and non-circular (S&P forward drawdown over 20 days ≤ -10 %).
Credit Stress vs volatility regime: ρ = +0.46 (threshold < 0.85) ✅ — Credit provides genuine independent information, no double-counting with VIX.
Funding Stress vs RRP sub-score: ρ = +0.43 (threshold < 0.95) ✅ — relocating RRP from USD-System Role into Funding Stress was clean.
Funding Stress vs USD-System Role (slimmed): ρ = -0.72 (threshold < 0.85) ✅ — both KPIs measure orthogonal axes.
Layer A vs Layer C: ρ = -0.07 ✅ — structural macro and market mechanics are uncorrelated over the period.
Forward-drawdown AUC 0.754 ≥ 0.75 with 60 % of the live system active (Layer A + C). Live performance is conservatively higher because Layers B and D contribute actively in production.
On top of the deterministic score thresholds, we fit two probabilistic models to give P(green/yellow/red) for any score.
Train accuracy 72.5 %, test accuracy 36.9 %. Used for calibration tables, not as a primary signal.
Cluster centers at score levels 28.6 / 31.9 / 40.2. Identifies natural regime separations in the historical distribution.
Output: a probability table P(green | score), P(yellow | score), P(red | score) for every integer score from 0 to 100. Used to communicate confidence ranges around the deterministic verdict.
Full transparency on the limits of this validation. The backtest is a strict lower bound on live performance — production has all four layers active.
RSS feeds in the current form with compound-pattern detection only exist from around 2018. Pre-2018: no standardised political event database can be reconstructed. In the v3 backtest, Layer B is removed from aggregation and the remaining layer weights are proportionally renormalised (rather than artificially dragging the score down with the 1.4 baseline). Geopolitical crises are captured only through Layer C (equity stress + crisis overlay), not through political event detection. OOS confirmation via Liberation Day 2025-04-02 (Layer B score 52.6, correctly CRITICAL) holds unchanged from v2.
ENERGY_LAYER_ACTIVE_FROM is intentionally pinned to the Phase-7 go-live — historical backfills use the 3-layer legacy formula. Energy crises before this date (2008 oil spike, 2022 Russia gas cut) are blind to Layer D in the historical test.
FRED reorganised the ICE BofA series in 2023 — the free API no longer serves long history. HY OAS (BAMLH0A0HYM2) was stitched from the v2 snapshot (1996-2026); IG OAS remains limited to the short window. Pre-2023, Credit Stress uses only the HY component (0.7 internal weight); IG falls back to 50.0.
Day-ahead electricity prices only exist from 2014-2017 depending on bidding zone. Pseudo-ATR (Layer D Intraday) requires 5-min bar history that we deliberately do not persist. Both fall back to 50.0.
With Phase-11 crisis triggers (VIX acceleration + HY-OAS velocity), v3 now correctly classifies 7 of 8 historical crises as red or dark-red: Lehman 50.5 · Eurocrisis 62.3 · Volmageddon 62.7 · COVID 69.3 · UK LDI 57.9 · SVB 52.3 — and that's without Layer B. Ukraine 37.7 (yellow) remains just below red. Liberation Day 2025-04-02 stays green in the backtest (29.9) because Trump's tariff announcement was primarily a political event — the most honest confirmation that Layer B contributes substantially to the live system.
We do not yet quantify how transaction costs would affect a strategy that uses our scores. Boiling Frog is an information signal, not a trading recommendation — but a full validation should include this.
Boiling Frog v3 is statistically validated: forward-drawdown AUC 0.754 (non-circular, ≥0.75 threshold passed), Mann-Whitney p<10⁻²⁸⁸ (calm vs stress discrimination is highly significant), all Phase-9 anti-redundancy correlations clean. Per-layer verdicts: A APPROVE · B not testable (RSS pre-2018 missing, OOS from v2 holds) · C APPROVE · D not testable (live from 2026-04-18) · Phase-9 add-ons APPROVE · Composite CONDITIONAL APPROVE. The backtest is a strict lower bound on live performance — production has all four layers + three additive overlays active.
We publish what we tested, what passed, and what is still open. Open methodology means open limitations.
Trust is not a feature you ship. It's a number you verify.