VALIDATION & BACKTEST v3

We tested it
before publishing it.

5,039 trading days. 4-layer system. ROC-AUC vs forward drawdown 0.754 (non-circular). Phase 9 anti-redundancy statistically confirmed. Honest about what passed — and what didn't.

This report (v3, 2026-05-01) validates the current 4-layer system (Macro/Politics/Markets/Energy + three additive overlays + Phase 9 Credit/Funding) against 5,039 trading days of market history 2007-2026. Fully reproducible with code and data in the public repository. We publish the results as they are — including the layers that were not historically testable.

01 — The data

All tests run on a single, in-Git committed dataset under `backtest_data/v3/`:

Period
2007-01-02 to 2026-04-30
Trading days
5,039
Asset series
S&P 500 · NASDAQ 100 · DAX · STOXX 50 · Gold · EUR/USD (yfinance intraday + EOD)
Macro series (FRED)
DGS2/10 · T10Y2Y · SOFR · EFFR · DCPF3M · BAMLH0A0HYM2 · BAMLC0A0CM · RRPONTSYD · WALCL · WTREGEN · DTWEXBGS · T5YIE · VIXCLS
Eurozone money market (ECB)
€STR (daily) · Euribor 3M (monthly)
Energy
OVX · Brent BZ=F · WTI CL=F (yfinance) · EIA Crude/Cushing/Distillate/Brent/WTI Spot from 2024-05
Sources
FRED + ECB Data Portal + yfinance + EIA — all freely verifiable

02 — Layer B calibration audit

Verdict: APPROVE

Layer B (politics) was migrated from v1 to v2 with five simultaneous parameter changes. We validated the change against five historical episodes.

Crisis response: peak Layer B score
Scenario
v1
v2
Δ
VIX
Lehman 2008
35.0
53.5
+18.5
80.9
COVID-19 2020
45.5
47.5
+2.0
82.7
Ukraine War 2022
45.4
54.5
+9.1
36.5
Tariff War 2025 (Liberation Day)
28.9
52.6
+23.7
52.3
Wilcoxon signed-rank (peak crisis response)
p = 0.0625
4 of 4 scenarios show v2 > v1

Power-limited at n=4 — direction is the strongest available evidence.

Mann-Whitney U (calm phases, VIX < 15)
p = <0.000001
Median v1 = 14.20 · Median v2 = 1.40

v2 eliminates v1's permanent background stress with overwhelming evidence.

Scenario audit
p =
16 of 16 PASS

All baseline, crisis-response, and out-of-sample checks.

Out-of-sample window

In-sample ends 2025-02-28. First OOS event: Liberation Day, 2 April 2025. v2 classifies it correctly as CRITICAL (score 52.6 vs. VIX 52.3) — independent confirmation.

03 — Look-ahead bias audit

Verdict: CONDITIONAL PASS

We checked every component for look-ahead bias — using future information that wouldn't be available at decision time.

Components free of look-ahead bias
  • Layer A change windows (20-day, strictly backward-looking)
  • Layer C equity stress (rolling max on past prices only)
  • Cross-region correlation (60-day Spearman on historical returns)
  • Volatility regime (VIX 5-day change, past data only)
  • Technical indicators (EMA50, MACD, RSI on historical prices)
  • Layer B decay model (exponential decay on past events)
  • Aggregation formula (arithmetic on past values)
  • Historical calculation path (historical_mode=True correctly set)
Latent risks (not active in current backtest, recommendations for live)
WALCL / WTREGEN release lag

Fed balance sheet and Treasury account publish weekly on Thursdays with up to 14-day revision windows. The fetcher takes the latest DB entry without a 7-day buffer.

fetch_latest() date guard

fred.py and stooq.py do not assert that the latest date is ≤ today. A DB error could allow future data through.

Market data upper limit

_load_market_data() has no upper date limit. Defensive guardrail recommended.

All three are defensive guardrails for live operation. The backtest itself uses frozen CSV snapshots, so it is unaffected. The warnings are published transparently.

04 — Composite validation v3

Verdict: CONDITIONAL APPROVE

Full backtest of the current 4-layer system (35/25/25/15 + Phase 9 + Phase 9.1) against 5,039 trading days. With renormalized aggregation: when Layer B or Layer D is structurally not testable, their weights are removed from the aggregation rather than filled with constants — the remaining layers are proportionally rescaled. Two benchmarks: circular (VIX > 25) and non-circular (S&P forward drawdown over 20 days ≤ -10 %).

ROC-AUC v3 — full period + walk-forward
Configuration
Train
Test
Full (5,039 days) — vs forward DD ≤ -10 % (non-circular, primary)
0.754 ✅
Full (5,039 days) — vs VIX > 25 (circular)
0.872
Walk-forward Train 2010-2018 (2,347 days) — vs VIX > 25
0.951
Walk-forward Test 2019-2026 (1,909 days) — vs VIX > 25
0.775
Mann-Whitney calm (VIX<15) vs stress (VIX≥25), median score
20.8
32.7 (p<10⁻²⁸⁸)
Phase 9 anti-redundancy (Spearman correlations)
  1. 01

    Credit Stress vs volatility regime: ρ = +0.46 (threshold < 0.85) ✅ — Credit provides genuine independent information, no double-counting with VIX.

  2. 02

    Funding Stress vs RRP sub-score: ρ = +0.43 (threshold < 0.95) ✅ — relocating RRP from USD-System Role into Funding Stress was clean.

  3. 03

    Funding Stress vs USD-System Role (slimmed): ρ = -0.72 (threshold < 0.85) ✅ — both KPIs measure orthogonal axes.

  4. 04

    Layer A vs Layer C: ρ = -0.07 ✅ — structural macro and market mechanics are uncorrelated over the period.

  5. 05

    Forward-drawdown AUC 0.754 ≥ 0.75 with 60 % of the live system active (Layer A + C). Live performance is conservatively higher because Layers B and D contribute actively in production.

05 — Probabilistic regime model

On top of the deterministic score thresholds, we fit two probabilistic models to give P(green/yellow/red) for any score.

Multinomial logistic regression

Train accuracy 72.5 %, test accuracy 36.9 %. Used for calibration tables, not as a primary signal.

Gaussian Mixture Model (3 clusters)

Cluster centers at score levels 28.6 / 31.9 / 40.2. Identifies natural regime separations in the historical distribution.

Output: a probability table P(green | score), P(yellow | score), P(red | score) for every integer score from 0 to 100. Used to communicate confidence ranges around the deterministic verdict.

06 — What we honestly could not test

Full transparency on the limits of this validation. The backtest is a strict lower bound on live performance — production has all four layers active.

Layer B (25 % weight) historically not testable — dropped + renormalized

RSS feeds in the current form with compound-pattern detection only exist from around 2018. Pre-2018: no standardised political event database can be reconstructed. In the v3 backtest, Layer B is removed from aggregation and the remaining layer weights are proportionally renormalised (rather than artificially dragging the score down with the 1.4 baseline). Geopolitical crises are captured only through Layer C (equity stress + crisis overlay), not through political event detection. OOS confirmation via Liberation Day 2025-04-02 (Layer B score 52.6, correctly CRITICAL) holds unchanged from v2.

Layer D Energy (15 % weight) only active from 2026-04-18

ENERGY_LAYER_ACTIVE_FROM is intentionally pinned to the Phase-7 go-live — historical backfills use the 3-layer legacy formula. Energy crises before this date (2008 oil spike, 2022 Russia gas cut) are blind to Layer D in the historical test.

IG OAS (BAMLC0A0CM) only from 2023-05

FRED reorganised the ICE BofA series in 2023 — the free API no longer serves long history. HY OAS (BAMLH0A0HYM2) was stitched from the v2 snapshot (1996-2026); IG OAS remains limited to the short window. Pre-2023, Credit Stress uses only the HY component (0.7 internal weight); IG falls back to 50.0.

ENTSO-E pre-2015 + Pseudo-ATR not reconstructible

Day-ahead electricity prices only exist from 2014-2017 depending on bidding zone. Pseudo-ATR (Layer D Intraday) requires 5-min bar history that we deliberately do not persist. Both fall back to 50.0.

Crisis detection in backtest: 7 of 8 captured (Phase 11)

With Phase-11 crisis triggers (VIX acceleration + HY-OAS velocity), v3 now correctly classifies 7 of 8 historical crises as red or dark-red: Lehman 50.5 · Eurocrisis 62.3 · Volmageddon 62.7 · COVID 69.3 · UK LDI 57.9 · SVB 52.3 — and that's without Layer B. Ukraine 37.7 (yellow) remains just below red. Liberation Day 2025-04-02 stays green in the backtest (29.9) because Trump's tariff announcement was primarily a political event — the most honest confirmation that Layer B contributes substantially to the live system.

Transaction cost analysis

We do not yet quantify how transaction costs would affect a strategy that uses our scores. Boiling Frog is an information signal, not a trading recommendation — but a full validation should include this.

07 — Overall verdict v3

Boiling Frog v3 is statistically validated: forward-drawdown AUC 0.754 (non-circular, ≥0.75 threshold passed), Mann-Whitney p<10⁻²⁸⁸ (calm vs stress discrimination is highly significant), all Phase-9 anti-redundancy correlations clean. Per-layer verdicts: A APPROVE · B not testable (RSS pre-2018 missing, OOS from v2 holds) · C APPROVE · D not testable (live from 2026-04-18) · Phase-9 add-ons APPROVE · Composite CONDITIONAL APPROVE. The backtest is a strict lower bound on live performance — production has all four layers + three additive overlays active.

We publish what we tested, what passed, and what is still open. Open methodology means open limitations.

Back to Methodology

Trust is not a feature you ship. It's a number you verify.