Dual-Pass Out-of-Sample — Axon Ridge Capital

Abstract

The standard chronological train/validation/test split is an inadequate OOS discipline for crypto perpetual data: the model has seen every regime that preceded the test window, and the test window is necessarily a small, correlated tail. We use two complementary OOS protocols and require any candidate strategy to be positive on both. The chronological-tail protocol (Pass A) measures regime-stress robustness: how does the model perform on the most recent, never-seen segment? The cascade-strict protocol (Pass B) measures steady-state edge: we excise a known regime-defining cascade (February through April 2025) from both training and test and ask whether the model trades anywhere else. Across 26 strategy candidates evaluated under both, only six are positive on both passes, and zero clear our full six-gate promotion checklist.

1 · Why one backtest is not enough

The conventional time-series split — train on the first 70 percent of the series, validate on the next 15 percent, test on the final 15 percent — has two failure modes that crypto perp data makes acute:

Regime concentration. Crypto perpetuals have a small number of large, persistent regime windows. A single one of these in the test set can flip a model from positive to negative. There is no central limit on regimes.
Implicit conditioning. The validation set sits adjacent to the test set in time. Hyperparameter selection on the validation set implicitly conditions on the regime that follows it. This is a subtle form of look-ahead.

Both problems can be partially mitigated. The dual-pass protocol does so by asking two independent questions of the same model and treating disagreement as information.

2 · The two passes, defined

The two OOS protocols

Time runs left to right · grey = train · gold = validate · sage = test · stripes = excised

Pass A holds out the most recent window (Feb–Apr 2026) without disturbing training. Pass B excises the Feb–Apr 2025 cascade from both training and test and asks whether the model edges anywhere else.

Pass A — chronological tail. Standard time-series split. Training runs from data start through 2026-01-31. Validation occupies a fixed window inside training (held out for model selection). The test window is 2026-02-01 through 2026-04-01. Pass A asks: does this model trade in the most recent regime?

Pass B — cascade-strict. A known regime-defining window — the February through mid-April 2025 cascade — is excised from training entirely, leaving a non-contiguous training set on either side. The test window is the same excised cascade. Pass B asks: can the model trade without having seen this specific regime in training? This is the “most realistic single backtest” we run.

3 · The result: only six strategies pass both

We applied both passes to 26 candidate strategy specifications. The scatter below plots Pass A Sharpe (x-axis) against Pass B Sharpe (y-axis). The upper-right quadrant — both positive — is sparsely populated.

Pass A vs Pass B annualised Sharpe across 26 strategies

Each point is one strategy candidate · 30 seeds · 4 bps / leg · 8 017 bars Pass B

Six of 26 candidates sit in the “both positive” upper-right quadrant (highlighted). STRAT-04b has the largest Pass A gap — strong on Pass B (+1.51) and catastrophic on Pass A (−4.24). STRAT-48 is the cleanest dual-positive candidate. STRAT-08 sits near the Pass A axis: it is mostly a regime bet.

4 · What each pass catches

The interesting cases are the strategies that pass one but fail the other. They are the most informative data points the dual-pass protocol produces.

Pass A positive, Pass B fragile — STRAT-08

The seven-architecture consensus blend (STRAT-08) is positive on Pass A (+3.13 Sharpe) but only marginally positive on Pass B (+0.87). The implication is that much of its Pass A performance is concentrated in market behaviour correlated with the cascade window: when that window is excised from training, the model loses most of its edge. We treat strategies of this shape as regime bets and decline to promote them.

Pass B positive, Pass A negative — STRAT-04b

The opposite shape: the LightGBM lambdarank inverse-7 strategy (STRAT-04b) is strongly positive on Pass B (+1.51) and deeply negative on Pass A (−4.24). The implication is that the model has an edge in “normal” cross-sectional ranking on the inverse cluster but cannot handle the recent regime. We deploy this only as part of a portfolio with regime-orthogonal sleeves, never standalone — see Paper № 07.

Both positive — STRAT-48

The static eight-coin consensus basket (STRAT-48) is the only candidate that sits cleanly above zero on both passes (+2.93 Pass A, +1.91 Pass B). It is also model-free, which we discuss in detail in Paper № 05.

5 · The full promotion gate is stricter than “both positive”

Dual-pass positivity is necessary but not sufficient. Our promotion checklist requires six gates:

Pass A Sharpe > +0.0
Pass B Sharpe > +1.0
Pass B 10th-percentile Sharpe > +0.5 (seed stability)
Honest max drawdown ÷ 24 < 15 percent (overlap-corrected)
Mistake-#21 ρ > +0.5 (training metric tracks Sharpe across folds)
Plausibility flags clean (no OVER_CAPITAL, no EXTREME_SR)

As of the 2026-05-15 audit, zero of the 26 candidates clear all six simultaneously. STRAT-48 and STRAT-04b are the closest to a full clear and are the current paper-validation roster.

Honest framing

“Six positive on both passes” is the strongest defensible claim from this audit. It is not “six validated strategies.” The gap between both-pass-positive and ready-for-live-capital is exactly the gap between an interesting candidate and a deployable one — and we keep it explicit because the alternative is to dress up “promising” as “promoted.”

6 · Why we do not use only purged CV

The purged combinatorial cross-validation framework of López de Prado^[1] is the gold standard for IID-violating financial data and we use it heavily upstream — every architecture in our 28-arm screen has CPCV runs in research/scoreboards/consolidated_arms_table.md. What CPCV alone does not catch is the regime-stress failure mode of Pass A: CPCV folds are scrambled across time, so a model that is good on average across regimes but catastrophic in the most recent one can still post a respectable CPCV Sharpe. The chronological tail is what reveals that. The cascade excision is what reveals whether the model's edge is regime-specific. The three protocols are complementary, not substitutable.

7 · A note on Deflated Sharpe Ratio

Bailey & López de Prado^[2] derive a deflation factor for the number of trials taken — the more candidates you screen, the higher the bar for any one of them to be real. Our screen has on the order of 616 model cards in the consolidated arms table, which is a heavy multiple-testing burden. The combined positivity gate on Pass A and Pass B is partly an implicit deflation: a strategy that passes both is much less likely to be a multiple-testing artefact than a strategy that passes only one.

Sources & references

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Ch. 7 (combinatorial purged cross-validation).
Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management.
Axon Ridge internal — `research/experiments/results/STRAT_scoreboard_20260515.md`
Axon Ridge internal — `research/experiments/results/STRAT-04b_lgbm_rank_solo_inverse7_20260513.md`
Axon Ridge internal — `research/experiments/results/STRAT_review_vs_live_2026-05-18.md`
Axon Ridge internal — `src/ml/dataset.py` — `cascade_strict_split`