Abstract

A single backtest is a hypothesis. A grid of backtests with a maximum taken over it is a trap: the winning cell's Sharpe is an order statistic of noise plus edge, and with enough cells the noise term dominates. An internal audit of our own earlier promotions (Phase 1.7, spring 2026) found exactly this — strategies promoted on backtest numbers that were statistically indistinguishable from selection on noise, followed by paper performance that regressed to approximately zero. Grid A/B/C is the protocol we rebuilt on the other side of that audit. Its design principle is that the researcher is the adversary: every degree of freedom is either locked before evaluation or charged against a multiple-testing budget, and the final gate is a causal, chronological test that cannot be iterated against without spending the budget.

1 · Why the previous pipeline failed

The earlier validation pipeline (Paper № 04) already used dual-pass out-of-sample windows and walk-forward blocks. It still promoted overfit strategies, for three reasons:

  1. Best-seed selection. Training the same cell with N seeds and deploying the best-looking one converts seed noise directly into fake edge. The audit found promoted “edges” that were seed lottery tickets.
  2. Gate shopping. When a candidate failed one gate configuration, a slightly different configuration was tried. Each retry is another draw from the noise distribution; none was charged for.
  3. CPCV Sharpe treated as tradeable. Combinatorial purged cross-validation folds train on data chronologically after their test block. That is by design — it maximises data efficiency for signal detection — but it means a CPCV Sharpe is not a sequence anyone could have traded. We reported CPCV numbers as if they were.

2 · The funnel

Grid A — SIGNAL   (CPCV, is the ranking structure real?)
    → Grid B — COST   (locked predictions, does it survive execution?)
        → Grid C — CAUSALITY   (15-seed expanding walk-forward, could you
                                have traded it?)
            → PAPER   (live data, simulated fills — the only gate
                       that spends no multiple-testing budget)

Grid A — signal structure under CPCV

Candidates train under combinatorial purged cross-validation with embargoes sized to the prediction horizon, on horizons h=24 and h=48. The locked gate is fold consistency ≥ 60% and strong-fold share ≥ 20% — the signal must be present in most folds, not carried by one regime. Grid A deliberately uses the data-efficient, non-causal split: its job is detection, not tradeability. Nothing from Grid A is ever quoted as a performance claim.

Grid B — execution cost on locked predictions

Grid A survivors have their out-of-sample predictions frozen. Grid B re-scores those locked predictions under execution overlays — entry bands, sizing rules, cost tiers at 0/4/8 bps per leg — without any retraining. Because the predictions cannot change, sweeping the execution layer here is cheap in multiple-testing terms: we are choosing how to trade a fixed signal, not reselecting the signal. The locked gate is net SR > 0 at the live fee tier.

Grid C — the causal test

The surviving (signal, construction) pair is retrained as a monthly expanding walk-forward: train through month M−1, predict month M, step forward, no fold ever sees the future. This is run under 15 pre-registered seeds — the seed list is written into the methodology document before Grid C starts. The locked gate is a distributional one:

The deploy object is the median prediction across all 15 seeds, never the best seed. Seed-median deployment means the number we quote is an estimate of the family's central tendency, not the maximum of 15 draws.

3 · What the funnel caught in its first full run

The first family to pass all three gates end-to-end was a pooled cross-sectional LightGBM ranker on a 50-coin universe (Paper № 12). Grid C passed 15/15 seeds positive. But the run also produced the protocol's first honest surprise: the construction ranking chosen at Grid B did not transfer to the causal panel. The smoothed K=12 book that won Grid B by +0.55 SR on CPCV-ensembled predictions lost roughly 0.5 SR against the plain K=8 book on the causal walk-forward — the CPCV panel averages 45 fold-models per bar and is smoother than any single causal model, so the extra smoothing that helped there over-dampens on the deployable object.

Under the old culture, the response would have been a third sweep to find the “true” construction. Under the funnel's rules, that iteration is forbidden — the multiple-testing budget for construction was already spent at Grid B. Both constructions were sent to paper trading side by side, and paper will settle it. That decision — stop backtesting, let live data adjudicate — is the protocol working as designed.

4 · Multiplicity accounting

The funnel's degrees of freedom are enumerated up front: signal recipe, hyperparameter variant, coin or universe, direction, and execution overlay. Each is locked at a declared gate. Anything swept afterwards is flagged in the result document as post-hoc, and near-ties inside a sweep are resolved by robustness dominance (cost stability, drawdown, turnover, chronological-half stability) rather than by the point winner — a 0.03 SR gap on two years of daily bars is far inside the bootstrap standard error of ~0.4 and carries no information.

Design principle

Backtest iteration is a spending account, not a free resource. Every query against the historical data buys information and costs credibility. The funnel's job is to make that ledger explicit — and paper trading is the only unlimited account, because live data cannot be overfit retroactively.

5 · Known limitations

Sources & references

  1. Axon Ridge — GRID-50: the first family through the funnel (Paper № 12). /research/grid50-pooled-ranker.html
  2. Axon Ridge — Five ways Sharpe gets warped (Paper № 07). /research/sharpe-pitfalls.html
  3. Axon Ridge — The earlier pipeline this supersedes (Paper № 04). /research/dual-pass-oos.html
  4. Axon Ridge internal — Specialist promotion methodology, locked 2026-06-08.
  5. López de Prado, M. (2018). Advances in Financial Machine Learning — CPCV, embargoes, and the deflated Sharpe ratio.