Backtest → Paper → Live: Where the Edge Evaporates

Abstract

Every systematic trader knows performance decays along the chain from backtest to paper to live, and most attribute the largest loss to the last hop: fills, fees, slippage, adverse selection. We instrumented all three stages with a shared ledger design — a synthetic paper ledger, a live model-mirror ledger (identical logic, live data), and the real on-chain fills — and decomposed the decay for our first generation of live directional specialists. The result inverted the folk wisdom: paper-to-live losses were under a dollar on a $25-notional test; the backtest-to-paper hop, driven by selection at a performance peak, accounted for essentially all of the damage. We describe the decomposition, the mean-reversion mechanics of promoting at a peak, and the two structural changes that followed.

1 · Three ledgers, one truth test

For every live strategy we maintain three parallel series on identical sizing:

PAPER — the synthetic ledger: model predictions applied to live market data with simulated fills at mid and modelled cost. No exchange contact.
MIRROR — the live model-mirror: the production scorer's own accounting of what its positions should be worth.
REAL — actual on-chain fills, mark-to-market, from exchange data.

The design gives a clean attribution: if REAL diverges from PAPER while PAPER stays healthy, the problem is execution. If PAPER itself turns, the problem is the signal — no amount of execution engineering will fix it.

2 · The case study: promoted at the peak

Two directional specialists (GRU-family short models on NEAR and ENA) were promoted to live in early June 2026 on the strength of roughly three months of strong paper performance. Segmenting their paper ledgers around the live window:

Specialist	Pre-live (paper)	During live (paper)	Days
NEAR short	+159.6% cum · +112 bps/day	−11.6% · −77 bps/day	85 pre · 16 during
ENA short	+131.7% cum · +100 bps/day	−15.8% · −86 bps/day	84 pre · 20 during

The paper strategy — which never touches the exchange — swung from roughly +100 bps/day to −80 bps/day at almost exactly the moment of promotion. Slippage cannot explain a synthetic ledger turning negative. And over the live overlap, all three ledgers agreed with each other:

Overlap total ($25 notional)	PAPER	MIRROR	REAL
NEAR short	−$2.64	−$2.91	−$2.09
ENA short	−$3.53	−$6.62	−$1.81

For NEAR, paper, mirror and real money landed within $0.80 of one another and moved in lockstep day by day. Execution — fills, fees, the exchange — was not the problem. The promotion decision was: we selected a strategy because it had just produced an extraordinary run, which is precisely when a mean-reverting performance series is most likely to disappoint. The +100 bps/day pre-live run was itself the kind of number our own plausibility rules now treat as a warning, not a credential.

3 · Measuring execution properly anyway

Absolving execution required measuring it. Comparing fills against mid-price at decision time across the live book put realised cost close to the modelled 4 bps per leg taker assumption — in line, not the hidden tax the folk wisdom expects at this size on Hyperliquid majors. Two genuine execution lessons did surface, both operational rather than statistical:

Delistings are an execution event. When TON was delisted, a stale prediction kept a phantom leg in the book until a staleness guard was added: any signal not refreshed within its horizon is force-flattened.
“Closing” must mean closing. A leg displayed as closing while the reduce order sat unfilled revealed a gap between intent and order state that the dashboard now surfaces explicitly.

4 · What changed structurally

Directional single-coin bets were retired from live. The failure mode — a beta-shaped edge decaying — is cancelled by construction in a market-neutral K-of-N book, which is now the only shape we run live (Paper № 12).
Promotion criteria stopped rewarding hot streaks. The Grid A/B/C funnel (Paper № 11) promotes on seed-median causal walk-forward distributions, not on recent paper performance — the exact statistic that seduced us here is no longer an input.
Every live strategy keeps its three-ledger instrumentation, so the next decay gets attributed in hours, not weeks.

Sample-size honesty

The live windows here are weeks, not months — 16 and 20 trading days of overlap. The co-movement of three independent ledgers is what makes the attribution credible at this sample size; the per-strategy dollar figures alone would not be. The K-of-N books currently in paper and live carry the same instrumentation and will extend this decomposition as their history accumulates.

Sources & references

Axon Ridge — GRID-50 pooled ranker (Paper № 12). /research/grid50-pooled-ranker.html
Axon Ridge — Grid A/B/C funnel (Paper № 11). /research/grid-abc-funnel.html
Axon Ridge internal — paper vs live three-ledger diagnostic, 2026-06.
Axon Ridge internal — live trader runbook: TON delisting and reduce-order gotchas.