Abstract
In May 2026 the program reported a top-of-roster strategy candidate — a LightGBM lambdarank at h = 2 on the 26-coin universe with K = 5 long / 5 short — at an annualised dollar Sharpe ratio of +7.376. The number was on the scoreboard for several days before live walk-forward over 38 hours showed the strategy down −3.46 percent. The investigation found a metric substitution: the “target” array in the evaluation pipeline held cross-sectional rank values in [−1, +1] rather than raw forward returns, so the reported figure was a rank information coefficient, not a dollar Sharpe. After replacing the target array with raw returns and re-running with realistic costs (4 bps per leg, applied to the per-bar weight delta), the corrected dollar Sharpe was −3.53. This paper documents the bug, the corrected protocol, and the standing guardrails that prevent its recurrence.
1 · How the substitution happens
The evaluation function in question expected a 1-D array of per-bar returns and computed an annualised dollar Sharpe via:
def annualised_sr(returns, bars_per_year=24*365):
return returns.mean() / returns.std() * sqrt(bars_per_year)
The pipeline producing the “returns” argument for LightGBM lambdarank sidecars wrote the model's training target into the array — and for a lambdarank ranker, the training target is the per-bar cross-sectional rank in [−1, +1], not the raw forward return. The function does not raise: a vector of ranks has a perfectly well-defined mean and standard deviation. The output is just a different statistic — Spearman rank correlation against the prediction, scaled by the same √(bars per year).
This is metric substitution at its most insidious. The number is real and reproducible. It just measures the wrong thing.
If more than 1% of your “return” vector has absolute value greater
than 0.5, it is not a return vector. A typical 1-hour crypto perp return on this
universe rarely exceeds ±5%. A rank target in [−1, +1] has by definition the
full range. We now run inspect_target_distribution.py
on every Sharpe number we publish.
2 · Side-by-side: the same strategy, three lenses
The retracted candidate is a useful case study because it sits cleanly on the boundary of what a metric substitution can hide. Below are three different Sharpe-like numbers for the same trained model, evaluated on the same OOS test window, with the same prediction array:
The retracted +7.376 was rank IR computed on a rank target. The honest gross Sharpe on raw returns at zero cost was +1.99 — already very different. At our operational cost convention (4 bps per leg, applied to the per-bar weight delta), the strategy is deeply negative. Zero of 27 seeds were positive after cost.
3 · The corrected cost sweep
Once the metric was fixed, we ran the same strategy across cost levels at three portfolio concentrations. The pattern is what cost-aware backtesting always shows on high-turnover strategies: a positive gross signal collapses well below break-even at realistic per-leg costs.
Higher concentration (K = 5/5) has the largest cost drag because portfolio weight deltas per bar are larger. At our live cost convention of 4 bps per leg, every concentration is solidly net negative. The strategy is not deployable as specified.
4 · The substitution glossary
Metric substitution is a much broader hazard than one bug in one evaluator. The four substitutions we have observed in our own scoreboards and in published crypto-ML papers, ordered roughly by how often they appear in the wild:
What we caught here. The “target” vector is a rank in [−1, +1]
rather than a raw return. Fixed by inspecting the target distribution and by
naming variables explicitly (target_mode='raw' vs
target_mode='xs_rank').
Spearman rank correlation between predictions and returns, scaled by √(bars per year), looks like a Sharpe. It isn't one. Across our 28-arm screen the best IC is +0.20 — that does not translate to a tradeable +7 Sharpe at any cost.
The 918-paper benchmark[1] shows that minimum-RMSE forecasters often have ~50 percent directional accuracy. For a cross-sectional trader, directional rank is the signal. We do not promote arms on RMSE alone — see Paper № 02.
The same model can be the strongest at zero cost and the weakest at 4 bps, if turnover is asymmetric across the candidates. Full treatment in Paper № 06.
5 · Standing guardrails
Three changes were made after this incident, all now part of the standard evaluation pipeline:
- Target-distribution check. Every Sharpe number is paired with a histogram of the return vector that produced it. If more than 1% of values exceed |0.5|, the number is rejected.
- Mistake #21 gate. Promotion to live capital requires the rank correlation between the training metric used and the final dollar Sharpe across folds to exceed +0.5. Models whose training metric does not track their tradeable Sharpe fail this gate regardless of headline performance.
- Canonical scripts. All published metrics come from a small set of audited entry points (
audit_universe_ic.py,compare_strategies_raw_returns.py). Hand-typed aggregates are not accepted; the audited script's output is the source of truth.
6 · What this incident did not invalidate
Lambdarank models on cross-sectional crypto data remain a valid and active research direction. STRAT-04b (covered in Paper № 07) uses the same family and produces a +1.51 Pass-B Sharpe at full cost. The retraction is about the evaluation pipeline reporting the wrong number for one candidate, not about lambdarank as a method. The fix is in the evaluator, not the architecture.
7 · A note on culture
We publish this paper because the alternative — quietly correcting the scoreboard — is worse for the program. The substitution is common enough in the literature that a public, reproducible case study is more useful than another paper claiming a seven-Sharpe crypto-ML strategy. We would rather publish a corrected −3.53 than an inflated +7.376.
A backtest without a metric audit is a number with a name attached.
Sources & references
- Saidd, M. et al. (2026). A 918-experiment empirical study of long-horizon forecasters. arXiv:2603.16886
- Axon Ridge internal — `research/experiments/results/STRATEGY_COMPARISON_KOFN_2026-05-05.md` (retraction)
- Axon Ridge internal — `research/experiments/results/STRATEGY_COMPARISON_RAW_2026-05-05.md` (corrected)
- Axon Ridge internal — `docs/CALCULATION_GLOSSARY.md` §5.3 (target_mode discipline)
- Axon Ridge internal — `research/scoreboards/06_top20_mistakes.md` Mistake #21 + #23
- Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART. Microsoft Research TR-2010-82.
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio.