Abstract

In May 2026 the program reported a top-of-roster strategy candidate — a LightGBM lambdarank at h = 2 on the 26-coin universe with K = 5 long / 5 short — at an annualised dollar Sharpe ratio of +7.376. The number was on the scoreboard for several days before live walk-forward over 38 hours showed the strategy down −3.46 percent. The investigation found a metric substitution: the “target” array in the evaluation pipeline held cross-sectional rank values in [−1, +1] rather than raw forward returns, so the reported figure was a rank information coefficient, not a dollar Sharpe. After replacing the target array with raw returns and re-running with realistic costs (4 bps per leg, applied to the per-bar weight delta), the corrected dollar Sharpe was −3.53. This paper documents the bug, the corrected protocol, and the standing guardrails that prevent its recurrence.

+7.376
Retracted headline
+1.99
Gross @ 0 bps (raw returns)
−3.53
Net @ 4 bps · production

1 · How the substitution happens

The evaluation function in question expected a 1-D array of per-bar returns and computed an annualised dollar Sharpe via:

def annualised_sr(returns, bars_per_year=24*365):
    return returns.mean() / returns.std() * sqrt(bars_per_year)

The pipeline producing the “returns” argument for LightGBM lambdarank sidecars wrote the model's training target into the array — and for a lambdarank ranker, the training target is the per-bar cross-sectional rank in [−1, +1], not the raw forward return. The function does not raise: a vector of ranks has a perfectly well-defined mean and standard deviation. The output is just a different statistic — Spearman rank correlation against the prediction, scaled by the same √(bars per year).

This is metric substitution at its most insidious. The number is real and reproducible. It just measures the wrong thing.

The diagnostic that catches it

If more than 1% of your “return” vector has absolute value greater than 0.5, it is not a return vector. A typical 1-hour crypto perp return on this universe rarely exceeds ±5%. A rank target in [−1, +1] has by definition the full range. We now run inspect_target_distribution.py on every Sharpe number we publish.

2 · Side-by-side: the same strategy, three lenses

The retracted candidate is a useful case study because it sits cleanly on the boundary of what a metric substitution can hide. Below are three different Sharpe-like numbers for the same trained model, evaluated on the same OOS test window, with the same prediction array:

Same strategy, three Sharpes
LGBM-H2 26-coin K = 5L / 5S · OOS test window · 27 seeds

The retracted +7.376 was rank IR computed on a rank target. The honest gross Sharpe on raw returns at zero cost was +1.99 — already very different. At our operational cost convention (4 bps per leg, applied to the per-bar weight delta), the strategy is deeply negative. Zero of 27 seeds were positive after cost.

3 · The corrected cost sweep

Once the metric was fixed, we ran the same strategy across cost levels at three portfolio concentrations. The pattern is what cost-aware backtesting always shows on high-turnover strategies: a positive gross signal collapses well below break-even at realistic per-leg costs.

Net Sharpe vs cost per leg, by K-of-N concentration
Same model, varied portfolio concentration K · raw returns · 27 seeds median

Higher concentration (K = 5/5) has the largest cost drag because portfolio weight deltas per bar are larger. At our live cost convention of 4 bps per leg, every concentration is solidly net negative. The strategy is not deployable as specified.

4 · The substitution glossary

Metric substitution is a much broader hazard than one bug in one evaluator. The four substitutions we have observed in our own scoreboards and in published crypto-ML papers, ordered roughly by how often they appear in the wild:

Substitution #1 · Rank-IR reported as Sharpe

What we caught here. The “target” vector is a rank in [−1, +1] rather than a raw return. Fixed by inspecting the target distribution and by naming variables explicitly (target_mode='raw' vs target_mode='xs_rank').

Substitution #2 · IC reported as Sharpe

Spearman rank correlation between predictions and returns, scaled by √(bars per year), looks like a Sharpe. It isn't one. Across our 28-arm screen the best IC is +0.20 — that does not translate to a tradeable +7 Sharpe at any cost.

Substitution #3 · RMSE-best is trading-best

The 918-paper benchmark[1] shows that minimum-RMSE forecasters often have ~50 percent directional accuracy. For a cross-sectional trader, directional rank is the signal. We do not promote arms on RMSE alone — see Paper № 02.

Substitution #4 · Gross-best is net-best

The same model can be the strongest at zero cost and the weakest at 4 bps, if turnover is asymmetric across the candidates. Full treatment in Paper № 06.

5 · Standing guardrails

Three changes were made after this incident, all now part of the standard evaluation pipeline:

  1. Target-distribution check. Every Sharpe number is paired with a histogram of the return vector that produced it. If more than 1% of values exceed |0.5|, the number is rejected.
  2. Mistake #21 gate. Promotion to live capital requires the rank correlation between the training metric used and the final dollar Sharpe across folds to exceed +0.5. Models whose training metric does not track their tradeable Sharpe fail this gate regardless of headline performance.
  3. Canonical scripts. All published metrics come from a small set of audited entry points (audit_universe_ic.py, compare_strategies_raw_returns.py). Hand-typed aggregates are not accepted; the audited script's output is the source of truth.

6 · What this incident did not invalidate

Lambdarank models on cross-sectional crypto data remain a valid and active research direction. STRAT-04b (covered in Paper № 07) uses the same family and produces a +1.51 Pass-B Sharpe at full cost. The retraction is about the evaluation pipeline reporting the wrong number for one candidate, not about lambdarank as a method. The fix is in the evaluator, not the architecture.

7 · A note on culture

We publish this paper because the alternative — quietly correcting the scoreboard — is worse for the program. The substitution is common enough in the literature that a public, reproducible case study is more useful than another paper claiming a seven-Sharpe crypto-ML strategy. We would rather publish a corrected −3.53 than an inflated +7.376.

A backtest without a metric audit is a number with a name attached.

Sources & references

  1. Saidd, M. et al. (2026). A 918-experiment empirical study of long-horizon forecasters. arXiv:2603.16886
  2. Axon Ridge internal — `research/experiments/results/STRATEGY_COMPARISON_KOFN_2026-05-05.md` (retraction)
  3. Axon Ridge internal — `research/experiments/results/STRATEGY_COMPARISON_RAW_2026-05-05.md` (corrected)
  4. Axon Ridge internal — `docs/CALCULATION_GLOSSARY.md` §5.3 (target_mode discipline)
  5. Axon Ridge internal — `research/scoreboards/06_top20_mistakes.md` Mistake #21 + #23
  6. Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART. Microsoft Research TR-2010-82.
  7. Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio.