Abstract
Most published evaluations of time-series forecasting architectures use RMSE on a handful of common benchmarks. For a directional cross-sectional trader on a 24-hour-ahead horizon, what we care about is rank correlation between predicted and realised forward returns — not squared error. We ran 32 candidate arms (28 neural networks, 2 rule-based deciles, plus failed runs) through a single canonical screening protocol on a 26-coin Hyperliquid perpetual universe and found that the family ranking changes substantially. The strongest arm on our universe — TimeMixer — is not the strongest on the long-horizon literature leaderboards, and the literature's top picks (ModernTCN, PatchTST) rank in the middle or bottom of our roster.
1 · The funnel
The full pipeline is three gates wide:
- Universe IC screen. Each arm is trained on the 26-coin panel and produces predictions for every coin at every bar in the test window. We compute Spearman rank correlation between predictions and raw forward returns, per coin and per architecture, under two splits (described below). One canonical script —
scripts/funnel/audit_universe_ic.py— produces every IC number we publish. - Strategy simulation. Surviving arms enter STRAT-level sims that translate predictions into positions, apply cost (4 bps per leg), and report annualised dollar Sharpe over 30 seeds.
- Paper validation. Strategy survivors run on live data in simulated execution before any live capital is risked.
This paper is about gate 1. The other two are covered in Paper № 04 (dual-pass OOS) and the strategy-specific writeups.
2 · The metric: HiConv-6-inverse
Reporting 26 separate per-coin IC values per architecture is not useful for ranking. We aggregate using a deliberately stress-tested statistic: the HiConv-6-inverse IC is the mean of negated Spearman IC on a fixed six-coin cluster — TON, JUP, POPCAT, LINK, kPEPE, TIA — that the early arch-funnel work identified as the cleanest short-side cluster on this universe.
An arm with a positive HiConv-6-inverse is one whose predictions, when used as a short signal on these six names, would have shown positive directional rank correlation against realised returns over the Pass-B test window. It is not a Sharpe; it is a screening metric. We use it consistently because it ranks architectures by their cross-sectional discrimination on the historically harder side of the universe.
3 · Results
The top cluster (TimeMixer through TFT, ranks 1-15) is dominated by hybrid mixers
and decomposition Transformers. The bottom of the table contains ModernTCN
and the linear floor NLinear-RevIN. xs_funding_z
and xs_oi_velocity are non-neural rule deciles
kept on the screen as orthogonal baselines.
4 · Literature transfer is uneven
The long-horizon time-series literature ranks architectures on RMSE. The 918-experiment study[1] places ModernTCN at the very top of the crypto leaderboard. Our HiConv-6-inverse IC places it at rank 24. The chart below shows the disagreement for five canonical “literature top-tier” arms:
PatchTST and TimesNet transfer reasonably (slightly worse on our universe than in the literature). ModernTCN and DLinear do not transfer at all by this metric. Autoformer transfers up — it is mid-tier in the literature and a top arm on our screen.
5 · The redundancy problem
Many arms produce almost identical IC vectors. A pairwise correlation audit of the IC-per-coin profile across the top-15 arms found pairs with correlation above +0.95 — meaning that despite having different mechanisms, they predict the same coins in the same direction with the same intensity. The most extreme case is BITCN and StemGNN with a pairwise correlation of +0.999.
The rule-based xs_oi_velocity and
xs_funding_z are the most independent.
lgbm_rank is the most independent neural-adjacent arm
and is our default diversifier. The TimeMixer / TSMixer / iTransformer /
TSMixerx cluster is highly redundant — stacking three of them buys nothing.
“Seven architectures in an ensemble” sounds diverse. On this universe, it is one signal repeated seven times. The diversification properties of the roster need to be measured on the predictions, not on the architecture taxonomy.
6 · What this is and isn't
The funnel ranks architectures by their cross-sectional discrimination on a specific universe, on a specific horizon, under a specific cost-aware OOS protocol. It is not a universal claim that TimeMixer is “better than” ModernTCN — on a different benchmark with different metrics that ranking can flip, as the 918 paper demonstrates. The funnel is the input to our strategy work; the validation of any actual edge happens downstream in the dual-pass OOS work covered separately.
The other constraint worth being explicit about: IC ≠ Sharpe. A high HiConv-6-inverse IC is a necessary condition for an architecture to be worth simulating as a strategy. It is not a sufficient condition for that strategy to be profitable at full cost. The follow-up paper on cost-aware backtesting covers cases where a +0.15 IC becomes a deeply negative net Sharpe[2].
7 · Reproducibility
Every number on this page can be regenerated from the repository with a single command:
$ python3 -m scripts.funnel.audit_universe_ic $ # writes research/experiments/results/audit_universe_ic_<date>.md
The 2026-05-15 audit caught seven cells in the pre-audit scoreboard that were hand-typed and wrong by more than 0.03 in IC. Those errors were corrected and the audit is now part of the standing automation.
Sources & references
- Saidd, M. et al. (2026). A 918-experiment empirical study of long-horizon forecasters. arXiv:2603.16886
- Axon Ridge — Cost-Aware Backtesting (Paper № 06). /research/cost-aware-backtesting.html
- Axon Ridge internal — `research/experiments/results/audit_universe_ic_2026-05-15.md`
- Axon Ridge internal — `research/scoreboards/01_top20_architectures.md`
- Axon Ridge internal — `research/experiments/results/IC_pattern_analysis_2026-05-15.md`