Literature Review — Axon Ridge Capital

Why this is not a generic survey

There are excellent broad surveys of deep learning for time series forecasting — Lim & Zohren (2021), the more recent Saidd et al. (2026) 918-experiment benchmark, and the LOBCAST critique of order-book deep learning, to name three. This paper is narrower and more honest: it covers only the families we actually train, screen, or keep as dossier-frontier on Hyperliquid perpetual futures. It is the reading list of a working research program rather than an exhaustive academic taxonomy.

Two operating constraints shape the survey. First, we evaluate on a fixed 26-coin Hyperliquid panel at hourly bars with horizon h = 24, costed at 4 basis points per leg — not the per-architecture single-asset benchmarks that fill most papers. Second, we report Spearman information coefficient (IC) as the primary screening metric and annualised dollar Sharpe at full cost as the deployment metric, and never confuse the two^[1].

Family map

The chart below sorts each architecture family by best in-program Pass-B information coefficient on the audited 28-arm Phase-1 screen (the “HiConv-6-inverse” metric — mean of negated IC on a fixed 6-coin short cluster). It is the single best in-program comparison across families on this universe.

Best Pass-B IC per architecture family

HIConv-6-inverse · audited 2026-05-15 · higher = stronger cross-sectional signal

Hybrid mixers (TimeMixer family) and decomposition Transformers (Autoformer) lead. State-space and foundation-model families are dossier-only and have not yet produced an empirical row on this universe.

The eleven families, in order

1 — Patch / attention Transformers

The dominant family in the long-horizon time-series literature. PatchTST^[2] introduced patching — splitting each series into overlapping or non-overlapping sub-sequences treated as tokens — and channel independence. iTransformer^[3] inverts the convention and treats each variate's full time series as a single token, applying attention across variates instead of across time. Older decomposition-style Transformers (Autoformer^[4], Informer, FEDformer) factor seasonality and trend before attention.

On our universe the picture is uneven: Autoformer ranks #2 by HiConv-6-inverse IC, PatchTST ranks #6, iTransformer #9 — all positive, but not the top tier. The PatchTST dossier records that it was dropped from active CPCV sweeps for repeatedly producing negative dollar Sharpe folds despite a respectable IC.

2 — MLP mixers and hybrids

A counter-trend family that argues much of the Transformer's advantage on time series can be recovered by carefully designed MLPs. TSMixer^[5] alternates time-mixing and feature-mixing MLPs over patched inputs; PatchTSMixer adds patching; TSMixerx adds exogenous channels (we feed it the funding-rate stack). TimeMixer adds multi-scale decomposition before mixing.

Empirically this is the strongest family on our screen. TimeMixer is the best single arm by HiConv-6-inverse IC (+0.2043), with TSMixerx, RMoK, TSMixer all in the top seven. We treat the family as a cluster — pairwise IC correlation often exceeds +0.95 across these arms^[6], meaning that stacking three of them buys very little diversification.

3 — Modern TCNs and FFT-period CNNs

Two distinct cousins. ModernTCN^[7] modernises the temporal convolutional network with large kernels, depthwise separability, and channel mixing — and reaches state-of-the-art RMSE on the long-horizon benchmark in the 918-experiment study^[8]. TimesNet^[9] takes a different route: it uses FFT to discover dominant periods in the input, reshapes the 1-D series into a 2-D tensor along those periods, and runs Inception-style 2-D convolutions.

This is also where our cleanest counter-example to literature transfer lives. ModernTCN ranks #2 on the 918-paper crypto benchmark but #24 on our HiConv-6-inverse screen, with IC near zero. The literature ranks the family on RMSE; we rank it on directional rank correlation against raw forward returns. The two metrics disagree sharply^[8].

4 — Recurrent networks (GRU, LSTM, xLSTM)

Still the most useful baseline for a sanity check. GRU^[10] appears in our zoo with a modest but stable IC (+0.061). We track this family primarily because its IC vector is empirically near-orthogonal to the modern mixer cluster — its errors decorrelate. xLSTM is reviewed in our dossier set as a candidate, not yet empirically run.

5 — Tabular trees

LightGBM^[11] is the workhorse for panels with hand-crafted features. Its lambdarank objective^[12] is what we use for cross-sectional ranking in STRAT-04b. The architecture is distinct from every other family in our screen — distinctness ~0.61 against the modern cluster — which makes it our default diversification source.

6 — Linear floors

The DLinear / NLinear family^[13] argues that with reversible instance normalisation (RevIN), a single linear layer is competitive on many time-series benchmarks. We don't treat these as a competitive arm — we treat them as a floor. If a deeper model can't beat NLinear with RevIN by a meaningful margin, it isn't contributing.

7 — LOB-specialised architectures

A family designed for limit order book data rather than candles. DeepLOB^[14] uses a CNN + Inception + LSTM stack on the L2 ladder and famously reported ~70% directional accuracy on London Stock Exchange data. LOBCAST^[15] later showed that this family generalises poorly off the original FI-2010 benchmark. More recent work (TLOB^[16]) reports modest improvements (BTC F1 +1.1 over SoTA). We hold DeepLOB as a frontier item gated on the L2 backfill EXP-006 and do not yet have a Hyperliquid IC for it.

8 — Foundation models

Pretrained zero-shot forecasters. Chronos / Chronos-Bolt^[17], TimesFM^[18], and Moirai^[19] have shown impressive zero-shot transfer across diverse benchmarks. Independent replication has been mixed — see Zhang & Gilpin's 2026 context-parroting analysis^[20]. Our position: dossier-only until we can put one of them through the same Pass-A / Pass-B funnel as the empirically-tested arms.

9 — Conformal prediction for sizing

A wrapper, not an architecture. Conformal prediction^[21] produces distribution-free prediction intervals around any underlying forecaster. The interesting use in our setting is not direction but sizing: a wider interval implies more model uncertainty implies smaller position. Frontier-only; dossier review covers JRFM 2024's 4,000-asset crypto VaR study.

10 — Self-supervised pretraining

Masked time-step or contrastive pretraining on the full unlabelled panel before fine-tuning on the forecasting task. Reviewed in our dossier set alongside the FITS critique^[22] and Kronos^[23]. Not yet on the screen.

11 — LLM-based time-series models

Time-LLM^[24], GPT4TS, and the broader use of frozen language model backbones for forecasting. Tan et al. (NeurIPS 2024)^[25] show that a trivial “PAttn” attention pattern recovers most of the reported gains, which sets a high bar for the family. Dossier-only.

Validation, costs, and what we leave out

Three methodological references are in heavier rotation across our research base than any individual architecture paper:

López de Prado's Advances in Financial Machine Learning^[26] for purged cross-validation, embargoing, and the discipline around metric selection.
Bailey & López de Prado's Deflated Sharpe Ratio^[27] for the multiple-testing correction we apply to candidates that survive both Pass A and Pass B.
Saidd et al.'s 918-experiment benchmark^[8] — both as a literature comparison and as a warning that minimum-RMSE leaderboards do not select for directional accuracy.

What this survey deliberately leaves out: pure-momentum factor models for crypto, reinforcement learning agents (we keep a small Decision Transformer dossier under EXP-017 but do not consider it competitive yet), classical statistical models (ARIMA, GARCH — covered briefly as baselines in the linear-floor section), and market-making microstructure literature beyond the LOB classification family.

A working assumption

The literature ranks architectures on RMSE on common benchmarks. We rank them on cross-sectional rank correlation against raw forward returns at h = 24 on a Hyperliquid panel, then on dollar Sharpe at 4 bps per leg. Where the two rankings disagree — as with ModernTCN — we follow the in-program measurement.

References

Cited works

Axon Ridge internal — Calculation Glossary, §3 & §5.3. The IC ↔ Sharpe distinction is enforced repo-wide.
Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. ICLR. arXiv:2211.14730
Liu, Y. et al. (2024). iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. ICLR. arXiv:2310.06625
Wu, H. et al. (2021). Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. NeurIPS. arXiv:2106.13008
Chen, S.-A. et al. (2023). TSMixer: An All-MLP Architecture for Time Series Forecasting. arXiv:2303.06053 · PatchTSMixer: arXiv:2306.09364
Axon Ridge — IC pattern analysis, 2026-05-15. Internal; documents pairwise IC correlations across the 28-arm roster.
Luo, D., & Wang, X. (2024). ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis. ICLR. OpenReview: vpJMJerXHU
Saidd, M., et al. (2026). A 918-experiment empirical study of long-horizon forecasters across crypto, FX and equities. arXiv:2603.16886
Wu, H. et al. (2023). TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. ICLR. arXiv:2210.02186
Cho, K. et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP. arXiv:1406.1078
Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS.
Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research TR-2010-82.
Zeng, A. et al. (2023). Are Transformers Effective for Time Series Forecasting? AAAI. arXiv:2205.13504 · RLinear / NLinear-RevIN: arXiv:2305.10721
Zhang, Z., Zohren, S., & Roberts, S. (2019). DeepLOB: Deep Convolutional Neural Networks for Limit Order Books. IEEE TSP. arXiv:1808.03668
Briola, A. et al. (2023). LOBCAST: An Open Source Framework for Limit Order Book Classification. arXiv:2308.01915
Briola, A. et al. (2025). TLOB: Transformers for Limit Order Books. arXiv:2502.15757
Ansari, A. F. et al. (2024). Chronos: Learning the Language of Time Series. arXiv:2403.07815
Das, A. et al. (2024). A decoder-only foundation model for time-series forecasting. ICML. arXiv:2310.10688
Woo, G. et al. (2024). Unified Training of Universal Time Series Forecasting Transformers. ICML. arXiv:2402.02592
Zhang, J., & Gilpin, W. (2026). Are Foundation Models Forecasting or Context-Parroting? ICLR.
Gibbs, I., & Candès, E. (2021). Adaptive Conformal Inference Under Distribution Shift. NeurIPS.
FITS critique. (2024). On the limits of linear FFT baselines for time series forecasting. arXiv:2410.18318
Kronos: self-supervised pretrain on cross-asset panels. AAAI 2026.
Jin, M. et al. (2024). Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. ICLR.
Tan, M. et al. (2024). Are Language Models Actually Useful for Time Series Forecasting? NeurIPS. arXiv:2406.16964
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management.