Mechanism diagram · Transformer / Attention
PATCH / ATTENTION TRANSFORMER Input series lookback × N coins RevIN norm + reverse Patching L → P patches Linear proj → d-model Transformer encoder multi-head self-attention + FFN × L Flatten + head → horizon h Forecast — h × N coins

How it works

iTransformer inverts the conventional axis assignment. Each variate's full lookback series becomes one token in the Transformer.

Self-attention is computed across variates (across coins), not across time positions — explicitly modelling cross-variate structure.

FFN layers then process the time-axis information per variate.

Pros and cons on this universe

Pros

  • Only top-15 arm with native cross-variate attention thesis — fits a cross-sectional trader's mental model.
  • Strong literature track record (ICLR 2024).
  • Rank #9 IC on this universe.

Cons / failure modes

  • Attention across variates means cost is O(N²) in the number of coins, not the lookback length.
  • Pairwise IC correlation ≈ +0.96 with TSMixer — much of its signal is redundant.
  • 918-paper benchmark places it mid-tier on RMSE.

References