Mechanism diagram · Transformer / Attention
How it works
iTransformer inverts the conventional axis assignment. Each variate's full lookback series becomes one token in the Transformer.
Self-attention is computed across variates (across coins), not across time positions — explicitly modelling cross-variate structure.
FFN layers then process the time-axis information per variate.
Pros and cons on this universe
Pros
- Only top-15 arm with native cross-variate attention thesis — fits a cross-sectional trader's mental model.
- Strong literature track record (ICLR 2024).
- Rank #9 IC on this universe.
Cons / failure modes
- Attention across variates means cost is O(N²) in the number of coins, not the lookback length.
- Pairwise IC correlation ≈ +0.96 with TSMixer — much of its signal is redundant.
- 918-paper benchmark places it mid-tier on RMSE.