Mechanism diagram · Transformer / Attention
How it works
The canonical Vaswani 2017 architecture — multi-head self-attention with feed-forward layers, positional encoding, encoder + decoder stacks.
No domain-specific tricks: no patching, no decomposition, no exogenous routing.
On time series, applied as encoder-only with a regression head.
Pros and cons on this universe
Pros
- Surprisingly strong on our screen (rank #5) without any time-series-specific modifications.
- Deep short signal on LINK (campaign-max −0.223 per-coin IC).
- Useful as a sanity check that decomposition / patching tricks are not the only source of edge.
Cons / failure modes
- O(L²) attention cost — slow at long lookback.
- Literature consistently shows that simple linear baselines beat vanilla Transformers on most LTSF benchmarks.
- Cluster-redundant with patched and mixer variants on this universe.