Mechanism diagram · Transformer / Attention
PATCH / ATTENTION TRANSFORMER Input series lookback × N coins RevIN norm + reverse Patching L → P patches Linear proj → d-model Transformer encoder multi-head self-attention + FFN × L Flatten + head → horizon h Forecast — h × N coins

How it works

The canonical Vaswani 2017 architecture — multi-head self-attention with feed-forward layers, positional encoding, encoder + decoder stacks.

No domain-specific tricks: no patching, no decomposition, no exogenous routing.

On time series, applied as encoder-only with a regression head.

Pros and cons on this universe

Pros

  • Surprisingly strong on our screen (rank #5) without any time-series-specific modifications.
  • Deep short signal on LINK (campaign-max −0.223 per-coin IC).
  • Useful as a sanity check that decomposition / patching tricks are not the only source of edge.

Cons / failure modes

  • O(L²) attention cost — slow at long lookback.
  • Literature consistently shows that simple linear baselines beat vanilla Transformers on most LTSF benchmarks.
  • Cluster-redundant with patched and mixer variants on this universe.

References