Vanilla Transformer — Axon Ridge Capital

Mechanism diagram · Transformer / Attention

How it works

The canonical Vaswani 2017 architecture — multi-head self-attention with feed-forward layers, positional encoding, encoder + decoder stacks.

No domain-specific tricks: no patching, no decomposition, no exogenous routing.

On time series, applied as encoder-only with a regression head.

Pros and cons on this universe

Pros

Surprisingly strong on our screen (rank #5) without any time-series-specific modifications.
Deep short signal on LINK (campaign-max −0.223 per-coin IC).
Useful as a sanity check that decomposition / patching tricks are not the only source of edge.

Cons / failure modes

O(L²) attention cost — slow at long lookback.
Literature consistently shows that simple linear baselines beat vanilla Transformers on most LTSF benchmarks.
Cluster-redundant with patched and mixer variants on this universe.

References