Mechanism diagram · Transformer / Attention
PATCH / ATTENTION TRANSFORMER Input series lookback × N coins RevIN norm + reverse Patching L → P patches Linear proj → d-model Transformer encoder multi-head self-attention + FFN × L Flatten + head → horizon h Forecast — h × N coins

How it works

TFT routes inputs through variable selection networks — gates that learn which features to attend to per time-step.

An LSTM encodes local temporal patterns.

Multi-head self-attention captures longer-range dependencies.

The model emits quantile forecasts via a quantile loss head, supporting probabilistic output.

Pros and cons on this universe

Pros

  • Interpretable feature gates — variable selection makes the model's attention legible.
  • Strong BNB long (+0.307 per-coin IC).
  • Rank #15 IC.

Cons / failure modes

  • Dossier currently recommends diagnostic-only use — hyperparameter-fragile in our experience.
  • Published “80% return” financial-application claims have been debunked in the wider literature.
  • Heavyweight architecture relative to mixer alternatives.

References