Mechanism diagram · Transformer / Attention
PATCH / ATTENTION TRANSFORMER Input series lookback × N coins RevIN norm + reverse Patching L → P patches Linear proj → d-model Transformer encoder multi-head self-attention + FFN × L Flatten + head → horizon h Forecast — h × N coins

How it works

Each coin's lookback series is split into patches (typically non-overlapping). Each patch becomes a token in a Transformer.

PatchTST is channel-independent: each variate is processed by the same shared Transformer separately, then concatenated for the final head.

RevIN normalisation handles non-stationary mean/variance per window.

Pros and cons on this universe

Pros

  • Strong literature track record on LTSF benchmarks.
  • Patching reduces sequence length by the patch size, sharply cutting attention cost.
  • Rank #6 on the latest shared-metric walk-forward screen — still a positive signal.

Cons / failure modes

  • Dropped from active CPCV sweep — repeated negative dollar Sharpe folds despite positive IC.
  • Channel independence ignores cross-coin structure, which is the central object on a perpetual panel.
  • Literature top-tier; on our IC screen ranked sixth — partial transfer.

References