The Data Spine: Research-Grade Market Data for One DEX

Abstract

Quantitative research programs fail on data before they fail on models — silently, through survivorship, gaps, mislabelled joins, and unreproducible local caches. We describe the data layer under our Hyperliquid program: what we collect live (because it cannot be back-filled), what we backfill from the exchange's public S3 archive, what we derive and how derivations are labelled, and the storage contract that makes a two-host research/production split workable. Nothing here is novel computer science; all of it is the difference between results that replicate and results that evaporate.

1 · Collect live what cannot be re-fetched

The tick-level order book is the one dataset with no archive: if the collector is down, that history is gone. Two long-running services stream the venue continuously:

Order-book collector — WebSocket L2 snapshots and trades for the full tracked universe, including a raw top-20 ladder for the LOB-model universe (BTC, ETH, SOL, kPEPE, DOGE).
Funding poller — funding rate, open interest, mark/oracle prices, and candles on a REST cadence.

Operationally these carry a standing rule: no one stops the collectors without explicit human sign-off, because the cost of an hour's gap is permanent. Everything else in the stack is allowed to fail; these two are not.

2 · Backfill the rest from the venue archive

Hyperliquid publishes historical L2 snapshots, trades, and minute-level asset contexts (funding, OI, mark, mid, volume) to a public S3 archive. Backfills are one-shot systemd jobs, resumable, with per-day work lists — the deepest series reach back roughly three years. The archive has real costs (S3 egress is metered), so every pull goes through a catalogued fetch script with an explicit cost note rather than ad-hoc downloads.

3 · Derived data is allowed — labelled and bounded

When we widened the universe from 30 to 51 coins for the GRID-50 program (Paper № 12), the 21 new names had native hourly candles only as far back as the REST API reaches. Rather than truncate two years of panel, we derived hourly OHLC from minute-level snapshot mid-prices for the missing span, under three rules that make derivation safe:

Derived rows never overwrite native rows — the deriver inserts strictly before the earliest native candle and stops.
Derived rows are identifiable (zero volume marks the synthetic span) and the derivation is declared in every result document that trains on the affected coins.
Idempotent by engine choice — dedup-safe table engines mean re-running any backfill is harmless, which in turn means backfills can be restarted without bookkeeping.

The same discipline covers retention: when we discovered the snapshot table's TTL would eventually eat the training window's tail, the fix was an explicit TTL extension migration, committed to the repo like any other schema change — not a quiet config edit.

4 · The weight-deletion incident, and the rule it produced

The instructive failure was not a data gap. Our GPU host runs a disk-pressure cleanup that deletes model weights after their predictions are safely ingested — correct behaviour for training scratch. Then a promotion decision selected 23 specialists whose weights were needed for deployment… and every single weight file had already been cleaned up. The retrain cost roughly 10–15 GPU-hours.

The root cause was a category error: weights had silently changed category from “training scratch” to “deploy state” the moment their cells became promotion candidates, and nothing in the system represented that transition. The fix is a standing contract:

The storage contract

If any downstream consumer might ever need an artefact — model weights, model cards, per-bar predictions, strategy configs — it is written to ClickHouse at creation time. Local files are cache, never contract. Cleanup scripts must check the promotion registry before deleting, and consumers read the database first, disk second.

Model weights now live in a dedicated artefacts table (a String column holds a 50 MB state dict without complaint), keyed by run, cell, seed and pass, with SHA-256 digests for integrity. The two-host split — GPU box for training, production VM for trading — stopped being an rsync choreography and became a database read.

5 · Coverage as a generated document

The last piece is epistemic hygiene about the data itself: a coverage audit script regenerates a per-symbol report (first/last timestamp, row counts, gap and quality notes, live freshness) on demand, and the program's rules require refreshing it before any data-dependent experiment is designed. The point is to make “what do we actually have?” a query, not a memory.

What we would tell a smaller shop

Start the live collectors before you think you need them — the order-book history you didn't record is the only dataset money can't buy later. Put everything in one queryable store with dedup-safe semantics. Label every derived series at creation. And decide explicitly, per artefact type, whether it is cache or contract — the default of “it's on some machine's disk” is how deploys break months later.

Sources & references

Axon Ridge — GRID-50 pooled ranker (Paper № 12). /research/grid50-pooled-ranker.html
Axon Ridge internal — Hyperliquid data catalog; coverage audit generator.
Axon Ridge internal — model-artifacts schema and storage contract, 2026-05-28.
Hyperliquid public S3 archive documentation.