An Operating System for Agent-Driven Research

Abstract

LLM agents fail at research in a characteristic way: fluent superficiality. They produce internally consistent numbers with confident prose around them, and the failure is invisible precisely because nothing looks broken. Small quant shops fail in a complementary way: vibes-based scoring, unwritten methodology, results that depend on who ran them. Our operating system is a set of written contracts that turns both failure modes into rule violations — mechanical, checkable, and enforced at the moment of publication rather than in retrospect. We describe the main contracts, score them against the bugs they did and did not catch, and explain the governance line we drew: agents propose, paper trades, a human promotes.

1 · The contracts

Depth contract

A research session has a mechanical floor: minimum sources consulted, minimum search effort, adversarial review of its own finding, an append-only session log, and an explicit output ratio (a session that reads much and writes little is re-classified as planning, not research). The contract exists because an agent asked to “research X” will otherwise return a plausible summary of its priors.

Numerical plausibility contract

No Sharpe-style number is publishable without its diagnostics inline: per-bar mean and volatility, tail percentile, sample size, compound return, max drawdown. Five automatic flags — extreme SR, SR/compound inconsistency, over-capital drawdown, heavy tails, implausibly clean per-bar ratios — must be evaluated before any result document is written. A target-distribution check must run before any Sharpe is computed, because the worst bug we shipped was a rank array with a return array's name (Paper № 07, Bug 2).

Calculation glossary

Every metric has exactly one canonical implementation, imported from one module — re-defining a Sharpe inside a script is a rule violation. Every formula added to the codebase must be added to the glossary in the same change, with a worked example and its plausibility flag. This kills the quiet drift where three evaluators compute three slightly different “Sharpes”.

Host-role and storage contracts

Training happens on the GPU host, production on the trading VM, and the only artefact bridge is the database (Paper № 10). Agents check which host they are on before acting. The rule reads as bureaucracy until an agent on the wrong host “restarts” a service that does not exist there and reports success.

2 · The track record, honestly scored

Incident	Caught by	Would today's rules have caught it?
Stop-overlay +8 SR (look-ahead through overlapping bars)	Human, next day	Yes — EXTREME_SR flag + overlap subsampling rule
+7.38 “Sharpe” that was a rank-IR (−3.53 corrected)	Human, in review	Yes — target-distribution check + INCONSISTENT flag
Non-reproducible training (seed ensembles sampling build noise)	Twice-run identity check	Yes — reproducibility is now a precondition
Selection on noise in early promotions	Internal audit, after paper decay	Partially — the funnel (№ 11) is the structural fix; no single flag catches selection

Two of the four were caught by a human before the corresponding rule existed — the rules are codified hindsight. That is the honest pattern of this whole system: each contract is a post-mortem that refuses to happen twice. The goal metric is simple: zero human-caught metric bugs going forward, because the machine now asks the human's questions first.

3 · The loop we chose not to close

Midway through the program we designed a fully autonomous researcher: a standing agent loop that generates hypotheses, runs them through the funnel, deploys survivors to paper via a queue, tracks calibration between predicted and realised performance, and feeds the errors back into its own priors. Technically it was within reach — every component existed. We rejected it, for a reason worth stating precisely:

The governance line

An autonomous loop optimises against its own evaluation function, and every bug in this paper's companion (№ 07) was an evaluation-function bug. Until the evaluator itself has a long clean track record, closing the loop automates self-deception at machine speed. So the boundary is fixed: agents execute and propose; paper trading adjudicates; a human — and only a human — moves anything to live capital.

In practice the human gate has bound exactly where it should: promotions out of paper are rare, deliberate, and twice now have been refusals — including refusing our own backtest's construction pick and letting two book variants fight it out in paper instead (Paper № 12).

4 · What we'd tell another shop trying this

Write the rules as executable prose, colocated with the code. Rules the agent reads at session start beat rules in a wiki nobody loads into context.
Make publication the checkpoint. Agents produce far too much intermediate output to review; the enforceable moment is when a number enters a result document, a scoreboard, or a commit message.
Keep negative knowledge. A debunked-claims folder with named post-mortems stops the same attractive idea from being “discovered” every six weeks — by agents and humans alike.
Assume the agent's number is mislabelled until its diagnostics agree. The failure mode is not arithmetic error; it is a correct computation of the wrong quantity, delivered fluently.
Decide your governance line before the first good backtest, because after it you will be negotiating with your own excitement.

Sources & references

Axon Ridge — Five ways Sharpe gets warped (Paper № 07). /research/sharpe-pitfalls.html
Axon Ridge — The Grid A/B/C funnel (Paper № 11). /research/grid-abc-funnel.html
Axon Ridge — The data spine (Paper № 10). /research/data-spine.html
Axon Ridge internal — DEPTH_CONTRACT; numerical-plausibility rule; calculation glossary; host-role rules.