Build log · trading-models
A walk through fine-tuning a 7-billion-parameter assistant to answer questions about quant trading models — what we were trying to do, how the pieces fit together, and the three separate bugs hiding behind one symptom.
What we were building, and the reason it's worth doing
The trading-models project is a portfolio of quant strategies — each with a reproducible backtest — wrapped in a web "workbench" where you pick a model, run a backtest over any window, and see the results as charts. Bolted onto that workbench is a chat assistant: you can ask "what data source does this model use?" or "how would higher fees change this?" and get an answer grounded in the run you're looking at.
That assistant can just call Claude. But calling a hosted model on every question costs money per query and needs a network round-trip. The goal of this sub-project was to fine-tune a small, open model that runs locally and matches the hosted model's quality on this one narrow domain — answering questions about these specific backtests, calling the right tools, and never inventing numbers.
A general 7B model is mediocre at this out of the box. But the domain is small and well-defined. If we can teach it the handful of tools and the house style for grounded answers, a free local model can stand in for the paid one — with a toggle in the UI to switch back to Claude when you want maximum quality.
How the system fits together
The teacher is Claude. We don't hand-write training data — we have Claude play the assistant across a grid of scenarios (every model × a legal ticker × a question template), producing realistic traces: the user's question, the tool calls the assistant makes, the tool outputs, and a final grounded answer.
The dataset sorts those traces into four kinds of question:
A grounding filter throws away any teacher answer that cites a number not present in its tool outputs — so the model only ever learns from answers that are actually supported by data. The working dataset (dataset_n20) ended up at 185 training + 51 held-out eval examples.
The student is Qwen2.5-7B-Instruct, fine-tuned to imitate those traces. The eval gate is how we measure whether it worked — more on both below.
QLoRA, in plain terms
We don't retrain all 7 billion of the model's weights — that needs far more memory than a 16 GB card has, and would risk wrecking its general ability. Instead we use QLoRA, which is two tricks stacked:
One important detail: we compute the training loss only on the assistant's turns (its tool calls and final answers), masking out the system prompt, the user's question, and the tool outputs. The model already knows how to read; we want to spend the limited training signal teaching it how to respond.
Train for 8 epochs (8 passes over the data), checkpoint after each one, and keep the checkpoint that scores best on the held-out eval set. On a dataset this small the model starts overfitting — memorizing the training examples — after just a few epochs, so the last epoch is rarely the best one. (This "keep the best checkpoint" logic is exactly where bug #2 hid — see below.)
How we decide if the student is any good
After training, the student answers all 51 held-out questions and we score it three ways. The held-out set is the key word: the model never saw these during training, so doing well means it generalized, not memorized.
A "ship bar" sets thresholds for each. The bars are deliberately strict (tool-call ≥ 0.90, grounded ≥ 0.99) — the model never cleared them in this work, but the metrics were invaluable as a compass for finding what was broken.
They measure different failure modes. A model can call perfect tools and still write a useless answer (high tool-call, low judge). It can write a beautiful answer full of made-up numbers (high judge, low grounded). You need all three to know where it's failing — which turned out to matter a lot.
The interesting part
We trained a new, bigger dataset (n20) and compared it to an older one (n12). The new model looked worse at the methodology questions — its score dropped noticeably. That one symptom turned out to be three completely separate problems, uncovered one at a time. Each looked like "the model regressed." None of them actually were.
Methodology tool-call accuracy looked terrible (~0.4) for the new model.
Methodology questions are answered by searching the docs — the model calls search_docs(query="..."). The scorer checked the query argument with an exact string match. So if the gold query was "slippage execution model" and the model searched "slippage fee model", it scored a flat zero — even though both find the same document. A quarter of the methodology questions were scored 0 for every model, purely on phrasing.
The metric demanded a word-for-word match on free-text search queries. It was measuring phrasing, not whether the model used the tool correctly.
Score the query by token overlap instead — match if the two queries share enough words (and keep exact matching for real identifiers like model IDs). Methodology tool-call jumped to 0.78–0.97 for both models. The "regression" was mostly the ruler.
Even after fixing the ruler, the new model's methodology tool-call (0.78) lagged the old one's (0.97).
Remember the recipe keeps the checkpoint with the lowest eval-loss. Eval-loss is a proxy for "how surprised is the model by the held-out answers" — a decent default, but not the same thing as "calls tools correctly." We trained again while keeping every epoch's checkpoint, then scored epochs 2, 3 and 4 directly.
Eval-loss bottomed out at epoch 2, but the tool-call format only fully locked in at epoch 3 — where eval-loss was a hair higher. The "keep best eval-loss" rule had been handing us the snapshot from one epoch too early.
Sweep the epochs and pick by what we actually care about. At epoch 3, methodology tool-call hit 0.969 (matching the old model) and overall tool-call rose to 0.675 — beating both earlier candidates. Added a flag to keep all checkpoints so this sweep is repeatable.
Tool-calls were now fixed, but the judge still rated the new model's methodology answers poorly (0.31) versus the old model (0.59). This one was real — and it survived every training knob we tried.
Since the model learns to answer like its training data, we compared the two datasets' methodology answers directly. The new ones weren't shorter or lazier — if anything, longer. But some were wrong. The clearest case: for a rules-based options strategy, the gold answer claimed it "uses a default 80/20 train/test split."
Checking the actual methodology docs: the 80/20 split is a discipline for machine-learning models only — and just one of the five models is ML. The other four are rule-based with no parameters to fit, so they have no train/test split at all. The teacher had pasted the ML boilerplate onto models it didn't apply to, and the student dutifully learned the wrong answer.
Bad gold answers. A model can only be as correct as what it's trained on — and judged answer quality is a data problem, not a training-knob problem. (Tool-calls and grounding can't catch this; only the judge, which compares the actual prose, can.)
Rewrite the affected gold answers to the ground-truth-correct version (ML → 80/20 out-of-sample; non-ML → no split, full-period backtest), then retrain. Methodology judge climbed 0.31 → 0.59 — exactly the old model's level — with no change to tool-calls. The gap was the textbook, not the student.
Before finding the real cause, we suspected the new dataset simply had too few methodology examples. It turned out there are only 12 unique methodology questions possible from the templates — the space was already saturated, and the new set had more copies than the old one. More data couldn't have helped; the problem was never quantity.
Methodology category, scored on the same 51 held-out questions
| Adapter | method. tool-call | method. judge | overall tool-call | what changed |
|---|---|---|---|---|
| n20 · epoch 2 start | 0.781 | 0.31 | 0.602 | baseline (after ruler fix) |
| n20 · epoch 3 | 0.969 | 0.31 | 0.675 | + checkpoint sweep (Act 2) |
| n21 · epoch 3 final | 0.969 | 0.59 | 0.699 | + corrected gold (Act 3) |
None of these clear the (deliberately strict) ship bar yet, but the trajectory is the point: every apparent "regression" was diagnosed to a specific, fixable cause, and the final adapter is the strongest on every metric we can attribute to a real change.
What this build actually taught us