Build log · trading-models

Teaching a small model to talk about backtests

A walk through fine-tuning a 7-billion-parameter assistant to answer questions about quant trading models — what we were trying to do, how the pieces fit together, and the three separate bugs hiding behind one symptom.

Model: Qwen2.5-7B-Instruct + LoRA Method: QLoRA (4-bit) via Unsloth Hardware: single RTX 5080 (16 GB) Teacher / judge: Claude

01The goal & why

What we were building, and the reason it's worth doing

The trading-models project is a portfolio of quant strategies — each with a reproducible backtest — wrapped in a web "workbench" where you pick a model, run a backtest over any window, and see the results as charts. Bolted onto that workbench is a chat assistant: you can ask "what data source does this model use?" or "how would higher fees change this?" and get an answer grounded in the run you're looking at.

That assistant can just call Claude. But calling a hosted model on every question costs money per query and needs a network round-trip. The goal of this sub-project was to fine-tune a small, open model that runs locally and matches the hosted model's quality on this one narrow domain — answering questions about these specific backtests, calling the right tools, and never inventing numbers.

The bet

A general 7B model is mediocre at this out of the box. But the domain is small and well-defined. If we can teach it the handful of tools and the house style for grounded answers, a free local model can stand in for the paid one — with a toggle in the UI to switch back to Claude when you want maximum quality.

02The four pieces

How the system fits together

1. Teacher Claude generates example conversations → 2. Dataset (question → tool calls → grounded answer)
↓
4. Eval gate score the student ← 3. Student Qwen-7B fine-tuned to imitate the teacher

The teacher is Claude. We don't hand-write training data — we have Claude play the assistant across a grid of scenarios (every model × a legal ticker × a question template), producing realistic traces: the user's question, the tool calls the assistant makes, the tool outputs, and a final grounded answer.

The dataset sorts those traces into four kinds of question:

explain "explain the worst month of this backtest" — narrate real numbers.
counterfactual "how would a longer lookback change this?" — re-run with a tweak.
methodology "what's the train/test split?" — answer from the docs.
refusal "should I buy this?" — decline politely; it's not financial advice.

A grounding filter throws away any teacher answer that cites a number not present in its tool outputs — so the model only ever learns from answers that are actually supported by data. The working dataset (dataset_n20) ended up at 185 training + 51 held-out eval examples.

The student is Qwen2.5-7B-Instruct, fine-tuned to imitate those traces. The eval gate is how we measure whether it worked — more on both below.

03What "training" actually means here

QLoRA, in plain terms

We don't retrain all 7 billion of the model's weights — that needs far more memory than a 16 GB card has, and would risk wrecking its general ability. Instead we use QLoRA, which is two tricks stacked:

Quantization (the "Q"): load the big model in 4-bit precision so it fits in 16 GB. These weights stay frozen.
LoRA (Low-Rank Adaptation): bolt a few small, trainable matrices onto each layer. We only train those — a tiny fraction of the parameters. The result is a ~160 MB "adapter" that rides on top of the frozen base model.

One important detail: we compute the training loss only on the assistant's turns (its tool calls and final answers), masking out the system prompt, the user's question, and the tool outputs. The model already knows how to read; we want to spend the limited training signal teaching it how to respond.

Recipe

Train for 8 epochs (8 passes over the data), checkpoint after each one, and keep the checkpoint that scores best on the held-out eval set. On a dataset this small the model starts overfitting — memorizing the training examples — after just a few epochs, so the last epoch is rarely the best one. (This "keep the best checkpoint" logic is exactly where bug #2 hid — see below.)

04The eval gate: three numbers

How we decide if the student is any good

After training, the student answers all 51 held-out questions and we score it three ways. The held-out set is the key word: the model never saw these during training, so doing well means it generalized, not memorized.

Tool-call accuracy — did it call the right tools with the right arguments? (Deterministic, computed locally, free.)
Grounded fraction — does every number in its answer actually appear in a tool output? Catches hallucinated stats. (Also local and free.)
Judge win-rate — Claude reads the student's answer and the gold answer side by side and says which is better. This is the closest thing to "is the answer actually good." (Needs an API call per question, so it costs money.)

A "ship bar" sets thresholds for each. The bars are deliberately strict (tool-call ≥ 0.90, grounded ≥ 0.99) — the model never cleared them in this work, but the metrics were invaluable as a compass for finding what was broken.

Why three, not one

They measure different failure modes. A model can call perfect tools and still write a useless answer (high tool-call, low judge). It can write a beautiful answer full of made-up numbers (high judge, low grounded). You need all three to know where it's failing — which turned out to matter a lot.

05One symptom, three bugs

The interesting part

We trained a new, bigger dataset (n20) and compared it to an older one (n12). The new model looked worse at the methodology questions — its score dropped noticeably. That one symptom turned out to be three completely separate problems, uncovered one at a time. Each looked like "the model regressed." None of them actually were.

1The ruler was wrong — a scoring artifact

2We picked the wrong snapshot — a checkpoint-selection artifact

3The textbook had errors — a gold-data bug

Act 1 — the ruler was wrong

Symptom

Methodology tool-call accuracy looked terrible (~0.4) for the new model.

Methodology questions are answered by searching the docs — the model calls search_docs(query="..."). The scorer checked the query argument with an exact string match. So if the gold query was "slippage execution model" and the model searched "slippage fee model", it scored a flat zero — even though both find the same document. A quarter of the methodology questions were scored 0 for every model, purely on phrasing.

Root cause

The metric demanded a word-for-word match on free-text search queries. It was measuring phrasing, not whether the model used the tool correctly.

Fix

Score the query by token overlap instead — match if the two queries share enough words (and keep exact matching for real identifiers like model IDs). Methodology tool-call jumped to 0.78–0.97 for both models. The "regression" was mostly the ruler.

Act 2 — we picked the wrong snapshot

Symptom

Even after fixing the ruler, the new model's methodology tool-call (0.78) lagged the old one's (0.97).

Remember the recipe keeps the checkpoint with the lowest eval-loss. Eval-loss is a proxy for "how surprised is the model by the held-out answers" — a decent default, but not the same thing as "calls tools correctly." We trained again while keeping every epoch's checkpoint, then scored epochs 2, 3 and 4 directly.

Root cause

Eval-loss bottomed out at epoch 2, but the tool-call format only fully locked in at epoch 3 — where eval-loss was a hair higher. The "keep best eval-loss" rule had been handing us the snapshot from one epoch too early.

Fix

Sweep the epochs and pick by what we actually care about. At epoch 3, methodology tool-call hit 0.969 (matching the old model) and overall tool-call rose to 0.675 — beating both earlier candidates. Added a flag to keep all checkpoints so this sweep is repeatable.

Act 3 — the textbook had errors

Symptom

Tool-calls were now fixed, but the judge still rated the new model's methodology answers poorly (0.31) versus the old model (0.59). This one was real — and it survived every training knob we tried.

Since the model learns to answer like its training data, we compared the two datasets' methodology answers directly. The new ones weren't shorter or lazier — if anything, longer. But some were wrong. The clearest case: for a rules-based options strategy, the gold answer claimed it "uses a default 80/20 train/test split."

Checking the actual methodology docs: the 80/20 split is a discipline for machine-learning models only — and just one of the five models is ML. The other four are rule-based with no parameters to fit, so they have no train/test split at all. The teacher had pasted the ML boilerplate onto models it didn't apply to, and the student dutifully learned the wrong answer.

Root cause

Bad gold answers. A model can only be as correct as what it's trained on — and judged answer quality is a data problem, not a training-knob problem. (Tool-calls and grounding can't catch this; only the judge, which compares the actual prose, can.)

Fix

Rewrite the affected gold answers to the ground-truth-correct version (ML → 80/20 out-of-sample; non-ML → no split, full-period backtest), then retrain. Methodology judge climbed 0.31 → 0.59 — exactly the old model's level — with no change to tool-calls. The gap was the textbook, not the student.

A dead end worth noting

Before finding the real cause, we suspected the new dataset simply had too few methodology examples. It turned out there are only 12 unique methodology questions possible from the templates — the space was already saturated, and the new set had more copies than the old one. More data couldn't have helped; the problem was never quantity.

06Results

Methodology category, scored on the same 51 held-out questions

"Judge" = win-rate vs the gold answer as scored by Claude. The final model matches the older n12 reference (0.59) on answer quality while leading on tool-calls. Per-category judge is noisy at this sample size, so the overall judge number isn't certifiable from a single run — but the methodology gain is large, causal, and reproducible.
Adapter	method. tool-call	method. judge	overall tool-call	what changed
n20 · epoch 2 start	0.781	0.31	0.602	baseline (after ruler fix)
n20 · epoch 3	0.969	0.31	0.675	+ checkpoint sweep (Act 2)
n21 · epoch 3 final	0.969	0.59	0.699	+ corrected gold (Act 3)

None of these clear the (deliberately strict) ship bar yet, but the trajectory is the point: every apparent "regression" was diagnosed to a specific, fixable cause, and the final adapter is the strongest on every metric we can attribute to a real change.

07Lessons

What this build actually taught us

Trust the metric last. Two of the three "regressions" were the measurement, not the model. Before believing a number, check that it measures what you think.
One symptom can have many causes. "Methodology got worse" was a scoring bug and a checkpoint bug and a data bug — stacked. Fixing one revealed the next.
Separate the metrics on purpose. Tool-call, grounding, and judge each catch a different failure. The data bug was invisible to two of them and only the judge could see it.
A model is a mirror of its data. The student faithfully learned a wrong answer because the teacher wrote one. Garbage-in isn't loud — it shows up as a quiet quality ceiling.
Cheap diagnostics first. The data bug was found by diffing two text files — no GPU, no API. The expensive judge run only confirmed what the free comparison already implied.