Engineering devlog — trading-models

Entry 12 · 2026-06-12 PRs #75–#80

Replay the past, then fix five things at once (plans A–E)

A no-lookahead replay harness, the study it enabled, and five plans executed in one day — whose two biggest results were nulls.

Why

The scanner had been issuing tickets nightly since 06-10, but forward evidence accumulates at exactly one night per night. To learn anything this year we needed to replay the funnel over history without cheating: each replayed night may only see data available as of that night. PR #75 built that harness (scripts/backfill_scan.py), and the resulting study — every 5th session over ~6 months — produced a headline: the watch tier looked like +23.4R. Five improvement plans (A–E) came out of that study. Then one of them quietly destroyed the headline.

B — issuance mechanics, and the correction (#77)

Plan B gave issuance some discipline: per-setup entry windows (the post-earnings-drift pair gets 15 sessions instead of the default), a re-issue cooldown keyed on (ticker, stance, strategy, tier) so a campaign that is still waiting or open cannot be issued again, and score-floor plumbing left deliberately empty until live evidence justifies any floor.

Problem

Replaying the study with the cooldown suppressed 43 of the 128 baseline watch rows. They were not new signals — they were the same persistent setups re-firing every replay night while their first campaign was still live, each re-fire counted as if it were a fresh trade.

Finding

The +23.4R headline was inflated roughly 3× by pseudo-replication. Deduplicated to one count per campaign, the window's truth is watch +3.15R: post-earnings drift +2.8R (3 of 4 campaigns hit), base breakouts ≈ noise (+1.2R across 36), shorts −6.6R. The replay harness's most valuable output to date is this correction to our own beliefs.

The correction, to scale. 43 of 128 baseline rows were the same campaigns re-firing on consecutive replay nights; counting each campaign once shrinks the window's watch-tier result from +23.4R to +3.15R. The per-strategy components are the deduplicated truth.

A — short detectors (#76)

The FA gate already computed a bottom-40 cohort that long-only scanning threw away. Plan A mirrored the detectors — base_breakdown, ma_rally_fade, pead_down — over that cohort into the watch path, with direction-aware briefs and a short chip on /scans.

Problem

Mirroring levels is where the bugs live: review caught moving-average pairs emitting trigger/stop in the wrong order — and the guard built for it exposed a pre-existing long-side bug (squeeze-spike tape emitting inverted levels). Short 2R targets can also be arithmetically unattainable when the risk exceeds half the trigger price.

Solution

Ordering guards on every emitted level pair, hard ValueErrors on stance/cohort mismatches, and unattainable targets demoted to report-only rows with levels: null. Replayed: the long side stayed byte-identical; the new shorts went 14 issued / 1 hit, −8.6R in an up-trending window — stated plainly in the PR, tracked in the live ledger by stance, not traded.

C — regime overlay, first null (#78)

A November loss on a PLTR mean-reversion pair motivated a regime gate: block mean-reversion-against-trend when SPY is below its 200-day moving average. Built as scanner/regime.py, always reported, gating only behind a default-off flag, with a replay A/B arm.

Finding — null

The A/B arms were byte-identical (85 rows, +1.15R both). The motivating PLTR pair fired with SPY ~12% above its 200dma — that "selloff" never breached the trend line, and only 2 of 42 replay nights were down-trend at all. The plan's own sanity expectation (PLTR blocked) was not met, and we recorded that instead of tuning the definition until it fired. The gate ships dark; flipping it on requires ≥20 live nights of regime-block evidence. If mean-reversion losses recur in pullbacks that don't breach the 200dma, the fix is a better trend definition, not a forced flip.

E — honest families for the replay (#79)

The replay had been using today's FA family for historical nights — a quiet survivorship distortion. --family-mode archived reconstructs each night's family from the newest archived report at or before that night, stamps family_source: "fallback" explicitly wherever coverage predates the archives, and loads the union of bars. A local 2026-06-09 report turned out to predate the FA format entirely — the skip path handled it in the wild on day one.

D — pooled certification, the bigger null (#80)

Per-ticker forward ledgers starve: no single ticker accumulates enough campaign dates to certify anything. Plan D pools by setup type across all tickers — an as-of sweep stamps every historical firing, per-date mean-R series feed the existing metrics engine, and a type is certified only at DSR ≥ 0.90, ≥ 20 pooled dates, and a Benjamini-Hochberg FDR pass across the 6-type menu. Certified types promote watch rows to real tickets, ahead of cooldown and regime in the pipeline (the ordering is mutation-locked by a test). A weekly Modal sidecar recomputes certification, failure-isolated and keep-stale-on-failure.

Why nothing certifies. Pooling solved the structural problem — every setup type now has 28–77 dates against a 20-date bar — so sample size is no longer the excuse. The Deflated Sharpe of the pooled per-date returns is what fails, for every type, every week of the replay.

Finding — the null is the result

0 of 6 types certified on every weekly recompute; the A/B arms were byte-identical (85 rows, +1.15R). And the pooled base rates contradict the study's best story: unconditioned post-earnings drift loses −18.5R over 54 pooled dates (its short mirror −18.2R over 55). The 86%-hit pead record that looked so good was 4 deduplicated campaigns — selection plus small-n noise, not an edge the cross-section corroborates. Promotion ships on, because certification is itself the evidence bar — and stays dark in practice until a type genuinely earns it. One known conservative bias, deliberately kept: the sweep has no cooldown, so serial re-fires over-count within a ticker; softening that to make the bar passable would be replay-tuning.

Problem

Review caught a true Critical before it reached production: render_markdown assumed every row has option structures and would have crashed with a KeyError on the first night a promoted row appeared — killing that night's report. Two test-fixture gotchas also bit: a constant return series produces DSR ≈ 1.0 from float noise, and a bimodal alternating series produces negative sample-kurtosis variance and a 0.0 sentinel.

Solution

Report, webapp, and scan-runner all hardened against structure-less promoted rows, with regression tests; DSR tests use a seeded normal series, the only fixture that behaves. The promotion-before-cooldown ordering and the default-on flag are both pinned by tests so they cannot drift silently.

The nightly funnel after plans A–E. Promotion runs before cooldown and the (dark) regime gate so that a freshly certified ticket cannot be suppressed by its own watch-campaign history; a test fails if anyone reorders it.

Lessons from the week

(1) A replay harness's best product is corrections — the 3× inflation, two null A/Bs, and an uncorroborated star strategy were each worth more than a feature. (2) Nulls get recorded, not tuned away. (3) Two-stage review caught a real bug in every plan, including one that would have killed a production report. (4) Evidence bars ship dark (or dark-in-practice) until the evidence exists.

Carried forward: ranking is cohort-blind (shorts sink in a mixed top-15 — needs a stance-aware term), suppressed rows and the regime block are JSON-only (should be visible before C's flip review), and a campaign-deduplicated pooled sweep becomes its own plan if live evidence says D's bar is wrongly harsh.

Entry 11 · 2026-06-11 PRs #69 #70 #71 #73 #74

Reading the tape: three-tier sentiment

What people are saying about a ticker, separated by who is saying it — plus two production lessons the same day.

Why

The scanner says what is setting up; it says nothing about the conversation around a name. The /sentiment page reads a ticker three ways and refuses to blend them: official (filings, news), forums (Reddit), and viral (short-form social) — because "the company said" and "the crowd is chanting" are different signals with different failure modes.

Problem

The viral tier needed breadth without API bills. Bluesky's documented public host (public.api.bsky.app) gates the search endpoint; and a naive source order let one noisy network dominate the page. Separately, the same day delivered two production lessons: CI smoke tests fetch live Google Trends data and got rate-limited into breaking main, and the first live night of the tiered ledger 500'd the /tournaments index on a None average-R that only a first night can produce.

Solution

Keyless search via api.bsky.app directly (#73) with round-robin interleaving so each viral source gets alternating slots; Trends rate-limits in CI classified as environment-unavailable and skipped rather than failed (#70); the tier strip guards None aggregates (#74). Rule extracted from the 500: first-night shapes are real shapes — any aggregate over an empty ledger needs a rendering path.

Entry 10 · 2026-06-10/11 PRs #57 #62–#68 #72

The options planner

A conversation in, a priced options ticket out — and the week's best disguised bug.

Why

Scanner tickets are share-denominated; expressing the same hypothesis in options means picking structure, strikes, expiry, and exits. The planner turns a stated hypothesis ("NVDA grinds higher into earnings") into a ticket through a short dialogue (#57), then grew a dedicated page (#62), neutral structures — iron condors and butterflies (#63), candidate strike ladders with computed exit plans (#65), guided level scenarios with a chart card and event warnings (#66), page-level sizing (#68), and after-hours quotes via a CBOE delayed-quote fallback (#72).

Problem

Tickets intermittently came out with a degenerate ATR — risk distances near zero, absurd ladders. The cause was nowhere near the options code: yfinance daily history can end in a trailing row whose OHLC is all-NaN (today's half-formed bar), and every volatility window that touched it collapsed. Process-side, two planner PRs (#64, #67) were merged into the wrong base branch of a PR stack and had to be re-landed cleanly as #65 and #68.

Solution

load_daily now drops a trailing all-NaN OHLC row at the loader — every consumer inherits the fix. The stack mishap became protocol: retarget a stacked PR's base before deleting the branch under it, and prefer clean re-lands over surgery on a wrong-base merge.

Entry 09 · 2026-06-10 PRs #56 #59 #60 #61

Hardening the pipeline until it says no

Golden metrics, walk-forward validation everywhere, FDR control, and tiers — the night the funnel returned one name.

Why

With real tickets flowing, the pipeline needed the disciplines the concept survey (CON-02) ranked highest: guard against non-stationarity and multiple testing before chasing edge. Four PRs landed it: a golden lock-down of the metrics engine plus a strategy conformance harness (#56), setup detectors rebuilt as walk-forward-validated strategies (#61), nightly Benjamini-Hochberg FDR with ticket/watchlist tiers and a tiered ledger (#59), and an ML model path with ridge_momentum plus an adding-a-model guide (#60).

Finding

The first live FDR night took 1003 candidates down to exactly 1 watch-tier name (NVDA). A funnel that strict feels broken; it is the opposite. Most nights, most setups are indistinguishable from noise, and a pipeline that admits that is the only kind whose survivors mean anything.

Problem

Two quiet traps from this batch: the scan lookback is measured in calendar days (~3.3 trading-years — easy to misread as trading days), and squash-merging a PR stack deletes context — the retarget-before-delete rule was learned here first.

Entry 08 · 2026-06-10 PRs #51–#55

Russell 1000: tournament → tickets → ledger

Double the universe, fight strategies per ticker, write the winners down, then grade them forward.

Why

The S&P-500 scanner proved the funnel; scale and accountability were missing. This run widened the universe to the Russell 1000 with a two-sided FA gate (#51), made strategy choice empirical — a per-ticker walk-forward tournament in which candidate strategies compete on out-of-sample folds (#52) — turned winners into concrete trade tickets with trigger/stop/target (#53), wired the nightly cron (#54), and gave it all a public face plus a forward performance ledger that grades every issued ticket against subsequent bars (#55).

Problem

Free data at Russell scale: yfinance starts returning 429s around a thousand tickers a night (request thinning is still a standing carry-forward). Deploy-side, Modal redeploys don't roll a warm container — twice the "shipped" code wasn't the running code until the app was stopped explicitly; modal app history commit vs origin/main became the standard cross-check.

Solution

The ledger is the heart of it: every ticket simulated forward with the same fill rules as the backtests (a limit-entry bar cannot also exit at target), refreshed on read. Strategy selection stopped being an opinion — if a setup family can't win its ticker's tournament out-of-sample, it doesn't issue tickets.

Entry 07 · 2026-06-09 PRs #46–#48 #50

The swing scanner ships

From backtesting the past to scanning tonight — a funnel that ends in grounded briefs.

Why

Everything before this analyzed history. The scanner (#47) looks at tonight: the S&P 500 through a two-pass fundamentals gate (with EDGAR filing trends), three technical setup detectors over the survivors, and an LLM brief per candidate that may only cite documents actually retrieved — the same grounding discipline the fine-tuned assistant was trained under. A nightly cron publishes to /scans. The Jane Street concept survey (#46) landed the same day and became the project's north star: adapt to non-stationarity first, validate with purged time-series CV, train on the true objective.

The original S&P-500 funnel on a real night: 503 → 40 → 4 → ranked cards with trigger/stop levels and grounded briefs. Later entries widen the left edge (Russell 1000) and tighten the right one (FDR tiers).

Problem

The first production wrinkle was infrastructure, not signal: the web app served /scans from a stale Modal volume — the cron wrote nightly reports the page never picked up.

Solution

Reload the data volume before serving (#48). The README got a scanner section with a live screenshot and an FA-gate walkthrough (#50), and the nightly cron has run since.

Entry 06 · 2026-06-08/09 PRs #40–#45 #49

Earnings straddle: publishing the no

A complete model, a thorough backtest, and a verdict of "don't" — written up like the wins.

Why

An SSRN paper specified an earnings-vol strategy but declined to implement it; we finished the job. Long an at-the-money straddle into scheduled earnings (#40), validated with bootstrap and FDR rather than a lone Sharpe, surfaced in the workbench (#42), then re-run thoroughly (#45) and on real option chains — DoltHub history plus live yfinance snapshots (#49).

Problem

The first pass scanned a hardcoded 2023–2024 window for earnings events regardless of the window requested — silently producing empty calendars elsewhere (#43 fixed it).

Finding — negative result

Buying earnings volatility loses to the volatility risk premium and the post-report IV crush; the only defensible edge is a selection filter (trade only when the forecast move beats the implied one), and even filtered trades didn't clear significance — profit factor 0.82, bootstrap p ≈ 0.83, viability 3/10 (#44). Published with the same care as a win: a documented "no" prevents re-deriving it in six months.

Entry 05 · 2026-06-07/08 PRs #37 #38

Frictions without a data bill

A validation-and-overfitting harness, and synthetic options frictions calibrated instead of purchased.

Why

Backtests flatter. PR #37 built the validation harness — overfitting checks as first-class machinery, next-open fills as the default everywhere. PR #38 made options backtests pay realistic costs: a synthetic volatility surface and bid/ask spreads, calibrated rather than bought.

Decision

Two scoping calls, made explicit: no paid options data until a strategy survives the synthetic frictions (a real-chain loader stays deferred), and no low-latency work at all — for 2-week-to-6-month holding periods, microseconds are someone else's problem.

Entry 04 · 2026-06-05/07 PRs #25–#36

Teaching a 7B model to talk about backtests

QLoRA on one RTX 5080, a three-metric eval gate, and one symptom hiding three bugs.

Why

The workbench assistant called a hosted model on every question. The bet: the domain is narrow enough that a fine-tuned local 7B (Qwen2.5 + LoRA) can match it here — answering about these backtests, calling these tools, inventing nothing. Claude plays teacher and judge; a grounding filter discards any training answer citing a number absent from its tool outputs. Data harness (#25), QLoRA trainer (#26), eval gate (#27), then a GPU service on Modal (#36).

Problem

A new, larger dataset scored worse on methodology questions. That one regression was three stacked bugs: an exact-string-match scorer punishing paraphrased search queries, checkpoint selection by eval-loss handing over the epoch before tool-call format locked in, and gold answers that were simply wrong (ML train/test boilerplate pasted onto rule-based models). The full story has its own page: the fine-tuning build log.

Solution

Token-overlap scoring, an epoch sweep keyed to the metric we actually care about (#33), and curated gold answers (#34) — methodology judge score 0.31 → 0.59 with tool-calls intact. Deploy gotchas worth remembering: Windows consoles crash printing "✓" without PYTHONIOENCODING=utf-8, and a Modal redeploy does not restart a warm container — stop the app to actually ship.

Entry 03 · 2026-06-04/05 PRs #13 #17 #21 #24 #29–#31

From script to service: the workbench

A typed backtest service, a FastAPI + HTMX + Plotly front end, and a deliberately bounded assistant.

Why

Streamlit had hit its ceiling (entry 01), and both the GUI and the planned LLM assistant needed the same thing underneath: a service substrate — typed request in, validated run, JSON result out (#13). On top of it: the workbench — pick a model, run any window, see charts (#17, #24) — and a chat assistant whose tools are the service itself, so every answer is grounded in a real run (#21).

Problem

The first users (us) immediately found the seams: a ticker control that ignored per-model locks, per-model symbol groups producing 400s, and a date picker happily requesting windows a model couldn't backtest.

Solution

Three small PRs (#29–#31): honor the lock, collapse to a single symbol field, bound the picker to each model's backtestable window. The bounded-tools design proved itself — the assistant cannot cite a number that no tool returned.

Entry 02 · 2026-06-04 PRs #6–#12

Options engine, honest fills, a reading room

A pricing core, two changes that made every backtest more honest, and the research site this page lives on.

Why

Three foundations in one day. An options pricing core with a delta-hedged seed model (#6) — the strategy that should lose money (realized vol below implied) and measurably did, a correctness check disguised as a model. Backtest accuracy (#11): next-open fills, because filling at the close of the signal bar is quiet lookahead, and the Deflated Sharpe Ratio, because trying many configurations and reporting the best Sharpe is multiple testing. And GitHub Pages hosting with the research hub (#7, #12) — every model gets a working paper, negative results included.

Finding

The Deflated Sharpe introduced here is the same statistic that, eight days later, becomes the certification bar in entry 12 — and declines to certify anything. The tools for being honest compound.

Entry 01 · 2026-05-29 PRs #1–#5

Streamlit and the one-gigabyte wall

The prototype era: a quick GUI, and the memory ceiling that shaped everything after.

Why

The repo began as backtest scripts; the first GUI was a Streamlit app with per-model ticker switching (#1) so the models were explorable at all.

Problem

Streamlit Community Cloud caps memory around 1 GB, and the microstructure model loads tick data. The app died with the platform's generic "Error running app" overlay — no traceback, no metric, nothing to debug against. The binding constraint was invisible.

Solution

Stream the tick aggregation instead of materializing it (#5), and harden the sidebar against unresolvable ticker configs (#2). The deeper lesson outlived the fix: a platform whose failure mode is a blank overlay is the wrong host for memory-hungry research — the seed of entry 03's self-hosted workbench.

Fifteen days, seventy-nine pull requests,
and the corrections along the way

Replay the past, then fix five things at once (plans A–E)

B — issuance mechanics, and the correction (#77)

A — short detectors (#76)

C — regime overlay, first null (#78)

E — honest families for the replay (#79)

D — pooled certification, the bigger null (#80)

Reading the tape: three-tier sentiment

The options planner

Hardening the pipeline until it says no

Russell 1000: tournament → tickets → ledger

The swing scanner ships

Earnings straddle: publishing the no

Frictions without a data bill

Teaching a 7B model to talk about backtests

From script to service: the workbench

Options engine, honest fills, a reading room

Streamlit and the one-gigabyte wall

Every merged PR, grouped by entry

PR	merged	title
Entry 12 — replay + plans A–E
#80	06-12	feat(scanner): pooled cross-sectional setup certification — watch-to-ticket promotion path
#79	06-12	feat(replay): --family-mode archived — per-night FA families from archived reports
#78	06-12	feat(scanner): regime overlay — meanrev-against-trend gate (ships dark) + replay A/B
#77	06-12	feat(scanner): entry windows, re-issue cooldown, score-floor plumbing
#76	06-12	feat(scanner): short setup detectors in the watch path
#75	06-12	feat(scanner): no-lookahead historical replay of the technical funnel
Entry 11 — sentiment + production lessons
#74	06-11	fix(tournaments): guard None avg_r in tier strip — first-night ledger 500
#73	06-11	feat(sentiment): bluesky cashtag search joins the viral tier
#71	06-11	docs(readme): ticker sentiment section (three-tier read)
#70	06-11	fix(tests): skip model smoke test on Google Trends rate limiting
#69	06-11	feat(sentiment): /sentiment page — three-tier ticker read (official / forums / viral)
Entry 10 — options planner
#72	06-11	feat(planner): after-hours options via CBOE delayed-quotes fallback
#68	06-11	feat(planner): page-level sizing settings + NaN-bar loader fix (reland #67)
#67	06-11	feat(planner): page-level sizing settings + NaN-bar loader fix
#66	06-11	feat(planner): propose level scenarios with chart card + event warnings
#65	06-11	feat(planner): re-land #64 — candidate strike ladder + computed exit plan on chat tickets
#64	06-11	feat(planner): candidate strike ladder + computed exit plan on chat tickets
#63	06-11	feat(planner): options-only tickets + neutral stance (iron condor / iron butterfly)
#62	06-10	feat(webapp): /planner page — dedicated options-simplifier tab with ticket cards
#57	06-10	feat(assistant): options planner — conversational hypothesis to options ticket
Entry 09 — pipeline hardening
#61	06-10	feat(tournament): setup detectors become walk-forward-validated strategies (hardening B)
#60	06-10	feat(tournament): ML model path — ridge_momentum + adding-a-model guide (hardening D)
#59	06-10	feat(scanner): nightly FDR + ticket/watchlist tiers + tiered ledger (hardening C)
#56	06-10	test(pipeline): metrics golden lock-down + strategy conformance harness (hardening A)
Entry 08 — Russell 1000 tournament era
#55	06-10	feat(tournaments): /tournaments pages + forward ticket-performance ledger
#54	06-10	feat(scan): nightly cron wiring, tickets rendering, /models page (sub-project D)
#53	06-10	feat(strategist): ticket playbook — tournament winners to trade tickets
#52	06-10	feat(tournament): per-ticker walk-forward strategy tournament
#51	06-10	feat(scanner): Russell 1000 universe + two-sided FA gate
Entry 07 — the swing scanner
#50	06-09	docs(readme): nightly swing-scanner section with live screenshot and FA-gate walkthrough
#48	06-09	fix(deploy): reload the data Volume before serving /scans
#47	06-09	feat(scanner): S&P 500 swing-setup scanner with FA gate and LLM document briefs
#46	06-09	docs(concepts): CON-02 Jane Street key-concepts survey
Entry 06 — earnings straddle
#49	06-09	Earnings straddle Phase 2: real option-chain data (DoltHub + yfinance snapshots)
#45	06-09	Feat/earnings straddle thorough backtest
#44	06-09	docs(03-earnings-straddle): thorough backtest — no significant edge, viability 3/10
#43	06-09	fix(03-earnings-straddle): scan the requested window for earnings, not hardcoded 2023-2024
#42	06-08	feat(workbench): surface the earnings-straddle filtered-vs-unfiltered report
#41	06-08	fix(workbench): earnings-straddle empty-calendar note + restore OPT-02 hub card
#40	06-08	feat(options): earnings event-vol straddle model (03) + bootstrap/FDR validation
Entry 05 — validation + frictions
#39	06-08	style: clear ruff format/lint failures (SP1 validation harness + modal deploy)
#38	06-08	Realistic options frictions: synthetic vol surface + bid/ask spread
#37	06-07	feat: validation & overfitting harness (SP1) + next-open fill default
Entry 04 — fine-tuning the assistant
#36	06-07	feat(modal): serve the fine-tuned assistant on a dedicated GPU service
#35	06-06	docs: assistant fine-tuning build log (HTML)
#34	06-06	feat(dataset): curate methodology gold answers + ship n21 adapter
#33	06-06	feat(train): epoch-sweep checkpoint selection + ship epoch-3 n20 adapter
#32	06-06	feat: Step A — dataset/trainer/eval fixes + assistant model switch
#28	06-05	Merge pull request #26 from VanKyle00/feat/llm-assistant-training
#27	06-05	feat(eval): LLM assistant eval harness (sub-project 3)
#26	06-05	feat(training): QLoRA training pipeline (sub-project 2)
#25	06-05	feat(dataset): LLM assistant data harness (sub-project 1)
Entry 03 — service substrate + workbench
#31	06-05	fix(webapp): bound date picker to each model's backtestable window
#30	06-05	fix(webapp): single symbol field — per-model groups caused 400s
#29	06-05	fix(webapp): make ticker control honor per-model lock
#24	06-04	Feat/webapp gui
#21	06-04	LLM assistant (Plan 3): bounded counterfactual agent + /api/v1/chat
#18	06-04	Feat/backtest service substrate
#17	06-04	Web GUI (Plan 2): FastAPI + HTMX + Plotly workbench on the backtest service
#13	06-04	Backtest service substrate (Plan 1 of 4): typed request → validated run → JSON result
Entry 02 — options engine + research hub
#23	06-04	docs: inline filter chips on the research hub
#22	06-04	docs: restore filter & sort access on the hub
#20	06-04	docs: hub polish — family-code IDs, legend, and restored filter access
#19	06-04	docs: fix nested link in concept card (the three-mismatched-links bug)
#16	06-04	docs: make hub track columns visually consistent
#15	06-04	docs: tidy concept-track entry points
#14	06-04	docs: tidy concept-track entry points + point links at GitHub Pages
#12	06-04	docs: research hub + concept-writeups track (order flow & liquidity)
#11	06-04	Options backtest engine + backtest accuracy (next-open fills, Deflated Sharpe)
#10	06-04	feat(docs): tag filtering, sorting, and compact view on catalogue index
#9	06-04	docs: link README model table to rendered working papers
#8	06-04	docs: link working-papers site at top of README
#7	06-04	docs(options): Paper V research write-up + GitHub Pages hosting
#6	06-04	Options backtest & simulation: pricing core + delta-hedged seed model
Entry 01 — streamlit era
#5	05-29	Stream microstructure aggregation to fix Cloud OOM
#4	05-29	adding research docs
#3	05-29	Fix malformed live-demo link in README
#2	05-29	Harden GUI against unresolvable model ticker config
#1	05-29	Add per-model ticker switching to the Streamlit app