Fifteen days, seventy-nine pull requests, and the corrections along the way
The running build log of this repository: why each major change was needed, what broke or surprised us while building it, and what fixed it. From a Streamlit prototype hitting a 1 GB memory wall to a nightly Russell-1000 scanning pipeline whose newest feature exists to tell us our own best headline number was wrong.
Window: 2026-05-29 → 2026-06-12Merged PRs: 79Newest first · oldest at the bottom← research hub
Cumulative merged PRs, drawn the way this repo draws everything else. The flat stretch (05-30 → 06-03) is real: nothing merged. The steepening from 06-09 on is the scanner-pipeline era — entries 07 through 12 below. PR #58 was never merged, so 80 numbers, 79 merges.
Replay the past, then fix five things at once (plans A–E)
A no-lookahead replay harness, the study it enabled, and five plans executed in one day — whose two biggest results were nulls.
Why
The scanner had been issuing tickets nightly since 06-10, but forward evidence accumulates at exactly one night per night. To learn anything this year we needed to replay the funnel over history without cheating: each replayed night may only see data available as of that night. PR #75 built that harness (scripts/backfill_scan.py), and the resulting study — every 5th session over ~6 months — produced a headline: the watch tier looked like +23.4R. Five improvement plans (A–E) came out of that study. Then one of them quietly destroyed the headline.
Plan B gave issuance some discipline: per-setup entry windows (the post-earnings-drift pair gets 15 sessions instead of the default), a re-issue cooldown keyed on (ticker, stance, strategy, tier) so a campaign that is still waiting or open cannot be issued again, and score-floor plumbing left deliberately empty until live evidence justifies any floor.
Problem
Replaying the study with the cooldown suppressed 43 of the 128 baseline watch rows. They were not new signals — they were the same persistent setups re-firing every replay night while their first campaign was still live, each re-fire counted as if it were a fresh trade.
Finding
The +23.4R headline was inflated roughly 3× by pseudo-replication. Deduplicated to one count per campaign, the window's truth is watch +3.15R: post-earnings drift +2.8R (3 of 4 campaigns hit), base breakouts ≈ noise (+1.2R across 36), shorts −6.6R. The replay harness's most valuable output to date is this correction to our own beliefs.
The correction, to scale. 43 of 128 baseline rows were the same campaigns re-firing on consecutive replay nights; counting each campaign once shrinks the window's watch-tier result from +23.4R to +3.15R. The per-strategy components are the deduplicated truth.
The FA gate already computed a bottom-40 cohort that long-only scanning threw away. Plan A mirrored the detectors — base_breakdown, ma_rally_fade, pead_down — over that cohort into the watch path, with direction-aware briefs and a short chip on /scans.
Problem
Mirroring levels is where the bugs live: review caught moving-average pairs emitting trigger/stop in the wrong order — and the guard built for it exposed a pre-existing long-side bug (squeeze-spike tape emitting inverted levels). Short 2R targets can also be arithmetically unattainable when the risk exceeds half the trigger price.
Solution
Ordering guards on every emitted level pair, hard ValueErrors on stance/cohort mismatches, and unattainable targets demoted to report-only rows with levels: null. Replayed: the long side stayed byte-identical; the new shorts went 14 issued / 1 hit, −8.6R in an up-trending window — stated plainly in the PR, tracked in the live ledger by stance, not traded.
A November loss on a PLTR mean-reversion pair motivated a regime gate: block mean-reversion-against-trend when SPY is below its 200-day moving average. Built as scanner/regime.py, always reported, gating only behind a default-off flag, with a replay A/B arm.
Finding — null
The A/B arms were byte-identical (85 rows, +1.15R both). The motivating PLTR pair fired with SPY ~12% above its 200dma — that "selloff" never breached the trend line, and only 2 of 42 replay nights were down-trend at all. The plan's own sanity expectation (PLTR blocked) was not met, and we recorded that instead of tuning the definition until it fired. The gate ships dark; flipping it on requires ≥20 live nights of regime-block evidence. If mean-reversion losses recur in pullbacks that don't breach the 200dma, the fix is a better trend definition, not a forced flip.
The replay had been using today's FA family for historical nights — a quiet survivorship distortion. --family-mode archived reconstructs each night's family from the newest archived report at or before that night, stamps family_source: "fallback" explicitly wherever coverage predates the archives, and loads the union of bars. A local 2026-06-09 report turned out to predate the FA format entirely — the skip path handled it in the wild on day one.
Per-ticker forward ledgers starve: no single ticker accumulates enough campaign dates to certify anything. Plan D pools by setup type across all tickers — an as-of sweep stamps every historical firing, per-date mean-R series feed the existing metrics engine, and a type is certified only at DSR ≥ 0.90, ≥ 20 pooled dates, and a Benjamini-Hochberg FDR pass across the 6-type menu. Certified types promote watch rows to real tickets, ahead of cooldown and regime in the pipeline (the ordering is mutation-locked by a test). A weekly Modal sidecar recomputes certification, failure-isolated and keep-stale-on-failure.
Why nothing certifies. Pooling solved the structural problem — every setup type now has 28–77 dates against a 20-date bar — so sample size is no longer the excuse. The Deflated Sharpe of the pooled per-date returns is what fails, for every type, every week of the replay.
Finding — the null is the result
0 of 6 types certified on every weekly recompute; the A/B arms were byte-identical (85 rows, +1.15R). And the pooled base rates contradict the study's best story: unconditioned post-earnings drift loses −18.5R over 54 pooled dates (its short mirror −18.2R over 55). The 86%-hit pead record that looked so good was 4 deduplicated campaigns — selection plus small-n noise, not an edge the cross-section corroborates. Promotion ships on, because certification is itself the evidence bar — and stays dark in practice until a type genuinely earns it. One known conservative bias, deliberately kept: the sweep has no cooldown, so serial re-fires over-count within a ticker; softening that to make the bar passable would be replay-tuning.
Problem
Review caught a true Critical before it reached production: render_markdown assumed every row has option structures and would have crashed with a KeyError on the first night a promoted row appeared — killing that night's report. Two test-fixture gotchas also bit: a constant return series produces DSR ≈ 1.0 from float noise, and a bimodal alternating series produces negative sample-kurtosis variance and a 0.0 sentinel.
Solution
Report, webapp, and scan-runner all hardened against structure-less promoted rows, with regression tests; DSR tests use a seeded normal series, the only fixture that behaves. The promotion-before-cooldown ordering and the default-on flag are both pinned by tests so they cannot drift silently.
The nightly funnel after plans A–E. Promotion runs before cooldown and the (dark) regime gate so that a freshly certified ticket cannot be suppressed by its own watch-campaign history; a test fails if anyone reorders it.
Lessons from the week
(1) A replay harness's best product is corrections — the 3× inflation, two null A/Bs, and an uncorroborated star strategy were each worth more than a feature. (2) Nulls get recorded, not tuned away. (3) Two-stage review caught a real bug in every plan, including one that would have killed a production report. (4) Evidence bars ship dark (or dark-in-practice) until the evidence exists.
Carried forward: ranking is cohort-blind (shorts sink in a mixed top-15 — needs a stance-aware term), suppressed rows and the regime block are JSON-only (should be visible before C's flip review), and a campaign-deduplicated pooled sweep becomes its own plan if live evidence says D's bar is wrongly harsh.
What people are saying about a ticker, separated by who is saying it — plus two production lessons the same day.
Why
The scanner says what is setting up; it says nothing about the conversation around a name. The /sentiment page reads a ticker three ways and refuses to blend them: official (filings, news), forums (Reddit), and viral (short-form social) — because "the company said" and "the crowd is chanting" are different signals with different failure modes.
Problem
The viral tier needed breadth without API bills. Bluesky's documented public host (public.api.bsky.app) gates the search endpoint; and a naive source order let one noisy network dominate the page. Separately, the same day delivered two production lessons: CI smoke tests fetch live Google Trends data and got rate-limited into breaking main, and the first live night of the tiered ledger 500'd the /tournaments index on a None average-R that only a first night can produce.
Solution
Keyless search via api.bsky.app directly (#73) with round-robin interleaving so each viral source gets alternating slots; Trends rate-limits in CI classified as environment-unavailable and skipped rather than failed (#70); the tier strip guards None aggregates (#74). Rule extracted from the 500: first-night shapes are real shapes — any aggregate over an empty ledger needs a rendering path.
A conversation in, a priced options ticket out — and the week's best disguised bug.
Why
Scanner tickets are share-denominated; expressing the same hypothesis in options means picking structure, strikes, expiry, and exits. The planner turns a stated hypothesis ("NVDA grinds higher into earnings") into a ticket through a short dialogue (#57), then grew a dedicated page (#62), neutral structures — iron condors and butterflies (#63), candidate strike ladders with computed exit plans (#65), guided level scenarios with a chart card and event warnings (#66), page-level sizing (#68), and after-hours quotes via a CBOE delayed-quote fallback (#72).
Problem
Tickets intermittently came out with a degenerate ATR — risk distances near zero, absurd ladders. The cause was nowhere near the options code: yfinance daily history can end in a trailing row whose OHLC is all-NaN (today's half-formed bar), and every volatility window that touched it collapsed. Process-side, two planner PRs (#64, #67) were merged into the wrong base branch of a PR stack and had to be re-landed cleanly as #65 and #68.
Solution
load_daily now drops a trailing all-NaN OHLC row at the loader — every consumer inherits the fix. The stack mishap became protocol: retarget a stacked PR's base before deleting the branch under it, and prefer clean re-lands over surgery on a wrong-base merge.
Golden metrics, walk-forward validation everywhere, FDR control, and tiers — the night the funnel returned one name.
Why
With real tickets flowing, the pipeline needed the disciplines the concept survey (CON-02) ranked highest: guard against non-stationarity and multiple testing before chasing edge. Four PRs landed it: a golden lock-down of the metrics engine plus a strategy conformance harness (#56), setup detectors rebuilt as walk-forward-validated strategies (#61), nightly Benjamini-Hochberg FDR with ticket/watchlist tiers and a tiered ledger (#59), and an ML model path with ridge_momentum plus an adding-a-model guide (#60).
Finding
The first live FDR night took 1003 candidates down to exactly 1 watch-tier name (NVDA). A funnel that strict feels broken; it is the opposite. Most nights, most setups are indistinguishable from noise, and a pipeline that admits that is the only kind whose survivors mean anything.
Problem
Two quiet traps from this batch: the scan lookback is measured in calendar days (~3.3 trading-years — easy to misread as trading days), and squash-merging a PR stack deletes context — the retarget-before-delete rule was learned here first.
Double the universe, fight strategies per ticker, write the winners down, then grade them forward.
Why
The S&P-500 scanner proved the funnel; scale and accountability were missing. This run widened the universe to the Russell 1000 with a two-sided FA gate (#51), made strategy choice empirical — a per-ticker walk-forward tournament in which candidate strategies compete on out-of-sample folds (#52) — turned winners into concrete trade tickets with trigger/stop/target (#53), wired the nightly cron (#54), and gave it all a public face plus a forward performance ledger that grades every issued ticket against subsequent bars (#55).
Problem
Free data at Russell scale: yfinance starts returning 429s around a thousand tickers a night (request thinning is still a standing carry-forward). Deploy-side, Modal redeploys don't roll a warm container — twice the "shipped" code wasn't the running code until the app was stopped explicitly; modal app history commit vs origin/main became the standard cross-check.
Solution
The ledger is the heart of it: every ticket simulated forward with the same fill rules as the backtests (a limit-entry bar cannot also exit at target), refreshed on read. Strategy selection stopped being an opinion — if a setup family can't win its ticker's tournament out-of-sample, it doesn't issue tickets.
From backtesting the past to scanning tonight — a funnel that ends in grounded briefs.
Why
Everything before this analyzed history. The scanner (#47) looks at tonight: the S&P 500 through a two-pass fundamentals gate (with EDGAR filing trends), three technical setup detectors over the survivors, and an LLM brief per candidate that may only cite documents actually retrieved — the same grounding discipline the fine-tuned assistant was trained under. A nightly cron publishes to /scans. The Jane Street concept survey (#46) landed the same day and became the project's north star: adapt to non-stationarity first, validate with purged time-series CV, train on the true objective.
The original S&P-500 funnel on a real night: 503 → 40 → 4 → ranked cards with trigger/stop levels and grounded briefs. Later entries widen the left edge (Russell 1000) and tighten the right one (FDR tiers).
Problem
The first production wrinkle was infrastructure, not signal: the web app served /scans from a stale Modal volume — the cron wrote nightly reports the page never picked up.
Solution
Reload the data volume before serving (#48). The README got a scanner section with a live screenshot and an FA-gate walkthrough (#50), and the nightly cron has run since.
A complete model, a thorough backtest, and a verdict of "don't" — written up like the wins.
Why
An SSRN paper specified an earnings-vol strategy but declined to implement it; we finished the job. Long an at-the-money straddle into scheduled earnings (#40), validated with bootstrap and FDR rather than a lone Sharpe, surfaced in the workbench (#42), then re-run thoroughly (#45) and on real option chains — DoltHub history plus live yfinance snapshots (#49).
Problem
The first pass scanned a hardcoded 2023–2024 window for earnings events regardless of the window requested — silently producing empty calendars elsewhere (#43 fixed it).
Finding — negative result
Buying earnings volatility loses to the volatility risk premium and the post-report IV crush; the only defensible edge is a selection filter (trade only when the forecast move beats the implied one), and even filtered trades didn't clear significance — profit factor 0.82, bootstrap p ≈ 0.83, viability 3/10 (#44). Published with the same care as a win: a documented "no" prevents re-deriving it in six months.
A validation-and-overfitting harness, and synthetic options frictions calibrated instead of purchased.
Why
Backtests flatter. PR #37 built the validation harness — overfitting checks as first-class machinery, next-open fills as the default everywhere. PR #38 made options backtests pay realistic costs: a synthetic volatility surface and bid/ask spreads, calibrated rather than bought.
Decision
Two scoping calls, made explicit: no paid options data until a strategy survives the synthetic frictions (a real-chain loader stays deferred), and no low-latency work at all — for 2-week-to-6-month holding periods, microseconds are someone else's problem.
QLoRA on one RTX 5080, a three-metric eval gate, and one symptom hiding three bugs.
Why
The workbench assistant called a hosted model on every question. The bet: the domain is narrow enough that a fine-tuned local 7B (Qwen2.5 + LoRA) can match it here — answering about these backtests, calling these tools, inventing nothing. Claude plays teacher and judge; a grounding filter discards any training answer citing a number absent from its tool outputs. Data harness (#25), QLoRA trainer (#26), eval gate (#27), then a GPU service on Modal (#36).
Problem
A new, larger dataset scored worse on methodology questions. That one regression was three stacked bugs: an exact-string-match scorer punishing paraphrased search queries, checkpoint selection by eval-loss handing over the epoch before tool-call format locked in, and gold answers that were simply wrong (ML train/test boilerplate pasted onto rule-based models). The full story has its own page: the fine-tuning build log.
Solution
Token-overlap scoring, an epoch sweep keyed to the metric we actually care about (#33), and curated gold answers (#34) — methodology judge score 0.31 → 0.59 with tool-calls intact. Deploy gotchas worth remembering: Windows consoles crash printing "✓" without PYTHONIOENCODING=utf-8, and a Modal redeploy does not restart a warm container — stop the app to actually ship.
A typed backtest service, a FastAPI + HTMX + Plotly front end, and a deliberately bounded assistant.
Why
Streamlit had hit its ceiling (entry 01), and both the GUI and the planned LLM assistant needed the same thing underneath: a service substrate — typed request in, validated run, JSON result out (#13). On top of it: the workbench — pick a model, run any window, see charts (#17, #24) — and a chat assistant whose tools are the service itself, so every answer is grounded in a real run (#21).
Problem
The first users (us) immediately found the seams: a ticker control that ignored per-model locks, per-model symbol groups producing 400s, and a date picker happily requesting windows a model couldn't backtest.
Solution
Three small PRs (#29–#31): honor the lock, collapse to a single symbol field, bound the picker to each model's backtestable window. The bounded-tools design proved itself — the assistant cannot cite a number that no tool returned.
A pricing core, two changes that made every backtest more honest, and the research site this page lives on.
Why
Three foundations in one day. An options pricing core with a delta-hedged seed model (#6) — the strategy that should lose money (realized vol below implied) and measurably did, a correctness check disguised as a model. Backtest accuracy (#11): next-open fills, because filling at the close of the signal bar is quiet lookahead, and the Deflated Sharpe Ratio, because trying many configurations and reporting the best Sharpe is multiple testing. And GitHub Pages hosting with the research hub (#7, #12) — every model gets a working paper, negative results included.
Finding
The Deflated Sharpe introduced here is the same statistic that, eight days later, becomes the certification bar in entry 12 — and declines to certify anything. The tools for being honest compound.
The prototype era: a quick GUI, and the memory ceiling that shaped everything after.
Why
The repo began as backtest scripts; the first GUI was a Streamlit app with per-model ticker switching (#1) so the models were explorable at all.
Problem
Streamlit Community Cloud caps memory around 1 GB, and the microstructure model loads tick data. The app died with the platform's generic "Error running app" overlay — no traceback, no metric, nothing to debug against. The binding constraint was invisible.
Solution
Stream the tick aggregation instead of materializing it (#5), and harden the sidebar against unresolvable ticker configs (#2). The deeper lesson outlived the fix: a platform whose failure mode is a blank overlay is the wrong host for memory-hungry research — the seed of entry 03's self-hosted workbench.
Appendix · full PR ledger
Every merged PR, grouped by entry
79 merges. Era rows link back to the entries above; PR numbers link to GitHub.