Preference flywheel — outcome-weighted DPO with label inversion
Every Approve / Override / Reject decision becomes a labeled DPO pair. preferences/extract.py backfills the realized T+5 market outcome onto each pair. Pairs whose operator decision contradicts the market verdict have their chosen/rejected labels inverted before training. The resulting LoRA scores future rebalance proposals in the modal before the operator decides.
What it is · how it works · why it matters
A closed-loop training pipeline that combines two signals — operator preference and realized T+5 market return — on every preference pair, and weights training by their agreement.
Audit log → preference extractor → T+5 yfinance backfill → outcome-weighted DPO (boost on agreement, invert label on disagreement) → LoRA adapter → live scoring in the rebalance modal. 20% holdout never trained on; post-train AUC + drift detector + rejected-winners list.
Pure RLHF only imitates the labeler. Pure RLVR ignores the labeler's stated reasons. We have both signals on every pair — the cheap dense preference signal + the slower objective verdict — and the trainer corrects one with the other in a single procedure.
Calibration correction, not just imitation. AUC = 0.5 honestly reported when there's no signal yet. Rejected-winners surfaces decision-level evidence — "the LoRA found N plans you rejected that beat SPY by X%."
Pipeline — 4 steps
1. Collect. PreferenceLearningAgent (bus/agents/feedback.py) subscribes to OrderFilled / RebalanceApproved / RebalanceRejected. Each PM action lands in data/audit/rebalance_decisions.jsonl with the decision, reason text, and order list.
2. Extract + backfill. preferences/extract.py::extract_all() rebuilds data/rlhf/preferences.jsonl. For each row with a basket of symbols and a timestamp ≥ T+5 trading days ago, it calls yfinance to compute the equal-weighted basket return vs SPY over those 5 days. Result lands as outcome_t5 = {return_pct, vs_spy_pct, computed_at, window_days: 5}.
3. Outcome-weight + train. preferences/trainer.py::build_dpo_dataset() converts each row into a DPO-shaped tuple (prompt, chosen, rejected), computes an outcome_weight, replicates or inverts the pair, holds back 20%, and writes the rest to data/rlhf/dpo_dataset.jsonl (TRL shape) and data/preferences/pairs.jsonl (NeMo-RL HelpSteer3 chat-message shape).
The outcome-weighting table
| operator decision | T+5 verdict (vs SPY) | weight | treatment |
|---|---|---|---|
| approve | + positive | +1.0 | replicate × 3 (boost: decision was right) |
| approve | − negative | +0.3 | retain (preserve but discount) |
| reject / override | − negative | +1.0 | replicate × 3 (boost: rejection was right) |
| reject / override | + positive | −0.5 | SWAP chosen ↔ rejected, replicate × 1.5 |
| any | (no verdict yet) | +0.5 | retain at moderate weight |
The negative-weight cell is the load-bearing one. When the operator rejected a plan that the market subsequently validated, the chosen/rejected labels are swapped before training — the LoRA learns to push back on the operator in that specific class of decision, not imitate them.
4. Serve + measure. After a successful TRL DPO run, kick_off() updates data/rlhf/adapters/latest → <new adapter dir> atomically. preferences/style_adapter.py hot-reloads the LoRA on top of TinyLlama-1.1B when the symlink mtime changes, and exposes score_proposal(prompt, candidate) → {style_match_score, margin_logp, comment} via POST /api/preferences/style_score. The rebalance modal calls this when each plan opens.
Measurement scaffolding
The 20% holdout (data/rlhf/dpo_holdout.jsonl) is never trained on. After each retrain, preferences/trainer.py::_eval_auc() scores every holdout pair under the just-loaded LoRA and computes:
- approval-prediction AUC — Mann-Whitney U over (score, decision == 'approve'). AUC > 0.65 = useful signal · 0.55–0.65 = marginal · < 0.55 = no useful signal yet. Plain stats, no sklearn dep.
- rejected-winners — pairs where the operator rejected AND the LoRA scored ≥ 0.5 AND
outcome_t5.vs_spy_pct > 0. Top 5 returned with score + vs-SPY %. These are the plans the operator passed on that the market validated — the most actionable artifact the system produces. - drift — rolling mean of the last 3 AUCs from
data/rlhf/dpo_metrics.jsonl. BelowDRIFT_AUC_THRESHOLD = 0.55→drift = highbadge in the UI. Signals the operator's style has shifted relative to the market or their own past behavior.
UI surface — Style-match panel in the rebalance modal
When the rebalance modal opens, it calls POST /api/preferences/style_score with the proposal text. The Style-match panel renders:
- A percentage with color (> 70% green, 50–70% text, 30–50% warn, < 30% danger) — the sigmoid of
margin_logp = mean_token_logp(approve | proposal; π_LoRA) − π_BASE. - A one-line comment matching the bracket: "Consistent with your past approvals" → "Inconsistent with your past pattern; review carefully."
- The learning-signal line:
learning signal: NN% AUC on M held-out plans · stable | drift. - A warn-row when
n_rejected_winners > 0: "⚠ LoRA found N plans you rejected that beat SPY in holdout." Click →/api/preferences/training_statuswith details.
Two trainer routes
| TRL DPO (live) | NeMo-RL DPO (production) | |
|---|---|---|
| Base model | TinyLlama-1.1B-Chat-v1.0 | Nemotron-class (configurable) |
| Adapter | LoRA r=8, lr=5e-5, β=0.1 | full DPO loop, dtensor or Megatron backend |
| Wall-time (60 pairs, 1 epoch) | ~3.8s on GB10 | minutes; gated on multi-GPU when scaled |
| Source data | data/rlhf/dpo_dataset.jsonl (TRL shape) | data/preferences/pairs.jsonl (HelpSteer3 chat-message shape: context + completions[{rank, completion}]) |
| Serves | UI style score (every rebalance) | Production policy checkpoint (when sample volume justifies it) |
| Trigger | POST /api/preferences/train?dry_run=false | Bus: NeMoRLFeedbackAgent emits TrainNemoRLRequested at PAIRS_PER_RETRAIN = 10 new pairs |
Both trainers consume the same outcome-weighted source data. The TRL run is cheap enough to fire on demand and serves the in-UI feedback loop. The NeMo-RL run is the production path that produces a Nemotron checkpoint when sample volume warrants it.
REST surface
| Verb | Path | Purpose |
|---|---|---|
| POST | /api/preferences/extract | Rebuild preferences.jsonl + T+5 outcome backfill. Idempotent. |
| GET | /api/preferences/stats | Personal-style fingerprint (sector tilts, top rejected names, turnover bias) derived from all collected pairs. |
| GET | /api/preferences/dataset?limit=20 | Tail the latest DPO-shaped (prompt, chosen, rejected) pairs. |
| GET | /api/preferences/training_status | n_preferences, n_dpo_pairs, n_holdout, ready_to_train, blocking[], last_run.eval.{approval_auc, rejected_winners}, drift.{drift, mean_auc_last3, history}. |
| POST | /api/preferences/train?dry_run=false | Kick a TRL LoRA-DPO fine-tune. dry_run=true reports readiness only. |
| POST | /api/preferences/style_score | {prompt, candidate} → {available, style_match_score, margin_logp, comment, adapter_path}. Called by the rebalance modal. |
Live verification
End-to-end on Grace-Blackwell GB10, 2026-05-19:
- 60 synthetic preference pairs (random labels) → outcome-weighted to 108 training pairs (75 boosted + 20 inverted + 13 in holdout)
- TRL DPO trained on TinyLlama-1.1B + LoRA in 3.8s
- Adapter saved to
data/rlhf/adapters/<ts>/;latestsymlink updated - Holdout AUC = 0.5 — correctly reported as "No useful signal yet" because the synthetic data has no real pattern
- 3 rejected-winners surfaced with vs-SPY = +0.62%, +1.57%, +1.77%
- NeMo-RL bridge: 8-pair short-prompt smoke →
exit_code: 0, validation loss = 0.6931 =ln(2)(textbook DPO baseline), Epoch 1/1 entered
With real operator data + multi-epoch training the AUC rises and the rejected-winners list becomes the operator-facing learning signal. With random data the system stays at 0.5 and says so — no fabricated confidence.