[ engine · preference flywheel ]

Preference flywheel — outcome-weighted DPO with label inversion

Every Approve / Override / Reject decision becomes a labeled DPO pair. preferences/extract.py backfills the realized T+5 market outcome onto each pair. Pairs whose operator decision contradicts the market verdict have their chosen/rejected labels inverted before training. The resulting LoRA scores future rebalance proposals in the modal before the operator decides.

What it is · how it works · why it matters

[ what ]

A closed-loop training pipeline that combines two signals — operator preference and realized T+5 market return — on every preference pair, and weights training by their agreement.

[ how ]

Audit log → preference extractor → T+5 yfinance backfill → outcome-weighted DPO (boost on agreement, invert label on disagreement) → LoRA adapter → live scoring in the rebalance modal. 20% holdout never trained on; post-train AUC + drift detector + rejected-winners list.

[ why ]

Pure RLHF only imitates the labeler. Pure RLVR ignores the labeler's stated reasons. We have both signals on every pair — the cheap dense preference signal + the slower objective verdict — and the trainer corrects one with the other in a single procedure.

[ advantage ]

Calibration correction, not just imitation. AUC = 0.5 honestly reported when there's no signal yet. Rejected-winners surfaces decision-level evidence — "the LoRA found N plans you rejected that beat SPY by X%."

Pipeline — 4 steps

1. Collect. PreferenceLearningAgent (bus/agents/feedback.py) subscribes to OrderFilled / RebalanceApproved / RebalanceRejected. Each PM action lands in data/audit/rebalance_decisions.jsonl with the decision, reason text, and order list.

2. Extract + backfill. preferences/extract.py::extract_all() rebuilds data/rlhf/preferences.jsonl. For each row with a basket of symbols and a timestamp ≥ T+5 trading days ago, it calls yfinance to compute the equal-weighted basket return vs SPY over those 5 days. Result lands as outcome_t5 = {return_pct, vs_spy_pct, computed_at, window_days: 5}.

3. Outcome-weight + train. preferences/trainer.py::build_dpo_dataset() converts each row into a DPO-shaped tuple (prompt, chosen, rejected), computes an outcome_weight, replicates or inverts the pair, holds back 20%, and writes the rest to data/rlhf/dpo_dataset.jsonl (TRL shape) and data/preferences/pairs.jsonl (NeMo-RL HelpSteer3 chat-message shape).

The outcome-weighting table

operator decisionT+5 verdict (vs SPY)weighttreatment
approve+ positive+1.0replicate × 3 (boost: decision was right)
approve− negative+0.3retain (preserve but discount)
reject / override− negative+1.0replicate × 3 (boost: rejection was right)
reject / override+ positive−0.5SWAP chosen ↔ rejected, replicate × 1.5
any(no verdict yet)+0.5retain at moderate weight

The negative-weight cell is the load-bearing one. When the operator rejected a plan that the market subsequently validated, the chosen/rejected labels are swapped before training — the LoRA learns to push back on the operator in that specific class of decision, not imitate them.

4. Serve + measure. After a successful TRL DPO run, kick_off() updates data/rlhf/adapters/latest<new adapter dir> atomically. preferences/style_adapter.py hot-reloads the LoRA on top of TinyLlama-1.1B when the symlink mtime changes, and exposes score_proposal(prompt, candidate) → {style_match_score, margin_logp, comment} via POST /api/preferences/style_score. The rebalance modal calls this when each plan opens.

Measurement scaffolding

The 20% holdout (data/rlhf/dpo_holdout.jsonl) is never trained on. After each retrain, preferences/trainer.py::_eval_auc() scores every holdout pair under the just-loaded LoRA and computes:

UI surface — Style-match panel in the rebalance modal

When the rebalance modal opens, it calls POST /api/preferences/style_score with the proposal text. The Style-match panel renders:

Two trainer routes

TRL DPO (live)NeMo-RL DPO (production)
Base modelTinyLlama-1.1B-Chat-v1.0Nemotron-class (configurable)
AdapterLoRA r=8, lr=5e-5, β=0.1full DPO loop, dtensor or Megatron backend
Wall-time (60 pairs, 1 epoch)~3.8s on GB10minutes; gated on multi-GPU when scaled
Source datadata/rlhf/dpo_dataset.jsonl (TRL shape)data/preferences/pairs.jsonl (HelpSteer3 chat-message shape: context + completions[{rank, completion}])
ServesUI style score (every rebalance)Production policy checkpoint (when sample volume justifies it)
TriggerPOST /api/preferences/train?dry_run=falseBus: NeMoRLFeedbackAgent emits TrainNemoRLRequested at PAIRS_PER_RETRAIN = 10 new pairs

Both trainers consume the same outcome-weighted source data. The TRL run is cheap enough to fire on demand and serves the in-UI feedback loop. The NeMo-RL run is the production path that produces a Nemotron checkpoint when sample volume warrants it.

REST surface

VerbPathPurpose
POST/api/preferences/extractRebuild preferences.jsonl + T+5 outcome backfill. Idempotent.
GET/api/preferences/statsPersonal-style fingerprint (sector tilts, top rejected names, turnover bias) derived from all collected pairs.
GET/api/preferences/dataset?limit=20Tail the latest DPO-shaped (prompt, chosen, rejected) pairs.
GET/api/preferences/training_statusn_preferences, n_dpo_pairs, n_holdout, ready_to_train, blocking[], last_run.eval.{approval_auc, rejected_winners}, drift.{drift, mean_auc_last3, history}.
POST/api/preferences/train?dry_run=falseKick a TRL LoRA-DPO fine-tune. dry_run=true reports readiness only.
POST/api/preferences/style_score{prompt, candidate}{available, style_match_score, margin_logp, comment, adapter_path}. Called by the rebalance modal.

Live verification

End-to-end on Grace-Blackwell GB10, 2026-05-19:

With real operator data + multi-epoch training the AUC rises and the rejected-winners list becomes the operator-facing learning signal. With random data the system stays at 0.5 and says so — no fabricated confidence.

NVTrader v0.1.18 · docs ·⚠ Not financial advice ·Docs home ·App