[ engine · preference flywheel ]

Preference flywheel — outcome-weighted DPO with label inversion

Every Approve / Override / Reject decision becomes a labeled DPO pair. preferences/extract.py backfills the realized T+5 market outcome onto each pair. Pairs whose operator decision contradicts the market verdict have their chosen/rejected labels inverted before training. The resulting LoRA scores future rebalance proposals in the modal before the operator decides.

What it is · how it works · why it matters

[ what ]

A closed-loop training pipeline that combines two signals — operator preference and realized T+5 market return — on every preference pair, and weights training by their agreement.

[ how ]

Audit log → preference extractor → T+5 yfinance backfill → outcome-weighted DPO (boost on agreement, invert label on disagreement) → LoRA adapter → live scoring in the rebalance modal. 20% holdout never trained on; post-train AUC + drift detector + rejected-winners list.

[ why ]

Pure RLHF only imitates the labeler. Pure RLVR ignores the labeler's stated reasons. We have both signals on every pair — the cheap dense preference signal + the slower objective verdict — and the trainer corrects one with the other in a single procedure.

[ advantage ]

Calibration correction, not just imitation. AUC = 0.5 honestly reported when there's no signal yet. Rejected-winners surfaces decision-level evidence — "the LoRA found N plans you rejected that beat SPY by X%."

Pipeline — 4 steps

1. Collect. PreferenceLearningAgent (bus/agents/feedback.py) subscribes to OrderFilled / RebalanceApproved / RebalanceRejected. Each PM action lands in data/audit/rebalance_decisions.jsonl with the decision, reason text, and order list.

2. Extract + backfill. preferences/extract.py::extract_all() rebuilds data/rlhf/preferences.jsonl. For each row with a basket of symbols and a timestamp ≥ T+5 trading days ago, it calls yfinance to compute the equal-weighted basket return vs SPY over those 5 days. Result lands as outcome_t5 = {return_pct, vs_spy_pct, computed_at, window_days: 5}.

3. Outcome-weight + train. preferences/trainer.py::build_dpo_dataset() converts each row into a DPO-shaped tuple (prompt, chosen, rejected), computes an outcome_weight, replicates or inverts the pair, holds back 20%, and writes the rest to data/rlhf/dpo_dataset.jsonl (TRL shape) and data/preferences/pairs.jsonl (NeMo-RL HelpSteer3 chat-message shape).

The outcome-weighting table

operator decision	T+5 verdict (vs SPY)	weight	treatment
approve	+ positive	+1.0	replicate × 3 (boost: decision was right)
approve	− negative	+0.3	retain (preserve but discount)
reject / override	− negative	+1.0	replicate × 3 (boost: rejection was right)
reject / override	+ positive	−0.5	SWAP chosen ↔ rejected, replicate × 1.5
any	(no verdict yet)	+0.5	retain at moderate weight

The negative-weight cell is the load-bearing one. When the operator rejected a plan that the market subsequently validated, the chosen/rejected labels are swapped before training — the LoRA learns to push back on the operator in that specific class of decision, not imitate them.

4. Serve + measure. After a successful TRL DPO run, kick_off() updates data/rlhf/adapters/latest → <new adapter dir> atomically. preferences/style_adapter.py hot-reloads the LoRA on top of TinyLlama-1.1B when the symlink mtime changes, and exposes score_proposal(prompt, candidate) → {style_match_score, margin_logp, comment} via POST /api/preferences/style_score. The rebalance modal calls this when each plan opens.

Measurement scaffolding

The 20% holdout (data/rlhf/dpo_holdout.jsonl) is never trained on. After each retrain, preferences/trainer.py::_eval_auc() scores every holdout pair under the just-loaded LoRA and computes:

approval-prediction AUC — Mann-Whitney U over (score, decision == 'approve'). AUC > 0.65 = useful signal · 0.55–0.65 = marginal · < 0.55 = no useful signal yet. Plain stats, no sklearn dep.
rejected-winners — pairs where the operator rejected AND the LoRA scored ≥ 0.5 AND outcome_t5.vs_spy_pct > 0. Top 5 returned with score + vs-SPY %. These are the plans the operator passed on that the market validated — the most actionable artifact the system produces.
drift — rolling mean of the last 3 AUCs from data/rlhf/dpo_metrics.jsonl. Below DRIFT_AUC_THRESHOLD = 0.55 → drift = high badge in the UI. Signals the operator's style has shifted relative to the market or their own past behavior.

UI surface — Style-match panel in the rebalance modal

When the rebalance modal opens, it calls POST /api/preferences/style_score with the proposal text. The Style-match panel renders:

A percentage with color (> 70% green, 50–70% text, 30–50% warn, < 30% danger) — the sigmoid of margin_logp = mean_token_logp(approve | proposal; π_LoRA) − π_BASE.
A one-line comment matching the bracket: "Consistent with your past approvals" → "Inconsistent with your past pattern; review carefully."
The learning-signal line: learning signal: NN% AUC on M held-out plans · stable | drift.
A warn-row when n_rejected_winners > 0: "⚠ LoRA found N plans you rejected that beat SPY in holdout." Click → /api/preferences/training_status with details.

Two trainer routes

	TRL DPO (live)	NeMo-RL DPO (production)
Base model	TinyLlama-1.1B-Chat-v1.0	Nemotron-class (configurable)
Adapter	LoRA r=8, lr=5e-5, β=0.1	full DPO loop, dtensor or Megatron backend
Wall-time (60 pairs, 1 epoch)	~3.8s on GB10	minutes; gated on multi-GPU when scaled
Source data	`data/rlhf/dpo_dataset.jsonl` (TRL shape)	`data/preferences/pairs.jsonl` (HelpSteer3 chat-message shape: `context` + `completions[{rank, completion}]`)
Serves	UI style score (every rebalance)	Production policy checkpoint (when sample volume justifies it)
Trigger	`POST /api/preferences/train?dry_run=false`	Bus: `NeMoRLFeedbackAgent` emits `TrainNemoRLRequested` at `PAIRS_PER_RETRAIN = 10` new pairs

Both trainers consume the same outcome-weighted source data. The TRL run is cheap enough to fire on demand and serves the in-UI feedback loop. The NeMo-RL run is the production path that produces a Nemotron checkpoint when sample volume warrants it.

REST surface

Verb	Path	Purpose
POST	`/api/preferences/extract`	Rebuild preferences.jsonl + T+5 outcome backfill. Idempotent.
GET	`/api/preferences/stats`	Personal-style fingerprint (sector tilts, top rejected names, turnover bias) derived from all collected pairs.
GET	`/api/preferences/dataset?limit=20`	Tail the latest DPO-shaped (prompt, chosen, rejected) pairs.
GET	`/api/preferences/training_status`	n_preferences, n_dpo_pairs, n_holdout, ready_to_train, blocking[], last_run.eval.{approval_auc, rejected_winners}, drift.{drift, mean_auc_last3, history}.
POST	`/api/preferences/train?dry_run=false`	Kick a TRL LoRA-DPO fine-tune. `dry_run=true` reports readiness only.
POST	`/api/preferences/style_score`	`{prompt, candidate}` → `{available, style_match_score, margin_logp, comment, adapter_path}`. Called by the rebalance modal.

Live verification

End-to-end on Grace-Blackwell GB10, 2026-05-19:

60 synthetic preference pairs (random labels) → outcome-weighted to 108 training pairs (75 boosted + 20 inverted + 13 in holdout)
TRL DPO trained on TinyLlama-1.1B + LoRA in 3.8s
Adapter saved to data/rlhf/adapters/<ts>/; latest symlink updated
Holdout AUC = 0.5 — correctly reported as "No useful signal yet" because the synthetic data has no real pattern
3 rejected-winners surfaced with vs-SPY = +0.62%, +1.57%, +1.77%
NeMo-RL bridge: 8-pair short-prompt smoke → exit_code: 0, validation loss = 0.6931 = ln(2) (textbook DPO baseline), Epoch 1/1 entered

With real operator data + multi-epoch training the AUC rises and the rejected-winners list becomes the operator-facing learning signal. With random data the system stays at 0.5 and says so — no fabricated confidence.