[ platform · models ]

Models · NemoRL · DPO

Where the learning loops are visible. NemoRL PPO training curve, the preference-learning fingerprint, and the continuous-learning scheduler with three jobs.

Overview

The Models page is split into three sections:

NemoRL live training curve — episode rewards over time for the current PPO run + the last N runs.
Preference fingerprint — sector tilts, rejected-symbol chips, approve / override / reject counts, realized-alpha capture.
Continuous-learning scheduler — three APScheduler jobs with next-run / last-run / run-now controls.

NemoRL

Click Train new run →. Pick universe, total_timesteps (default 3000), and learning-rate schedule. The PPO trainer runs on cuda and the curve updates every 100 timesteps. Policy zip persists to data/rl/policy_<timestamp>.zip.

See NemoRL engine docs for env shape and reward function.

Preference fingerprint

Reflects everything the PreferenceLearningAgent has seen on the bus. Updates on every OrderFilled, RebalanceApproved, RebalanceRejected.

panel	shows
Approve / Override / Reject counts	Total decisions in the audit ledger and ratio.
Sector tilt bars	Your historical over/under-weights by sector vs the proposals.
Rejected-symbol chips	Top 10 names you've rejected. Click to see why.
Turnover preference	Where your override rate spikes vs proposal turnover.
Realized-alpha capture	T+5 outcome of approved vs rejected proposals.
DPO dataset status	`{n_pairs, ready_to_train, blocking}`. Once `n_pairs >= 50`, the nightly LoRA train fires.

Continuous-learning scheduler

Three APScheduler jobs:

job	cron	what it does
`preference_extract`	daily 23:00 ET	Read `rebalance_decisions.jsonl`; backfill T+5 outcomes; emit DPO pairs.
`dpo_train_check`	daily 23:15 ET	If `n_pairs >= 50` && `trl` installed && GPU available → LoRA fine-tune the PM-narration model.
`nemorl_retrain`	Sun 02:00 ET	PPO retrain on the current sleeve universe. Persists a policy.

Each job has Run now so you can fire it manually for demos or sanity checks. History persists to data/scheduler/jobs.jsonl — every fire (success, error, skip) gets a row.

What the curve should look like

Climbing then plateau — normal. The policy is exploiting what it found.
Climbing then collapse — usually a reward-shaping issue. Check turnover_cost; if it's too low the policy churns and realized vol blows up.
Flat at zero — env is rejecting actions. Check the NVTraderEnv action bounds.
Mean reward > current portfolio — the policy claims to beat your live portfolio. Inspect with a paper-side comparison before promoting.

REST surface

Verb	Path	Purpose
GET	`/api/rl/status`	Active run, last completed run, policy registry.
POST	`/api/rl/train`	Body: `{symbols, total_timesteps, lr}`. Returns `{run_id}`.
GET	`/api/preferences/status`	DPO dataset state.
POST	`/api/preferences/extract`	Fire the extraction job ad-hoc.
GET	`/api/scheduler/jobs`	All three jobs + next/last fire.
POST	`/api/scheduler/run_now/{job_id}`	Fire a job manually.