Models · NemoRL · DPO
Where the learning loops are visible. NemoRL PPO training curve, the preference-learning fingerprint, and the continuous-learning scheduler with three jobs.
Overview
The Models page is split into three sections:
- NemoRL live training curve — episode rewards over time for the current PPO run + the last N runs.
- Preference fingerprint — sector tilts, rejected-symbol chips, approve / override / reject counts, realized-alpha capture.
- Continuous-learning scheduler — three APScheduler jobs with next-run / last-run / run-now controls.
NemoRL
Click Train new run →. Pick universe, total_timesteps (default 3000), and learning-rate schedule. The PPO trainer runs on cuda and the curve updates every 100 timesteps. Policy zip persists to data/rl/policy_<timestamp>.zip.
See NemoRL engine docs for env shape and reward function.
Preference fingerprint
Reflects everything the PreferenceLearningAgent has seen on the bus. Updates on every OrderFilled, RebalanceApproved, RebalanceRejected.
| panel | shows |
|---|---|
| Approve / Override / Reject counts | Total decisions in the audit ledger and ratio. |
| Sector tilt bars | Your historical over/under-weights by sector vs the proposals. |
| Rejected-symbol chips | Top 10 names you've rejected. Click to see why. |
| Turnover preference | Where your override rate spikes vs proposal turnover. |
| Realized-alpha capture | T+5 outcome of approved vs rejected proposals. |
| DPO dataset status | {n_pairs, ready_to_train, blocking}. Once n_pairs >= 50, the nightly LoRA train fires. |
Continuous-learning scheduler
Three APScheduler jobs:
| job | cron | what it does |
|---|---|---|
preference_extract | daily 23:00 ET | Read rebalance_decisions.jsonl; backfill T+5 outcomes; emit DPO pairs. |
dpo_train_check | daily 23:15 ET | If n_pairs >= 50 && trl installed && GPU available → LoRA fine-tune the PM-narration model. |
nemorl_retrain | Sun 02:00 ET | PPO retrain on the current sleeve universe. Persists a policy. |
Each job has Run now so you can fire it manually for demos or sanity checks. History persists to data/scheduler/jobs.jsonl — every fire (success, error, skip) gets a row.
What the curve should look like
- Climbing then plateau — normal. The policy is exploiting what it found.
- Climbing then collapse — usually a reward-shaping issue. Check
turnover_cost; if it's too low the policy churns and realized vol blows up. - Flat at zero — env is rejecting actions. Check the
NVTraderEnvaction bounds. - Mean reward > current portfolio — the policy claims to beat your live portfolio. Inspect with a paper-side comparison before promoting.
REST surface
| Verb | Path | Purpose |
|---|---|---|
| GET | /api/rl/status | Active run, last completed run, policy registry. |
| POST | /api/rl/train | Body: {symbols, total_timesteps, lr}. Returns {run_id}. |
| GET | /api/preferences/status | DPO dataset state. |
| POST | /api/preferences/extract | Fire the extraction job ad-hoc. |
| GET | /api/scheduler/jobs | All three jobs + next/last fire. |
| POST | /api/scheduler/run_now/{job_id} | Fire a job manually. |