[ platform · models ]

Models · NemoRL · DPO

Where the learning loops are visible. NemoRL PPO training curve, the preference-learning fingerprint, and the continuous-learning scheduler with three jobs.

Overview

The Models page is split into three sections:

  1. NemoRL live training curve — episode rewards over time for the current PPO run + the last N runs.
  2. Preference fingerprint — sector tilts, rejected-symbol chips, approve / override / reject counts, realized-alpha capture.
  3. Continuous-learning scheduler — three APScheduler jobs with next-run / last-run / run-now controls.

NemoRL

Click Train new run →. Pick universe, total_timesteps (default 3000), and learning-rate schedule. The PPO trainer runs on cuda and the curve updates every 100 timesteps. Policy zip persists to data/rl/policy_<timestamp>.zip.

See NemoRL engine docs for env shape and reward function.

Preference fingerprint

Reflects everything the PreferenceLearningAgent has seen on the bus. Updates on every OrderFilled, RebalanceApproved, RebalanceRejected.

panelshows
Approve / Override / Reject countsTotal decisions in the audit ledger and ratio.
Sector tilt barsYour historical over/under-weights by sector vs the proposals.
Rejected-symbol chipsTop 10 names you've rejected. Click to see why.
Turnover preferenceWhere your override rate spikes vs proposal turnover.
Realized-alpha captureT+5 outcome of approved vs rejected proposals.
DPO dataset status{n_pairs, ready_to_train, blocking}. Once n_pairs >= 50, the nightly LoRA train fires.

Continuous-learning scheduler

Three APScheduler jobs:

jobcronwhat it does
preference_extractdaily 23:00 ETRead rebalance_decisions.jsonl; backfill T+5 outcomes; emit DPO pairs.
dpo_train_checkdaily 23:15 ETIf n_pairs >= 50 && trl installed && GPU available → LoRA fine-tune the PM-narration model.
nemorl_retrainSun 02:00 ETPPO retrain on the current sleeve universe. Persists a policy.

Each job has Run now so you can fire it manually for demos or sanity checks. History persists to data/scheduler/jobs.jsonl — every fire (success, error, skip) gets a row.

What the curve should look like

REST surface

VerbPathPurpose
GET/api/rl/statusActive run, last completed run, policy registry.
POST/api/rl/trainBody: {symbols, total_timesteps, lr}. Returns {run_id}.
GET/api/preferences/statusDPO dataset state.
POST/api/preferences/extractFire the extraction job ad-hoc.
GET/api/scheduler/jobsAll three jobs + next/last fire.
POST/api/scheduler/run_now/{job_id}Fire a job manually.
NVTrader v0.1.18 · docs ·⚠ Not financial advice ·Docs home ·App