[ platform · self-improvement loop ]

Continuous RL

The platform's self-improvement surface. Six panels that compound over time: a Gymnasium env, an LLM-meta-agent research loop (Karpathy pattern), manual PPO training, a cross-run ratio curve, the saved-policy library, the DPO preference flywheel, and the nightly scheduler. Live page: /continuous-rl.html.

What it is · how it works · why it matters

[ what ]

The single page that owns NVTrader's continuous-learning loop. Manual PPO training, NemoRL AutoResearch (LLM-meta-agent search), a Sharpe ratio curve across all runs, the policy library, the DPO preference flywheel, and the cron scheduler — all in one place.

[ how ]

Reads from data/rl/ + data/autoresearch/ + data/audit/. PPO via stable-baselines3 on NVTraderEnv; meta-agent via Nemotron 3 Super; chart interpretation via Nemotron Nano Omni. Same A2A bus emits every iteration as an OTel span via NeMo Agent Toolkit.

[ why ]

The discovered policies feed straight back into the PM via PolicyRetrained — the agent designs a strategy and the system executes it. End-to-end closed loop on one DGX Spark.

Section order on the page

The page is workflow-aligned: Setup → Train → Review → Deploy → Feedback → Schedule.

#SectionPhaseUse it for
1Gymnasium envSetupVerify the env config the rest of the page trains against (nvtrader/PortfolioCVaR-v0).
2NemoRL AutoResearchTrain · headlineThe Karpathy-pattern meta-agent search. Most powerful surface — pick a scope (PPO / cuFOLIO / unified), set a goal, hit Start.
3NemoRL live training curveTrain · manualSingle one-off PPO run. Simpler than AutoResearch when you just want a baseline.
4Sharpe ratio curveReviewCross-run timeline — every completed training run + AutoResearch trial, ranked by Sharpe with a max-DD overlay.
5Policy libraryDeployEvery saved .zip from manual training + AutoResearch winners. Click load → to make it the PM's active policy via pointer file.
6Preference learningFeedbackApprove / Override / Reject → DPO pairs with T+5 outcome labels. Separate from PPO; complements it.
7SchedulerAdminThree cron jobs: nightly preference extract, DPO check, weekly NemoRL retrain. Run-now buttons for each.

NemoRL AutoResearch · search scope

The headline panel exposes three eval modes the meta-agent searches over, picked at session start:

ModePer-trial workCostBest for
ppo_only (default)PPO train + walk-forward Sharpe eval~8sSearch reward shapes + PPO hparams. Fastest.
cufolio_onlyOne cuFOLIO Mean-CVaR solve on held-out window~2sCheap portfolio-knob sweep without retraining a policy.
unifiedPPO train + cuFOLIO held-out~12sDiscovers joint optima — "tighter vol_penalty AND raise cvar_alpha together".

The 16 knobs

The meta-agent's ConfigDelta schema covers three groups. Out-of-bounds proposals clip and log a BoundsClip entry in the journal.

GroupKnobBounds
Env (ppo_only / unified)lookback5–60
episode_len20–120
rebal_freq1–10
turnover_cost_bps0–50
vol_penalty0–5
PPO (ppo_only / unified)learning_rate1e-5–1e-2
n_steps64–4096
batch_size32–512
gae_lambda0.85–0.99
gamma0.90–0.999
ent_coef0–0.10
n_epochs3–30
cuFOLIO (cufolio_only / unified)cvar_alpha0.85–0.99
max_position_pct0.02–0.30
n_scenarios1,000–10,000
cufolio_hold_days5–60

Chart interpretation with Omni VLM

All three charts (live training curve, AutoResearch Sharpe-over-iterations, ratio curve) have an Analyze with Omni VLM → button that's identical in style to the trading-view chart button on the Research page. Click → Plotly captures the chart as PNG → POST /api/continuous-rl/interpret-chart with {kind, image_b64, context} → Nemotron Nano Omni returns a 4-6 bullet read on the chart. Renders inline below, signed with the model id and elapsed seconds. ~2-4s per click. User-triggered (not auto) so it costs nothing when ignored.

Closed-loop · discover → execute

When AutoResearch produces a new best Sharpe, the policy zip lands in data/autoresearch/policies/. The NemoRLFeedbackAgent on the A2A bus picks it up via PolicyRetrained and hands it to the PortfolioManagerAgent. The next Scheduler.tick.eod runs the discovered policy through cuFOLIO → ComplianceAgent → ExecutionAgent → Alpaca. The agent designed a strategy; the system traded it. Same cascade as every other rebalance.

REST surface

VerbPathPurpose
GET/api/nemorl/statusActive training run + 200-point reward curve.
GET/api/nemorl/curveJust the reward-curve points (used by the live chart).
POST/api/nemorl/trainKick a manual PPO run. Body: {symbols, total_timesteps}.
GET/api/nemorl/ratio-curveCross-run Sharpe + max-DD timeline from AutoResearch trials and manual runs.
GET/api/nemorl/policiesList every saved policy zip with metadata.
POST/api/nemorl/policies/activateWrite pointer file → PM uses this policy on next tick.
POST/api/nemorl-autoresearch/startKick a NemoRL AutoResearch session. Body: {goal, symbols, budget, total_timesteps, eval_mode}.
POST/api/nemorl-autoresearch/stopGraceful stop after current trial.
GET/api/nemorl-autoresearch/statusCurrent session snapshot.
GET/api/nemorl-autoresearch/journal/{sid}Full journal for one session.
GET/api/nemorl-autoresearch/sessionsAll sessions on disk.
POST/api/continuous-rl/interpret-chartOmni VLM chart read. Body: {kind, image_b64, context}.
NVTrader v0.1.18 · docs ·⚠ Not financial advice ·Docs home ·App