Continuous RL
The platform's self-improvement surface. Six panels that compound over time: a Gymnasium env, an LLM-meta-agent research loop (Karpathy pattern), manual PPO training, a cross-run ratio curve, the saved-policy library, the DPO preference flywheel, and the nightly scheduler. Live page: /continuous-rl.html.
What it is · how it works · why it matters
The single page that owns NVTrader's continuous-learning loop. Manual PPO training, NemoRL AutoResearch (LLM-meta-agent search), a Sharpe ratio curve across all runs, the policy library, the DPO preference flywheel, and the cron scheduler — all in one place.
Reads from data/rl/ + data/autoresearch/ + data/audit/. PPO via stable-baselines3 on NVTraderEnv; meta-agent via Nemotron 3 Super; chart interpretation via Nemotron Nano Omni. Same A2A bus emits every iteration as an OTel span via NeMo Agent Toolkit.
The discovered policies feed straight back into the PM via PolicyRetrained — the agent designs a strategy and the system executes it. End-to-end closed loop on one DGX Spark.
Section order on the page
The page is workflow-aligned: Setup → Train → Review → Deploy → Feedback → Schedule.
| # | Section | Phase | Use it for |
|---|---|---|---|
| 1 | Gymnasium env | Setup | Verify the env config the rest of the page trains against (nvtrader/PortfolioCVaR-v0). |
| 2 | NemoRL AutoResearch | Train · headline | The Karpathy-pattern meta-agent search. Most powerful surface — pick a scope (PPO / cuFOLIO / unified), set a goal, hit Start. |
| 3 | NemoRL live training curve | Train · manual | Single one-off PPO run. Simpler than AutoResearch when you just want a baseline. |
| 4 | Sharpe ratio curve | Review | Cross-run timeline — every completed training run + AutoResearch trial, ranked by Sharpe with a max-DD overlay. |
| 5 | Policy library | Deploy | Every saved .zip from manual training + AutoResearch winners. Click load → to make it the PM's active policy via pointer file. |
| 6 | Preference learning | Feedback | Approve / Override / Reject → DPO pairs with T+5 outcome labels. Separate from PPO; complements it. |
| 7 | Scheduler | Admin | Three cron jobs: nightly preference extract, DPO check, weekly NemoRL retrain. Run-now buttons for each. |
NemoRL AutoResearch · search scope
The headline panel exposes three eval modes the meta-agent searches over, picked at session start:
| Mode | Per-trial work | Cost | Best for |
|---|---|---|---|
ppo_only (default) | PPO train + walk-forward Sharpe eval | ~8s | Search reward shapes + PPO hparams. Fastest. |
cufolio_only | One cuFOLIO Mean-CVaR solve on held-out window | ~2s | Cheap portfolio-knob sweep without retraining a policy. |
unified | PPO train + cuFOLIO held-out | ~12s | Discovers joint optima — "tighter vol_penalty AND raise cvar_alpha together". |
The 16 knobs
The meta-agent's ConfigDelta schema covers three groups. Out-of-bounds proposals clip and log a BoundsClip entry in the journal.
| Group | Knob | Bounds |
|---|---|---|
Env (ppo_only / unified) | lookback | 5–60 |
episode_len | 20–120 | |
rebal_freq | 1–10 | |
turnover_cost_bps | 0–50 | |
vol_penalty | 0–5 | |
PPO (ppo_only / unified) | learning_rate | 1e-5–1e-2 |
n_steps | 64–4096 | |
batch_size | 32–512 | |
gae_lambda | 0.85–0.99 | |
gamma | 0.90–0.999 | |
ent_coef | 0–0.10 | |
n_epochs | 3–30 | |
cuFOLIO (cufolio_only / unified) | cvar_alpha | 0.85–0.99 |
max_position_pct | 0.02–0.30 | |
n_scenarios | 1,000–10,000 | |
cufolio_hold_days | 5–60 |
Chart interpretation with Omni VLM
All three charts (live training curve, AutoResearch Sharpe-over-iterations, ratio curve) have an Analyze with Omni VLM → button that's identical in style to the trading-view chart button on the Research page. Click → Plotly captures the chart as PNG → POST /api/continuous-rl/interpret-chart with {kind, image_b64, context} → Nemotron Nano Omni returns a 4-6 bullet read on the chart. Renders inline below, signed with the model id and elapsed seconds. ~2-4s per click. User-triggered (not auto) so it costs nothing when ignored.
Closed-loop · discover → execute
When AutoResearch produces a new best Sharpe, the policy zip lands in data/autoresearch/policies/. The NemoRLFeedbackAgent on the A2A bus picks it up via PolicyRetrained and hands it to the PortfolioManagerAgent. The next Scheduler.tick.eod runs the discovered policy through cuFOLIO → ComplianceAgent → ExecutionAgent → Alpaca. The agent designed a strategy; the system traded it. Same cascade as every other rebalance.
REST surface
| Verb | Path | Purpose |
|---|---|---|
| GET | /api/nemorl/status | Active training run + 200-point reward curve. |
| GET | /api/nemorl/curve | Just the reward-curve points (used by the live chart). |
| POST | /api/nemorl/train | Kick a manual PPO run. Body: {symbols, total_timesteps}. |
| GET | /api/nemorl/ratio-curve | Cross-run Sharpe + max-DD timeline from AutoResearch trials and manual runs. |
| GET | /api/nemorl/policies | List every saved policy zip with metadata. |
| POST | /api/nemorl/policies/activate | Write pointer file → PM uses this policy on next tick. |
| POST | /api/nemorl-autoresearch/start | Kick a NemoRL AutoResearch session. Body: {goal, symbols, budget, total_timesteps, eval_mode}. |
| POST | /api/nemorl-autoresearch/stop | Graceful stop after current trial. |
| GET | /api/nemorl-autoresearch/status | Current session snapshot. |
| GET | /api/nemorl-autoresearch/journal/{sid} | Full journal for one session. |
| GET | /api/nemorl-autoresearch/sessions | All sessions on disk. |
| POST | /api/continuous-rl/interpret-chart | Omni VLM chart read. Body: {kind, image_b64, context}. |
Related docs
- NemoRL PPO — the underlying training engine.
- Preference learning (DPO) — the orthogonal feedback loop.
- cuFOLIO · cuOpt PDLP — the optimizer the cuFOLIO Sweep + unified-mode autoresearch sweep over.
- Backtesting — home of the cuFOLIO Sweep (deterministic sibling of NemoRL AutoResearch).
- A2A event bus — the bus surface every iteration emits to.