[ platform · self-improvement loop ]

Continuous RL

The platform's self-improvement surface. Six panels that compound over time: a Gymnasium env, an LLM-meta-agent research loop (Karpathy pattern), manual PPO training, a cross-run ratio curve, the saved-policy library, the DPO preference flywheel, and the nightly scheduler. Live page: /continuous-rl.html.

What it is · how it works · why it matters

[ what ]

The single page that owns NVTrader's continuous-learning loop. Manual PPO training, NemoRL AutoResearch (LLM-meta-agent search), a Sharpe ratio curve across all runs, the policy library, the DPO preference flywheel, and the cron scheduler — all in one place.

[ how ]

Reads from data/rl/ + data/autoresearch/ + data/audit/. PPO via stable-baselines3 on NVTraderEnv; meta-agent via Nemotron 3 Super; chart interpretation via Nemotron Nano Omni. Same A2A bus emits every iteration as an OTel span via NeMo Agent Toolkit.

[ why ]

The discovered policies feed straight back into the PM via PolicyRetrained — the agent designs a strategy and the system executes it. End-to-end closed loop on one DGX Spark.

Section order on the page

The page is workflow-aligned: Setup → Train → Review → Deploy → Feedback → Schedule.

#	Section	Phase	Use it for
1	Gymnasium env	Setup	Verify the env config the rest of the page trains against (`nvtrader/PortfolioCVaR-v0`).
2	NemoRL AutoResearch	Train · headline	The Karpathy-pattern meta-agent search. Most powerful surface — pick a scope (PPO / cuFOLIO / unified), set a goal, hit Start.
3	NemoRL live training curve	Train · manual	Single one-off PPO run. Simpler than AutoResearch when you just want a baseline.
4	Sharpe ratio curve	Review	Cross-run timeline — every completed training run + AutoResearch trial, ranked by Sharpe with a max-DD overlay.
5	Policy library	Deploy	Every saved `.zip` from manual training + AutoResearch winners. Click load → to make it the PM's active policy via pointer file.
6	Preference learning	Feedback	Approve / Override / Reject → DPO pairs with T+5 outcome labels. Separate from PPO; complements it.
7	Scheduler	Admin	Three cron jobs: nightly preference extract, DPO check, weekly NemoRL retrain. Run-now buttons for each.

NemoRL AutoResearch · search scope

The headline panel exposes three eval modes the meta-agent searches over, picked at session start:

Mode	Per-trial work	Cost	Best for
`ppo_only` (default)	PPO train + walk-forward Sharpe eval	~8s	Search reward shapes + PPO hparams. Fastest.
`cufolio_only`	One cuFOLIO Mean-CVaR solve on held-out window	~2s	Cheap portfolio-knob sweep without retraining a policy.
`unified`	PPO train + cuFOLIO held-out	~12s	Discovers joint optima — "tighter vol_penalty AND raise cvar_alpha together".

The 16 knobs

The meta-agent's ConfigDelta schema covers three groups. Out-of-bounds proposals clip and log a BoundsClip entry in the journal.

Group	Knob	Bounds
Env (`ppo_only` / `unified`)	`lookback`	5–60
	`episode_len`	20–120
	`rebal_freq`	1–10
	`turnover_cost_bps`	0–50
	`vol_penalty`	0–5
PPO (`ppo_only` / `unified`)	`learning_rate`	1e-5–1e-2
	`n_steps`	64–4096
	`batch_size`	32–512
	`gae_lambda`	0.85–0.99
	`gamma`	0.90–0.999
	`ent_coef`	0–0.10
	`n_epochs`	3–30
cuFOLIO (`cufolio_only` / `unified`)	`cvar_alpha`	0.85–0.99
	`max_position_pct`	0.02–0.30
	`n_scenarios`	1,000–10,000
	`cufolio_hold_days`	5–60

Chart interpretation with Omni VLM

All three charts (live training curve, AutoResearch Sharpe-over-iterations, ratio curve) have an Analyze with Omni VLM → button that's identical in style to the trading-view chart button on the Research page. Click → Plotly captures the chart as PNG → POST /api/continuous-rl/interpret-chart with {kind, image_b64, context} → Nemotron Nano Omni returns a 4-6 bullet read on the chart. Renders inline below, signed with the model id and elapsed seconds. ~2-4s per click. User-triggered (not auto) so it costs nothing when ignored.

Closed-loop · discover → execute

When AutoResearch produces a new best Sharpe, the policy zip lands in data/autoresearch/policies/. The NemoRLFeedbackAgent on the A2A bus picks it up via PolicyRetrained and hands it to the PortfolioManagerAgent. The next Scheduler.tick.eod runs the discovered policy through cuFOLIO → ComplianceAgent → ExecutionAgent → Alpaca. The agent designed a strategy; the system traded it. Same cascade as every other rebalance.

REST surface

Verb	Path	Purpose
GET	`/api/nemorl/status`	Active training run + 200-point reward curve.
GET	`/api/nemorl/curve`	Just the reward-curve points (used by the live chart).
POST	`/api/nemorl/train`	Kick a manual PPO run. Body: `{symbols, total_timesteps}`.
GET	`/api/nemorl/ratio-curve`	Cross-run Sharpe + max-DD timeline from AutoResearch trials and manual runs.
GET	`/api/nemorl/policies`	List every saved policy zip with metadata.
POST	`/api/nemorl/policies/activate`	Write pointer file → PM uses this policy on next tick.
POST	`/api/nemorl-autoresearch/start`	Kick a NemoRL AutoResearch session. Body: `{goal, symbols, budget, total_timesteps, eval_mode}`.
POST	`/api/nemorl-autoresearch/stop`	Graceful stop after current trial.
GET	`/api/nemorl-autoresearch/status`	Current session snapshot.
GET	`/api/nemorl-autoresearch/journal/{sid}`	Full journal for one session.
GET	`/api/nemorl-autoresearch/sessions`	All sessions on disk.
POST	`/api/continuous-rl/interpret-chart`	Omni VLM chart read. Body: `{kind, image_b64, context}`.

NemoRL PPO — the underlying training engine.
Preference learning (DPO) — the orthogonal feedback loop.
cuFOLIO · cuOpt PDLP — the optimizer the cuFOLIO Sweep + unified-mode autoresearch sweep over.
Backtesting — home of the cuFOLIO Sweep (deterministic sibling of NemoRL AutoResearch).
A2A event bus — the bus surface every iteration emits to.