CONTINUOUS RL · LLM post-training nvidia/nemo-rl 0.6.0 · subprocess-launched in dedicated Python 3.13 env github.com/NVIDIA-NeMo/RL
N
NVTrader
v0.1.18

Continuous RL · LLM post-training

NVIDIA NeMo-RL training surface. Post-train the Nemotron policies that drive PM Agent narration, deep research, compliance review, and risk explanations — using user approve/reject pairs (DPO), reward models (GRPO), or instruction data (SFT). Subprocess-launched into a dedicated Python 3.13 env so this stack stays untouched.

Not for portfolio weights. Portfolio optimization is pure cuFOLIO Mean-CVaR (correct tool for that problem). NeMo-RL trains LLMs; portfolio policy is classical RL — different action spaces.

env: checking…
runs total:
flow: api.nemo_rl NeMoRLTrainingAgent third_party/nemo-rl/examples/run_<algo>.py PM Agent+ AuditAgent
Launch a training run
POST /api/nemo-rl/launch · routes via the NAT A2A bus through NeMoRLTrainingAgent
Training runs — loading
loading runs from /api/nemo-rl/runs…
Select a run on the left
click a run to stream its log
NemoRL AutoResearch Karpathy pattern · meta-agent over NeMo-RL
idle
Meta-loop pattern from karpathy/autoresearch. Each iteration, Nemotron 3 Super proposes a typed config edit (KL penalty, batch size, learning rate, …); the loop launches the inner training via the NeMo-RL bridge, parses the eval metric, keeps or reverts. Click ⚙ Launch with AutoResearch above to start a session — the meta-agent picks the best config across `budget` iterations.
events: TrainNemoRLRequested NeMoRLTrainingStarted NeMoRLTrainingProgress NeMoRLTrainingComplete PM Agent + AuditAgent subscribe to all of these
[ DPO LOOP · PROMOTED POLICIES ] trained checkpoint → local vLLM → next narration call
vLLM checking…
[ active policy per decision_type ]
[ all registered policies ]
events: PolicyCandidateRegistered PolicyPromoted PolicyPromotionFailed LocalInferenceReloadStarted PolicyPromotionAgent → InferenceServerAgent → policy_router