NVIDIA NeMo-RL
NVIDIA NeMo-RL 0.6.0 wired into NVTrader for LLM post-training. SFT, DPO, PPO, GRPO, DAPO, GDPO, RM, and on-policy distillation — every algorithm NeMo-RL ships, callable from the bus, the chat tools, and the Continuous RL page. Live-verified on a Grace-Blackwell GB10 (Python 3.13, torch 2.12+cu130, Ray 2.54, NeMo-RL 0.6.0+30afecc), 8-pair DPO smoke exited 0 with validation loss = ln(2) at step 0 (textbook DPO baseline). Not used for: portfolio weight selection — that stays on cuFOLIO Mean-CVaR.
NeMo-RL trains language models. Its PPO/GRPO/DPO operate on next-token distributions, not portfolio weights. The previous version of this doc described an in-house stable-baselines3 PPO under the "NemoRL" brand — that has been removed. cuFOLIO does the portfolio math; NeMo-RL post-trains the Nemotron LLMs that drive PM narration, deep research, and compliance review.
What it is · how it works · why it matters
The exact nvidia/nemo-rl 0.6.0 package + source tree from github.com/NVIDIA-NeMo/RL. Subprocess-launched into a dedicated Python 3.13 env so the heavy deps (torch 2.10, ray 2.54, vLLM, Megatron-Core, FlashAttention) don't touch the rest of the stack.
Bridge module src/traderspace/nemo_rl/ writes an OmegaConf-compatible YAML, spawns third_party/nemo-rl/examples/run_<algo>.py in the nemo_rl conda env. Subprocess stdout streams into data/nemo_rl/runs/<id>/log.jsonl. The runner emits NeMoRLTrainingStarted/Progress/Complete onto the bus — PM Agent + AuditAgent subscribe.
The platform's LLM behavior compounds with use: every approve/reject pair you record becomes a DPO training example; every research session can score a reward model; every chat turn is SFT-able material. NeMo-RL turns that data into a steadily-improving Nemotron checkpoint.
Algorithms exposed
| Algo | Run script | Use case in NVTrader |
|---|---|---|
| SFT | run_sft.py | Fine-tune Nemotron on the audit log + PM chat history so it picks up your vocabulary and preferred metrics. |
| DPO | run_dpo.py | Train on PreferenceLearningAgent's approve/reject pairs — make PM narrations + rebalance recommendations match your judgment. |
| GRPO | run_grpo.py | Group relative PO against a reward model that scores response quality (cited sources, named risks, specific numbers). |
| PPO | run_grpo.py | Token-level PPO ≡ GRPO with K=1. Same entrypoint, K=1 in config. |
| DAPO | run_grpo.py | Decoupled clip + dynamic sampling PO. GRPO recipe variant. |
| GDPO | run_grpo.py | Group reward-decoupled normalization PO. Multi-reward RL training. |
| RM | run_rm.py | Train an internal Bradley-Terry reward model — then GRPO against it. |
| Distillation | run_distillation.py | On-policy student-from-teacher: bring a smaller Nemotron Nano up to Super quality on your domain. |
| Eval | run_eval.py | Score any checkpoint against benchmarks (math-verify, custom reward fns). |
Bridge module
Source: src/traderspace/nemo_rl/.
runner.py—NemoRLRunner.launch(algo, config)spawns the subprocess, registers it in an in-memory map for cancellation, streams stdout tolog.jsonl, emits bus events when state changes.configs.py— Python builders matching NeMo-RL's example configs:build_dpo_config(),build_grpo_config(),build_sft_config(),build_rm_config(). Sensible defaults for a 1B-class model on a single GB10; copy + override for bigger runs.__init__.py— re-exportslaunch_run,list_runs,get_run,cancel_run,tail_log,ALGOS.
A2A bus wiring
Two new bus agents in src/traderspace/bus/agents/nemo_rl_agents.py:
- NeMoRLTrainingAgent · subscribes
TrainNemoRLRequested; callslaunch_run(algo, cfg); emitsNeMoRLTrainingStarted. Progress + Complete are emitted by the runner directly as the subprocess runs. - NeMoRLFeedbackAgent · subscribes
PreferenceRecorded+RebalanceDecided; counts accumulated approve/reject pairs against the threshold; emitsTrainNemoRLRequested(algo='dpo')when ready — wiring the closed-loop DPO retrain automatically.
PM Agent subscribes to the full lifecycle so chat surfaces can narrate training progress. AuditAgent captures every event onto the immutable JSONL ledger.
REST surface
| Verb | Path | Purpose |
|---|---|---|
| GET | /api/nemo-rl/env | Diagnostics: is the nemo_rl env + source clone bootstrapped? |
| POST | /api/nemo-rl/launch | Routes through the bus → NeMoRLTrainingAgent. Body: {algo, model_name?, train_data_path?, max_steps?, config?}. |
| GET | /api/nemo-rl/runs | List recent training runs (newest first). |
| GET | /api/nemo-rl/run/{run_id} | Run metadata + status. |
| GET | /api/nemo-rl/run/{run_id}/log | Tail the live training log (?since_line=N&max_lines=200). |
| POST | /api/nemo-rl/run/{run_id}/cancel | SIGINT the subprocess. |
Bootstrap on a fresh host
bash scripts/bootstrap_nemo_rl.sh # dtensor backend only (default)
bash scripts/bootstrap_nemo_rl.sh --with-megatron # also install megatron-core for TP/PP recipes
Creates the nemo_rl conda env (Python 3.13.13), pip-installs nemo-rl @ git+https://github.com/NVIDIA-NeMo/RL.git, clones the source tree to third_party/nemo-rl/ with submodules (Megatron-Bridge, Automodel, Gym — required for the tool.uv.workspace resolution at training time). Then force-installs torch 2.12+cu130 over NeMo-RL's pinned torch==2.10.0 because torch 2.10's cu130 wheels max out at CUDA capability (12, 0) and Blackwell GB10 needs sm_121. Verify with curl :8015/api/nemo-rl/env.
Override paths via env vars: NVTRADER_NEMORL_PYTHON (default /home/phdaggie/miniconda3/envs/nemo_rl/bin/python), NVTRADER_NEMORL_SRC (default ./third_party/nemo-rl).
The bootstrap encodes ten fixes discovered live on a GB10 box. If you see one of these errors, the bootstrap should already have handled it — listing them here so the error messages are searchable:
- 1.
torch 2.10.0+cpuresolved by default — bootstrap force-installstorch 2.12+cu130after the nemo-rl pin. - 2. "CUDA capability range (8.0)-(12.0)" on Blackwell — torch 2.12+cu130 supports sm_121.
- 3. "Transformer Engine and Apex are not installed" warnings — informational. Megatron-Core falls back to torch SDPA + Torch optimizers automatically. TE is optional even with Megatron-Core; do not install unless you specifically need FP8 fused attention.
- 4.
uv venv "No interpreter found for Python 3.13.13 in managed installations"— fixed byUV_PYTHON_PREFERENCE=system+UV_PYTHONin the bridge subprocess env. - 5. "`nemo-gym` references a workspace ... but is not a workspace member" — fixed by cloning with
--recurse-submodules --shallow-submodules. - 6. Empty worker venv reused on retry — clear with
rm -rf third_party/nemo-rl/venvs/*when a build failed before the venv finished syncing. - 7. Ray GCS connection timeout — usually stale Ray state from a previous failed run.
pkill -9 -f raylet+rm -rf /tmp/rayresolves it. - 8. NeMo-RL downloads HelpSteer3 (38k samples) instead of your
train_data_path— the reference YAML hasdata.train.dataset_name: HelpSteer3as default.nemo_rl/configs.py::build_dpo_configpops this and writes thePreferenceDatasetshape pointing at your JSONL. - 9.
KeyError: 'completions'frompreference_preprocessor— NeMo-RL expects HelpSteer3 chat-message shape (context+completions[{rank, completion}]), not TRL's{prompt, chosen, rejected}.preferences/trainer.py::build_dpo_datasetwrites both shapes to different paths. - 10.
PythonFinalizationError: preexec_fn not supported at interpreter shutdown— Python 3.13 specific. Happens during shutdown when the dataset filtered to empty; means the actual training failed earlier (usually a config or data-shape issue). Look further up the log for the real error.
dtensor vs Megatron-Core
| Backend | Use for | Install |
|---|---|---|
| dtensor (default) | All 1B-class DPO/GRPO/SFT/RM recipes — the bridge's build_*_config builders set megatron_cfg.enabled: false. | Always installed. No extra flag. |
| Megatron-Core (opt-in) | Llama 70B, Nemotron 30BA3B, and other tensor-parallel / pipeline-parallel recipes under third_party/nemo-rl/examples/configs/recipes/llm/*megatron*.yaml. | --with-megatron or WITH_MEGATRON=1. Needs CUDA dev toolkit (nvcc on PATH); install can be heavy. |
Docker: WITH_NEMORL=1 WITH_MEGATRON=1 docker compose build app. Shell deploy: bash scripts/deploy.sh --with-megatron.
AutoResearch (Karpathy meta-loop)
The Karpathy-pattern meta-loop (karpathy/autoresearch) is preserved at src/traderspace/nemorl_autoresearch/ — but its inner training is now NeMo-RL, not stable-baselines3 PPO. Each iteration, Nemotron 3 Super proposes a typed config edit (KL penalty, batch size, learning rate, …); the orchestrator launches a real DPO/GRPO/SFT run via the bridge, parses the eval metric out of the log, keeps or reverts.
Endpoint: POST /api/nemorl-autoresearch/start with {goal, model_name, algo, budget, max_steps, train_data_path}.
UI
Live surface on the Continuous RL page. Pick an algorithm, set the base model + training data, click Launch. The runs list polls every 8s; the selected run streams its log every 1.5s. AutoResearch sessions show the meta-agent's iteration history with accept/revert markers.
What NeMo-RL is NOT used for
| Problem | Stack | Why |
|---|---|---|
| Portfolio weight selection | cuFOLIO Mean-CVaR | Continuous weight vector on N stocks. Classical RL on continuous action spaces, not token-level. cuFOLIO + cuOpt PDLP is the right tool. |
| Risk attribution | cuFOLIO + per-position covariance | Pure linear algebra. No model training needed. |
| Latent factor scoring | AIFactorAgent · sklearn PCA on the cross-sectional feature matrix | One linear projection per rebalance. No RL needed. |
| Forward-return forecasting | PredictiveModelingAgent · XGBoost regressor on a (date × symbol) rolling panel | Supervised learning on tabular features, not RL. Booster cached per (universe, as-of) so repeat rebalances on the same day skip retraining. |
| Sleeve allocation | CapitalAllocationAgent (cuFOLIO at the sleeve level) | Same problem class as portfolio weights, one level up. |
Reading the training log
The page tail-streams log.jsonl from the subprocess. Look for:
- val/reward, val/accuracy, reward/mean — higher-is-better metrics; the AutoResearch loop maximizes these.
- loss — lower-is-better; DPO/SFT training loss.
- error, fail, traceback — surface red. Subprocess will exit non-zero; the run flips to
status: "failed".