[ engine · llm post-training ]

NVIDIA NeMo-RL

NVIDIA NeMo-RL 0.6.0 wired into NVTrader for LLM post-training. SFT, DPO, PPO, GRPO, DAPO, GDPO, RM, and on-policy distillation — every algorithm NeMo-RL ships, callable from the bus, the chat tools, and the Continuous RL page. Live-verified on a Grace-Blackwell GB10 (Python 3.13, torch 2.12+cu130, Ray 2.54, NeMo-RL 0.6.0+30afecc), 8-pair DPO smoke exited 0 with validation loss = ln(2) at step 0 (textbook DPO baseline). Not used for: portfolio weight selection — that stays on cuFOLIO Mean-CVaR.

honest scope

NeMo-RL trains language models. Its PPO/GRPO/DPO operate on next-token distributions, not portfolio weights. The previous version of this doc described an in-house stable-baselines3 PPO under the "NemoRL" brand — that has been removed. cuFOLIO does the portfolio math; NeMo-RL post-trains the Nemotron LLMs that drive PM narration, deep research, and compliance review.

What it is · how it works · why it matters

[ what ]

The exact nvidia/nemo-rl 0.6.0 package + source tree from github.com/NVIDIA-NeMo/RL. Subprocess-launched into a dedicated Python 3.13 env so the heavy deps (torch 2.10, ray 2.54, vLLM, Megatron-Core, FlashAttention) don't touch the rest of the stack.

[ how ]

Bridge module src/traderspace/nemo_rl/ writes an OmegaConf-compatible YAML, spawns third_party/nemo-rl/examples/run_<algo>.py in the nemo_rl conda env. Subprocess stdout streams into data/nemo_rl/runs/<id>/log.jsonl. The runner emits NeMoRLTrainingStarted/Progress/Complete onto the bus — PM Agent + AuditAgent subscribe.

[ why ]

The platform's LLM behavior compounds with use: every approve/reject pair you record becomes a DPO training example; every research session can score a reward model; every chat turn is SFT-able material. NeMo-RL turns that data into a steadily-improving Nemotron checkpoint.

Algorithms exposed

AlgoRun scriptUse case in NVTrader
SFTrun_sft.pyFine-tune Nemotron on the audit log + PM chat history so it picks up your vocabulary and preferred metrics.
DPOrun_dpo.pyTrain on PreferenceLearningAgent's approve/reject pairs — make PM narrations + rebalance recommendations match your judgment.
GRPOrun_grpo.pyGroup relative PO against a reward model that scores response quality (cited sources, named risks, specific numbers).
PPOrun_grpo.pyToken-level PPO ≡ GRPO with K=1. Same entrypoint, K=1 in config.
DAPOrun_grpo.pyDecoupled clip + dynamic sampling PO. GRPO recipe variant.
GDPOrun_grpo.pyGroup reward-decoupled normalization PO. Multi-reward RL training.
RMrun_rm.pyTrain an internal Bradley-Terry reward model — then GRPO against it.
Distillationrun_distillation.pyOn-policy student-from-teacher: bring a smaller Nemotron Nano up to Super quality on your domain.
Evalrun_eval.pyScore any checkpoint against benchmarks (math-verify, custom reward fns).

Bridge module

Source: src/traderspace/nemo_rl/.

A2A bus wiring

Two new bus agents in src/traderspace/bus/agents/nemo_rl_agents.py:

PM Agent subscribes to the full lifecycle so chat surfaces can narrate training progress. AuditAgent captures every event onto the immutable JSONL ledger.

REST surface

VerbPathPurpose
GET/api/nemo-rl/envDiagnostics: is the nemo_rl env + source clone bootstrapped?
POST/api/nemo-rl/launchRoutes through the bus → NeMoRLTrainingAgent. Body: {algo, model_name?, train_data_path?, max_steps?, config?}.
GET/api/nemo-rl/runsList recent training runs (newest first).
GET/api/nemo-rl/run/{run_id}Run metadata + status.
GET/api/nemo-rl/run/{run_id}/logTail the live training log (?since_line=N&max_lines=200).
POST/api/nemo-rl/run/{run_id}/cancelSIGINT the subprocess.

Bootstrap on a fresh host

bash scripts/bootstrap_nemo_rl.sh                  # dtensor backend only (default)
bash scripts/bootstrap_nemo_rl.sh --with-megatron  # also install megatron-core for TP/PP recipes

Creates the nemo_rl conda env (Python 3.13.13), pip-installs nemo-rl @ git+https://github.com/NVIDIA-NeMo/RL.git, clones the source tree to third_party/nemo-rl/ with submodules (Megatron-Bridge, Automodel, Gym — required for the tool.uv.workspace resolution at training time). Then force-installs torch 2.12+cu130 over NeMo-RL's pinned torch==2.10.0 because torch 2.10's cu130 wheels max out at CUDA capability (12, 0) and Blackwell GB10 needs sm_121. Verify with curl :8015/api/nemo-rl/env.

Override paths via env vars: NVTRADER_NEMORL_PYTHON (default /home/phdaggie/miniconda3/envs/nemo_rl/bin/python), NVTRADER_NEMORL_SRC (default ./third_party/nemo-rl).

install gotchas (fixed)

The bootstrap encodes ten fixes discovered live on a GB10 box. If you see one of these errors, the bootstrap should already have handled it — listing them here so the error messages are searchable:

dtensor vs Megatron-Core

BackendUse forInstall
dtensor (default)All 1B-class DPO/GRPO/SFT/RM recipes — the bridge's build_*_config builders set megatron_cfg.enabled: false.Always installed. No extra flag.
Megatron-Core (opt-in)Llama 70B, Nemotron 30BA3B, and other tensor-parallel / pipeline-parallel recipes under third_party/nemo-rl/examples/configs/recipes/llm/*megatron*.yaml.--with-megatron or WITH_MEGATRON=1. Needs CUDA dev toolkit (nvcc on PATH); install can be heavy.

Docker: WITH_NEMORL=1 WITH_MEGATRON=1 docker compose build app. Shell deploy: bash scripts/deploy.sh --with-megatron.

AutoResearch (Karpathy meta-loop)

The Karpathy-pattern meta-loop (karpathy/autoresearch) is preserved at src/traderspace/nemorl_autoresearch/ — but its inner training is now NeMo-RL, not stable-baselines3 PPO. Each iteration, Nemotron 3 Super proposes a typed config edit (KL penalty, batch size, learning rate, …); the orchestrator launches a real DPO/GRPO/SFT run via the bridge, parses the eval metric out of the log, keeps or reverts.

Endpoint: POST /api/nemorl-autoresearch/start with {goal, model_name, algo, budget, max_steps, train_data_path}.

UI

Live surface on the Continuous RL page. Pick an algorithm, set the base model + training data, click Launch. The runs list polls every 8s; the selected run streams its log every 1.5s. AutoResearch sessions show the meta-agent's iteration history with accept/revert markers.

What NeMo-RL is NOT used for

ProblemStackWhy
Portfolio weight selectioncuFOLIO Mean-CVaRContinuous weight vector on N stocks. Classical RL on continuous action spaces, not token-level. cuFOLIO + cuOpt PDLP is the right tool.
Risk attributioncuFOLIO + per-position covariancePure linear algebra. No model training needed.
Latent factor scoringAIFactorAgent · sklearn PCA on the cross-sectional feature matrixOne linear projection per rebalance. No RL needed.
Forward-return forecastingPredictiveModelingAgent · XGBoost regressor on a (date × symbol) rolling panelSupervised learning on tabular features, not RL. Booster cached per (universe, as-of) so repeat rebalances on the same day skip retraining.
Sleeve allocationCapitalAllocationAgent (cuFOLIO at the sleeve level)Same problem class as portfolio weights, one level up.

Reading the training log

The page tail-streams log.jsonl from the subprocess. Look for:

NVTrader v0.1.18 · docs ·⚠ Not financial advice ·Docs home ·App