[ engine · llm post-training ]

NVIDIA NeMo-RL

NVIDIA NeMo-RL 0.6.0 wired into NVTrader for LLM post-training. SFT, DPO, PPO, GRPO, DAPO, GDPO, RM, and on-policy distillation — every algorithm NeMo-RL ships, callable from the bus, the chat tools, and the Continuous RL page. Live-verified on a Grace-Blackwell GB10 (Python 3.13, torch 2.12+cu130, Ray 2.54, NeMo-RL 0.6.0+30afecc), 8-pair DPO smoke exited 0 with validation loss = ln(2) at step 0 (textbook DPO baseline). Not used for: portfolio weight selection — that stays on cuFOLIO Mean-CVaR.

honest scope

NeMo-RL trains language models. Its PPO/GRPO/DPO operate on next-token distributions, not portfolio weights. The previous version of this doc described an in-house stable-baselines3 PPO under the "NemoRL" brand — that has been removed. cuFOLIO does the portfolio math; NeMo-RL post-trains the Nemotron LLMs that drive PM narration, deep research, and compliance review.

What it is · how it works · why it matters

[ what ]

The exact nvidia/nemo-rl 0.6.0 package + source tree from github.com/NVIDIA-NeMo/RL. Subprocess-launched into a dedicated Python 3.13 env so the heavy deps (torch 2.10, ray 2.54, vLLM, Megatron-Core, FlashAttention) don't touch the rest of the stack.

[ how ]

Bridge module src/traderspace/nemo_rl/ writes an OmegaConf-compatible YAML, spawns third_party/nemo-rl/examples/run_<algo>.py in the nemo_rl conda env. Subprocess stdout streams into data/nemo_rl/runs/<id>/log.jsonl. The runner emits NeMoRLTrainingStarted/Progress/Complete onto the bus — PM Agent + AuditAgent subscribe.

[ why ]

The platform's LLM behavior compounds with use: every approve/reject pair you record becomes a DPO training example; every research session can score a reward model; every chat turn is SFT-able material. NeMo-RL turns that data into a steadily-improving Nemotron checkpoint.

Algorithms exposed

Algo	Run script	Use case in NVTrader
SFT	`run_sft.py`	Fine-tune Nemotron on the audit log + PM chat history so it picks up your vocabulary and preferred metrics.
DPO	`run_dpo.py`	Train on PreferenceLearningAgent's approve/reject pairs — make PM narrations + rebalance recommendations match your judgment.
GRPO	`run_grpo.py`	Group relative PO against a reward model that scores response quality (cited sources, named risks, specific numbers).
PPO	`run_grpo.py`	Token-level PPO ≡ GRPO with K=1. Same entrypoint, K=1 in config.
DAPO	`run_grpo.py`	Decoupled clip + dynamic sampling PO. GRPO recipe variant.
GDPO	`run_grpo.py`	Group reward-decoupled normalization PO. Multi-reward RL training.
RM	`run_rm.py`	Train an internal Bradley-Terry reward model — then GRPO against it.
Distillation	`run_distillation.py`	On-policy student-from-teacher: bring a smaller Nemotron Nano up to Super quality on your domain.
Eval	`run_eval.py`	Score any checkpoint against benchmarks (math-verify, custom reward fns).

Bridge module

Source: src/traderspace/nemo_rl/.

runner.py — NemoRLRunner.launch(algo, config) spawns the subprocess, registers it in an in-memory map for cancellation, streams stdout to log.jsonl, emits bus events when state changes.
configs.py — Python builders matching NeMo-RL's example configs: build_dpo_config(), build_grpo_config(), build_sft_config(), build_rm_config(). Sensible defaults for a 1B-class model on a single GB10; copy + override for bigger runs.
__init__.py — re-exports launch_run, list_runs, get_run, cancel_run, tail_log, ALGOS.

A2A bus wiring

Two new bus agents in src/traderspace/bus/agents/nemo_rl_agents.py:

NeMoRLTrainingAgent · subscribes TrainNemoRLRequested; calls launch_run(algo, cfg); emits NeMoRLTrainingStarted. Progress + Complete are emitted by the runner directly as the subprocess runs.
NeMoRLFeedbackAgent · subscribes PreferenceRecorded + RebalanceDecided; counts accumulated approve/reject pairs against the threshold; emits TrainNemoRLRequested(algo='dpo') when ready — wiring the closed-loop DPO retrain automatically.

PM Agent subscribes to the full lifecycle so chat surfaces can narrate training progress. AuditAgent captures every event onto the immutable JSONL ledger.

REST surface

Verb	Path	Purpose
GET	`/api/nemo-rl/env`	Diagnostics: is the `nemo_rl` env + source clone bootstrapped?
POST	`/api/nemo-rl/launch`	Routes through the bus → NeMoRLTrainingAgent. Body: `{algo, model_name?, train_data_path?, max_steps?, config?}`.
GET	`/api/nemo-rl/runs`	List recent training runs (newest first).
GET	`/api/nemo-rl/run/{run_id}`	Run metadata + status.
GET	`/api/nemo-rl/run/{run_id}/log`	Tail the live training log (`?since_line=N&max_lines=200`).
POST	`/api/nemo-rl/run/{run_id}/cancel`	SIGINT the subprocess.

Bootstrap on a fresh host

bash scripts/bootstrap_nemo_rl.sh                  # dtensor backend only (default)
bash scripts/bootstrap_nemo_rl.sh --with-megatron  # also install megatron-core for TP/PP recipes

Creates the nemo_rl conda env (Python 3.13.13), pip-installs nemo-rl @ git+https://github.com/NVIDIA-NeMo/RL.git, clones the source tree to third_party/nemo-rl/ with submodules (Megatron-Bridge, Automodel, Gym — required for the tool.uv.workspace resolution at training time). Then force-installs torch 2.12+cu130 over NeMo-RL's pinned torch==2.10.0 because torch 2.10's cu130 wheels max out at CUDA capability (12, 0) and Blackwell GB10 needs sm_121. Verify with curl :8015/api/nemo-rl/env.

Override paths via env vars: NVTRADER_NEMORL_PYTHON (default /home/phdaggie/miniconda3/envs/nemo_rl/bin/python), NVTRADER_NEMORL_SRC (default ./third_party/nemo-rl).

install gotchas (fixed)

The bootstrap encodes ten fixes discovered live on a GB10 box. If you see one of these errors, the bootstrap should already have handled it — listing them here so the error messages are searchable:

1. torch 2.10.0+cpu resolved by default — bootstrap force-installs torch 2.12+cu130 after the nemo-rl pin.
2. "CUDA capability range (8.0)-(12.0)" on Blackwell — torch 2.12+cu130 supports sm_121.
3. "Transformer Engine and Apex are not installed" warnings — informational. Megatron-Core falls back to torch SDPA + Torch optimizers automatically. TE is optional even with Megatron-Core; do not install unless you specifically need FP8 fused attention.
4. uv venv "No interpreter found for Python 3.13.13 in managed installations" — fixed by UV_PYTHON_PREFERENCE=system + UV_PYTHON in the bridge subprocess env.
5. "`nemo-gym` references a workspace ... but is not a workspace member" — fixed by cloning with --recurse-submodules --shallow-submodules.
6. Empty worker venv reused on retry — clear with rm -rf third_party/nemo-rl/venvs/* when a build failed before the venv finished syncing.
7. Ray GCS connection timeout — usually stale Ray state from a previous failed run. pkill -9 -f raylet + rm -rf /tmp/ray resolves it.
8. NeMo-RL downloads HelpSteer3 (38k samples) instead of your train_data_path — the reference YAML has data.train.dataset_name: HelpSteer3 as default. nemo_rl/configs.py::build_dpo_config pops this and writes the PreferenceDataset shape pointing at your JSONL.
9. KeyError: 'completions' from preference_preprocessor — NeMo-RL expects HelpSteer3 chat-message shape (context + completions[{rank, completion}]), not TRL's {prompt, chosen, rejected}. preferences/trainer.py::build_dpo_dataset writes both shapes to different paths.
10. PythonFinalizationError: preexec_fn not supported at interpreter shutdown — Python 3.13 specific. Happens during shutdown when the dataset filtered to empty; means the actual training failed earlier (usually a config or data-shape issue). Look further up the log for the real error.

dtensor vs Megatron-Core

Backend	Use for	Install
dtensor (default)	All 1B-class DPO/GRPO/SFT/RM recipes — the bridge's `build_*_config` builders set `megatron_cfg.enabled: false`.	Always installed. No extra flag.
Megatron-Core (opt-in)	Llama 70B, Nemotron 30BA3B, and other tensor-parallel / pipeline-parallel recipes under `third_party/nemo-rl/examples/configs/recipes/llm/megatron.yaml`.	`--with-megatron` or `WITH_MEGATRON=1`. Needs CUDA dev toolkit (nvcc on PATH); install can be heavy.

Docker: WITH_NEMORL=1 WITH_MEGATRON=1 docker compose build app. Shell deploy: bash scripts/deploy.sh --with-megatron.

AutoResearch (Karpathy meta-loop)

The Karpathy-pattern meta-loop (karpathy/autoresearch) is preserved at src/traderspace/nemorl_autoresearch/ — but its inner training is now NeMo-RL, not stable-baselines3 PPO. Each iteration, Nemotron 3 Super proposes a typed config edit (KL penalty, batch size, learning rate, …); the orchestrator launches a real DPO/GRPO/SFT run via the bridge, parses the eval metric out of the log, keeps or reverts.

Endpoint: POST /api/nemorl-autoresearch/start with {goal, model_name, algo, budget, max_steps, train_data_path}.

UI

Live surface on the Continuous RL page. Pick an algorithm, set the base model + training data, click Launch. The runs list polls every 8s; the selected run streams its log every 1.5s. AutoResearch sessions show the meta-agent's iteration history with accept/revert markers.

What NeMo-RL is NOT used for

Problem	Stack	Why
Portfolio weight selection	cuFOLIO Mean-CVaR	Continuous weight vector on N stocks. Classical RL on continuous action spaces, not token-level. cuFOLIO + cuOpt PDLP is the right tool.
Risk attribution	cuFOLIO + per-position covariance	Pure linear algebra. No model training needed.
Latent factor scoring	AIFactorAgent · sklearn PCA on the cross-sectional feature matrix	One linear projection per rebalance. No RL needed.
Forward-return forecasting	PredictiveModelingAgent · XGBoost regressor on a (date × symbol) rolling panel	Supervised learning on tabular features, not RL. Booster cached per (universe, as-of) so repeat rebalances on the same day skip retraining.
Sleeve allocation	CapitalAllocationAgent (cuFOLIO at the sleeve level)	Same problem class as portfolio weights, one level up.

Reading the training log

The page tail-streams log.jsonl from the subprocess. Look for:

val/reward, val/accuracy, reward/mean — higher-is-better metrics; the AutoResearch loop maximizes these.
loss — lower-is-better; DPO/SFT training loss.
error, fail, traceback — surface red. Subprocess will exit non-zero; the run flips to status: "failed".