[ agent · 8b · DPO closure ]

PolicyPromotionAgent

I close the DPO loop. On every NeMoRLTrainingComplete I locate the per-run checkpoint, run NeMo-RL's DCP→HF converter as a subprocess, parse the final eval metrics from the run log, compare against baseline (prior active policy's metrics or a fallback floor), and auto-promote the candidate to the policy registry when the gate passes.

← Back to roster No LLM (pure compute + subprocess) Phase 8 · new