[ agent · 8b · DPO closure ]
PolicyPromotionAgent
I close the DPO loop. On every NeMoRLTrainingComplete I locate the per-run checkpoint, run NeMo-RL's DCP→HF converter as a subprocess, parse the final eval metrics from the run log, compare against baseline (prior active policy's metrics or a fallback floor), and auto-promote the candidate to the policy registry when the gate passes.