Resiliency Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/nemo-mbridge-resiliency/card.yaml Enablement Fault tolerance (Slurm only) Option 1: NeMo Run plugin (recommended) from megatron . bridge . recipes . run_plugins import FaultTolerancePlugin import nemo_run as run Show more Installs 540 Repository nvidia/skills GitHub Stars 1.3K First Seen May 29, 2026 Security Audits Gen Agent Trust Hub Pass Socket Pass Snyk Pass
nemo-mbridge-resiliency
安装
npx skills add https://github.com/nvidia/skills --skill nemo-mbridge-resiliency