nemo-mbridge-multi-node-slurm

安装量: 582
排名: #9149

安装

npx skills add https://github.com/nvidia/skills --skill nemo-mbridge-multi-node-slurm

Multi-Node Slurm Convert single-node uv run python -m torch.distributed.run commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures. Two Approaches: srun-native vs uv run torch.distributed Approach ntasks-per-node Process spawning Best for srun-native (preferred) 8 Slurm spawns 8 tasks/node Conversion, inference, Bridge scripts uv run torch.distributed (legacy) 1 uv run python -m torch.distributed.run spawns 8 procs/node MLM pretrain_gpt.py Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives RANK , WORLD_SIZE , LOCAL_RANK , MASTER_ADDR , MASTER_PORT from SLURM env vars ( SLURM_PROCID , SLURM_NTASKS , SLURM_LOCALID , SLURM_NODELIST ) via common_utils.py helpers called during initialize.py distributed init, so you never need to set them manually. Cluster Environment Container Show more Installs 557 Repository nvidia/skills GitHub Stars 1.3K First Seen May 29, 2026 Security Audits Gen Agent Trust Hub Pass Socket Pass Snyk Pass

返回排行榜