Megatron-Core - Large-Scale LLM Training Quick start

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation:

Docker (recommended)

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

Or pip

pip install megatron-core

Simple distributed training:

Train with 2 GPUs using data parallelism

torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

Or LLaMA-3 8B training

./examples/llama/train_llama3_8b_fp8.sh

Common workflows Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup: - [ ] Step 1: Choose parallelism configuration - [ ] Step 2: Configure training hyperparameters - [ ] Step 3: Launch distributed training - [ ] Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model Size GPUs Tensor Parallel Pipeline Parallel Data Parallel Context Parallel 7B 8 1 1 8 1 13B 8 2 1 4 1 70B 64 4 4 4 1 405B 128 8 8 2 2

Step 2: Configure training hyperparameters

!/bin/bash

train_llama_70b.sh

GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel

LLaMA 70B configuration

MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096

Training hyperparameters

MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4

torchrun \ --nproc_per_node=$GPUS_PER_NODE \ --nnodes=$NNODES \ pretrain_gpt.py \ --tensor-model-parallel-size $TP \ --pipeline-model-parallel-size $PP \ --context-parallel-size $CP \ --sequence-parallel \ --num-layers $NUM_LAYERS \ --hidden-size $HIDDEN_SIZE \ --num-attention-heads $NUM_HEADS \ --seq-length $SEQ_LENGTH \ --max-position-embeddings $SEQ_LENGTH \ --micro-batch-size $MICRO_BATCH \ --global-batch-size $GLOBAL_BATCH \ --lr $LR \ --train-iters 100000 \ --lr-decay-style cosine \ --lr-warmup-iters 2000 \ --weight-decay 0.1 \ --clip-grad 1.0 \ --bf16 \ --use-mcore-models \ --transformer-impl transformer_engine \ --data-path /path/to/data \ --vocab-file /path/to/vocab.json \ --merge-file /path/to/merges.txt

Step 3: Launch distributed training

Single node (8 GPUs)

bash train_llama_70b.sh

Multi-node with SLURM

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

Step 4: Monitor performance metrics

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily

Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training: - [ ] Step 1: Configure expert parallelism - [ ] Step 2: Set MoE hyperparameters - [ ] Step 3: Launch training with EP

Step 1: Configure expert parallelism

Mixtral 8x7B example

TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))

= 2 * 1 * 4 * 4 = 32 GPUs

Step 2: Set MoE hyperparameters

torchrun \ --nproc_per_node=8 \ pretrain_gpt.py \ --tensor-model-parallel-size 2 \ --pipeline-model-parallel-size 1 \ --expert-model-parallel-size 4 \ --num-experts 8 \ --moe-router-topk 2 \ --moe-router-load-balancing-type aux_loss \ --moe-aux-loss-coeff 0.01 \ --hidden-size 4096 \ --num-layers 32 \ --num-attention-heads 32 \ --seq-length 4096 \ --max-position-embeddings 4096 \ --bf16 \ --use-mcore-models \ --transformer-impl transformer_engine \ --data-path /path/to/data \ --vocab-file /path/to/vocab.json \ --merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU Memory with EP=4: 2 experts × 7B = 14GB per GPU Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization: - [ ] Step 1: Enable Flash Attention - [ ] Step 2: Use FP8 precision (H100) - [ ] Step 3: Optimize micro-batch size - [ ] Step 4: Tune parallelism degrees

Step 1: Enable optimizations

--use-mcore-models # Use Megatron Core models --transformer-impl transformer_engine # Use Transformer Engine --sequence-parallel # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

--fp8-hybrid # FP8 mixed precision training

Transformer Engine handles FP8 automatically

Result: 1.5-2x speedup on H100 vs BF16.

Step 3: Optimize micro-batch size

Find largest micro-batch that fits in memory:

Start with 1, increase until OOM

for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done

Typical values:

7B model: 4-8 70B model: 1-2 405B model: 1

Step 4: Tune parallelism degrees

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:

TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs

When to use vs alternatives

Use Megatron-Core when:

Training models >10B parameters Need maximum efficiency (target >40% MFU) Using NVIDIA GPUs (A100, H100) Production training at scale Want fine-grained parallelism control

Use alternatives instead:

PyTorch FSDP: Models <70B, simpler API, PyTorch native DeepSpeed: Easier setup, good for <100B models HuggingFace Accelerate: Prototyping, simpler workflows LitGPT: Educational, single-file implementations Common issues

Issue: Low GPU utilization (<30% MFU)

Causes:

Micro-batch too small Too much parallelism overhead Not using Flash Attention

Fixes:

Increase micro-batch

--micro-batch-size 4 # Was 1

Enable optimizations

--use-flash-attn --sequence-parallel

Reduce TP if >8

--tensor-model-parallel-size 4 # Was 16

Issue: Out of memory

Reduce memory with:

--tensor-model-parallel-size 2 # Split model across GPUs --recompute-granularity full # Gradient checkpointing --recompute-method block # Checkpoint transformer blocks --recompute-num-layers 1 # Checkpoint every layer

Or use CPU/NVMe offloading:

--cpu-optimizer # Offload optimizer to CPU --cpu-optimizer-type ADAM # CPU Adam variant

Issue: Training slower than expected

Check:

Network bottleneck: Ensure InfiniBand/NVLink enabled Pipeline bubbles: Use interleaved pipeline schedule --num-layers-per-virtual-pipeline-stage 2

Data loading: Use fast data loader --dataloader-type cyclic

Issue: Diverging loss

Stabilize training:

--lr-warmup-iters 2000 # Longer warmup --clip-grad 1.0 # Gradient clipping --init-method-std 0.006 # Smaller init --attention-dropout 0.0 # No dropout in attention --hidden-dropout 0.0 # No dropout in FFN

Advanced topics

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

Hardware requirements GPU: NVIDIA Ampere+ (A100, H100, B200) Turing works but slower FP8 requires Hopper/Ada/Blackwell Network: InfiniBand or 400Gb+ Ethernet for multi-node Memory per GPU: 7B model: 40GB+ 70B model: 80GB (with TP=4) 405B model: 80GB (with TP=8, PP=8) Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models) Resources Docs: https://docs.nvidia.com/megatron-core/ GitHub: https://github.com/NVIDIA/Megatron-LM Papers: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019) "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021) NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)

training-llms-megatron

安装