LLM Training

Frameworks and techniques for training and finetuning large language models.

Framework Comparison

Framework

Best For

Multi-GPU

Memory Efficient

Accelerate

Simple distributed

Yes

Basic

DeepSpeed

Large models, ZeRO

Yes

Excellent

PyTorch Lightning

Clean training loops

Yes

Good

Ray Train

Scalable, multi-node

Yes

Good

TRL

RLHF, reward modeling

Yes

Good

Unsloth

Fast LoRA finetuning

Limited

Excellent

Accelerate (HuggingFace)

Minimal wrapper for distributed training. Run

accelerate config

for interactive setup.

Key concept

Wrap model, optimizer, dataloader with

accelerator.prepare()

, use

accelerator.backward()

for loss.

DeepSpeed (Large Models)

Microsoft's optimization library for training massive models.

ZeRO Stages:

Stage 1

Optimizer states partitioned across GPUs

Stage 2

Gradients partitioned

Stage 3
- Parameters partitioned (for largest models, 100B+)
  
  Key concept
  
  Configure via JSON, higher stages = more memory savings but more communication overhead.
  
  TRL (RLHF/DPO)
  
  HuggingFace library for reinforcement learning from human feedback.
  
  Training types:
  
  SFT (Supervised Finetuning)
  
  Standard instruction tuning
  
  DPO (Direct Preference Optimization)
  
  Simpler than RLHF, uses preference pairs
  
  PPO
  
  Classic RLHF with reward model
  
  Key concept
  
  DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.
  
  Unsloth (Fast LoRA)
  
  Optimized LoRA finetuning - 2x faster, 60% less memory.
  
  Key concept
  
  Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models. Memory Optimization Techniques Technique Memory Savings Trade-off Gradient checkpointing ~30-50% Slower training Mixed precision (fp16/bf16) ~50% Minor precision loss 4-bit quantization (QLoRA) ~75% Some quality loss Flash Attention ~20-40% Requires compatible GPU Gradient accumulation Effective batch↑ No memory cost Decision Guide Scenario Recommendation Simple finetuning Accelerate + PEFT 7B-13B models Unsloth (fastest) 70B+ models DeepSpeed ZeRO-3 RLHF/DPO alignment TRL Multi-node cluster Ray Train Clean code structure PyTorch Lightning Resources Accelerate: https://huggingface.co/docs/accelerate DeepSpeed: https://www.deepspeed.ai/ TRL: https://huggingface.co/docs/trl Unsloth: https://github.com/unslothai/unsloth

llm-training

安装