安装
npx skills add https://github.com/eyadsibai/ltk --skill llm-training
- LLM Training
- Frameworks and techniques for training and finetuning large language models.
- Framework Comparison
- Framework
- Best For
- Multi-GPU
- Memory Efficient
- Accelerate
- Simple distributed
- Yes
- Basic
- DeepSpeed
- Large models, ZeRO
- Yes
- Excellent
- PyTorch Lightning
- Clean training loops
- Yes
- Good
- Ray Train
- Scalable, multi-node
- Yes
- Good
- TRL
- RLHF, reward modeling
- Yes
- Good
- Unsloth
- Fast LoRA finetuning
- Limited
- Excellent
- Accelerate (HuggingFace)
- Minimal wrapper for distributed training. Run
- accelerate config
- for interactive setup.
- Key concept
-
- Wrap model, optimizer, dataloader with
- accelerator.prepare()
- , use
- accelerator.backward()
- for loss.
- DeepSpeed (Large Models)
- Microsoft's optimization library for training massive models.
- ZeRO Stages:
- Stage 1
-
- Optimizer states partitioned across GPUs
- Stage 2
-
-
- Gradients partitioned
- Stage 3
-
-
- Parameters partitioned (for largest models, 100B+)
- Key concept
-
- Configure via JSON, higher stages = more memory savings but more communication overhead.
- TRL (RLHF/DPO)
- HuggingFace library for reinforcement learning from human feedback.
- Training types:
- SFT (Supervised Finetuning)
-
- Standard instruction tuning
- DPO (Direct Preference Optimization)
-
- Simpler than RLHF, uses preference pairs
- PPO
-
- Classic RLHF with reward model
- Key concept
-
- DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.
- Unsloth (Fast LoRA)
- Optimized LoRA finetuning - 2x faster, 60% less memory.
- Key concept
- Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.
Memory Optimization Techniques
Technique
Memory Savings
Trade-off
Gradient checkpointing
~30-50%
Slower training
Mixed precision (fp16/bf16)
~50%
Minor precision loss
4-bit quantization (QLoRA)
~75%
Some quality loss
Flash Attention
~20-40%
Requires compatible GPU
Gradient accumulation
Effective batch↑
No memory cost
Decision Guide
Scenario
Recommendation
Simple finetuning
Accelerate + PEFT
7B-13B models
Unsloth (fastest)
70B+ models
DeepSpeed ZeRO-3
RLHF/DPO alignment
TRL
Multi-node cluster
Ray Train
Clean code structure
PyTorch Lightning
Resources
Accelerate:
https://huggingface.co/docs/accelerate
DeepSpeed:
https://www.deepspeed.ai/
TRL:
https://huggingface.co/docs/trl
Unsloth:
https://github.com/unslothai/unsloth
← 返回排行榜