SimPO - Simple Preference Optimization Quick start

SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.

Installation:

Create environment

conda create -n simpo python=3.10 && conda activate simpo

Install PyTorch 2.2.2

Visit: https://pytorch.org/get-started/locally/

Install alignment-handbook

git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install .

Install Flash Attention 2

python -m pip install flash-attn --no-build-isolation

Training (Mistral 7B):

ACCELERATE_LOG_LEVEL=info accelerate launch \ --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py \ training_configs/mistral-7b-base-simpo.yaml

Common workflows Workflow 1: Train from base model (Mistral 7B)

Config (mistral-7b-base-simpo.yaml):

Model

model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16

Dataset

dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits: - train_prefs - test_prefs

SimPO hyperparameters

beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization

Training

learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8

Output

output_dir: ./outputs/mistral-7b-simpo

Launch training:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml

Workflow 2: Fine-tune instruct model (Llama 3 8B)

Config (llama3-8b-instruct-simpo.yaml):

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

dataset_mixer: argilla/ultrafeedback-binarized-preferences-cleaned: 1.0

beta: 2.5 gamma_beta_ratio: 0.5 learning_rate: 5e-7 sft_weight: 0.1 # Add SFT loss to preserve capabilities

num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 4 output_dir: ./outputs/llama3-8b-simpo

Launch:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml

Workflow 3: Reasoning-intensive tasks (lower LR)

For math/code tasks:

model_name_or_path: deepseek-ai/deepseek-math-7b-base

dataset_mixer: argilla/distilabel-math-preference-dpo: 1.0

beta: 5.0 # Higher for stronger signal gamma_beta_ratio: 0.7 # Larger margin learning_rate: 3e-7 # Lower LR for reasoning sft_weight: 0.0

num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16

When to use vs alternatives

Use SimPO when:

Want simpler training than DPO (no reference model) Have preference data (chosen/rejected pairs) Need better performance than DPO Limited compute resources Single-node training sufficient

Algorithm selection:

SimPO: Simplest, best performance, no reference model DPO: Need reference model baseline, more conservative PPO: Maximum control, need reward model, complex setup GRPO: Memory-efficient RL, no critic

Use alternatives instead:

OpenRLHF: Multi-node distributed training, PPO/GRPO TRL: Need multiple methods in one framework DPO: Established baseline comparison Common issues

Issue: Loss divergence

Reduce learning rate:

learning_rate: 3e-7 # Reduce from 5e-7

Reduce beta:

beta: 1.0 # Reduce from 2.0

Issue: Model forgets capabilities

Add SFT regularization:

sft_weight: 0.1 # Add SFT loss component

Issue: Poor preference separation

Increase beta and margin:

beta: 5.0 # Increase from 2.0 gamma_beta_ratio: 0.8 # Increase from 0.5

Issue: OOM during training

Reduce batch size:

per_device_train_batch_size: 1 gradient_accumulation_steps: 16 # Maintain effective batch

Enable gradient checkpointing:

gradient_checkpointing: true

Advanced topics

Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.

Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.

Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.

Hardware requirements GPU: NVIDIA A100/H100 recommended VRAM: 7B model: 1× A100 40GB (DeepSpeed ZeRO-3) 8B model: 2× A100 40GB 70B model: 8× A100 80GB Single-node: DeepSpeed ZeRO-3 sufficient Mixed precision: BF16 recommended

Memory optimization:

DeepSpeed ZeRO-3 (default config) Gradient checkpointing Flash Attention 2 Resources Paper: https://arxiv.org/abs/2405.14734 (NeurIPS 2024) GitHub: https://github.com/princeton-nlp/SimPO Models: https://huggingface.co/princeton-nlp Alignment Handbook: https://github.com/huggingface/alignment-handbook

simpo-training

安装