Model Merging: Combining Pre-trained Models When to Use This Skill

Use Model Merging when you need to:

Combine capabilities from multiple fine-tuned models without retraining Create specialized models by blending domain-specific expertise (math + coding + chat) Improve performance beyond single models (often +5-10% on benchmarks) Reduce training costs - no GPUs needed, merges run on CPU Experiment rapidly - create new model variants in minutes, not days Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

Install mergekit

git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .

Or via pip

pip install mergekit

Optional: Transformer library

pip install transformers torch

Quick Start Simple Linear Merge

config.yml - Merge two models with equal weights

merge_method: linear models: - model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

Run merge

mergekit-yaml config.yml ./merged-model --cuda

Use merged model

python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

config.yml - Spherical interpolation

merge_method: slerp slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32] - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # Interpolation factor (0=model1, 1=model2) dtype: bfloat16

Core Concepts 1. Merge Methods

Linear (Model Soup)

Simple weighted average of parameters Fast, works well for similar models Can merge 2+ models merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights

where w1 + w2 + w3 = 1

SLERP (Spherical Linear Interpolation)

Interpolates along sphere in weight space Preserves magnitude of weight vectors Best for merging 2 models Smoother than linear

SLERP formula

merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task Arithmetic

Extract "task vectors" (fine-tuned - base) Combine task vectors, add to base Good for merging multiple specialized models

Task vector

task_vector = finetuned_model - base_model

Merge multiple task vectors

merged = base_model + α₁task_vector₁ + α₂task_vector₂

TIES-Merging

Task arithmetic + sparsification Resolves sign conflicts in parameters Best for merging many task-specific models

DARE (Drop And REscale)

Randomly drops fine-tuned parameters Rescales remaining parameters Reduces redundancy, maintains performance 2. Configuration Structure

Basic structure

merge_method: # linear, slerp, ties, dare_ties, task_arithmetic base_model: # Optional: base model for task arithmetic

models: - model: parameters: weight: # Merge weight density: # For TIES/DARE

model: parameters: weight:

parameters: # Method-specific parameters

dtype: # bfloat16, float16, float32

Optional

slices: # Layer-wise merging tokenizer: # Tokenizer configuration

Merge Methods Guide Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear models: - model: WizardLM/WizardMath-7B-V1.1 parameters: weight: 0.4 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.3 - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: weight: 0.3 dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32] - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # 0.0 = first model, 1.0 = second model dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp slices: - sources: - model: model_a layer_range: [0, 32] - model: model_b layer_range: [0, 32] parameters: t: - filter: self_attn # Attention layers value: 0.3 - filter: mlp # MLP layers value: 0.7 - value: 0.5 # Default for other layers dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 # Math parameters: weight: 0.5 - model: teknium/OpenHermes-2.5-Mistral-7B # Chat parameters: weight: 0.3 - model: ajibawa-2023/Code-Mistral-7B # Code parameters: weight: 0.2 dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Keep top 50% of parameters weight: 1.0 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 1.0 - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: density: 0.5 weight: 1.0 parameters: normalize: true dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Drop 50% of deltas weight: 0.6 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 0.4 parameters: int8_mask: true # Use int8 for masks (saves memory) dtype: bfloat16

Advanced Patterns Layer-wise Merging

Different models for different layers

merge_method: passthrough slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # First half - sources: - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # Second half dtype: bfloat16

MoE from Merged Models

Create Mixture of Experts

merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts: - source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts: - "math" - "calculate" - source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts: - "chat" - "conversation" - source_model: ajibawa-2023/Code-Mistral-7B positive_prompts: - "code" - "python" dtype: bfloat16

Tokenizer Merging merge_method: linear models: - model: mistralai/Mistral-7B-v0.1 - model: custom/specialized-model

tokenizer: source: "union" # Combine vocabularies from both models tokens: <|special_token|>: source: "custom/specialized-model"

Best Practices 1. Model Compatibility

✅ Good: Same architecture

models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B ]

❌ Bad: Different architectures

models = [ "meta-llama/Llama-2-7b-hf", # Llama "mistralai/Mistral-7B-v0.1", # Mistral (incompatible!) ]

Weight Selection

✅ Good: Weights sum to 1.0

models: - model: model_a parameters: weight: 0.6 - model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

models: - model: model_a parameters: weight: 0.8 - model: model_b parameters: weight: 0.8 # May boost performance

Method Selection

Choose merge method based on use case:

2 models, smooth blend → SLERP

merge_method = "slerp"

3+ models, simple average → Linear

merge_method = "linear"

Multiple task-specific models → Task Arithmetic or TIES

merge_method = "ties"

Want to reduce redundancy → DARE

merge_method = "dare_ties"

Density Tuning (TIES/DARE)

Start conservative (keep more parameters)

parameters: density: 0.8 # Keep 80%

If performance good, increase sparsity

parameters: density: 0.5 # Keep 50%

If performance degrades, reduce sparsity

parameters: density: 0.9 # Keep 90%

Layer-specific Merging

Preserve base model's beginning and end

merge_method: passthrough slices: - sources: - model: base_model layer_range: [0, 2] # Keep first layers - sources: - model: merged_middle # Merge middle layers layer_range: [2, 30] - sources: - model: base_model layer_range: [30, 32] # Keep last layers

Evaluation & Testing Benchmark Merged Models from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Test on various tasks

test_prompts = { "math": "Calculate: 25 * 17 =", "code": "Write a Python function to reverse a string:", "chat": "What is the capital of France?", }

for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks Open LLM Leaderboard: General capabilities MT-Bench: Multi-turn conversation MMLU: Multitask accuracy HumanEval: Code generation GSM8K: Math reasoning Production Deployment Save and Upload from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Upload to HuggingFace Hub

model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

Quantize with GGUF

python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

Quantize with GPTQ

python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls ❌ Pitfall 1: Merging Incompatible Models

Wrong: Different architectures

models: - model: meta-llama/Llama-2-7b # Llama architecture - model: mistralai/Mistral-7B # Mistral architecture

Fix: Only merge models with same architecture

❌ Pitfall 2: Over-weighting One Model

Suboptimal: One model dominates

models: - model: model_a parameters: weight: 0.95 # Too high - model: model_b parameters: weight: 0.05 # Too low

Fix: Use more balanced weights (0.3-0.7 range)

❌ Pitfall 3: Not Evaluating

Wrong: Merge and deploy without testing

mergekit-yaml config.yml ./merged-model

Deploy immediately (risky!)

Fix: Always benchmark before deploying

Resources mergekit GitHub: https://github.com/arcee-ai/mergekit HuggingFace Tutorial: https://huggingface.co/blog/mlabonne/merge-models LazyMergekit: Automated merging notebook TIES Paper: https://arxiv.org/abs/2306.01708 DARE Paper: https://arxiv.org/abs/2311.03099 See Also references/methods.md - Deep dive into merge algorithms references/examples.md - Real-world merge configurations references/evaluation.md - Benchmarking and testing strategies

安装

Install mergekit

Or via pip

Optional: Transformer library

config.yml - Merge two models with equal weights

Run merge

Use merged model

config.yml - Spherical interpolation

where w1 + w2 + w3 = 1

SLERP formula

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task vector

Merge multiple task vectors

Basic structure

Optional

Different models for different layers

Create Mixture of Experts

✅ Good: Same architecture

❌ Bad: Different architectures

✅ Good: Weights sum to 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

Choose merge method based on use case:

2 models, smooth blend → SLERP

3+ models, simple average → Linear

Multiple task-specific models → Task Arithmetic or TIES

Want to reduce redundancy → DARE

Start conservative (keep more parameters)

If performance good, increase sparsity

If performance degrades, reduce sparsity

Preserve base model's beginning and end

Load merged model

Test on various tasks

Load merged model

Upload to HuggingFace Hub

Quantize with GGUF

Quantize with GPTQ

Wrong: Different architectures

Suboptimal: One model dominates

Wrong: Merge and deploy without testing

Deploy immediately (risky!)