model-merging

安装量: 191
排名: #4482

安装

npx skills add https://github.com/davila7/claude-code-templates --skill model-merging

Model Merging: Combining Pre-trained Models When to Use This Skill

Use Model Merging when you need to:

Combine capabilities from multiple fine-tuned models without retraining Create specialized models by blending domain-specific expertise (math + coding + chat) Improve performance beyond single models (often +5-10% on benchmarks) Reduce training costs - no GPUs needed, merges run on CPU Experiment rapidly - create new model variants in minutes, not days Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

Install mergekit

git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .

Or via pip

pip install mergekit

Optional: Transformer library

pip install transformers torch

Quick Start Simple Linear Merge

config.yml - Merge two models with equal weights

merge_method: linear models: - model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

Run merge

mergekit-yaml config.yml ./merged-model --cuda

Use merged model

python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

config.yml - Spherical interpolation

merge_method: slerp slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32] - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # Interpolation factor (0=model1, 1=model2) dtype: bfloat16

Core Concepts 1. Merge Methods

Linear (Model Soup)

Simple weighted average of parameters Fast, works well for similar models Can merge 2+ models merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights

where w1 + w2 + w3 = 1

SLERP (Spherical Linear Interpolation)

Interpolates along sphere in weight space Preserves magnitude of weight vectors Best for merging 2 models Smoother than linear

SLERP formula

merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task Arithmetic

Extract "task vectors" (fine-tuned - base) Combine task vectors, add to base Good for merging multiple specialized models

Task vector

task_vector = finetuned_model - base_model

Merge multiple task vectors

merged = base_model + α₁task_vector₁ + α₂task_vector₂

TIES-Merging

Task arithmetic + sparsification Resolves sign conflicts in parameters Best for merging many task-specific models

DARE (Drop And REscale)

Randomly drops fine-tuned parameters Rescales remaining parameters Reduces redundancy, maintains performance 2. Configuration Structure

Basic structure

merge_method: # linear, slerp, ties, dare_ties, task_arithmetic base_model: # Optional: base model for task arithmetic

models: - model: parameters: weight: # Merge weight density: # For TIES/DARE

  • model: parameters: weight:

parameters: # Method-specific parameters

dtype: # bfloat16, float16, float32

Optional

slices: # Layer-wise merging tokenizer: # Tokenizer configuration

Merge Methods Guide Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear models: - model: WizardLM/WizardMath-7B-V1.1 parameters: weight: 0.4 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.3 - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: weight: 0.3 dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32] - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # 0.0 = first model, 1.0 = second model dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp slices: - sources: - model: model_a layer_range: [0, 32] - model: model_b layer_range: [0, 32] parameters: t: - filter: self_attn # Attention layers value: 0.3 - filter: mlp # MLP layers value: 0.7 - value: 0.5 # Default for other layers dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 # Math parameters: weight: 0.5 - model: teknium/OpenHermes-2.5-Mistral-7B # Chat parameters: weight: 0.3 - model: ajibawa-2023/Code-Mistral-7B # Code parameters: weight: 0.2 dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Keep top 50% of parameters weight: 1.0 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 1.0 - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: density: 0.5 weight: 1.0 parameters: normalize: true dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties base_model: mistralai/Mistral-7B-v0.1 models: - model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Drop 50% of deltas weight: 0.6 - model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 0.4 parameters: int8_mask: true # Use int8 for masks (saves memory) dtype: bfloat16

Advanced Patterns Layer-wise Merging

Different models for different layers

merge_method: passthrough slices: - sources: - model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # First half - sources: - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # Second half dtype: bfloat16

MoE from Merged Models

Create Mixture of Experts

merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts: - source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts: - "math" - "calculate" - source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts: - "chat" - "conversation" - source_model: ajibawa-2023/Code-Mistral-7B positive_prompts: - "code" - "python" dtype: bfloat16

Tokenizer Merging merge_method: linear models: - model: mistralai/Mistral-7B-v0.1 - model: custom/specialized-model

tokenizer: source: "union" # Combine vocabularies from both models tokens: <|special_token|>: source: "custom/specialized-model"

Best Practices 1. Model Compatibility

✅ Good: Same architecture

models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B ]

❌ Bad: Different architectures

models = [ "meta-llama/Llama-2-7b-hf", # Llama "mistralai/Mistral-7B-v0.1", # Mistral (incompatible!) ]

  1. Weight Selection

✅ Good: Weights sum to 1.0

models: - model: model_a parameters: weight: 0.6 - model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

models: - model: model_a parameters: weight: 0.8 - model: model_b parameters: weight: 0.8 # May boost performance

  1. Method Selection

Choose merge method based on use case:

2 models, smooth blend → SLERP

merge_method = "slerp"

3+ models, simple average → Linear

merge_method = "linear"

Multiple task-specific models → Task Arithmetic or TIES

merge_method = "ties"

Want to reduce redundancy → DARE

merge_method = "dare_ties"

  1. Density Tuning (TIES/DARE)

Start conservative (keep more parameters)

parameters: density: 0.8 # Keep 80%

If performance good, increase sparsity

parameters: density: 0.5 # Keep 50%

If performance degrades, reduce sparsity

parameters: density: 0.9 # Keep 90%

  1. Layer-specific Merging

Preserve base model's beginning and end

merge_method: passthrough slices: - sources: - model: base_model layer_range: [0, 2] # Keep first layers - sources: - model: merged_middle # Merge middle layers layer_range: [2, 30] - sources: - model: base_model layer_range: [30, 32] # Keep last layers

Evaluation & Testing Benchmark Merged Models from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Test on various tasks

test_prompts = { "math": "Calculate: 25 * 17 =", "code": "Write a Python function to reverse a string:", "chat": "What is the capital of France?", }

for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks Open LLM Leaderboard: General capabilities MT-Bench: Multi-turn conversation MMLU: Multitask accuracy HumanEval: Code generation GSM8K: Math reasoning Production Deployment Save and Upload from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Upload to HuggingFace Hub

model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

Quantize with GGUF

python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

Quantize with GPTQ

python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls ❌ Pitfall 1: Merging Incompatible Models

Wrong: Different architectures

models: - model: meta-llama/Llama-2-7b # Llama architecture - model: mistralai/Mistral-7B # Mistral architecture

Fix: Only merge models with same architecture

❌ Pitfall 2: Over-weighting One Model

Suboptimal: One model dominates

models: - model: model_a parameters: weight: 0.95 # Too high - model: model_b parameters: weight: 0.05 # Too low

Fix: Use more balanced weights (0.3-0.7 range)

❌ Pitfall 3: Not Evaluating

Wrong: Merge and deploy without testing

mergekit-yaml config.yml ./merged-model

Deploy immediately (risky!)

Fix: Always benchmark before deploying

Resources mergekit GitHub: https://github.com/arcee-ai/mergekit HuggingFace Tutorial: https://huggingface.co/blog/mlabonne/merge-models LazyMergekit: Automated merging notebook TIES Paper: https://arxiv.org/abs/2306.01708 DARE Paper: https://arxiv.org/abs/2311.03099 See Also references/methods.md - Deep dive into merge algorithms references/examples.md - Real-world merge configurations references/evaluation.md - Benchmarking and testing strategies

返回排行榜