peft-fine-tuning

安装量: 194
排名: #4417

安装

npx skills add https://github.com/davila7/claude-code-templates --skill peft-fine-tuning

PEFT (Parameter-Efficient Fine-Tuning) Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods. When to use PEFT Use PEFT/LoRA when: Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100) Need to train <1% parameters (6MB adapters vs 14GB full model) Want fast iteration with multiple task-specific adapters Deploying multiple fine-tuned variants from one base model Use QLoRA (PEFT + quantization) when: Fine-tuning 70B models on single 24GB GPU Memory is the primary constraint Can accept ~5% quality trade-off vs full fine-tuning Use full fine-tuning instead when: Training small models (<1B parameters) Need maximum quality and have compute budget Significant domain shift requires updating all weights Quick start Installation

Basic installation

pip install peft

With quantization support (recommended)

pip install peft bitsandbytes

Full stack

pip install peft transformers accelerate bitsandbytes datasets LoRA fine-tuning (standard) from transformers import AutoModelForCausalLM , AutoTokenizer , TrainingArguments , Trainer from peft import get_peft_model , LoraConfig , TaskType from datasets import load_dataset

Load base model

model_name

"meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM . from_pretrained ( model_name , torch_dtype = "auto" , device_map = "auto" ) tokenizer = AutoTokenizer . from_pretrained ( model_name ) tokenizer . pad_token = tokenizer . eos_token

LoRA configuration

lora_config

LoraConfig ( task_type = TaskType . CAUSAL_LM , r = 16 ,

Rank (8-64, higher = more capacity)

lora_alpha

32 ,

Scaling factor (typically 2*r)

lora_dropout

0.05 ,

Dropout for regularization

target_modules

[ "q_proj" , "v_proj" , "k_proj" , "o_proj" ] ,

Attention layers

bias

"none"

Don't train biases

)

Apply LoRA

model

get_peft_model ( model , lora_config ) model . print_trainable_parameters ( )

Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

Prepare dataset

dataset

load_dataset ( "databricks/databricks-dolly-15k" , split = "train" ) def tokenize ( example ) : text = f"### Instruction:\n { example [ 'instruction' ] } \n\n### Response:\n { example [ 'response' ] } " return tokenizer ( text , truncation = True , max_length = 512 , padding = "max_length" ) tokenized = dataset . map ( tokenize , remove_columns = dataset . column_names )

Training

training_args

TrainingArguments ( output_dir = "./lora-llama" , num_train_epochs = 3 , per_device_train_batch_size = 4 , gradient_accumulation_steps = 4 , learning_rate = 2e-4 , fp16 = True , logging_steps = 10 , save_strategy = "epoch" ) trainer = Trainer ( model = model , args = training_args , train_dataset = tokenized , data_collator = lambda data : { "input_ids" : torch . stack ( [ f [ "input_ids" ] for f in data ] ) , "attention_mask" : torch . stack ( [ f [ "attention_mask" ] for f in data ] ) , "labels" : torch . stack ( [ f [ "input_ids" ] for f in data ] ) } ) trainer . train ( )

Save adapter only (6MB vs 16GB)

model . save_pretrained ( "./lora-llama-adapter" ) QLoRA fine-tuning (memory-efficient) from transformers import AutoModelForCausalLM , BitsAndBytesConfig from peft import get_peft_model , LoraConfig , prepare_model_for_kbit_training

4-bit quantization config

bnb_config

BitsAndBytesConfig ( load_in_4bit = True , bnb_4bit_quant_type = "nf4" ,

NormalFloat4 (best for LLMs)

bnb_4bit_compute_dtype

"bfloat16" ,

Compute in bf16

bnb_4bit_use_double_quant

True

Nested quantization

)

Load quantized model

model

AutoModelForCausalLM . from_pretrained ( "meta-llama/Llama-3.1-70B" , quantization_config = bnb_config , device_map = "auto" )

Prepare for training (enables gradient checkpointing)

model

prepare_model_for_kbit_training ( model )

LoRA config for QLoRA

lora_config

LoraConfig ( r = 64 ,

Higher rank for 70B

lora_alpha

128 , lora_dropout = 0.1 , target_modules = [ "q_proj" , "v_proj" , "k_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ] , bias = "none" , task_type = "CAUSAL_LM" ) model = get_peft_model ( model , lora_config )

70B model now fits on single 24GB GPU!

LoRA parameter selection Rank (r) - capacity vs efficiency Rank Trainable Params Memory Quality Use Case 4 ~3M Minimal Lower Simple tasks, prototyping 8 ~7M Low Good Recommended starting point 16 ~14M Medium Better General fine-tuning 32 ~27M Higher High Complex tasks 64 ~54M High Highest Domain adaptation, 70B models Alpha (lora_alpha) - scaling factor

Rule of thumb: alpha = 2 * rank

LoraConfig ( r = 16 , lora_alpha = 32 )

Standard

LoraConfig ( r = 16 , lora_alpha = 16 )

Conservative (lower learning rate effect)

LoraConfig ( r = 16 , lora_alpha = 64 )

Aggressive (higher learning rate effect)

Target modules by architecture

Llama / Mistral / Qwen

target_modules

[ "q_proj" , "v_proj" , "k_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ]

GPT-2 / GPT-Neo

target_modules

[ "c_attn" , "c_proj" , "c_fc" ]

Falcon

target_modules

[ "query_key_value" , "dense" , "dense_h_to_4h" , "dense_4h_to_h" ]

BLOOM

target_modules

[ "query_key_value" , "dense" , "dense_h_to_4h" , "dense_4h_to_h" ]

Auto-detect all linear layers

target_modules

"all-linear"

PEFT 0.6.0+

Loading and merging adapters Load trained adapter from peft import PeftModel , AutoPeftModelForCausalLM from transformers import AutoModelForCausalLM

Option 1: Load with PeftModel

base_model

AutoModelForCausalLM . from_pretrained ( "meta-llama/Llama-3.1-8B" ) model = PeftModel . from_pretrained ( base_model , "./lora-llama-adapter" )

Option 2: Load directly (recommended)

model

AutoPeftModelForCausalLM . from_pretrained ( "./lora-llama-adapter" , device_map = "auto" ) Merge adapter into base model

Merge for deployment (no adapter overhead)

merged_model

model . merge_and_unload ( )

Save merged model

merged_model . save_pretrained ( "./llama-merged" ) tokenizer . save_pretrained ( "./llama-merged" )

Push to Hub

merged_model . push_to_hub ( "username/llama-finetuned" ) Multi-adapter serving from peft import PeftModel

Load base with first adapter

model

AutoPeftModelForCausalLM . from_pretrained ( "./adapter-task1" )

Load additional adapters

model . load_adapter ( "./adapter-task2" , adapter_name = "task2" ) model . load_adapter ( "./adapter-task3" , adapter_name = "task3" )

Switch between adapters at runtime

model . set_adapter ( "task1" )

Use task1 adapter

output1

model . generate ( ** inputs ) model . set_adapter ( "task2" )

Switch to task2

output2

model . generate ( ** inputs )

Disable adapters (use base model)

with model . disable_adapter ( ) : base_output = model . generate ( ** inputs ) PEFT methods comparison Method Trainable % Memory Speed Best For LoRA 0.1-1% Low Fast General fine-tuning QLoRA 0.1-1% Very Low Medium Memory-constrained AdaLoRA 0.1-1% Low Medium Automatic rank selection IA3 0.01% Minimal Fastest Few-shot adaptation Prefix Tuning 0.1% Low Medium Generation control Prompt Tuning 0.001% Minimal Fast Simple task adaptation P-Tuning v2 0.1% Low Medium NLU tasks IA3 (minimal parameters) from peft import IA3Config ia3_config = IA3Config ( target_modules = [ "q_proj" , "v_proj" , "k_proj" , "down_proj" ] , feedforward_modules = [ "down_proj" ] ) model = get_peft_model ( model , ia3_config )

Trains only 0.01% of parameters!

Prefix Tuning from peft import PrefixTuningConfig prefix_config = PrefixTuningConfig ( task_type = "CAUSAL_LM" , num_virtual_tokens = 20 ,

Prepended tokens

prefix_projection

True

Use MLP projection

) model = get_peft_model ( model , prefix_config ) Integration patterns With TRL (SFTTrainer) from trl import SFTTrainer , SFTConfig from peft import LoraConfig lora_config = LoraConfig ( r = 16 , lora_alpha = 32 , target_modules = "all-linear" ) trainer = SFTTrainer ( model = model , args = SFTConfig ( output_dir = "./output" , max_seq_length = 512 ) , train_dataset = dataset , peft_config = lora_config ,

Pass LoRA config directly

) trainer . train ( ) With Axolotl (YAML config)

axolotl config.yaml

adapter : lora lora_r : 16 lora_alpha : 32 lora_dropout : 0.05 lora_target_modules : - q_proj - v_proj - k_proj - o_proj lora_target_linear : true

Target all linear layers

With vLLM (inference) from vllm import LLM from vllm . lora . request import LoRARequest

Load base model with LoRA support

llm

LLM ( model = "meta-llama/Llama-3.1-8B" , enable_lora = True )

Serve with adapter

outputs

llm . generate ( prompts , lora_request = LoRARequest ( "adapter1" , 1 , "./lora-adapter" ) ) Performance benchmarks Memory usage (Llama 3.1 8B) Method GPU Memory Trainable Params Full fine-tuning 60+ GB 8B (100%) LoRA r=16 18 GB 14M (0.17%) QLoRA r=16 6 GB 14M (0.17%) IA3 16 GB 800K (0.01%) Training speed (A100 80GB) Method Tokens/sec vs Full FT Full FT 2,500 1x LoRA 3,200 1.3x QLoRA 2,100 0.84x Quality (MMLU benchmark) Model Full FT LoRA QLoRA Llama 2-7B 45.3 44.8 44.1 Llama 2-13B 54.8 54.2 53.5 Common issues CUDA OOM during training

Solution 1: Enable gradient checkpointing

model . gradient_checkpointing_enable ( )

Solution 2: Reduce batch size + increase accumulation

TrainingArguments ( per_device_train_batch_size = 1 , gradient_accumulation_steps = 16 )

Solution 3: Use QLoRA

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig ( load_in_4bit = True , bnb_4bit_quant_type = "nf4" ) Adapter not applying

Verify adapter is active

print ( model . active_adapters )

Should show adapter name

Check trainable parameters

model . print_trainable_parameters ( )

Ensure model in training mode

model . train ( ) Quality degradation

Increase rank

LoraConfig ( r = 32 , lora_alpha = 64 )

Target more modules

target_modules

"all-linear"

Use more training data and epochs

TrainingArguments ( num_train_epochs = 5 )

Lower learning rate

TrainingArguments
(
learning_rate
=
1e-4
)
Best practices
Start with r=8-16
, increase if quality insufficient
Use alpha = 2 * rank
as starting point
Target attention + MLP layers
for best quality/efficiency
Enable gradient checkpointing
for memory savings
Save adapters frequently
(small files, easy rollback)
Evaluate on held-out data
before merging
Use QLoRA for 70B+ models
on consumer hardware
References
Advanced Usage
- DoRA, LoftQ, rank stabilization, custom modules
Troubleshooting
- Common errors, debugging, optimization
Resources
GitHub
:
https://github.com/huggingface/peft
Docs
:
https://huggingface.co/docs/peft
LoRA Paper
arXiv:2106.09685
QLoRA Paper
arXiv:2305.14314 Models : https://huggingface.co/models?library=peft
返回排行榜