LLM Fine-Tuning Guide
Master the art of fine-tuning large language models to create specialized models optimized for your specific use cases, domains, and performance requirements.
Overview
Fine-tuning adapts pre-trained LLMs to specific tasks, domains, or styles by training them on curated datasets. This improves accuracy, reduces hallucinations, and optimizes costs.
When to Fine-Tune Domain Specialization: Legal documents, medical records, financial reports Task-Specific Performance: Better results on specific tasks than base model Cost Optimization: Smaller fine-tuned model replaces expensive large model Style Adaptation: Match specific writing styles or tones Compliance Requirements: Keep sensitive data within your infrastructure Latency Requirements: Smaller models deploy faster When NOT to Fine-Tune One-off queries (use prompting instead) Rapidly changing information (use RAG instead) Limited training data (< 100 examples typically insufficient) General knowledge questions (base model sufficient) Quick Start
Full Fine-Tuning:
python examples/full_fine_tuning.py
LoRA (Recommended for most cases):
python examples/lora_fine_tuning.py
QLoRA (Single GPU):
python examples/qlora_fine_tuning.py
Data Preparation:
python scripts/data_preparation.py
Fine-Tuning Approaches 1. Full Fine-Tuning
Update all model parameters during training.
Pros:
Maximum performance improvement Can completely rewrite model behavior Best for significant domain shifts
Cons:
High computational cost Requires large dataset (1000+ examples) Risk of catastrophic forgetting Long training time from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
model_id = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id)
training_args = TrainingArguments( output_dir="./fine-tuned-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, weight_decay=0.01, logging_steps=10, save_steps=100, eval_strategy="steps", eval_steps=50, load_best_model_at_end=True, )
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, )
trainer.train()
- Parameter-Efficient Fine-Tuning (PEFT)
Train only a small fraction of parameters.
LoRA (Low-Rank Adaptation)
Adds trainable low-rank matrices to existing weights.
Pros:
99% fewer trainable parameters Maintains base model knowledge Fast training (10-20x faster) Easy to switch between adapters
Cons:
Slightly lower performance than full fine-tuning Requires base model at inference from peft import get_peft_model, LoraConfig, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_id = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(base_model_id) tokenizer = AutoTokenizer.from_pretrained(base_model_id)
Configure LoRA
lora_config = LoraConfig( r=8, # Rank of low-rank matrices lora_alpha=16, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
Wrap model with LoRA
model = get_peft_model(model, lora_config) model.print_trainable_parameters()
Output: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06
Train as normal
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) trainer.train()
Save only LoRA weights
model.save_pretrained("./llama-lora-adapter")
QLoRA (Quantized LoRA)
Combines LoRA with quantization for extreme efficiency.
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig from transformers import AutoModelForCausalLM, BitsAndBytesConfig
Quantization config
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True )
Load quantized model
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", quantization_config=bnb_config, device_map="auto" )
Prepare for training
model = prepare_model_for_kbit_training(model)
Apply LoRA
lora_config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
model = get_peft_model(model, lora_config)
Train on single GPU
trainer = Trainer( model=model, args=TrainingArguments( output_dir="./qlora-output", per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-4, num_train_epochs=3, ), train_dataset=train_dataset, ) trainer.train()
Prefix Tuning
Prepends trainable tokens to input.
from peft import get_peft_model, PrefixTuningConfig
config = PrefixTuningConfig( num_virtual_tokens=20, task_type=TaskType.CAUSAL_LM, )
model = get_peft_model(model, config)
Only 20 * embedding_dim parameters trained
- Instruction Fine-Tuning
Train model to follow instructions with examples.
Training data format
training_data = [ { "instruction": "Translate to French", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?" }, { "instruction": "Summarize this text", "input": "Long document...", "output": "Summary..." } ]
Template for training
template = """Below is an instruction that describes a task, paired with an input that provides further context.
Instruction:
{instruction}
Input:
{input}
Response:
{output}"""
Create formatted dataset
formatted_data = [ template.format(**example) for example in training_data ]
- Domain-Specific Fine-Tuning
Tailor models for specific industries or fields.
Legal Domain Example legal_training_data = [ { "prompt": "What are the key clauses in an NDA?", "completion": """Key clauses typically include: 1. Definition of Confidential Information 2. Non-Disclosure Obligations 3. Permitted Disclosures 4. Term and Termination 5. Return of Information 6. Remedies""" }, # ... more legal examples ]
Train on legal domain
model = fine_tune_on_domain( base_model="gpt-3.5-turbo", training_data=legal_training_data, epochs=3, learning_rate=0.0002, )
Data Preparation 1. Dataset Quality class DatasetValidator: def validate_dataset(self, data): issues = { "empty_samples": 0, "duplicates": 0, "outliers": 0, "imbalance": {} }
# Check for empty samples
for sample in data:
if not sample.get("text"):
issues["empty_samples"] += 1
# Check for duplicates
texts = [s.get("text") for s in data]
issues["duplicates"] = len(texts) - len(set(texts))
# Check for length outliers
lengths = [len(t.split()) for t in texts]
mean_length = sum(lengths) / len(lengths)
issues["outliers"] = sum(1 for l in lengths if l > mean_length * 3)
return issues
Validate before training
validator = DatasetValidator() issues = validator.validate_dataset(training_data) print(f"Dataset Issues: {issues}")
- Data Augmentation from nlpaug.augmenter.word import SynonymAug, RandomWordAug import nlpaug.flow as naf
Create augmentation pipeline
text = "The quick brown fox jumps over the lazy dog"
Synonym replacement
aug_syn = SynonymAug(aug_p=0.3) augmented_syn = aug_syn.augment(text)
Random word insertion
aug_insert = RandomWordAug(action="insert", aug_p=0.3) augmented_insert = aug_insert.augment(text)
Combine augmentations
flow = naf.Sequential([ SynonymAug(aug_p=0.2), RandomWordAug(action="swap", aug_p=0.2) ]) augmented = flow.augment(text)
- Train/Validation Split from sklearn.model_selection import train_test_split
Create splits
train_data, eval_data = train_test_split( data, test_size=0.2, random_state=42 )
eval_data, test_data = train_test_split( eval_data, test_size=0.5, random_state=42 )
print(f"Train: {len(train_data)}, Eval: {len(eval_data)}, Test: {len(test_data)}")
Training Techniques 1. Learning Rate Scheduling from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR
Linear warmup + cosine annealing
def get_scheduler(optimizer, num_steps): lr_scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=500, num_training_steps=num_steps ) return lr_scheduler
training_args = TrainingArguments( learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=500, warmup_ratio=0.1, )
- Gradient Accumulation training_args = TrainingArguments( gradient_accumulation_steps=4, # Accumulate gradients over 4 steps per_device_train_batch_size=1, # Effective batch size: 1 * 4 = 4 )
Simulates larger batch on limited GPU memory
- Mixed Precision Training training_args = TrainingArguments( fp16=True, # Use 16-bit floats bf16=False, )
Reduces memory usage by 50%, speeds up training
- Multi-GPU Training training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=4, dataloader_pin_memory=True, dataloader_num_workers=4, )
Automatically uses all available GPUs
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, )
Popular Models for Fine-Tuning Open Source Models Llama 3.2 (Meta) from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7b") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-7b")
Fine-tune on custom data
... training code
Characteristics:
7B, 70B parameter versions Strong instruction-following Excellent for domain adaptation Apache 2.0 license Gemma 3 (Google) model = AutoModelForCausalLM.from_pretrained("google/gemma-3-2b") tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-2b")
Gemma 3 sizes: 2B, 7B, 27B
Very efficient, great for fine-tuning
Characteristics:
Small, medium, large sizes Efficient architecture Good for edge deployment Built on cutting-edge research Mistral 7B model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
Strong performance, efficient architecture
Characteristics:
Sliding window attention Efficient inference Strong performance-to-size ratio Commercial Models OpenAI Fine-Tuning API import openai
Prepare training data
training_file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose="fine-tune" )
Create fine-tuning job
fine_tune_job = openai.FineTuningJob.create( training_file=training_file.id, model="gpt-3.5-turbo", hyperparameters={ "n_epochs": 3, "learning_rate_multiplier": 0.1, } )
Wait for completion
fine_tuned_model = openai.FineTuningJob.retrieve(fine_tune_job.id) print(f"Status: {fine_tuned_model.status}")
Use fine-tuned model
response = openai.ChatCompletion.create( model=fine_tuned_model.fine_tuned_model, messages=[{"role": "user", "content": "Hello"}] )
Evaluation and Metrics 1. Perplexity import torch from math import exp
def calculate_perplexity(model, eval_dataset): model.eval() total_loss = 0 total_tokens = 0
with torch.no_grad():
for batch in eval_dataset:
outputs = model(**batch)
loss = outputs.loss
total_loss += loss.item() * batch["input_ids"].shape[0]
total_tokens += batch["input_ids"].shape[0]
perplexity = exp(total_loss / total_tokens)
return perplexity
perplexity = calculate_perplexity(model, eval_dataset) print(f"Perplexity: {perplexity:.2f}")
- Task-Specific Metrics from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
def evaluate_task(predictions, ground_truth): return { "accuracy": accuracy_score(ground_truth, predictions), "precision": precision_score(ground_truth, predictions, average='weighted'), "recall": recall_score(ground_truth, predictions, average='weighted'), "f1": f1_score(ground_truth, predictions, average='weighted'), }
Evaluate on task
predictions = [model.predict(x) for x in test_data] metrics = evaluate_task(predictions, test_labels) print(f"Metrics: {metrics}")
-
Human Evaluation class HumanEvaluator: def evaluate_response(self, prompt, response): criteria = { "relevance": self._score_relevance(prompt, response), "coherence": self._score_coherence(response), "factuality": self._score_factuality(response), "helpfulness": self._score_helpfulness(response), } return sum(criteria.values()) / len(criteria)
def _score_relevance(self, prompt, response): # Score 1-5 pass
def _score_coherence(self, response): # Score 1-5 pass
Common Challenges & Solutions Challenge: Catastrophic Forgetting
Model forgets pre-trained knowledge while adapting to new domain.
Solutions:
Use lower learning rates (2e-5 to 5e-5) Smaller training epochs (1-3) Regularization techniques Continual learning approaches
Conservative training settings
training_args = TrainingArguments( learning_rate=2e-5, # Lower learning rate num_train_epochs=2, # Few epochs weight_decay=0.01, # L2 regularization warmup_steps=500, save_total_limit=3, load_best_model_at_end=True, )
Challenge: Overfitting
Model performs well on training data but poorly on new data.
Solutions:
Use more training data Implement dropout Early stopping Validation monitoring training_args = TrainingArguments( eval_strategy="steps", eval_steps=50, load_best_model_at_end=True, early_stopping_patience=3, metric_for_best_model="eval_loss", )
Challenge: Insufficient Training Data
Few examples for fine-tuning.
Solutions:
Data augmentation Use PEFT (LoRA) instead of full fine-tuning Few-shot learning with prompting Transfer learning
Use LoRA when data is limited
lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )
Best Practices Before Fine-Tuning ✓ Start with a strong base model ✓ Prepare high-quality training data (100+ examples recommended) ✓ Define clear evaluation metrics ✓ Set up proper train/validation splits ✓ Document your objectives During Fine-Tuning ✓ Monitor training/validation loss ✓ Use appropriate learning rates ✓ Save checkpoints regularly ✓ Validate on held-out data ✓ Watch for overfitting/underfitting After Fine-Tuning ✓ Evaluate on test set ✓ Compare against baseline ✓ Perform qualitative analysis ✓ Document configuration and results ✓ Version your fine-tuned models Implementation Checklist Determine fine-tuning approach (full, LoRA, QLoRA, instruction) Prepare and validate training dataset (100+ examples) Choose base model (Llama 3.2, Gemma 3, Mistral, etc.) Set up PEFT if using parameter-efficient methods Configure training arguments and hyperparameters Implement data loading and preprocessing Set up evaluation metrics Train model with monitoring Evaluate on test set Save and version fine-tuned model Test in production environment Document process and results Resources Frameworks Hugging Face Transformers: https://huggingface.co/transformers/ PEFT (Parameter-Efficient Fine-Tuning): https://github.com/huggingface/peft Hugging Face Datasets: https://huggingface.co/datasets Models Llama 3.2: https://www.meta.com/llama/ Gemma 3: https://deepmind.google/technologies/gemma/ Mistral: https://mistral.ai/ Papers "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al.) "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al.)