Ollama Optimizer Optimize Ollama configuration based on system hardware analysis. Repo Sync Before Edits (mandatory) Before writing any output files, sync with the remote to avoid conflicts: branch = " $( git rev-parse --abbrev-ref HEAD ) " git fetch origin git pull --rebase origin " $branch " If the working tree is dirty, stash first ( git stash ), sync, then pop ( git stash pop ). If origin is missing or conflicts occur, stop and ask the user before continuing. Workflow Phase 1: System Detection Run the detection script to gather hardware information: python3 scripts/detect_system.py Parse the JSON output to identify: OS and version CPU model and core count Total RAM / unified memory GPU type, VRAM, and driver version Current Ollama installation and environment variables Phase 2: Analyze and Recommend Based on detected hardware, determine the optimization profile: Hardware Tier Classification: Tier Criteria Max Model Key Optimizations CPU-only No GPU detected 3B num_thread tuning, Q4_K_M quant Low VRAM <6GB VRAM 3B Flash attention, KV cache q4_0 Entry 6-8GB VRAM 8B Flash attention, KV cache q8_0 Prosumer 10-12GB VRAM 14B Flash attention, full offload Workstation 16-24GB VRAM 32B Standard config, Q5_K_M option High-end 48GB+ VRAM 70B+ Multiple models, Q5/Q6 quants Apple Silicon Special Case: Unified memory = shared CPU/GPU RAM 8GB Mac → treat as 6GB VRAM tier 16GB Mac → treat as 12GB VRAM tier 32GB+ Mac → treat as workstation tier Phase 3: Generate Optimization Plan Create a structured optimization guide with these sections: 1. System Overview Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models"). 2. Dependency Assessment List what's needed based on the platform: macOS: Ollama only (Metal automatic) Linux NVIDIA: Ollama + NVIDIA driver 450+ Linux AMD: Ollama + ROCm 5.0+ Windows: Ollama + NVIDIA driver 452+ 3. Configuration Recommendations Essential environment variables:
Always recommended
export OLLAMA_FLASH_ATTENTION = 1
Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE = q8_0
or q4_0 for severe constraints
Model selection guidance:
Recommend specific models from
ollama list
output
Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
Warn if current models exceed hardware capacity
Modelfile tuning (when needed):
PARAMETER num_gpu
Benchmark current performance
python3 scripts/benchmark_ollama.py --model < model
Check GPU memory usage (NVIDIA)
nvidia-smi
Verify config is applied
ollama run < model
"test" --verbose 2
&1 | head -20 Reference Files VRAM Requirements - Model sizing and quantization guide Environment Variables - Complete env var reference Platform-Specific Setup - OS-specific installation and configuration Output Format Generate an ollama-optimization-guide.md file in the current directory with:
Ollama Optimization Guide ** Generated: ** < timestamp
** System: ** < OS
| < CPU
| < RAM
GB RAM | < GPU
System Overview < hardware summary and constraints
Current Configuration < existing Ollama setup and env vars
Recommendations
Environment Variables < shell commands to set vars
Model Selection < recommended models with rationale
Performance Tuning < Modelfile adjustments if needed
Execution Checklist
[ ] < step 1
- [ ] < step 2
...
Verification < benchmark commands and expected results
Rollback < commands to revert changes if needed
Quick Optimization Commands For users who want immediate results without full analysis: macOS (Apple Silicon): export OLLAMA_FLASH_ATTENTION = 1 export OLLAMA_KV_CACHE_TYPE = q8_0 ollama pull llama3.2:3b
Safe for 8GB, fast
Linux/Windows with 8GB NVIDIA GPU: export OLLAMA_FLASH_ATTENTION = 1 export OLLAMA_KV_CACHE_TYPE = q8_0 ollama pull llama3.1:8b-instruct-q4_K_M CPU-only systems: export CUDA_VISIBLE_DEVICES = -1 ollama pull llama3.2:3b