Ollama Optimizer Optimize Ollama configuration based on system hardware analysis. Repo Sync Before Edits (mandatory) Before writing any output files, sync with the remote to avoid conflicts: branch = " $( git rev-parse --abbrev-ref HEAD ) " git fetch origin git pull --rebase origin " $branch " If the working tree is dirty, stash first ( git stash ), sync, then pop ( git stash pop ). If origin is missing or conflicts occur, stop and ask the user before continuing. Workflow Phase 1: System Detection Run the detection script to gather hardware information: python3 scripts/detect_system.py Parse the JSON output to identify: OS and version CPU model and core count Total RAM / unified memory GPU type, VRAM, and driver version Current Ollama installation and environment variables Phase 2: Analyze and Recommend Based on detected hardware, determine the optimization profile: Hardware Tier Classification: Tier Criteria Max Model Key Optimizations CPU-only No GPU detected 3B num_thread tuning, Q4_K_M quant Low VRAM <6GB VRAM 3B Flash attention, KV cache q4_0 Entry 6-8GB VRAM 8B Flash attention, KV cache q8_0 Prosumer 10-12GB VRAM 14B Flash attention, full offload Workstation 16-24GB VRAM 32B Standard config, Q5_K_M option High-end 48GB+ VRAM 70B+ Multiple models, Q5/Q6 quants Apple Silicon Special Case: Unified memory = shared CPU/GPU RAM 8GB Mac → treat as 6GB VRAM tier 16GB Mac → treat as 12GB VRAM tier 32GB+ Mac → treat as workstation tier Phase 3: Generate Optimization Plan Create a structured optimization guide with these sections: 1. System Overview Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models"). 2. Dependency Assessment List what's needed based on the platform: macOS: Ollama only (Metal automatic) Linux NVIDIA: Ollama + NVIDIA driver 450+ Linux AMD: Ollama + ROCm 5.0+ Windows: Ollama + NVIDIA driver 452+ 3. Configuration Recommendations Essential environment variables:

Always recommended

export OLLAMA_FLASH_ATTENTION = 1

Memory-constrained systems (<12GB)

export OLLAMA_KV_CACHE_TYPE = q8_0

or q4_0 for severe constraints

Model selection guidance: Recommend specific models from ollama list output Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists) Warn if current models exceed hardware capacity Modelfile tuning (when needed): PARAMETER num_gpu # Partial offload for limited VRAM PARAMETER num_thread # CPU threads (physical cores, not hyperthreads) PARAMETER num_ctx # Reduce context for memory savings 4. Execution Checklist Provide copy-paste commands in order: Set environment variables Restart Ollama service Pull recommended models Test with ollama run --verbose 5. Verification Commands

Benchmark current performance

python3 scripts/benchmark_ollama.py --model < model

Check GPU memory usage (NVIDIA)

nvidia-smi

Verify config is applied

ollama run < model

"test" --verbose 2

&1 | head -20 Reference Files VRAM Requirements - Model sizing and quantization guide Environment Variables - Complete env var reference Platform-Specific Setup - OS-specific installation and configuration Output Format Generate an ollama-optimization-guide.md file in the current directory with:

Ollama Optimization Guide ** Generated: ** < timestamp

** System: ** < OS

| < CPU

| < RAM

GB RAM | < GPU

System Overview < hardware summary and constraints

Current Configuration < existing Ollama setup and env vars

Recommendations

Environment Variables < shell commands to set vars

Model Selection < recommended models with rationale

Performance Tuning < Modelfile adjustments if needed

Execution Checklist

[ ] < step 1

- [ ] < step 2

...

Verification < benchmark commands and expected results

Rollback < commands to revert changes if needed

Quick Optimization Commands For users who want immediate results without full analysis: macOS (Apple Silicon): export OLLAMA_FLASH_ATTENTION = 1 export OLLAMA_KV_CACHE_TYPE = q8_0 ollama pull llama3.2:3b

Safe for 8GB, fast

Linux/Windows with 8GB NVIDIA GPU: export OLLAMA_FLASH_ATTENTION = 1 export OLLAMA_KV_CACHE_TYPE = q8_0 ollama pull llama3.1:8b-instruct-q4_K_M CPU-only systems: export CUDA_VISIBLE_DEVICES = -1 ollama pull llama3.2:3b

Create Modelfile with: PARAMETER num_thread 4

ollama-optimizer

安装