GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

Deploying on consumer hardware (laptops, desktops) Running on Apple Silicon (M1/M2/M3) with Metal acceleration Need CPU inference without GPU requirements Want flexible quantization (Q2_K to Q8_0) Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support No Python runtime: Pure C/C++ inference Flexible quantization: 2-8 bit with various methods (K-quants) Ecosystem support: LM Studio, Ollama, koboldcpp, and more imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs HQQ: Fast calibration-free quantization for HuggingFace bitsandbytes: Simple integration with transformers library TensorRT-LLM: Production NVIDIA deployment with maximum speed Quick start Installation

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp

Build (CPU)

make

Build with CUDA (NVIDIA)

make GGML_CUDA=1

Build with Metal (Apple Silicon)

make GGML_METAL=1

Install Python bindings (optional)

pip install llama-cpp-python

Convert model to GGUF

Install requirements

pip install -r requirements.txt

Convert HuggingFace model to GGUF (FP16)

python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

Or specify output type

python convert_hf_to_gguf.py ./path/to/model \ --outfile model-f16.gguf \ --outtype f16

Quantize model

Basic quantization to Q4_K_M

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantize with importance matrix (better quality)

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

CLI inference

./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

Interactive mode

./llama-cli -m model-q4_k_m.gguf --interactive

With GPU offload

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types K-quant methods (recommended) Type Bits Size (7B) Quality Use Case Q2_K 2.5 ~2.8 GB Low Extreme compression Q3_K_S 3.0 ~3.0 GB Low-Med Memory constrained Q3_K_M 3.3 ~3.3 GB Medium Balance Q4_K_S 4.0 ~3.8 GB Med-High Good balance Q4_K_M 4.5 ~4.1 GB High Recommended default Q5_K_S 5.0 ~4.6 GB High Quality focused Q5_K_M 5.5 ~4.8 GB Very High High quality Q6_K 6.0 ~5.5 GB Excellent Near-original Q8_0 8.0 ~7.2 GB Best Maximum quality Legacy methods Type Description Q4_0 4-bit, basic Q4_1 4-bit with delta Q5_0 5-bit, basic Q5_1 5-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows Workflow 1: HuggingFace to GGUF

1. Download model

huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

2. Convert to GGUF (FP16)

python convert_hf_to_gguf.py ./llama-3.1-8b \ --outfile llama-3.1-8b-f16.gguf \ --outtype f16

3. Quantize

./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

4. Test

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

1. Convert to GGUF

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

2. Create calibration text (diverse samples)

cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.

Add more diverse text samples...

EOF

3. Generate importance matrix

./llama-imatrix -m model-f16.gguf \ -f calibration.txt \ --chunk 512 \ -o model.imatrix \ -ngl 35 # GPU layers if available

4. Quantize with imatrix

./llama-quantize --imatrix model.imatrix \ model-f16.gguf \ model-q4_k_m.gguf \ Q4_K_M

Workflow 3: Multiple quantizations

!/bin/bash

MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix"

Generate imatrix once

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

Create multiple quantizations

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done

Python usage llama-cpp-python from llama_cpp import Llama

Load model

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads )

Generate

output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["", "\n\n"] ) print(output["choices"][0]["text"])

Chat completion from llama_cpp import Llama

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, chat_format="llama-3" # Or "chatml", "mistral", etc. )

messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"} ]

response = llm.create_chat_completion( messages=messages, max_tokens=256, temperature=0.7 ) print(response["choices"][0]["message"]["content"])

Streaming from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

Stream tokens

for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)

Server mode Start OpenAI-compatible server

Start server

./llama-server -m model-q4_k_m.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ -c 4096

Or with Python bindings

python -m llama_cpp.server \ --model model-q4_k_m.gguf \ --n_gpu_layers 35 \ --host 0.0.0.0 \ --port 8080

Use with OpenAI client from openai import OpenAI

client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" )

response = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256 ) print(response.choices[0].message.content)

Hardware optimization Apple Silicon (Metal)

Build with Metal

make clean && make GGML_METAL=1

Run with Metal acceleration

./llama-cli -m model.gguf -ngl 99 -p "Hello"

Python with Metal

llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )

NVIDIA CUDA

Build with CUDA

make clean && make GGML_CUDA=1

Run with CUDA

./llama-cli -m model.gguf -ngl 35 -p "Hello"

Specify GPU

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

Build with AVX2/AVX512

make clean && make

Run with optimal threads

./llama-cli -m model.gguf -t 8 -p "Hello"

Python CPU config

llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )

Integration with tools Ollama

Create Modelfile

cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF

Create Ollama model

ollama create mymodel -f Modelfile

Run

ollama run mymodel "Hello!"

LM Studio Place GGUF file in ~/.cache/lm-studio/models/ Open LM Studio and select the model Configure context length and GPU offload Start inference text-generation-webui

Place in models folder

cp model-q4_k_m.gguf text-generation-webui/models/

Start with llama.cpp loader

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices Use K-quants: Q4_K_M offers best quality/size balance Use imatrix: Always use importance matrix for Q4 and below GPU offload: Offload as many layers as VRAM allows Context length: Start with 4096, increase if needed Thread count: Match physical CPU cores, not logical Batch size: Increase n_batch for faster prompt processing Common issues

Model loads slowly:

Use mmap for faster loading

./llama-cli -m model.gguf --mmap

Out of memory:

Reduce GPU layers

./llama-cli -m model.gguf -ngl 20 # Reduce from 35

Or use smaller quantization

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

Always use imatrix for Q4 and below