High-performance inference engines for serving large language models.

Engine Comparison

Decision Guide

| Production API server | vLLM or TGI

| Maximum throughput | vLLM

| Local development | Ollama or llama.cpp

| CPU-only deployment | llama.cpp

| Edge/embedded | llama.cpp

| Apple Silicon | llama.cpp with Metal

| Quick experimentation | Ollama

| Privacy-sensitive (no cloud) | llama.cpp

vLLM

Production-grade serving with PagedAttention for optimal GPU memory usage.

Key Innovations

| PagedAttention | Non-contiguous KV cache, better memory utilization

| Continuous batching | Dynamic request grouping for throughput

| Speculative decoding | Small model drafts, large model verifies

Strengths: Highest throughput, OpenAI-compatible API, multi-GPU Limitations: GPU required, more complex setup

Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.

llama.cpp

C++ inference for running models anywhere—laptops, phones, Raspberry Pi.

Quantization Formats (GGUF)

| Q8_0 | ~7 GB | Highest | When you have RAM

| Q6_K | ~6 GB | High | Good balance

| Q5_K_M | ~5 GB | Good | Balanced

| Q4_K_M | ~4 GB | OK | Memory constrained

| Q2_K | ~2.5 GB | Low | Minimum viable

Recommendation: Q4_K_M for best quality/size balance.

Memory Requirements

| 7B | ~4 GB | 8 GB

| 13B | ~7 GB | 16 GB

| 30B | ~17 GB | 32 GB

| 70B | ~38 GB | 64 GB

Platform Optimization

| Apple Silicon | n_gpu_layers=-1 (Metal offload)

| CUDA GPU | n_gpu_layers=-1 + offload_kqv=True

| CPU only | n_gpu_layers=0 + set n_threads to core count

Strengths: Runs anywhere, GGUF format, Metal/CUDA support Limitations: Lower throughput than vLLM, single-user focused

Key concept: GGUF format + quantization = run large models on consumer hardware.

Key Optimization Concepts

| KV Cache | Reuse attention computations | Always (automatic)

| Continuous Batching | Group requests dynamically | High-throughput serving

| Tensor Parallelism | Split model across GPUs | Large models

| Quantization | Reduce precision (fp16→int4) | Memory constrained

| Speculative Decoding | Small model drafts, large verifies | Latency sensitive

| GPU Offloading | Move layers to GPU | When GPU available

Common Parameters

| n_ctx | Context window size | 2048-8192

| n_gpu_layers | Layers to offload | -1 (all) or 0 (none)

| temperature | Randomness | 0.0-1.0

| max_tokens | Output limit | 100-2000

| n_threads | CPU threads | Match core count

Troubleshooting

| Out of memory | Reduce n_ctx, use smaller quant

| Slow inference | Enable GPU offload, use faster quant

| Model won't load | Check GGUF integrity, check RAM

| Metal not working | Reinstall with -DLLAMA_METAL=on

| Poor quality | Use higher quant (Q5_K_M, Q6_K)

Resources

vLLM: https://docs.vllm.ai
llama.cpp: https://github.com/ggerganov/llama.cpp
TGI: https://huggingface.co/docs/text-generation-inference
Ollama: https://ollama.ai
GGUF Models: https://huggingface.co/TheBloke

llm-inference

安装