awq-quantization

安装量: 151
排名: #5692

安装

npx skills add https://github.com/davila7/claude-code-templates --skill awq-quantization

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

Need 4-bit quantization with <5% accuracy loss Deploying instruction-tuned or chat models (AWQ generalizes better) Want ~2.5-3x inference speedup over FP16 Using vLLM for production serving Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

Need maximum ecosystem compatibility (more tools support GPTQ) Working with ExLlamaV2 backend specifically Have older GPUs without Marlin support

Use bitsandbytes instead when:

Need zero calibration overhead (quantize on-the-fly) Want to fine-tune with QLoRA Prefer simpler integration Quick start Installation

Default (Triton kernels)

pip install autoawq

With optimized CUDA kernels + Flash Attention

pip install autoawq[kernels]

Intel CPU/XPU optimization

pip install autoawq[cpu]

Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model from awq import AutoAWQForCausalLM from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized( model_name, fuse_layers=True # Enable fused attention for speed ) tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model from awq import AutoAWQForCausalLM from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

Load model and tokenizer

model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path)

Quantization config

quant_config = { "zero_point": True, # Use zero-point quantization "q_group_size": 128, # Group size (128 recommended) "w_bit": 4, # 4-bit weights "version": "GEMM" # GEMM for batch, GEMV for single-token }

Quantize (uses pileval dataset by default)

model.quantize(tokenizer, quant_config=quant_config)

Save

model.save_quantized("mistral-7b-awq") tokenizer.save_pretrained("mistral-7b-awq")

Timing: ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes Feature AWQ GPTQ bitsandbytes Speedup (4-bit) ~2.5-3x ~2x ~1.5x Accuracy loss <5% ~5-10% ~5-15% Calibration Minimal (128-1K tokens) More extensive None Overfitting risk Low Higher N/A Best for Production inference GPU inference Easy integration vLLM support Native Yes Limited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends GEMM (default, batch inference) quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" # Best for batch sizes > 1 }

GEMV (single-token generation) quant_config = { "version": "GEMV" # 20% faster for batch_size=1 }

Limitation: Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs) from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig( bits=4, version="marlin" # 2x faster on A100/H100 )

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Mistral-7B-AWQ", quantization_config=config )

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible) config = AwqConfig( bits=4, version="exllama" # Faster prefill, AMD GPU support )

HuggingFace Transformers integration Direct loading from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained( "TheBloke/zephyr-7B-alpha-AWQ", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended) from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig( bits=4, fuse_max_seq_len=512, # Max sequence length for fusing do_fuse=True # Enable fused attention/MLP )

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Mistral-7B-OpenOrca-AWQ", quantization_config=config )

Note: Fused modules cannot combine with FlashAttention2.

vLLM integration from vllm import LLM, SamplingParams

vLLM auto-detects AWQ models

llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half" )

sampling = SamplingParams(temperature=0.7, max_tokens=200) outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks Memory reduction Model FP16 AWQ 4-bit Reduction Mistral 7B 14 GB 5.5 GB 2.5x Llama 2-13B 26 GB 10 GB 2.6x Llama 2-70B 140 GB 35 GB 4x Inference speed (RTX 4090) Model Prefill (tok/s) Decode (tok/s) Memory Mistral 7B GEMM 3,897 114 5.55 GB TinyLlama 1B GEMV 5,179 431 2.10 GB Llama 2-13B GEMM 2,279 74 10.28 GB Accuracy (perplexity) Model FP16 AWQ 4-bit Degradation Llama 3 8B 8.20 8.48 +3.4% Mistral 7B 5.25 5.42 +3.2% Qwen2 72B 4.85 4.95 +2.1% Custom calibration data

Use custom dataset for domain-specific models

model.quantize( tokenizer, quant_config=quant_config, calib_data="wikitext", # Or custom list of strings max_calib_samples=256, # More samples = better accuracy max_calib_seq_len=512 # Sequence length )

Or provide your own samples

calib_samples = [ "Your domain-specific text here...", "More examples from your use case...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment model = AutoAWQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-AWQ", device_map="auto", # Auto-split across GPUs max_memory={0: "40GB", 1: "40GB"} )

Supported models

35+ architectures including:

Llama family: Llama 2/3, Code Llama, Mistral, Mixtral Qwen: Qwen, Qwen2, Qwen2.5-VL Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma Multimodal: LLaVA, LLaVA-Next, Qwen2-VL Common issues

CUDA OOM during quantization:

Reduce batch size

model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference:

Enable fused layers

model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support:

Use ExLlama backend

config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

vLLM llm-compressor: https://github.com/vllm-project/llm-compressor MLX-LM: For Mac devices with Apple Silicon

Existing quantized models remain usable.

References Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper GitHub: https://github.com/casper-hansen/AutoAWQ MIT Han Lab: https://github.com/mit-han-lab/llm-awq Models: https://huggingface.co/models?library=awq

返回排行榜