hqq-quantization

安装量: 147
排名: #5839

安装

npx skills add https://github.com/davila7/claude-code-templates --skill hqq-quantization

HQQ - Half-Quadratic Quantization

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

When to use HQQ

Use HQQ when:

Quantizing models without calibration data (no dataset needed) Need fast quantization (minutes vs hours for GPTQ/AWQ) Deploying with vLLM or HuggingFace Transformers Fine-tuning quantized models with LoRA/PEFT Experimenting with extreme quantization (2-bit, 1-bit)

Key advantages:

No calibration: Quantize any model instantly without sample data Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference Flexible precision: 8/4/3/2/1-bit with configurable group sizes Framework integration: Native HuggingFace and vLLM support PEFT compatible: Fine-tune quantized models with LoRA

Use alternatives instead:

AWQ: Need calibration-based accuracy, production serving GPTQ: Maximum accuracy with calibration data available bitsandbytes: Simple 8-bit/4-bit without custom backends llama.cpp/GGUF: CPU inference, Apple Silicon deployment Quick start Installation pip install hqq

With specific backend

pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend

Basic quantization from hqq.core.quantize import BaseQuantizeConfig, HQQLinear import torch.nn as nn

Configure quantization

config = BaseQuantizeConfig( nbits=4, # 4-bit quantization group_size=64, # Group size for quantization axis=1 # Quantize along output dimension )

Quantize a linear layer

linear = nn.Linear(4096, 4096) hqq_linear = HQQLinear(linear, config)

Use normally

output = hqq_linear(input_tensor)

Quantize full model with HuggingFace from transformers import AutoModelForCausalLM, HqqConfig

Configure HQQ

quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 )

Load and quantize

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )

Model is quantized and ready to use

Core concepts Quantization configuration

HQQ uses BaseQuantizeConfig to define quantization parameters:

from hqq.core.quantize import BaseQuantizeConfig

Standard 4-bit config

config_4bit = BaseQuantizeConfig( nbits=4, # Bits per weight (1-8) group_size=64, # Weights per quantization group axis=1 # 0=input dim, 1=output dim )

Aggressive 2-bit config

config_2bit = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups for low-bit axis=1 )

Mixed precision per layer type

layer_configs = { "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64), "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64), }

HQQLinear layer

The core quantized layer that replaces nn.Linear:

from hqq.core.quantize import HQQLinear import torch

Create quantized layer

linear = torch.nn.Linear(4096, 4096) hqq_layer = HQQLinear(linear, config)

Access quantized weights

W_q = hqq_layer.W_q # Quantized weights scale = hqq_layer.scale # Scale factors zero = hqq_layer.zero # Zero points

Dequantize for inspection

W_dequant = hqq_layer.dequantize()

Backends

HQQ supports multiple inference backends for different hardware:

from hqq.core.quantize import HQQLinear

Available backends

backends = [ "pytorch", # Pure PyTorch (default) "pytorch_compile", # torch.compile optimized "aten", # Custom CUDA kernels "torchao_int4", # TorchAO int4 matmul "gemlite", # GemLite CUDA kernels "bitblas", # BitBlas optimized "marlin", # Marlin 4-bit kernels ]

Set backend globally

HQQLinear.set_backend("torchao_int4")

Or per layer

hqq_layer.set_backend("marlin")

Backend selection guide:

Backend Best For Requirements pytorch Compatibility Any GPU pytorch_compile Moderate speedup torch>=2.0 aten Good balance CUDA GPU torchao_int4 4-bit inference torchao installed marlin Maximum 4-bit speed Ampere+ GPU bitblas Flexible bit-widths bitblas installed HuggingFace integration Load pre-quantized models from transformers import AutoModelForCausalLM, AutoTokenizer

Load HQQ-quantized model from Hub

model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

Use normally

inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50)

Quantize and save from transformers import AutoModelForCausalLM, HqqConfig

Quantize

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

Save quantized model

model.save_pretrained("./llama-8b-hqq-4bit")

Push to Hub

model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")

Mixed precision quantization from transformers import AutoModelForCausalLM, HqqConfig

Different precision per layer type

config = HqqConfig( nbits=4, group_size=64, # Attention layers: higher precision # MLP layers: lower precision for memory savings dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, "mlp": {"nbits": 2, "group_size": 32} } )

vLLM integration Serve HQQ models with vLLM from vllm import LLM, SamplingParams

Load HQQ-quantized model

llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" )

Generate

sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)

vLLM with custom HQQ config from vllm import LLM

llm = LLM( model="meta-llama/Llama-3.1-8B", quantization="hqq", quantization_config={ "nbits": 4, "group_size": 64 } )

PEFT/LoRA fine-tuning Fine-tune quantized models from transformers import AutoModelForCausalLM, HqqConfig from peft import LoraConfig, get_peft_model

Load quantized model

quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" )

Apply LoRA

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

model = get_peft_model(model, lora_config)

Train normally with Trainer or custom loop

QLoRA-style training from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir="./hqq-lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=3, fp16=True, logging_steps=10, save_strategy="epoch" )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, data_collator=data_collator )

trainer.train()

Quantization workflows Workflow 1: Quick model compression from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

1. Configure quantization

config = HqqConfig(nbits=4, group_size=64)

2. Load and quantize (no calibration needed!)

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

3. Verify quality

prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0]))

4. Save

model.save_pretrained("./llama-8b-hqq") tokenizer.save_pretrained("./llama-8b-hqq")

Workflow 2: Optimize for inference speed from hqq.core.quantize import HQQLinear from transformers import AutoModelForCausalLM, HqqConfig

1. Quantize with optimal backend

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

2. Set fast backend

HQQLinear.set_backend("marlin") # or "torchao_int4"

3. Compile for additional speedup

import torch model = torch.compile(model)

4. Benchmark

import time inputs = tokenizer("Hello", return_tensors="pt").to(model.device) start = time.time() for _ in range(10): model.generate(**inputs, max_new_tokens=100) print(f"Avg time: {(time.time() - start) / 10:.2f}s")

Best practices Start with 4-bit: Best quality/size tradeoff for most models Use group_size=64: Good balance; smaller for extreme quantization Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility Verify quality: Always test generation quality after quantization Mixed precision: Keep attention at higher precision, compress MLP more PEFT training: Use LoRA r=16-32 for good fine-tuning results Common issues

Out of memory during quantization:

Quantize layer-by-layer

from hqq.models.hf.base import AutoHQQHFModel

model = AutoHQQHFModel.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="sequential" # Load layers sequentially )

Slow inference:

Switch to optimized backend

from hqq.core.quantize import HQQLinear HQQLinear.set_backend("marlin") # Requires Ampere+ GPU

Or compile

model = torch.compile(model, mode="reduce-overhead")

Poor quality at 2-bit:

Use smaller group size

config = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups help at low bits axis=1 )

References Advanced Usage - Custom backends, mixed precision, optimization Troubleshooting - Common issues, debugging, benchmarks Resources Repository: https://github.com/mobiusml/hqq Paper: Half-Quadratic Quantization HuggingFace Models: https://huggingface.co/mobiuslabsgmbh Version: 0.2.0+ License: Apache 2.0

返回排行榜