turboquant-pytorch

安装量: 317
排名: #6577

安装

npx skills add https://github.com/aradotso/trending-skills --skill turboquant-pytorch
TurboQuant PyTorch
Skill by
ara.so
— Daily 2026 Skills collection.
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.
What It Does
TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:
Stage 1
Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
Stage 2
QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products , not vector fidelity. Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline): 4-bit → 76 MB (3.8x) 3-bit → 58 MB (5.0x) ← practical sweet spot 2-bit → 40 MB (7.3x) Installation git clone https://github.com/tonbistudio/turboquant-pytorch cd turboquant-pytorch pip install -r requirements.txt

For CUDA PyTorch:

pip install torch --index-url https://download.pytorch.org/whl/cu128 requirements.txt includes: torch>=2.0 scipy (Lloyd-Max codebook computation) transformers , accelerate , bitsandbytes (only for real model validation) Project Structure turboquant/ init.py # Package exports lloyd_max.py # Lloyd-Max optimal scalar quantizer turboquant.py # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache compressors.py # Production compressors for real model tensors test_turboquant.py # Synthetic validation tests validate.py # Real model (Qwen2.5-3B) validation Key Commands

Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)

python -m turboquant.test_turboquant

Run real model validation on Qwen2.5-3B-Instruct

Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run

python -m turboquant.validate Core API Lloyd-Max Codebook from turboquant . lloyd_max import build_lloyd_max_codebook

Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors

Returns (boundaries, centroids) for the given bit-width

boundaries , centroids = build_lloyd_max_codebook ( dim = 128 , bits = 3 ) Stage 1: MSE Quantization (TurboQuantMSE) from turboquant . turboquant import TurboQuantMSE

Initialize for head_dim=128, 3-bit quantization

tq_mse

TurboQuantMSE ( dim = 128 , bits = 3 )

Compress a batch of vectors: shape (batch, dim)

keys

torch . randn ( 512 , 128 )

512 key vectors

codes

tq_mse . quantize ( keys )

integer codes, (512, 128)

reconstructed

tq_mse . dequantize ( codes )

approximate keys, (512, 128)

Stage 2: Unbiased Inner Product Estimation (TurboQuantProd) from turboquant . turboquant import TurboQuantProd

Initialize with QJL correction

tq_prod

TurboQuantProd ( dim = 128 , bits = 3 , proj_dim = 64 )

Compress key vectors (stores codes + QJL residual signs)

compressed

tq_prod . compress ( keys )

dict with 'codes', 'signs', 'residual_norms'

Estimate inner products for all keys — unbiased estimator

query

torch . randn ( 128 ) scores = tq_prod . inner_product ( query , compressed )

shape (512,)

KV Cache Wrapper (TurboQuantKVCache) from turboquant . turboquant import TurboQuantKVCache

Wrap a KV cache for a single attention head

cache

TurboQuantKVCache ( dim = 128 , bits = 3 , proj_dim = 64 )

Add key/value vectors as tokens are generated

cache . append_key ( new_key )

shape (dim,)

cache . append_value ( new_val )

shape (dim,)

Compute attention scores for a query against all cached keys

query

torch . randn ( 128 ) scores = cache . attention_scores ( query )

shape (seq_len,), unbiased

Get values (MSE-reconstructed, used for weighted sum)

values

cache . get_values ( )

shape (seq_len, dim)

Production Compressors (for real model tensors) from turboquant . compressors import TurboQuantCompressorV2 , TurboQuantCompressorMSE

Key compressor — supports asymmetric attention score computation

key_compressor

TurboQuantCompressorV2 ( dim = 128 , bits = 3 , proj_dim = 64 )

Compress all keys in a layer: shape (num_heads, seq_len, head_dim)

compressed_keys

key_compressor . compress ( layer_keys )

Compute attention scores directly from compressed keys (no decompress needed)

query shape: (num_heads, head_dim)

scores

key_compressor . asymmetric_attention_scores ( query , compressed_keys )

scores shape: (num_heads, seq_len)

Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)

val_compressor

TurboQuantCompressorMSE ( dim = 128 , bits = 3 ) compressed_vals = val_compressor . compress ( layer_values ) reconstructed_vals = val_compressor . decompress ( compressed_vals ) Common Patterns Pattern 1: Compress a Full Model's KV Cache import torch from turboquant . compressors import TurboQuantCompressorV2 , TurboQuantCompressorMSE def compress_kv_cache ( kv_cache , head_dim = 128 , bits = 3 , proj_dim = 64 ) : """ kv_cache: list of (keys, values) per layer keys/values shape: (num_heads, seq_len, head_dim) Returns list of compressed (keys, values) per layer. """ key_comp = TurboQuantCompressorV2 ( dim = head_dim , bits = bits , proj_dim = proj_dim ) val_comp = TurboQuantCompressorMSE ( dim = head_dim , bits = bits ) compressed = [ ] for layer_keys , layer_vals in kv_cache : c_keys = key_comp . compress ( layer_keys ) c_vals = val_comp . compress ( layer_vals ) compressed . append ( ( c_keys , c_vals ) ) return compressed , key_comp , val_comp def run_attention_with_compressed_cache ( query , compressed_keys , compressed_vals , key_comp , val_comp ) : """ query: (num_heads, head_dim) Returns: attention output (num_heads, head_dim) """

Unbiased attention scores from compressed keys

scores

key_comp . asymmetric_attention_scores ( query , compressed_keys )

scores: (num_heads, seq_len)

attn_weights

torch . softmax ( scores , dim = - 1 )

(num_heads, seq_len)

Decompress values and compute weighted sum

values

val_comp . decompress ( compressed_vals )

(num_heads, seq_len, head_dim)

output

torch . einsum ( 'hs,hsd->hd' , attn_weights , values ) return output Pattern 2: Validate Compression Quality import torch import torch . nn . functional as F from turboquant . turboquant import TurboQuantProd def measure_attention_fidelity ( keys , queries , bits = 3 , proj_dim = 64 ) : """ Measure how well TurboQuant preserves attention distributions. keys: (seq_len, head_dim) queries: (num_queries, head_dim) """ dim = keys . shape [ - 1 ] tq = TurboQuantProd ( dim = dim , bits = bits , proj_dim = proj_dim ) compressed = tq . compress ( keys ) cosine_sims = [ ] top1_matches = [ ] for q in queries :

True attention scores

true_scores

( keys @ q )

(seq_len,)

true_attn

torch . softmax ( true_scores , dim = 0 )

TurboQuant estimated scores

est_scores

tq . inner_product ( q , compressed )

(seq_len,)

est_attn

torch . softmax ( est_scores , dim = 0 )

Cosine similarity of attention distributions

cos_sim

F . cosine_similarity ( true_attn . unsqueeze ( 0 ) , est_attn . unsqueeze ( 0 ) ) . item ( ) cosine_sims . append ( cos_sim )

Top-1 match

top1_matches . append ( true_attn . argmax ( ) == est_attn . argmax ( ) ) return { 'mean_cosine_sim' : sum ( cosine_sims ) / len ( cosine_sims ) , 'top1_accuracy' : sum ( top1_matches ) / len ( top1_matches ) , }

Example usage

keys

torch . randn ( 2048 , 128 ) keys = F . normalize ( keys , dim = - 1 ) queries = torch . randn ( 100 , 128 ) queries = F . normalize ( queries , dim = - 1 ) results = measure_attention_fidelity ( keys , queries , bits = 3 ) print ( f"Cosine similarity: { results [ 'mean_cosine_sim' ] : .4f } " ) print ( f"Top-1 accuracy: { results [ 'top1_accuracy' ] : .2% } " ) Pattern 3: Needle-in-Haystack Retrieval Test import torch import torch . nn . functional as F from turboquant . turboquant import TurboQuantProd def needle_in_haystack ( seq_len = 2048 , dim = 128 , bits = 3 ) : """Test whether TurboQuant preserves nearest-neighbor ordering.""" tq = TurboQuantProd ( dim = dim , bits = bits , proj_dim = 64 )

Build haystack of random unit vectors

haystack

F . normalize ( torch . randn ( seq_len , dim ) , dim = - 1 )

Insert needle at random position

needle_idx

torch . randint ( 0 , seq_len , ( 1 , ) ) . item ( ) query = F . normalize ( torch . randn ( dim ) , dim = 0 ) needle = query + 0.1 * torch . randn ( dim )

Similar to query

needle

F . normalize ( needle , dim = 0 ) haystack [ needle_idx ] = needle

Compress

compressed

tq . compress ( haystack )

True nearest neighbor

true_scores

haystack @ query true_best = true_scores . argmax ( ) . item ( )

TurboQuant estimated nearest neighbor

est_scores

tq . inner_product ( query , compressed ) est_best = est_scores . argmax ( ) . item ( ) return true_best == est_best , true_best , est_best

Run multiple trials

successes

sum ( needle_in_haystack ( seq_len = 8192 ) [ 0 ] for _ in range ( 20 ) ) print ( f"Retrieval accuracy: { successes } /20" ) Pattern 4: Compute Memory Savings def estimate_memory_savings ( num_layers , num_kv_heads , seq_len , head_dim , bits , proj_dim = 64 ) : """ Estimate compressed KV cache size vs FP16 baseline. """

FP16 baseline: 2 bytes per element

fp16_bytes

num_layers * 2 * num_kv_heads * seq_len * head_dim * 2

Stage 1 codes: bits per element, packed into bytes

codes_bytes

( num_layers * 2 * num_kv_heads * seq_len * head_dim * bits ) // 8

Stage 2 signs (keys only): 1 bit per proj_dim element

signs_bytes

( num_layers * num_kv_heads * seq_len * proj_dim ) // 8

Residual norms: 1 float16 per vector (keys only)

norms_bytes

num_layers * num_kv_heads * seq_len * 2 total_compressed = codes_bytes + signs_bytes + norms_bytes ratio = fp16_bytes / total_compressed print ( f"FP16 baseline: { fp16_bytes / 1e6 : .1f } MB" ) print ( f"TurboQuant { bits } -bit: { total_compressed / 1e6 : .1f } MB" ) print ( f"Compression ratio: { ratio : .1f } x" ) return ratio

Qwen2.5-3B: 36 layers, 2 KV heads, head_dim=128

estimate_memory_savings ( num_layers = 36 , num_kv_heads = 2 , seq_len = 8192 , head_dim = 128 , bits = 3 )

FP16 baseline: 289.4 MB

TurboQuant 3-bit: 57.9 MB

Compression ratio: 5.0x

Algorithm Details
Why Random Rotation?
Rotating by a random orthogonal matrix
R
maps unit vectors to a space where each coordinate follows
N(0, 1/d)
. This makes coordinates nearly independent with known distribution — enabling optimal per-coordinate scalar quantization (Lloyd-Max).
Why QJL for Keys but Not Values?
Keys
Used in dot products with queries. Bias in inner product estimates directly corrupts attention weights. QJL correction is essential.
Values
Used in weighted sums after softmax. Small per-vector MSE errors average out. Stage 1 MSE quantization is sufficient. Choosing proj_dim (QJL projection dimension) Higher proj_dim → lower variance in inner product estimates, but more memory:

Rule of thumb: proj_dim = head_dim // 2 is a good default

head_dim=128 → proj_dim=64

head_dim=64 → proj_dim=32

head_dim=256 → proj_dim=128

Bit-width Selection Guide Bits Compression Cosine Sim Top-1 Match Use Case 4 3.8x 0.999 87% Quality-critical tasks 3 5.0x 0.995 82% Recommended default 2 7.3x 0.988 66% Extreme memory pressure Troubleshooting scipy import error when building codebooks: pip install scipy CUDA out of memory during validate.py : Requires ≥6GB VRAM for Qwen2.5-3B in 4-bit Reduce seq_len in the validation script or use a smaller model Inner product estimates have high variance: Increase proj_dim (try head_dim instead of head_dim // 2 ) Check that input vectors are normalized before compressing Codebook build is slow on first run: Lloyd-Max uses numerical integration (scipy) — this is expected Codebooks are precomputed once per (dim, bits) combination; cache them: import pickle

Save codebook

boundaries , centroids = build_lloyd_max_codebook ( dim = 128 , bits = 3 ) with open ( 'codebook_128_3bit.pkl' , 'wb' ) as f : pickle . dump ( ( boundaries , centroids ) , f )

Load cached codebook

with open ( 'codebook_128_3bit.pkl' , 'rb' ) as f : boundaries , centroids = pickle . load ( f ) Attention fidelity lower than expected: Ensure vectors are L2-normalized before compressing ( F.normalize(x, dim=-1) ) The compressors in compressors.py handle normalization internally; TurboQuantProd expects unit vectors References TurboQuant paper — ICLR 2026 QJL paper — 1-bit residual correction technique PolarQuant — Related polar coordinate approach

返回排行榜