TurboQuant PyTorch

Skill by

ara.so

— Daily 2026 Skills collection.

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.

What It Does

TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:

Stage 1

Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products , not vector fidelity. Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline): 4-bit → 76 MB (3.8x) 3-bit → 58 MB (5.0x) ← practical sweet spot 2-bit → 40 MB (7.3x) Installation git clone https://github.com/tonbistudio/turboquant-pytorch cd turboquant-pytorch pip install -r requirements.txt

For CUDA PyTorch:

pip install torch --index-url https://download.pytorch.org/whl/cu128 requirements.txt includes: torch>=2.0 scipy (Lloyd-Max codebook computation) transformers , accelerate , bitsandbytes (only for real model validation) Project Structure turboquant/ init.py # Package exports lloyd_max.py # Lloyd-Max optimal scalar quantizer turboquant.py # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache compressors.py # Production compressors for real model tensors test_turboquant.py # Synthetic validation tests validate.py # Real model (Qwen2.5-3B) validation Key Commands

Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)

python -m turboquant.test_turboquant

Run real model validation on Qwen2.5-3B-Instruct

Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run

python -m turboquant.validate Core API Lloyd-Max Codebook from turboquant . lloyd_max import build_lloyd_max_codebook

Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors

Returns (boundaries, centroids) for the given bit-width

boundaries , centroids = build_lloyd_max_codebook ( dim = 128 , bits = 3 ) Stage 1: MSE Quantization (TurboQuantMSE) from turboquant . turboquant import TurboQuantMSE

Initialize for head_dim=128, 3-bit quantization

tq_mse

TurboQuantMSE ( dim = 128 , bits = 3 )

Compress a batch of vectors: shape (batch, dim)

keys

torch . randn ( 512 , 128 )

512 key vectors

codes

tq_mse . quantize ( keys )

integer codes, (512, 128)

reconstructed

tq_mse . dequantize ( codes )

approximate keys, (512, 128)

Stage 2: Unbiased Inner Product Estimation (TurboQuantProd) from turboquant . turboquant import TurboQuantProd

Initialize with QJL correction

tq_prod

TurboQuantProd ( dim = 128 , bits = 3 , proj_dim = 64 )

Compress key vectors (stores codes + QJL residual signs)

compressed

tq_prod . compress ( keys )

dict with 'codes', 'signs', 'residual_norms'

Estimate inner products for all keys — unbiased estimator

query

torch . randn ( 128 ) scores = tq_prod . inner_product ( query , compressed )

shape (512,)

KV Cache Wrapper (TurboQuantKVCache) from turboquant . turboquant import TurboQuantKVCache

Wrap a KV cache for a single attention head

cache

TurboQuantKVCache ( dim = 128 , bits = 3 , proj_dim = 64 )

Add key/value vectors as tokens are generated

cache . append_key ( new_key )

shape (dim,)

cache . append_value ( new_val )

shape (dim,)

Compute attention scores for a query against all cached keys

query

torch . randn ( 128 ) scores = cache . attention_scores ( query )

shape (seq_len,), unbiased

Get values (MSE-reconstructed, used for weighted sum)

values

cache . get_values ( )

shape (seq_len, dim)

Production Compressors (for real model tensors) from turboquant . compressors import TurboQuantCompressorV2 , TurboQuantCompressorMSE

Key compressor — supports asymmetric attention score computation

key_compressor

TurboQuantCompressorV2 ( dim = 128 , bits = 3 , proj_dim = 64 )

Compress all keys in a layer: shape (num_heads, seq_len, head_dim)

compressed_keys

key_compressor . compress ( layer_keys )

Compute attention scores directly from compressed keys (no decompress needed)

query shape: (num_heads, head_dim)

scores

key_compressor . asymmetric_attention_scores ( query , compressed_keys )

scores shape: (num_heads, seq_len)

Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)

val_compressor

TurboQuantCompressorMSE ( dim = 128 , bits = 3 ) compressed_vals = val_compressor . compress ( layer_values ) reconstructed_vals = val_compressor . decompress ( compressed_vals ) Common Patterns Pattern 1: Compress a Full Model's KV Cache import torch from turboquant . compressors import TurboQuantCompressorV2 , TurboQuantCompressorMSE def compress_kv_cache ( kv_cache , head_dim = 128 , bits = 3 , proj_dim = 64 ) : """ kv_cache: list of (keys, values) per layer keys/values shape: (num_heads, seq_len, head_dim) Returns list of compressed (keys, values) per layer. """ key_comp = TurboQuantCompressorV2 ( dim = head_dim , bits = bits , proj_dim = proj_dim ) val_comp = TurboQuantCompressorMSE ( dim = head_dim , bits = bits ) compressed = [ ] for layer_keys , layer_vals in kv_cache : c_keys = key_comp . compress ( layer_keys ) c_vals = val_comp . compress ( layer_vals ) compressed . append ( ( c_keys , c_vals ) ) return compressed , key_comp , val_comp def run_attention_with_compressed_cache ( query , compressed_keys , compressed_vals , key_comp , val_comp ) : """ query: (num_heads, head_dim) Returns: attention output (num_heads, head_dim) """

Unbiased attention scores from compressed keys

scores

key_comp . asymmetric_attention_scores ( query , compressed_keys )

scores: (num_heads, seq_len)

attn_weights

torch . softmax ( scores , dim = - 1 )

(num_heads, seq_len)

Decompress values and compute weighted sum

values

val_comp . decompress ( compressed_vals )

(num_heads, seq_len, head_dim)

output

torch . einsum ( 'hs,hsd->hd' , attn_weights , values ) return output Pattern 2: Validate Compression Quality import torch import torch . nn . functional as F from turboquant . turboquant import TurboQuantProd def measure_attention_fidelity ( keys , queries , bits = 3 , proj_dim = 64 ) : """ Measure how well TurboQuant preserves attention distributions. keys: (seq_len, head_dim) queries: (num_queries, head_dim) """ dim = keys . shape [ - 1 ] tq = TurboQuantProd ( dim = dim , bits = bits , proj_dim = proj_dim ) compressed = tq . compress ( keys ) cosine_sims = [ ] top1_matches = [ ] for q in queries :

True attention scores

true_scores

( keys @ q )

(seq_len,)

true_attn

torch . softmax ( true_scores , dim = 0 )

TurboQuant estimated scores

est_scores

tq . inner_product ( q , compressed )

(seq_len,)

est_attn

torch . softmax ( est_scores , dim = 0 )

Cosine similarity of attention distributions

cos_sim

F . cosine_similarity ( true_attn . unsqueeze ( 0 ) , est_attn . unsqueeze ( 0 ) ) . item ( ) cosine_sims . append ( cos_sim )

Top-1 match

top1_matches . append ( true_attn . argmax ( ) == est_attn . argmax ( ) ) return { 'mean_cosine_sim' : sum ( cosine_sims ) / len ( cosine_sims ) , 'top1_accuracy' : sum ( top1_matches ) / len ( top1_matches ) , }

Example usage

keys

torch . randn ( 2048 , 128 ) keys = F . normalize ( keys , dim = - 1 ) queries = torch . randn ( 100 , 128 ) queries = F . normalize ( queries , dim = - 1 ) results = measure_attention_fidelity ( keys , queries , bits = 3 ) print ( f"Cosine similarity: { results [ 'mean_cosine_sim' ] : .4f } " ) print ( f"Top-1 accuracy: { results [ 'top1_accuracy' ] : .2% } " ) Pattern 3: Needle-in-Haystack Retrieval Test import torch import torch . nn . functional as F from turboquant . turboquant import TurboQuantProd def needle_in_haystack ( seq_len = 2048 , dim = 128 , bits = 3 ) : """Test whether TurboQuant preserves nearest-neighbor ordering.""" tq = TurboQuantProd ( dim = dim , bits = bits , proj_dim = 64 )

Build haystack of random unit vectors

haystack

F . normalize ( torch . randn ( seq_len , dim ) , dim = - 1 )

Insert needle at random position

needle_idx

torch . randint ( 0 , seq_len , ( 1 , ) ) . item ( ) query = F . normalize ( torch . randn ( dim ) , dim = 0 ) needle = query + 0.1 * torch . randn ( dim )

Similar to query

needle

F . normalize ( needle , dim = 0 ) haystack [ needle_idx ] = needle

Compress

compressed

tq . compress ( haystack )

True nearest neighbor

true_scores

haystack @ query true_best = true_scores . argmax ( ) . item ( )

TurboQuant estimated nearest neighbor

est_scores

tq . inner_product ( query , compressed ) est_best = est_scores . argmax ( ) . item ( ) return true_best == est_best , true_best , est_best

Run multiple trials

successes

sum ( needle_in_haystack ( seq_len = 8192 ) [ 0 ] for _ in range ( 20 ) ) print ( f"Retrieval accuracy: { successes } /20" ) Pattern 4: Compute Memory Savings def estimate_memory_savings ( num_layers , num_kv_heads , seq_len , head_dim , bits , proj_dim = 64 ) : """ Estimate compressed KV cache size vs FP16 baseline. """

FP16 baseline: 2 bytes per element

fp16_bytes

num_layers * 2 * num_kv_heads * seq_len * head_dim * 2

Stage 1 codes: bits per element, packed into bytes

codes_bytes

( num_layers * 2 * num_kv_heads * seq_len * head_dim * bits ) // 8

Stage 2 signs (keys only): 1 bit per proj_dim element

signs_bytes

( num_layers * num_kv_heads * seq_len * proj_dim ) // 8

Residual norms: 1 float16 per vector (keys only)

norms_bytes

num_layers * num_kv_heads * seq_len * 2 total_compressed = codes_bytes + signs_bytes + norms_bytes ratio = fp16_bytes / total_compressed print ( f"FP16 baseline: { fp16_bytes / 1e6 : .1f } MB" ) print ( f"TurboQuant { bits } -bit: { total_compressed / 1e6 : .1f } MB" ) print ( f"Compression ratio: { ratio : .1f } x" ) return ratio

Qwen2.5-3B: 36 layers, 2 KV heads, head_dim=128

estimate_memory_savings ( num_layers = 36 , num_kv_heads = 2 , seq_len = 8192 , head_dim = 128 , bits = 3 )

FP16 baseline: 289.4 MB

TurboQuant 3-bit: 57.9 MB

Compression ratio: 5.0x

Algorithm Details

Why Random Rotation?

Rotating by a random orthogonal matrix

R

maps unit vectors to a space where each coordinate follows

N(0, 1/d)

. This makes coordinates nearly independent with known distribution — enabling optimal per-coordinate scalar quantization (Lloyd-Max).

Why QJL for Keys but Not Values?

Keys

Used in dot products with queries. Bias in inner product estimates directly corrupts attention weights. QJL correction is essential.
Values: Used in weighted sums after softmax. Small per-vector MSE errors average out. Stage 1 MSE quantization is sufficient. Choosing proj_dim (QJL projection dimension) Higher proj_dim → lower variance in inner product estimates, but more memory:

Rule of thumb: proj_dim = head_dim // 2 is a good default

head_dim=128 → proj_dim=64

head_dim=64 → proj_dim=32

head_dim=256 → proj_dim=128

Bit-width Selection Guide Bits Compression Cosine Sim Top-1 Match Use Case 4 3.8x 0.999 87% Quality-critical tasks 3 5.0x 0.995 82% Recommended default 2 7.3x 0.988 66% Extreme memory pressure Troubleshooting scipy import error when building codebooks: pip install scipy CUDA out of memory during validate.py : Requires ≥6GB VRAM for Qwen2.5-3B in 4-bit Reduce seq_len in the validation script or use a smaller model Inner product estimates have high variance: Increase proj_dim (try head_dim instead of head_dim // 2 ) Check that input vectors are normalized before compressing Codebook build is slow on first run: Lloyd-Max uses numerical integration (scipy) — this is expected Codebooks are precomputed once per (dim, bits) combination; cache them: import pickle

Save codebook

boundaries , centroids = build_lloyd_max_codebook ( dim = 128 , bits = 3 ) with open ( 'codebook_128_3bit.pkl' , 'wb' ) as f : pickle . dump ( ( boundaries , centroids ) , f )

Load cached codebook

with open ( 'codebook_128_3bit.pkl' , 'rb' ) as f : boundaries , centroids = pickle . load ( f ) Attention fidelity lower than expected: Ensure vectors are L2-normalized before compressing ( F.normalize(x, dim=-1) ) The compressors in compressors.py handle normalization internally; TurboQuantProd expects unit vectors References TurboQuant paper — ICLR 2026 QJL paper — 1-bit residual correction technique PolarQuant — Related polar coordinate approach

安装

For CUDA PyTorch:

Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)

Run real model validation on Qwen2.5-3B-Instruct

Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run

Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors

Returns (boundaries, centroids) for the given bit-width

Initialize for head_dim=128, 3-bit quantization

tq_mse

Compress a batch of vectors: shape (batch, dim)

keys

512 key vectors

codes

integer codes, (512, 128)

reconstructed

approximate keys, (512, 128)

Initialize with QJL correction

tq_prod

Compress key vectors (stores codes + QJL residual signs)

compressed

dict with 'codes', 'signs', 'residual_norms'

Estimate inner products for all keys — unbiased estimator

query

shape (512,)

Wrap a KV cache for a single attention head

cache

Add key/value vectors as tokens are generated

shape (dim,)

shape (dim,)

Compute attention scores for a query against all cached keys

query

shape (seq_len,), unbiased

Get values (MSE-reconstructed, used for weighted sum)

values

shape (seq_len, dim)

Key compressor — supports asymmetric attention score computation

key_compressor

Compress all keys in a layer: shape (num_heads, seq_len, head_dim)

compressed_keys

Compute attention scores directly from compressed keys (no decompress needed)

query shape: (num_heads, head_dim)

scores

scores shape: (num_heads, seq_len)

Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)

val_compressor

Unbiased attention scores from compressed keys

scores

scores: (num_heads, seq_len)

attn_weights

(num_heads, seq_len)

Decompress values and compute weighted sum

values

(num_heads, seq_len, head_dim)

output

True attention scores

true_scores

(seq_len,)

true_attn

TurboQuant estimated scores

est_scores

(seq_len,)

est_attn

Cosine similarity of attention distributions

cos_sim

Top-1 match

Example usage

keys

Build haystack of random unit vectors

haystack

Insert needle at random position

needle_idx

Similar to query

needle

Compress

compressed

True nearest neighbor

true_scores

TurboQuant estimated nearest neighbor

est_scores

Run multiple trials