Research · April 17, 2026

TurboQuant — rotation preserves inner products. Affine quantization doesn't.

Topic Research Reading time 6 min Subject KV-cache compression

For long-context inference, the KV cache becomes the dominant memory cost — at 64 k tokens it's often larger than the model weights themselves. The standard fix is to quantize the cache. The standard quantization is affine: stretch each tensor to fit in a smaller integer range, store an offset and a scale per group. It works, mostly. But it loses something attention specifically cares about.

Attention is a dot-product operation. Affine quantization preserves magnitudes. It does not preserve inner products.

This is the gap TurboQuant — our research-path KV-cache primitive — closes. It's vector quantization on a random orthogonal rotation of the key/value space, plus a custom Metal kernel that does attention in rotated space so we never have to dequantize per key. The result is the cleanest A/B we have on long-context retrieval.

The problem with affine on attention

An affine quantizer maps x → round((x - z) / s) where z is a zero-point and s is a scale. Reconstruct with x' ≈ s · q + z. The error x - x' has a sawtooth shape that depends on x's position in its quantization bin.

That's fine for storing values. But attention isn't storing — it's comparing. The attention score is q · k, an inner product. The error in that inner product, q · k - q · k', depends on the alignment of the per-element errors with the query direction. In high-dimensional spaces, those alignments are not random — they correlate with semantic content. Specifically, they correlate with which tokens are semantically similar — exactly what attention is trying to surface.

The empirical signature: needle-in-a-haystack retrieval breaks at 4-bit affine. We measured 73 % retrieval (vs 100 % at fp16) on Qwen3.5 with affine 4-bit KV, even though perplexity was only mildly degraded. The retrieval failures concentrate on long-distance matches — exactly where small inner-product errors push the wrong token over the softmax threshold.

The rotation trick

Multiply both K and Q by the same random orthogonal matrix R. The attention score is preserved exactly: (R q) · (R k) = q · R^TR · k = q · k. So we can quantize R k instead of k and the original score is recoverable.

Why is this an improvement? Random rotation distributes a vector's mass roughly uniformly across all coordinates (concentration of measure). The marginal distribution along each post-rotation dimension is approximately Gaussian with the same variance. Quantization noise becomes nearly isotropic in rotated space — it doesn't preferentially align with any one direction, including the query direction. Inner-product errors collapse from "structured" to "spherical."

We use Lloyd-Max scalar quantization per dimension (1-D vector quantization) on the rotated values. The codebook adapts to the empirical distribution of rotated coordinates, which is well-behaved by the concentration argument above.

The kernel — attending in rotated space

The naive way to use a rotated quantization is: dequantize the K cache, multiply by R^T, then attend. That costs O(seq_len × d²) in extra ops per step — fatal for long contexts.

The fix: rotate the query once, attend in rotated space directly against the stored quantized rotated keys. The rotation is now O(d²) fixed cost per step, not per-key. The fused Metal kernel does the dequant inline as part of the dot-product reduction, never materializing dequantized keys in memory.

This is what gets us back to ~speed parity with affine despite the rotation overhead.

Results — quality and speed

Method	Bits	Needle retrieval	Reasoning	Speed vs fp16
fp16 (reference)	16	100 %	32 %	1.00 ×
Affine	4	73 %	30 %	0.96 ×
TurboQuant	4	100 %	32 %	0.98 ×

100 % needle retrieval at 4-bit, vs 73 % for affine. Reasoning quality also slightly higher. Speed within 2 % of fp16 — a 47 percentage-point swing from our first naive implementation, achieved through incremental dequantization and the rotated-space SDPA kernel.

Where this lives in mlx-optiq

TurboQuant ships today as a Python primitive you can import directly:

turbo_kv.pypython

from optiq.core.turbo_kv_cache import (
    TurboQuantKVCache, patch_attention,
)

patch_attention()
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim,
            bits=4, seed=42+i,
        )

Why `optiq serve` doesn't use it (yet)

The default optiq serve --kv-config path uses mlx_lm.models.cache.QuantizedKVCache (affine). The reason is integration: that cache class plugs into mx.quantized_matmul's fused fast path on Apple Silicon, which gives the 30–60 % decode speedup at long context that we ship as our headline number.

A fused-TurboQuant serve path is on the roadmap. Until then, TurboQuant is for retrieval-quality-critical workloads where you can drop into the lower-level Python API.

Where the name comes from

"Turbo" because the rotated-space attention kernel was the speed unlock — without it, rotated quantization is a 47 % slowdown, which kills it as a serving option. The kernel turns a research curiosity into something that can plausibly hold its own against affine in production.

The full empirical write-up — including perplexity comparisons across bit-widths and the speed-optimization journey — lives in experiments.

— the mlx-optiq team