Postmortem · April 17, 2026

TurboQuant — what we built, what we measured, and why we didn't ship it.

Topic Postmortem Reading time 7 min Subject KV-cache compression

This post is the version of itself we wish we'd written months ago. We built a thing. The benchmarks looked good. We didn't ship it. The reason isn't a flaw in the technique — it's that the marginal win didn't justify the cost of a parallel serving path. That's a reasonable call but a deeply unsexy one to write up, so most teams don't.

The right answer to "should we ship this research?" is sometimes "no, even though the numbers are real."

The technique — rotated-space attention

Affine quantization is the standard way to compress the KV cache: stretch each tensor to fit in a smaller integer range, store an offset and a scale per group. It works for storage. But attention is a dot-product operation, and affine quantization doesn't preserve dot products well — it preserves magnitudes. The error in q · k after quantizing k isn't isotropic; it correlates with semantic structure in ways that bias which tokens win the softmax.

The fix is mathematically simple. Multiply both K and Q by the same random orthogonal matrix R. The attention score is preserved exactly: (R q) · (R k) = q · R^TR · k = q · k. So you can quantize R k instead of k and recover the original score. Random rotation distributes a vector's mass roughly uniformly across all coordinates — the per-coordinate distribution becomes nearly Gaussian (concentration of measure), which is exactly what scalar quantizers like best.

The catch is the cost: naive rotated quantization needs you to dequantize keys and rotate them back at attention time, which is O(seq_len × d²) per token — fatal at long context. Our trick was attending in rotated space: rotate the query once per step (cost O(d²) fixed), and dot it directly against the stored quantized rotated keys via a fused Metal kernel that does the dequant inline. That brings the overhead per step back to roughly the same as affine.

The numbers we measured

On Qwen3.5-9B with 4-bit KV at 64 k context, head-to-head against mlx-lm's affine QuantizedKVCache:

Method	Bits	Needle retrieval	Reasoning	PPL drift	Speed vs fp16
fp16 (reference)	16	100 %	32 %	—	1.00 ×
Affine	4	73 %	30 %	+0.48	0.96 ×
TurboQuant (rotated)	4	100 %	32 %	+0.37	0.98 ×

The needle-in-a-haystack number was the most striking: 100 % vs 73 % at 4-bit, on the exact same model and exact same prompts. The retrieval failures with affine concentrated on long-distance matches — exactly where small inner-product errors push the wrong token over the softmax threshold. Reasoning quality and perplexity were close to fp16; speed was within 2 % of the affine path.

The optimization journey itself was satisfying engineering: our first naive Python implementation of rotated-space attention was 47 % slower than affine. Three rounds of work — incremental dequantization, a custom Metal kernel, a fused SDPA pass that never materializes dequantized keys — closed the gap to 2 %.

So why didn't we ship it

Three things killed it in turn.

1. The 100 % vs 73 % needle test was synthetic

Needle-in-a-haystack is a worst-case probe — single rare token, single position, against a vast distractor field. Real workloads almost never look like this. When we re-ran on more realistic long-context tasks (multi-fact retrieval, tool-result interpretation, multi-turn chat with file context), the gap shrunk dramatically. At 32 k context on a 9 B model, both quantizers landed within 1–2 percentage points of fp16 quality on the tasks our users actually run — and within noise of each other.

The headline number was real. It just wasn't predictive of user-visible quality.

2. Per-layer mixed-precision affine already captured most of the win

The companion experiment — per-layer KV bit-width assignment — was the bigger lever. Once we measured per-layer KV sensitivity and protected layer 0 (often 56× more sensitive than the average) at 8-bit, the affine path's quality regressions on long context largely disappeared. Mixed-precision affine was already covering most of what TurboQuant fixed, and it integrates with mx.quantized_matmul's fused fast path on Apple Silicon — which TurboQuant (with its custom kernel) does not.

Net for users: mixed-precision affine gave roughly the same quality as TurboQuant on real workloads, and ~30–60 % faster decode at 64 k context. We can't ship a quality wash that's also slower.

3. Two parallel serving paths is one too many

Even ignoring quality and speed, shipping TurboQuant in optiq serve meant maintaining a fork of mlx-lm's attention path — every kernel update, every model class addition, every new attention variant (sliding window, hybrid, MoE-routed) would need to be ported into our rotated-space version. That's a structural ongoing cost the technique would have had to keep earning.

For a research codebase that's a fine trade. For something we ship to users on PyPI as their default serving stack, it's not.

What we kept, what we removed

For a long time we kept the TurboQuant code in the package as an "import-it-yourself" library primitive — disabled by default, available for users who specifically wanted to play with rotated-space KV. As we matured toward our first real release, we audited what was actually load-bearing and what was just historical research that hadn't earned its place. TurboQuant fell into the second bucket. We removed it from the package.

What stayed:

Per-layer KV-cache sensitivity analysis (optiq kv-cache) — this is the real production feature that drives the long-context wins. Results page has the numbers.
The conceptual framing that informed TurboQuant — "attention compression has different sensitivity than weight compression" — is now embedded in the standard mixed-precision KV pipeline.

What went away:

optiq.core.turbo_kv_cache, optiq.core.turbo_quant, optiq.core.turbo_metal, optiq.core.turbo_state_cache — all deleted.
The demo/demo_turbo_kv.py stress test.
The TurboQuant-flavoured options that never made it into the public CLI anyway.

Lessons we took with us

Synthetic benchmarks lie about marginal wins. The 100 % vs 73 % needle gap was real and reproducible. It was also a worst-case probe of one failure mode that doesn't dominate real usage. We've since moved to multi-domain eval suites (5-shot MMLU, IFEval, BFCL, plus task-specific tests at multiple context lengths) so we don't get fooled the same way again.

If a feature requires a separate serving path, it has to clear a much higher bar. The cost of maintaining parallel infrastructure compounds with every new model architecture. Anything that doesn't cleanly compose with stock mlx-lm needs to deliver a meaningful, durable win — not a marginal one that erodes as the upstream landscape changes.

Ship fewer things, but ship them well. A library that does five things excellently beats one that does six things plus a research demo. mlx-optiq's edge is mixed-precision quantization, sensitivity-aware LoRA, and a dual-protocol serving stack. That's enough.

The TurboQuant code is gone from main. The git history still has it for anyone who wants to extract it for their own research — the math is straightforward and the Metal kernel works. We just won't be the ones maintaining it.

— the mlx-optiq team