mlx-optiq
Measurements · Apple M3 Max 36 GB

Benchmarks: quality, speed, retrieval.

Three axes — weight quantization quality on GSM8K, KV-cache serving throughput at 64 k context, and TurboQuant rotated-space retrieval accuracy. All numbers are reproducible from the CLI and dataset listed in each section.

01 Weight quantization quality

mlx-optiq vs uniform 4-bit on GSM8K.

200-sample subset, 3-shot CoT prompt, identical sampling, identical chat-template handling.

LLM quality — GSM8K · 200 samples
ModelUniform-4mlx-optiq-4Delta
Qwen3.5-0.8B11.5%27.0%+15.5pp
Qwen3.5-2B48.5%48.0%−0.5pp
Qwen3.5-4B79.5%81.5%+2.0pp
Qwen3.5-9B90.0%90.0%0.0pp
Qwen3.6-27B94.0%95.0%+1.0pp
Qwen3.5-35B-A3B89.5%89.5%0.0pp
gemma-4-e2b-it5.5%13.0%+7.5pp
gemma-4-e4b-it23.5%55.5%+32.0pp
gemma-4-26B-A4B-it92.0%94.0%+2.0pp
gemma-4-31B-it96.0%96.0%0.0pp
Pattern mlx-optiq's wins grow with how far uniform 4-bit degrades the model. On saturated benchmarks (Qwen3.5-9B at 90%), there's nothing to recover. On models that uniform 4-bit nearly breaks (gemma-4-e4b at 23.5%), recovery is dramatic — +32 points to 55.5%.

02 KV cache serving

Mixed-precision KV at 64 k tokens.

Apple M3 Max 36 GB. 64,000-token English prose prompt, streaming 500 output tokens. Comparing stock mlx_lm.server (fp16 KV) vs optiq serve --kv-config (per-layer sensitivity-guided KV quantization).

Decode tok/s at 64k
Model fp16 TTFT fp16 decode Mixed TTFT Mixed decode Decode speedup
Qwen3.5-0.8B34.5s47.2 tok/s40.5s42.4 tok/s−10%
Qwen3.5-2B82.8s27.9 tok/s161.7s41.8 tok/s+50%
Qwen3.5-4B165.8s8.1 tok/s252.0s13.1 tok/s+62%
Qwen3.5-9B214.8s20.7 tok/s163.8s27.1 tok/s+31%
Takeaway For Qwen3.5 2 B and larger, mixed-precision KV gives a 31–62% decode speedup at 64 k context. 0.8 B is too small to benefit — optiq kv-cache's sensitivity pass correctly picks uniform 4-bit (no layer needs 8-bit protection), but on M3 Max even that's marginally slower than fp16 at this scale.
Per-layer KV configs (generated by optiq kv-cache --target-bits 4.5)
ModelFull-attn layersConfigAvg bits
Qwen3.5-0.8B6 of 246 @ 4-bit4.00
Qwen3.5-2B6 of 246 @ 4-bit4.00
Qwen3.5-4B8 of 327 @ 4-bit + 1 @ 8-bit (layer 3)4.50
Qwen3.5-9B8 of 327 @ 4-bit + 1 @ 8-bit (layer 3)4.50
Why layer 3? In Qwen3.5's hybrid architecture, layer 3 is the first full-attention layer (layers 0, 1, 2 are linear-attention). It's consistently the most KV-sensitive layer across both 4 B and 9 B — and protecting it at 8-bit also happens to flip it onto mx.quantized_matmul's fast path on Apple Silicon. Quality and speed point the same direction.

03 TurboQuant research

Rotated-space attention vs affine KV.

The research path: rotation-based vector quantization that preserves inner products for attention. Compared against mlx-lm's affine QuantizedKVCache.

Quality and speed comparison
TurboQuant 4-bit: better reasoning (32 % vs 30 %), perfect retrieval (100 % vs 73 %), same speed (−2 %).
Needle retrieval
100 % at 4-bit vs 73 % for affine — rotated-space quantization preserves inner products across all sequence positions.
Perplexity
Tight perplexity gap at matched bit-widths — TurboQuant MSE 4-bit is PPL +0.37 vs affine's +0.48.
Memory scaling
Per-token KV storage as a function of sequence length — 4-bit TurboQuant is 4× smaller than fp16.