Measurements · Apple M3 Max 36 GB

Benchmarks: quality, speed, retrieval.

Three axes — weight quantization quality on GSM8K, KV-cache serving throughput at 64 k context, and TurboQuant rotated-space retrieval accuracy. All numbers are reproducible from the CLI and dataset listed in each section.

01 Weight quantization quality

mlx-optiq vs uniform 4-bit on GSM8K.

200-sample subset, 3-shot CoT prompt, identical sampling, identical chat-template handling.

LLM quality — GSM8K · 200 samples

Model	Uniform-4	mlx-optiq-4	Delta
Qwen3.5-0.8B	11.5%	27.0%	+15.5pp
Qwen3.5-2B	48.5%	48.0%	−0.5pp
Qwen3.5-4B	79.5%	81.5%	+2.0pp
Qwen3.5-9B	90.0%	90.0%	0.0pp
Qwen3.6-27B	94.0%	95.0%	+1.0pp
Qwen3.5-35B-A3B	89.5%	89.5%	0.0pp
gemma-4-e2b-it	5.5%	13.0%	+7.5pp
gemma-4-e4b-it	23.5%	55.5%	+32.0pp
gemma-4-26B-A4B-it	92.0%	94.0%	+2.0pp
gemma-4-31B-it	96.0%	96.0%	0.0pp

Pattern mlx-optiq's wins grow with how far uniform 4-bit degrades the model. On saturated benchmarks (Qwen3.5-9B at 90%), there's nothing to recover. On models that uniform 4-bit nearly breaks (gemma-4-e4b at 23.5%), recovery is dramatic — +32 points to 55.5%.

02 KV cache serving

Mixed-precision KV at 64 k tokens.

Apple M3 Max 36 GB. 64,000-token English prose prompt, streaming 500 output tokens. Comparing stock mlx_lm.server (fp16 KV) vs optiq serve --kv-config (per-layer sensitivity-guided KV quantization).

Decode tok/s at 64k

Model	fp16 TTFT	fp16 decode	Mixed TTFT	Mixed decode	Decode speedup
Qwen3.5-0.8B	34.5s	47.2 tok/s	40.5s	42.4 tok/s	−10%
Qwen3.5-2B	82.8s	27.9 tok/s	161.7s	41.8 tok/s	+50%
Qwen3.5-4B	165.8s	8.1 tok/s	252.0s	13.1 tok/s	+62%
Qwen3.5-9B	214.8s	20.7 tok/s	163.8s	27.1 tok/s	+31%

Takeaway For Qwen3.5 2 B and larger, mixed-precision KV gives a 31–62% decode speedup at 64 k context. 0.8 B is too small to benefit — optiq kv-cache's sensitivity pass correctly picks uniform 4-bit (no layer needs 8-bit protection), but on M3 Max even that's marginally slower than fp16 at this scale.

Per-layer KV configs (generated by optiq kv-cache --target-bits 4.5)

Model	Full-attn layers	Config	Avg bits
Qwen3.5-0.8B	6 of 24	6 @ 4-bit	4.00
Qwen3.5-2B	6 of 24	6 @ 4-bit	4.00
Qwen3.5-4B	8 of 32	7 @ 4-bit + 1 @ 8-bit (layer 3)	4.50
Qwen3.5-9B	8 of 32	7 @ 4-bit + 1 @ 8-bit (layer 3)	4.50

Why layer 3? In Qwen3.5's hybrid architecture, layer 3 is the first full-attention layer (layers 0, 1, 2 are linear-attention). It's consistently the most KV-sensitive layer across both 4 B and 9 B — and protecting it at 8-bit also happens to flip it onto mx.quantized_matmul's fast path on Apple Silicon. Quality and speed point the same direction.

03 TurboQuant research

Rotated-space attention vs affine KV.

The research path: rotation-based vector quantization that preserves inner products for attention. Compared against mlx-lm's affine QuantizedKVCache.

Quality and speed comparison — **TurboQuant 4-bit:** better reasoning (32 % vs 30 %), perfect retrieval (100 % vs 73 %), same speed (−2 %).

Needle retrieval — **100 % at 4-bit** vs 73 % for affine — rotated-space quantization preserves inner products across all sequence positions.

Perplexity — Tight perplexity gap at matched bit-widths — TurboQuant MSE 4-bit is PPL +0.37 vs affine's +0.48.

Memory scaling — Per-token KV storage as a function of sequence length — 4-bit TurboQuant is 4× smaller than fp16.