Engineering · research · releases

Blog

Release notes, methodology dives, and benchmarking deep-dives. New posts land alongside major releases and research findings.

Gemma-4 lands on mlx-optiq — four sizes, +32 pp on the small one

Adding Google's full Gemma-4 instruct lineup — e2b, e4b, 26B-A4B sparse MoE, and 31B dense. The +32-point GSM8K recovery on gemma-4-e4b is the cleanest mixed-precision win we have. Plus the shared-KV caveat that means you'll want Qwen for quantized-KV serving.

TurboQuant — rotated-space KV attention preserves inner products

Affine quantization preserves magnitudes but distorts the inner products that attention actually uses. Rotation-based vector quantization — plus a Metal kernel that attends in rotated space — closes the gap. 100 % needle retrieval at 4-bit vs 73 % for affine, at the same speed.

Sensitivity-aware LoRA — fine-tuning that respects the bit budget

The same per-layer signal that drives mixed-precision quantization also drives adapter rank. 8-bit-quantized layers get 2× the adapter rank of 4-bit-quantized ones at the same parameter budget — and validation loss drops 12 % in head-to-head A/Bs. Plus the empirical training-ceiling map for a 36 GB Mac across all 10 supported models.

Not All Layers Are Equal — mixed-precision quantization for weights and KV cache on Apple Silicon

The research foundation behind mlx-optiq. Some layers are 56× more sensitive than others. The KV cache becomes the dominant memory cost at long contexts. Mixed-precision recovers what uniform 4-bit drops; mixed-precision KV fixes the perplexity collapse uniform 4-bit causes.