mlx-optiq
Engineering · April 25, 2026

Gemma-4 lands on mlx-optiq.

Four new pre-built quants on Hugging Face this week — Google's full Gemma-4 instruct lineup, from the compact 4 GB e2b through the 18 GB 31 B-dense. The headline result is on the small one: gemma-4-e4b at uniform 4-bit collapses to 23.5 % on GSM8K. Mixed-precision recovers it to 55.5 % — a +32-point jump at the same 6 GB on disk.

The further uniform-4 knocks the model out of usable territory, the more mixed-precision recovers.

The lineup

ModelSizeGSM8Kvs uniform-4
gemma-4-e2b-it4.0 GB13.0 %+7.5 pp
gemma-4-e4b-it6.0 GB55.5 %+32.0 pp
gemma-4-26B-A4B-it14.9 GB94.0 %+2.0 pp
gemma-4-31B-it18.1 GB96.0 %0.0 pp

Why e4b recovers so dramatically

The pattern across all 10 quants we ship is consistent: the bigger the gap between bf16 quality and uniform-4 quality, the bigger mlx-optiq's win. Saturated benchmarks (gemma-4-31B already at 96 %) leave nothing to recover. Models where uniform-4 nearly breaks them (e4b at 23.5 %) have all the room to grow.

e4b is right at the edge — it's a 4 B model with strong reasoning capability that gets lobotomized by quantization noise on a few sensitive layers. The per-layer KL pass identifies those layers (lm_head, the first attention block, layer 0's KV-projection sibling, the last few transformer blocks) and protects them at 8-bit. Everything else stays 4-bit. Net result: same 6 GB on disk, 2.4× the GSM8K accuracy.

The 26B-A4B sparse MoE

Gemma-4-26B-A4B is the family's mixture-of-experts: 26 B total parameters, 4 B active per token. Different from the Qwen MoE layout, Gemma uses switch_glu with a fused gate-and-up projection — so each expert is a pair of fused tensors rather than the three-tensor split Qwen uses.

mlx-optiq's MoE walker is arch-aware: it identifies the layout, walks the experts, and treats the fused expert tensor as a single layer for sensitivity purposes. Per-expert bit-widths come out of the same knapsack as everything else. The result: 26 B-A4B at 14.9 GB on disk runs faster than the dense 27 B variants because only 4 B of weights actually multiply per token.

One known caveat — KV-quant serving

Gemma-4 uses shared-KV attention (multiple Q heads share one KV head). The current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine — only the quantized KV path is affected.

If you need quantized KV at long context, use Qwen3.5 / 3.6. The Gemma family is best for outright quality at moderate context lengths.

Get them

terminalbash
from mlx_lm import load, generate

model, tok = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")
# 6.0 GB on disk · 55.5 % GSM8K · runs on a 16 GB MacBook Air

Full sampling defaults, training recipes and the shared-KV caveat are in the Gemma-4 family guide.

— the mlx-optiq team