Engineering · April 25, 2026

Gemma-4 lands on mlx-optiq.

Topic Family launch Reading time 4 min Models e2b · e4b · 26B-A4B · 31B

Four new pre-built quants on Hugging Face this week — Google's full Gemma-4 instruct lineup, from the compact 4 GB e2b through the 18 GB 31 B-dense. The headline result is on the small one: gemma-4-e4b at uniform 4-bit collapses to 23.5 % on GSM8K. Mixed-precision recovers it to 55.5 % — a +32-point jump at the same 6 GB on disk.

The further uniform-4 knocks the model out of usable territory, the more mixed-precision recovers.

The lineup

Model	Size	GSM8K	vs uniform-4
gemma-4-e2b-it	4.0 GB	13.0 %	+7.5 pp
gemma-4-e4b-it	6.0 GB	55.5 %	+32.0 pp
gemma-4-26B-A4B-it	14.9 GB	94.0 %	+2.0 pp
gemma-4-31B-it	18.1 GB	96.0 %	0.0 pp

Why e4b recovers so dramatically

The pattern across all 10 quants we ship is consistent: the bigger the gap between bf16 quality and uniform-4 quality, the bigger mlx-optiq's win. Saturated benchmarks (gemma-4-31B already at 96 %) leave nothing to recover. Models where uniform-4 nearly breaks them (e4b at 23.5 %) have all the room to grow.

e4b is right at the edge — it's a 4 B model with strong reasoning capability that gets lobotomized by quantization noise on a few sensitive layers. The per-layer KL pass identifies those layers (lm_head, the first attention block, layer 0's KV-projection sibling, the last few transformer blocks) and protects them at 8-bit. Everything else stays 4-bit. Net result: same 6 GB on disk, 2.4× the GSM8K accuracy.

The 26B-A4B sparse MoE

Gemma-4-26B-A4B is the family's mixture-of-experts: 26 B total parameters, 4 B active per token. Different from the Qwen MoE layout, Gemma uses switch_glu with a fused gate-and-up projection — so each expert is a pair of fused tensors rather than the three-tensor split Qwen uses.

mlx-optiq's MoE walker is arch-aware: it identifies the layout, walks the experts, and treats the fused expert tensor as a single layer for sensitivity purposes. Per-expert bit-widths come out of the same knapsack as everything else. The result: 26 B-A4B at 14.9 GB on disk runs faster than the dense 27 B variants because only 4 B of weights actually multiply per token.

One known caveat — KV-quant serving

Gemma-4 uses shared-KV attention (multiple Q heads share one KV head). The current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine — only the quantized KV path is affected.

If you need quantized KV at long context, use Qwen3.5 / 3.6. The Gemma family is best for outright quality at moderate context lengths.

Get them

terminalbash

from mlx_lm import load, generate

model, tok = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")
# 6.0 GB on disk · 55.5 % GSM8K · runs on a 16 GB MacBook Air

Full sampling defaults, training recipes and the shared-KV caveat are in the Gemma-4 family guide.

— the mlx-optiq team