Gemma-4 lands on mlx-optiq.
Four new pre-built quants on Hugging Face this week — Google's full Gemma-4 instruct lineup, from the compact 4 GB e2b through the 18 GB 31 B-dense. The headline result is on the small one: gemma-4-e4b at uniform 4-bit collapses to 23.5 % on GSM8K. Mixed-precision recovers it to 55.5 % — a +32-point jump at the same 6 GB on disk.
The further uniform-4 knocks the model out of usable territory, the more mixed-precision recovers.
The lineup
| Model | Size | GSM8K | vs uniform-4 |
|---|---|---|---|
| gemma-4-e2b-it | 4.0 GB | 13.0 % | +7.5 pp |
| gemma-4-e4b-it | 6.0 GB | 55.5 % | +32.0 pp |
| gemma-4-26B-A4B-it | 14.9 GB | 94.0 % | +2.0 pp |
| gemma-4-31B-it | 18.1 GB | 96.0 % | 0.0 pp |
Why e4b recovers so dramatically
The pattern across all 10 quants we ship is consistent: the bigger the gap between bf16 quality and uniform-4 quality, the bigger mlx-optiq's win. Saturated benchmarks (gemma-4-31B already at 96 %) leave nothing to recover. Models where uniform-4 nearly breaks them (e4b at 23.5 %) have all the room to grow.
e4b is right at the edge — it's a 4 B model with strong reasoning capability that gets lobotomized by quantization noise on a few sensitive layers. The per-layer KL pass identifies those layers (lm_head, the first attention block, layer 0's KV-projection sibling, the last few transformer blocks) and protects them at 8-bit. Everything else stays 4-bit. Net result: same 6 GB on disk, 2.4× the GSM8K accuracy.
The 26B-A4B sparse MoE
Gemma-4-26B-A4B is the family's mixture-of-experts: 26 B total parameters, 4 B active per token. Different from the Qwen MoE layout, Gemma uses switch_glu with a fused gate-and-up projection — so each expert is a pair of fused tensors rather than the three-tensor split Qwen uses.
mlx-optiq's MoE walker is arch-aware: it identifies the layout, walks the experts, and treats the fused expert tensor as a single layer for sensitivity purposes. Per-expert bit-widths come out of the same knapsack as everything else. The result: 26 B-A4B at 14.9 GB on disk runs faster than the dense 27 B variants because only 4 B of weights actually multiply per token.
One known caveat — KV-quant serving
Gemma-4 uses shared-KV attention (multiple Q heads share one KV head). The current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine — only the quantized KV path is affected.
If you need quantized KV at long context, use Qwen3.5 / 3.6. The Gemma family is best for outright quality at moderate context lengths.
Get them
from mlx_lm import load, generate model, tok = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit") # 6.0 GB on disk · 55.5 % GSM8K · runs on a 16 GB MacBook Air
Full sampling defaults, training recipes and the shared-KV caveat are in the Gemma-4 family guide.
— the mlx-optiq team