Gemma-4 lands on mlx-optiq.
We added four mlx-optiq Gemma-4 instruct quants to Hugging Face, recovering up to 32 percentage points of GSM8K accuracy on the small model through mixed-precision quantization. The headline: gemma-4-e4b at uniform 4-bit collapses to 23.5% on GSM8K; mlx-optiq mixed-precision recovers it to 55.5%, a +32-point jump at the same 6 GB on disk. The full lineup spans the compact 4 GB e2b through the 18 GB 31B-dense.
The further uniform-4 knocks the model out of usable territory, the more mixed-precision recovers.
The lineup
| Model | Size | GSM8K | vs uniform-4 |
|---|---|---|---|
| gemma-4-e2b-it | 4.0 GB | 13.0 % | +7.5 pp |
| gemma-4-e4b-it | 6.0 GB | 55.5 % | +32.0 pp |
| gemma-4-26B-A4B-it | 14.9 GB | 94.0 % | +2.0 pp |
| gemma-4-31B-it | 18.1 GB | 96.0 % | 0.0 pp |
Why e4b recovers so dramatically
The pattern across all 10 quants we ship is consistent: the bigger the gap between bf16 quality and uniform-4 quality, the bigger mlx-optiq's win. Saturated benchmarks (gemma-4-31B already at 96 %) leave nothing to recover. Models where uniform-4 nearly breaks them (e4b at 23.5 %) have all the room to grow.
e4b is right at the edge. It's a 4 B model with strong reasoning capability that gets lobotomized by quantization noise on a few sensitive layers. The per-layer KL pass identifies those layers (lm_head, the first attention block, layer 0's KV-projection sibling, the last few transformer blocks) and protects them at 8-bit. Everything else stays 4-bit. Net result: same 6 GB on disk, 2.4× the GSM8K accuracy.
The 26B-A4B sparse MoE
Gemma-4-26B-A4B is the family's mixture-of-experts: 26 B total parameters, 4 B active per token. Different from the Qwen MoE layout, Gemma uses switch_glu with a fused gate-and-up projection, so each expert is a pair of fused tensors rather than the three-tensor split Qwen uses.
mlx-optiq's MoE walker is arch-aware: it identifies the layout, walks the experts, and treats the fused expert tensor as a single layer for sensitivity purposes. Per-expert bit-widths come out of the same knapsack as everything else. The result: 26 B-A4B at 14.9 GB on disk runs faster than the dense 27 B variants because only 4 B of weights actually multiply per token.
KV-quant serving on Gemma-4 (v0.1.3+)
Earlier OptIQ versions failed on Gemma-4's KV-quant path because upstream mlx-lm raises NotImplementedError: RotatingKVCache Quantization NYI on the sliding-window cache used by Gemma's SWA layers. v0.1.3 ships optiq.runtime.kv.RotatingQuantizedKVCache as a drop-in subclass with quantized (packed, scales, biases) storage and the rotating-buffer mechanics preserved, plus a small SDPA dispatch patch for Gemma's KV-sharing layers (where one layer's quantized K/V tuples flow into a downstream layer whose own cache is None). The patch installs automatically when --kv-bits or --kv-config is set.
Each Gemma-4 OptIQ-4bit repo on Hugging Face now ships a recommended kv_config.json from a real sensitivity-analysis pass — drop-in via optiq serve --kv-config kv_config.json.
Get them
from mlx_lm import load, generate model, tok = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit") # 6.0 GB on disk · 55.5 % GSM8K · runs on a 16 GB MacBook Air
Full sampling defaults, training recipes and the shared-KV caveat are in the Gemma-4 family guide.
— the mlx-optiq team