Gemma-4 on Apple Silicon
Google's Gemma-4 instruct series spans five sizes: two compact dense (e2b, e4b), a dense 12 B (the unified text+vision model, now with image input), one sparse mixture-of-experts (26 B-A4B), and one frontier-class dense (31 B). All five ship with mlx-optiq-4-bit quants on Hugging Face, and all five take image input. They share the Gemma chat template and a distinctive shared-KV attention design.
optiq_vision sidecar, optiq serve and the Lab answer image+text prompts with no mlx-vlm dependency. See vision (image input).
Available quants
| Model | Size | GSM8K | vs uniform-4 | Best for |
|---|---|---|---|---|
| gemma-4-e2b-it-OptiQ-4bit | 4.0 GB | 13.0% | +7.5pp | Compact daily-driver |
| gemma-4-e4b-it-OptiQ-4bit | 6.0 GB | 55.5% | +32.0pp | Best small-model recovery |
| gemma-4-12B-it-OptiQ-4bit | 8.3 GB | 93.4% | +3.3pp | Unified text+vision, image input |
| gemma-4-26B-A4B-it-OptiQ-4bit | 14.9 GB | 94.0% | +2.0pp | Sparse MoE, 4 B active |
| gemma-4-31B-it-OptiQ-4bit | 18.1 GB | 96.0% | 0.0pp | Strongest dense quant |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/gemma-4-31B-it-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "What's the difference between TF-IDF and BM25?"}], tokenize=False, add_generation_prompt=True, ) print(generate(model, tok, prompt=prompt, max_tokens=400))
Recommended sampling
Gemma-4-it variants prefer slightly higher temperature than Qwen3.x:
from mlx_lm.sample_utils import make_sampler # Default chat (Google's recommended) sampler = make_sampler(temp=1.0, top_p=0.95, top_k=64) # Reasoning / math: slightly tighter sampler = make_sampler(temp=0.7, top_p=0.9)
Mixed-precision KV cache (v0.1.3+)
RotatingKVCache for SWA layers. Upstream mlx-lm raises NotImplementedError: RotatingKVCache Quantization NYI on the rotating path, which blocked mixed-precision KV on all Gemma-4 sizes through v0.1.2. v0.1.3 ships optiq.runtime.kv.RotatingQuantizedKVCache (a drop-in rotating cache with quantized (packed, scales, biases) storage) plus a small SDPA dispatch patch that handles Gemma-4's KV-sharing edge case. The patch is auto-installed by optiq serve and optiq kv-cache whenever quantized KV is requested.
Each Gemma-4 OptIQ-4bit repo on Hugging Face now bundles a recommended kv_config.json from a real sensitivity-analysis pass. Drop-in via --kv-config:
$ optiq serve \
--model mlx-community/gemma-4-31B-it-OptiQ-4bit \
--kv-config kv_config.json
The 26B-A4B MoE
Gemma-4-26B-A4B is a sparse mixture-of-experts: 26 B total, 4 B active per token. Different from Qwen MoE, Gemma uses switch_glu with a fused gate-and-up projection. mlx-optiq's MoE walker handles both layouts. The per-expert sensitivity rolls up into a single switch_glu tensor for allocation purposes.
Long-context
fp16 KV is still the simplest path; mixed-precision KV (above) gives faster decode at long contexts. Either way:
# Stock fp16 KV serving via mlx-optiq (no --kv-config) $ optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \ --max-tokens 8192 --temp 1.0 --top-p 0.95
Fine-tuning recipes
Empirical training-ceiling map at iogpu.wired_limit_mb=0 on 36 GB Mac:
| Model | Max seq len | Peak mem | Tokens/sec |
|---|---|---|---|
| gemma-4-e2b-it | 2,400 | 22 GB | ~28 |
| gemma-4-e4b-it | 1,800 | 24 GB | ~22 |
| gemma-4-26B-A4B-it | 512 | 27.6 GB | 22.2 |
| gemma-4-31B-it | 32 | 21.4 GB | 30.9 |
The 31B-dense is unusually memory-tight at long context due to its larger embedding+vocab footprint. The 26B-A4B MoE actually allows a longer sequence at the same RAM budget. Sparse activation pays off here.
$ optiq lora train mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit \ --data ./my_data \ --max-seq-length 512 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 2000 \ -o ./gemma_adapter
Next: read about how sensitivity works or the LoRA fine-tuning guide.