Gemma-4 on Apple Silicon
Google's Gemma-4 instruct series spans four sizes — two compact dense (e2b, e4b), one sparse mixture-of-experts (26 B-A4B), and one frontier-class dense (31 B). All four ship with mlx-optiq-4-bit quants on Hugging Face. They share the Gemma chat template and a distinctive shared-KV attention design.
Available quants
| Model | Size | GSM8K | vs uniform-4 | Best for |
|---|---|---|---|---|
| gemma-4-e2b-it-OptiQ-4bit | 4.0 GB | 13.0% | +7.5pp | Compact daily-driver |
| gemma-4-e4b-it-OptiQ-4bit | 6.0 GB | 55.5% | +32.0pp | Best small-model recovery |
| gemma-4-26B-A4B-it-OptiQ-4bit | 14.9 GB | 94.0% | +2.0pp | Sparse MoE, 4 B active |
| gemma-4-31B-it-OptiQ-4bit | 18.1 GB | 96.0% | 0.0pp | Strongest dense quant |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/gemma-4-31B-it-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "What's the difference between TF-IDF and BM25?"}], tokenize=False, add_generation_prompt=True, ) print(generate(model, tok, prompt=prompt, max_tokens=400))
Recommended sampling
Gemma-4-it variants prefer slightly higher temperature than Qwen3.x:
from mlx_lm.sample_utils import make_sampler # Default chat (Google's recommended) sampler = make_sampler(temp=1.0, top_p=0.95, top_k=64) # Reasoning / math — slightly tighter sampler = make_sampler(temp=0.7, top_p=0.9)
The shared-KV caveat
mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache will exit cleanly with an explanation when run on Gemma-4. Use Qwen3.5 / 3.6 if you specifically need quantized KV at long context. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine.
The 26B-A4B MoE
Gemma-4-26B-A4B is a sparse mixture-of-experts: 26 B total, 4 B active per token. Different from Qwen MoE, Gemma uses switch_glu with a fused gate-and-up projection. mlx-optiq's MoE walker handles both layouts — the per-expert sensitivity rolls up into a single switch_glu tensor for allocation purposes.
Long-context
Use stock fp16 KV serving (mixed-precision KV is currently broken on Gemma-4 — see caveat above):
# Stock fp16 KV serving via mlx-optiq (no --kv-config) $ optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \ --max-tokens 8192 --temp 1.0 --top-p 0.95
Fine-tuning recipes
Empirical training-ceiling map at iogpu.wired_limit_mb=0 on 36 GB Mac:
| Model | Max seq len | Peak mem | Tokens/sec |
|---|---|---|---|
| gemma-4-e2b-it | 2,400 | 22 GB | ~28 |
| gemma-4-e4b-it | 1,800 | 24 GB | ~22 |
| gemma-4-26B-A4B-it | 512 | 27.6 GB | 22.2 |
| gemma-4-31B-it | 32 | 21.4 GB | 30.9 |
The 31B-dense is unusually memory-tight at long context due to its larger embedding+vocab footprint. The 26B-A4B MoE actually allows a longer sequence at the same RAM budget — sparse activation pays off here.
$ optiq lora train mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit \ --data ./my_data \ --max-seq-length 512 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 2000 \ -o ./gemma_adapter
Next: read about how sensitivity works or the LoRA fine-tuning guide.