Family guide · Gemma-4

Gemma-4 on Apple Silicon

Google's Gemma-4 instruct series spans four sizes — two compact dense (e2b, e4b), one sparse mixture-of-experts (26 B-A4B), and one frontier-class dense (31 B). All four ship with mlx-optiq-4-bit quants on Hugging Face. They share the Gemma chat template and a distinctive shared-KV attention design.

Available quants

Model	Size	GSM8K	vs uniform-4	Best for
gemma-4-e2b-it-OptiQ-4bit	4.0 GB	13.0%	+7.5pp	Compact daily-driver
gemma-4-e4b-it-OptiQ-4bit	6.0 GB	55.5%	+32.0pp	Best small-model recovery
gemma-4-26B-A4B-it-OptiQ-4bit	14.9 GB	94.0%	+2.0pp	Sparse MoE, 4 B active
gemma-4-31B-it-OptiQ-4bit	18.1 GB	96.0%	0.0pp	Strongest dense quant

Headline result gemma-4-e4b-it at uniform 4-bit collapses to 23.5% on GSM8K. mlx-optiq recovers it to 55.5% — a +32-point jump at the same 6 GB on disk. This is one of the cleanest examples of why uniform-bit quantization wastes potential.

Hello world

hello.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/gemma-4-31B-it-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What's the difference between TF-IDF and BM25?"}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=400))

Recommended sampling

Gemma-4-it variants prefer slightly higher temperature than Qwen3.x:

sampling.pypython

from mlx_lm.sample_utils import make_sampler

# Default chat (Google's recommended)
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=64)

# Reasoning / math — slightly tighter
sampler = make_sampler(temp=0.7, top_p=0.9)

The shared-KV caveat

Known limitation Gemma-4 uses shared-KV attention (multiple Q heads share one KV head). The current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache will exit cleanly with an explanation when run on Gemma-4. Use Qwen3.5 / 3.6 if you specifically need quantized KV at long context. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine.

The 26B-A4B MoE

Gemma-4-26B-A4B is a sparse mixture-of-experts: 26 B total, 4 B active per token. Different from Qwen MoE, Gemma uses switch_glu with a fused gate-and-up projection. mlx-optiq's MoE walker handles both layouts — the per-expert sensitivity rolls up into a single switch_glu tensor for allocation purposes.

Long-context

Use stock fp16 KV serving (mixed-precision KV is currently broken on Gemma-4 — see caveat above):

serve.shbash

# Stock fp16 KV serving via mlx-optiq (no --kv-config)
$ optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \
    --max-tokens 8192 --temp 1.0 --top-p 0.95

Fine-tuning recipes

Empirical training-ceiling map at iogpu.wired_limit_mb=0 on 36 GB Mac:

Model	Max seq len	Peak mem	Tokens/sec
gemma-4-e2b-it	2,400	22 GB	~28
gemma-4-e4b-it	1,800	24 GB	~22
gemma-4-26B-A4B-it	512	27.6 GB	22.2
gemma-4-31B-it	32	21.4 GB	30.9

The 31B-dense is unusually memory-tight at long context due to its larger embedding+vocab footprint. The 26B-A4B MoE actually allows a longer sequence at the same RAM budget — sparse activation pays off here.

finetune.shbash

$ optiq lora train mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit \
    --data ./my_data \
    --max-seq-length 512 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 2000 \
    -o ./gemma_adapter

Next: read about how sensitivity works or the LoRA fine-tuning guide.