mlx-optiq
Family guide · Gemma-4

Gemma-4 on Apple Silicon

Google's Gemma-4 instruct series spans four sizes — two compact dense (e2b, e4b), one sparse mixture-of-experts (26 B-A4B), and one frontier-class dense (31 B). All four ship with mlx-optiq-4-bit quants on Hugging Face. They share the Gemma chat template and a distinctive shared-KV attention design.

Available quants

ModelSizeGSM8Kvs uniform-4Best for
gemma-4-e2b-it-OptiQ-4bit4.0 GB13.0%+7.5ppCompact daily-driver
gemma-4-e4b-it-OptiQ-4bit6.0 GB55.5%+32.0ppBest small-model recovery
gemma-4-26B-A4B-it-OptiQ-4bit14.9 GB94.0%+2.0ppSparse MoE, 4 B active
gemma-4-31B-it-OptiQ-4bit18.1 GB96.0%0.0ppStrongest dense quant
Headline result gemma-4-e4b-it at uniform 4-bit collapses to 23.5% on GSM8K. mlx-optiq recovers it to 55.5% — a +32-point jump at the same 6 GB on disk. This is one of the cleanest examples of why uniform-bit quantization wastes potential.

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/gemma-4-31B-it-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What's the difference between TF-IDF and BM25?"}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=400))

Recommended sampling

Gemma-4-it variants prefer slightly higher temperature than Qwen3.x:

sampling.pypython
from mlx_lm.sample_utils import make_sampler

# Default chat (Google's recommended)
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=64)

# Reasoning / math — slightly tighter
sampler = make_sampler(temp=0.7, top_p=0.9)

The shared-KV caveat

Known limitation Gemma-4 uses shared-KV attention (multiple Q heads share one KV head). The current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache will exit cleanly with an explanation when run on Gemma-4. Use Qwen3.5 / 3.6 if you specifically need quantized KV at long context. Standard fp16 KV serving (stock mlx_lm.server or optiq serve without --kv-config) works fine.

The 26B-A4B MoE

Gemma-4-26B-A4B is a sparse mixture-of-experts: 26 B total, 4 B active per token. Different from Qwen MoE, Gemma uses switch_glu with a fused gate-and-up projection. mlx-optiq's MoE walker handles both layouts — the per-expert sensitivity rolls up into a single switch_glu tensor for allocation purposes.

Long-context

Use stock fp16 KV serving (mixed-precision KV is currently broken on Gemma-4 — see caveat above):

serve.shbash
# Stock fp16 KV serving via mlx-optiq (no --kv-config)
$ optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \
    --max-tokens 8192 --temp 1.0 --top-p 0.95

Fine-tuning recipes

Empirical training-ceiling map at iogpu.wired_limit_mb=0 on 36 GB Mac:

ModelMax seq lenPeak memTokens/sec
gemma-4-e2b-it2,40022 GB~28
gemma-4-e4b-it1,80024 GB~22
gemma-4-26B-A4B-it51227.6 GB22.2
gemma-4-31B-it3221.4 GB30.9

The 31B-dense is unusually memory-tight at long context due to its larger embedding+vocab footprint. The 26B-A4B MoE actually allows a longer sequence at the same RAM budget — sparse activation pays off here.

finetune.shbash
$ optiq lora train mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit \
    --data ./my_data \
    --max-seq-length 512 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 2000 \
    -o ./gemma_adapter

Next: read about how sensitivity works or the LoRA fine-tuning guide.