Pre-built quants · Hugging Face

Ten OptIQ-4bit quants. Ready to load.

Each model is a standard MLX checkpoint — load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.

01 Qwen3.5 family

Qwen3.5 — the daily-driver dense series.

From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning and 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all parity-or-better against uniform 4-bit on GSM8K (200-sample subset).

GSM8K · 200 samples · 3-shot CoT

Model	Original size	mlx-optiq size	Compression	mlx-optiq vs uniform-4
Qwen3.5-0.8B-OptiQ-4bit	1,666 MB	570 MB	2.9×	27.0% vs 11.5% +15.5pp
Qwen3.5-2B-OptiQ-4bit	4,338 MB	1,365 MB	3.2×	48.0% vs 48.5% −0.5pp
Qwen3.5-4B-OptiQ-4bit	8,888 MB	2,811 MB	3.2×	81.5% vs 79.5% +2.0pp
Qwen3.5-9B-OptiQ-4bit	18,412 MB	5,763 MB	3.2×	90.0% vs 90.0% 0.0pp
Qwen3.5-27B-OptiQ-4bit	52,989 MB	15,710 MB	3.4×	87.5% vs 90.0% −2.5pp
Qwen3.5-35B-A3B-OptiQ-4bit MoE · 3 B active	68,597 MB	20,146 MB	3.4×	89.5% vs 89.5% 0.0pp

Recommended starting point For most users on a 36 GB Mac, the Qwen3.5-9B quant is the default — strongest GSM8K-per-GB and runs at full 64 k context with mixed-precision KV. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.

Qwen3.5 getting-started guide →

02 Qwen3.6 family

Qwen3.6 — frontier-class reasoning.

Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass and parity-or-better against uniform 4-bit on GSM8K.

GSM8K · 200 samples

Model	Original size	mlx-optiq size	Compression	mlx-optiq vs uniform-4
Qwen3.6-27B-OptiQ-4bit	52,989 MB	15,710 MB	3.4×	95.0% vs 94.0% +1.0pp
Qwen3.6-35B-A3B-OptiQ-4bit MoE · 256 experts · 3 B active	68,597 MB	20,146 MB	3.4×	89.5% vs 91.5% −2.0pp

Qwen3.6 getting-started guide →

03 Gemma-4 family

Gemma-4 — Google's instruct series.

Two small dense (e2b, e4b) and two large (26 B-A4B sparse-MoE, 31 B dense). The two largest dramatically benefit from mixed-precision: gemma-4-e4b at uniform 4-bit drops to 23.5% GSM8K — mlx-optiq recovers it to 55.5%.

GSM8K · 200 samples · with chat template

Model	Original size	mlx-optiq size	Compression	mlx-optiq vs uniform-4
gemma-4-e2b-it-OptiQ-4bit	9,772 MB	3,978 MB	2.5×	13.0% vs 5.5% +7.5pp
gemma-4-e4b-it-OptiQ-4bit	15,252 MB	6,028 MB	2.5×	55.5% vs 23.5% +32.0pp
gemma-4-26B-A4B-it-OptiQ-4bit MoE · 4 B active	48,584 MB	14,866 MB	3.4×	94.0% vs 92.0% +2.0pp
gemma-4-31B-it-OptiQ-4bit dense	59,648 MB	18,123 MB	3.3×	96.0% vs 96.0% 0.0pp

Heads up Gemma-4 inference works fine with fp16 KV (stock mlx_lm.server or optiq serve without --kv-config). The mixed-precision KV path currently fails on Gemma-4's shared-KV attention — an upstream mlx-lm limitation we're tracking. Use Qwen3.5 / 3.6 if you specifically need quantized KV.

Gemma-4 getting-started guide →

04 Loading any of them

One snippet. Any model on this page.

All ten quants follow the same load contract. Swap the repo name; the rest stays.

load_any.pypython

from mlx_lm import load, generate

# Pick any of the 10. Stock mlx-lm, no special loader needed.
model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)

Per-family notes Each model family has slightly different recommended sampling defaults and chat templates. See the Qwen3.5, Qwen3.6 and Gemma-4 getting-started guides.