mlx-optiq
Pre-built quants · Hugging Face

Ten OptIQ-4bit quants. Ready to load.

Each model is a standard MLX checkpoint — load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.

01 Qwen3.5 family

Qwen3.5 — the daily-driver dense series.

From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning and 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all parity-or-better against uniform 4-bit on GSM8K (200-sample subset).

GSM8K · 200 samples · 3-shot CoT
Model Original size mlx-optiq size Compression mlx-optiq vs uniform-4
Qwen3.5-0.8B-OptiQ-4bit 1,666 MB 570 MB 2.9× 27.0% vs 11.5%   +15.5pp
Qwen3.5-2B-OptiQ-4bit 4,338 MB 1,365 MB 3.2× 48.0% vs 48.5%   −0.5pp
Qwen3.5-4B-OptiQ-4bit 8,888 MB 2,811 MB 3.2× 81.5% vs 79.5%   +2.0pp
Qwen3.5-9B-OptiQ-4bit 18,412 MB 5,763 MB 3.2× 90.0% vs 90.0%   0.0pp
Qwen3.5-27B-OptiQ-4bit 52,989 MB 15,710 MB 3.4× 87.5% vs 90.0%   −2.5pp
Qwen3.5-35B-A3B-OptiQ-4bit MoE · 3 B active 68,597 MB 20,146 MB 3.4× 89.5% vs 89.5%   0.0pp
Recommended starting point For most users on a 36 GB Mac, the Qwen3.5-9B quant is the default — strongest GSM8K-per-GB and runs at full 64 k context with mixed-precision KV. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.
Qwen3.5 getting-started guide →

02 Qwen3.6 family

Qwen3.6 — frontier-class reasoning.

Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass and parity-or-better against uniform 4-bit on GSM8K.

GSM8K · 200 samples
Model Original size mlx-optiq size Compression mlx-optiq vs uniform-4
Qwen3.6-27B-OptiQ-4bit 52,989 MB 15,710 MB 3.4× 95.0% vs 94.0%   +1.0pp
Qwen3.6-35B-A3B-OptiQ-4bit MoE · 256 experts · 3 B active 68,597 MB 20,146 MB 3.4× 89.5% vs 91.5%   −2.0pp
Qwen3.6 getting-started guide →

03 Gemma-4 family

Gemma-4 — Google's instruct series.

Two small dense (e2b, e4b) and two large (26 B-A4B sparse-MoE, 31 B dense). The two largest dramatically benefit from mixed-precision: gemma-4-e4b at uniform 4-bit drops to 23.5% GSM8K — mlx-optiq recovers it to 55.5%.

GSM8K · 200 samples · with chat template
Model Original size mlx-optiq size Compression mlx-optiq vs uniform-4
gemma-4-e2b-it-OptiQ-4bit 9,772 MB 3,978 MB 2.5× 13.0% vs 5.5%   +7.5pp
gemma-4-e4b-it-OptiQ-4bit 15,252 MB 6,028 MB 2.5× 55.5% vs 23.5%   +32.0pp
gemma-4-26B-A4B-it-OptiQ-4bit MoE · 4 B active 48,584 MB 14,866 MB 3.4× 94.0% vs 92.0%   +2.0pp
gemma-4-31B-it-OptiQ-4bit dense 59,648 MB 18,123 MB 3.3× 96.0% vs 96.0%   0.0pp
Heads up Gemma-4 inference works fine with fp16 KV (stock mlx_lm.server or optiq serve without --kv-config). The mixed-precision KV path currently fails on Gemma-4's shared-KV attention — an upstream mlx-lm limitation we're tracking. Use Qwen3.5 / 3.6 if you specifically need quantized KV.
Gemma-4 getting-started guide →

04 Loading any of them

One snippet. Any model on this page.

All ten quants follow the same load contract. Swap the repo name; the rest stays.

load_any.pypython
from mlx_lm import load, generate

# Pick any of the 10. Stock mlx-lm, no special loader needed.
model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)
Per-family notes Each model family has slightly different recommended sampling defaults and chat templates. See the Qwen3.5, Qwen3.6 and Gemma-4 getting-started guides.