Ten OptIQ-4bit quants. Ready to load.
Each model is a standard MLX checkpoint — load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.
Qwen3.5 — the daily-driver dense series.
From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning and 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all parity-or-better against uniform 4-bit on GSM8K (200-sample subset).
| Model | Original size | mlx-optiq size | Compression | mlx-optiq vs uniform-4 |
|---|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 1,666 MB | 570 MB | 2.9× | 27.0% vs 11.5% +15.5pp |
| Qwen3.5-2B-OptiQ-4bit | 4,338 MB | 1,365 MB | 3.2× | 48.0% vs 48.5% −0.5pp |
| Qwen3.5-4B-OptiQ-4bit | 8,888 MB | 2,811 MB | 3.2× | 81.5% vs 79.5% +2.0pp |
| Qwen3.5-9B-OptiQ-4bit | 18,412 MB | 5,763 MB | 3.2× | 90.0% vs 90.0% 0.0pp |
| Qwen3.5-27B-OptiQ-4bit | 52,989 MB | 15,710 MB | 3.4× | 87.5% vs 90.0% −2.5pp |
| Qwen3.5-35B-A3B-OptiQ-4bit MoE · 3 B active | 68,597 MB | 20,146 MB | 3.4× | 89.5% vs 89.5% 0.0pp |
Qwen3.6 — frontier-class reasoning.
Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass and parity-or-better against uniform 4-bit on GSM8K.
| Model | Original size | mlx-optiq size | Compression | mlx-optiq vs uniform-4 |
|---|---|---|---|---|
| Qwen3.6-27B-OptiQ-4bit | 52,989 MB | 15,710 MB | 3.4× | 95.0% vs 94.0% +1.0pp |
| Qwen3.6-35B-A3B-OptiQ-4bit MoE · 256 experts · 3 B active | 68,597 MB | 20,146 MB | 3.4× | 89.5% vs 91.5% −2.0pp |
Gemma-4 — Google's instruct series.
Two small dense (e2b, e4b) and two large (26 B-A4B sparse-MoE, 31 B dense). The two largest dramatically benefit from mixed-precision: gemma-4-e4b at uniform 4-bit drops to 23.5% GSM8K — mlx-optiq recovers it to 55.5%.
| Model | Original size | mlx-optiq size | Compression | mlx-optiq vs uniform-4 |
|---|---|---|---|---|
| gemma-4-e2b-it-OptiQ-4bit | 9,772 MB | 3,978 MB | 2.5× | 13.0% vs 5.5% +7.5pp |
| gemma-4-e4b-it-OptiQ-4bit | 15,252 MB | 6,028 MB | 2.5× | 55.5% vs 23.5% +32.0pp |
| gemma-4-26B-A4B-it-OptiQ-4bit MoE · 4 B active | 48,584 MB | 14,866 MB | 3.4× | 94.0% vs 92.0% +2.0pp |
| gemma-4-31B-it-OptiQ-4bit dense | 59,648 MB | 18,123 MB | 3.3× | 96.0% vs 96.0% 0.0pp |
mlx_lm.server or optiq serve without --kv-config). The mixed-precision KV path currently fails on Gemma-4's shared-KV attention — an upstream mlx-lm limitation we're tracking. Use Qwen3.5 / 3.6 if you specifically need quantized KV.
One snippet. Any model on this page.
All ten quants follow the same load contract. Swap the repo name; the rest stays.
from mlx_lm import load, generate # Pick any of the 10. Stock mlx-lm, no special loader needed. model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}], tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=300) print(out)