mlx-optiq
Pre-built quants · Hugging Face

Sixteen OptIQ-4bit quants. Ready to load.

Each model is a standard MLX checkpoint. Load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.

01 Nemotron 3 family · added Jun 3, 2026

Nemotron 3: NVIDIA's Mamba-attention hybrid.

NVIDIA's Nemotron 3 Nano interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. The 30B-A3B (≈3 B active per token) is the standout: OptiQ assigns per-layer 4/8-bit across the fused routed experts and clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks. The dense 4B is a smaller, tighter win.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit 63,236 MB 21,043 MB 3.0× 69.15 +2.02
NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit 7,947 MB 2,938 MB 2.7× 63.60 +0.24
Hybrid KV cache Only the four full-attention layers carry a KV cache; the Mamba2 blocks keep recurrent state instead. Each repo ships a kv_config.json covering just those attention layers (three at 4-bit, one at 8-bit). Point optiq serve --kv-config kv_config.json at it. optiq kv-cache gained NemotronH support in v0.1.5 — earlier versions raised ZeroDivisionError on this architecture.
Nemotron 3 getting-started guide →

02 MiniCPM5 family · added May 28, 2026

MiniCPM5: a 1B that punches above its weight.

OpenBMB's 1.08B-parameter Llama-architecture base, fully Apache-2.0. Hybrid-reasoning chat template with an enable_thinking flag. In non-thinking mode (the OptIQ benchmark recipe) it posts 52% MMLU, 65% IFEval, and 58% HumanEval on a model that weighs less than a gigabyte on disk. The OptIQ-4bit quant beats stock uniform-4 by 12 points on HumanEval and rescues HashHop from a 0% floor — same sensitivity-aware allocation story as the larger families.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
MiniCPM5-1B-OptiQ-4bit 2,062 MB 875 MB 2.4× 30.28 +4.44
Heads up MiniCPM5 ships a hybrid <think> reasoning mode. Pass chat_template_kwargs={"enable_thinking": true} to wake it up; expect substantially higher math/tool scores in that mode. OptIQ's benchmark recipe forces thinking off for cross-family comparability, so the table reflects fast-assistant performance.
MiniCPM5 getting-started guide →

03 Gemma-4 family · added Apr 25, 2026

Gemma-4: Google's instruct series.

Two small dense (e2b, e4b), the new 12 B (the unified text+vision Gemma-4, now with image input), and two large (31 B dense, 26 B-A4B sparse-MoE). Mixed-precision recovery is dramatic — gemma-4-e4b posts a +13.6-point Capability Score gain over uniform 4-bit, and the 12 B adds +6.4. Pair e4b or 31B with their matching -assistant-bf16 drafter for speculative decoding. All five also take image input through a bf16 vision sidecar; see the vision guide.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
gemma-4-e2b-it-OptiQ-4bit 9,216 MB 4,098 MB 2.2× 53.21 +2.12
gemma-4-e4b-it-OptiQ-4bit 14,336 MB 6,231 MB 2.3× 65.84 +13.57
gemma-4-12B-it-OptiQ-4bit 22,811 MB 8,449 MB 2.7× 68.23 +6.40
gemma-4-31B-it-OptiQ-4bit 63,488 MB 21,328 MB 3.0× 79.69 +3.47
gemma-4-26B-A4B-it-OptiQ-4bit 53,248 MB 16,813 MB 3.2× 72.68 +3.06
Mixed-precision KV now works on Gemma-4 (v0.1.3+) Each Gemma-4 repo above ships a recommended kv_config.json from a real sensitivity-analysis pass. Point optiq serve --kv-config kv_config.json at it. The runtime fills in for upstream mlx-lm's RotatingKVCache.to_quantized (which raises NotImplementedError in v0.1.2 and earlier) via optiq.runtime.kv.RotatingQuantizedKVCache, plus a small SDPA dispatch patch for Gemma-4's KV-sharing layers. The model still loads fine without the config (stock fp16 KV).
Gemma-4 getting-started guide →

04 Qwen3.6 family · added earlier in April 2026

Qwen3.6: frontier-class reasoning.

Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass; both ship a bundled MTP head for ~1.4× decode via optiq serve --mtp. Both beat uniform 4-bit on the six-metric Capability Score, and both take image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
Qwen3.6-27B-OptiQ-4bit 57,344 MB 17,876 MB 3.2× 82.96 +0.46
Qwen3.6-35B-A3B-OptiQ-4bit 73,728 MB 22,679 MB 3.3× 76.78 +1.12
Qwen3.6 getting-started guide →

05 Qwen3.5 family · the founding lineup

Qwen3.5: the daily-driver series.

From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning, plus a 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all ship a bundled MTP head for speculative decoding via optiq serve --mtp; all beat uniform 4-bit on the six-metric Capability Score. Every Qwen3.5 size also takes image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
Qwen3.5-0.8B-OptiQ-4bit 1,229 MB 620 MB 2.0× 36.00 +4.27
Qwen3.5-2B-OptiQ-4bit 3,072 MB 1,463 MB 2.1× 47.66 +2.12
Qwen3.5-4B-OptiQ-4bit 8,192 MB 3,118 MB 2.6× 65.76 +1.90
Qwen3.5-9B-OptiQ-4bit 18,432 MB 6,772 MB 2.7× 66.77 +0.19
Qwen3.5-27B-OptiQ-4bit 57,344 MB 17,788 MB 3.2× 79.05 +0.17
Qwen3.5-35B-A3B-OptiQ-4bit 73,728 MB 21,603 MB 3.4× 74.17 +0.42
Recommended starting point For most users on a 36 GB Mac, the Qwen3.5-9B quant is the default. Strongest Capability-per-GB and runs at full 64 k context with mixed-precision KV. Bundled MTP head delivers ~1.4× decode via optiq serve --mtp. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.
Qwen3.5 getting-started guide →

05 Loading any of them

One snippet. Any model on this page.

All sixteen quants follow the same load contract. Swap the repo name; the rest stays.

load_any.pypython
from mlx_lm import load, generate

# Pick any of the 12. Stock mlx-lm, no special loader needed.
model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)
Per-family notes Each model family has slightly different recommended sampling defaults and chat templates. See the MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 getting-started guides.