Engineering · April 28, 2026

The eval framework that drives every quant we ship.

Topic Evaluation methodology Reading time 7 min Related calibration mix

For most of mlx-optiq's life, the regression check on a fresh quant was: does it answer the same GSM8K-50 questions the bf16 model answered? A quick 50-sample math check. Five minutes. If it passed, we shipped.

That worked when the things we cared about were does it still talk and can it still do arithmetic. It stopped working the day someone tried to use one of our quants behind an MCP server and got JSONDecodeError on every third tool call. The model could still do math. It had quietly forgotten how to emit valid function-call syntax. GSM8K-50 didn't ask the question.

You catch the regression with the eval that tests the workload. Not the one that's cheap to run.

This release replaces the GSM8K-50 smoketest with a two-tier suite that we now run on every quant before it gets a HuggingFace card. Sandboxed HumanEval execution, auto-resolved KL reference, single roll-up Capability Score, all from one CLI command.

The two tiers

Quants pass through two checkpoints. The first is fast and triages; the second is slow and decides what we ship.

Tier	Time / model	What it answers	Tasks
Tier-1 smoketest	~5 min	Did the convert work? Are we close to the reference distribution?	KL on 64 prompts × 256 tokens · GSM8K-50 (chat-templated, thinking off)
Tier-2 headline	~90 min	How much capability did we keep across the workloads users actually run?	MMLU-1k 5-shot · GSM8K-1k · IFEval (full) · BFCL-V3 simple (200) · HumanEval (164)

Tier-1 is the gate. A quant that fails Tier-1 doesn't get Tier-2. Tier-2 is the headline number that ends up on the model card.

Tier 1: KL + GSM8K-50

KL divergence between two language models, computed token-by-token over a small batch of held-out prompts, is a cheap signal that works well in practice. The reference is the highest-fidelity version of the model that fits on the box. The candidate is the OptIQ quant. We compute KL(reference ‖ candidate) per token, average across 64 prompts × 256 tokens, and report mean + p95.

The auto-resolver picks the reference automatically:

KL reference auto-resolverpython

# Pick the highest-fidelity reference that fits on the box.
bf16_gb  = hf_repo_size_gb(strip_quant_suffix(model_id))
avail_gb = psutil.virtual_memory().available / 1024**3

if bf16_gb <= 0.70 * avail_gb:
    return "bf16", strip_quant_suffix(model_id)

# bf16 doesn't fit; fall back to the uniform-4-bit MLX baseline.
return "uniform_4bit", uniform_4bit_repo(model_id)

The smoketest sweep across the 12 quants we shipped pre-v0.1.0 produced this table. It's also the source of the re-quant list at the bottom of the post:

Model	KL mean	KL p95	GSM8K-50	Reference
`Qwen3.5-27B-OptiQ-4bit`	0.05	0.15	96 %	uniform-4bit
`Qwen3.6-27B-OptiQ-4bit`	0.06	0.29	100 %	uniform-4bit
`Qwen3.5-9B-OptiQ-4bit`	0.18	0.80	82 %	bf16
`Qwen3.5-2B-OptiQ-4bit`	0.19	0.84	56 %	bf16
`gemma-4-e4b-it-OptiQ-4bit`	0.28	1.35	92 %	bf16
`gemma-4-e2b-it-OptiQ-4bit`	0.57	3.04	56 %	bf16
`gemma-4-26B-A4B-it-OptiQ-4bit` 🔻	0.93	4.31	96 %	uniform-4bit
`gemma-4-31B-it-OptiQ-4bit` 🔻	0.99	4.76	98 %	uniform-4bit

Two flags fell out: the Gemma-4 26B-A4B sparse-MoE and the 31B dense, both with KL 20× higher than the Qwen3.5-27B at comparable size. The GSM8K-50 numbers were fine, 96 % and 98 % respectively. That's exactly why we needed Tier-2.

(Two more quants, Qwen3.5-27B and Qwen3.6-27B, looked perfect on Tier-1 but went into the re-quant list anyway, on the suspicion that the WikiText-only calibration was also under-protecting their tool-call layers. Tier-2 confirmed.)

Tier 2: five-metric headline

The Tier-2 suite is what ends up on the model card. Each task targets a capability slice:

MMLU: 5-shot, stratified across the 57 subjects, 1000 samples. Encyclopedic knowledge after instruction-tuning. The bf16 anchor.
GSM8K: 1000 samples, 3-shot CoT, chat-templated, enable_thinking=False for reasoning models. Multi-step arithmetic.
IFEval: full Google IFEval set with all 25+ constraint verifiers. Measures whether the model can follow detailed format / length / capitalization / inclusion-exclusion instructions. Strict + loose accuracy.
BFCL-V3 simple: 200 single-turn function-calls with AST equivalence scoring. Whether the model can emit a syntactically valid call and pick the right tool from a small candidate set.
HumanEval: all 164 problems, sandboxed Python execution, pass@1 only.

Run from the CLI as a single task:

terminalbash

optiq eval ./optiq_mixed --task all --score

Each individual task is also addressable (--task mmlu, --task ifeval, etc.) for when you only need one number.

Sandboxing HumanEval

HumanEval requires actually executing the model's generated Python against a unit-test harness. Doing that on the user's machine with no isolation is a footgun. A model that emits os.system("rm -rf …") ruins someone's afternoon. The sandbox helper falls through three tiers:

apple/container: when present, runs each candidate inside a fresh container with no network, no filesystem mount outside /tmp, and a wall-clock timeout. Hardest isolation, slowest start.
sandbox-exec: macOS native, when /usr/bin/sandbox-exec is available. Subprocess with a tight seatbelt profile (no network, deny file-write outside /tmp). Fast.
subprocess + rlimit: universal fallback. Spawn a Python child with RLIMIT_AS, RLIMIT_CPU, RLIMIT_FSIZE caps and a process-group timeout. No filesystem isolation; exists so the eval doesn't simply fail to run on Linux CI.

The helper picks the strictest tier available at runtime. Reported pass@1 is identical across tiers because the test harness is deterministic. Only the blast radius of malicious code changes.

The Capability Score

Five percentages are hard to compare side-by-side. We want one number that answers which quant is more capable on average?. And we want a formula the reader can audit, not a hidden value judgement dressed up as math.

The simplest one that meets that bar:

Capability Scoreformula

Capability_Score = mean(MMLU, GSM8K, IFEval, BFCL, HumanEval)

We tried a weighted formula first. Something like MMLU + 0.3 × IFEval + 0.5 × BFCL − 5 × disk_GB. It looked clever. It also embedded our quality/disk tradeoff in a way users can't see, and it could turn a +1 pp capability win into a "loss" if the disk grew by half a gigabyte. That's a recommendation, not a measurement.

So we stripped it down. The five benchmarks each get an equal vote. disk_gb is reported next to the score as an unweighted second axis, and the reader picks their own tradeoff. If you're optimizing for an 8 GB Mac, smaller wins. If you're on a 64 GB Studio, larger probably wins. The score doesn't pretend to know.

Two consequences worth flagging. (1) GSM8K is now back in the average. Earlier we worried about double-counting MMLU's reasoning content, but in practice GSM8K and MMLU disagree often enough on quants that letting GSM8K vote catches regressions MMLU misses. (2) HumanEval is in too, which means a quant that breaks code generation can't hide behind strong instruction-following.

Picking the KL reference

One technical note that took us a few iterations to get right.

For models that fit in RAM (everything ≤ ~10 B at bf16 on a 36 GB Mac), the KL reference is unambiguous: it's the bf16 model itself. For 27 B+, bf16 doesn't fit, and you need a substitute reference that's still strictly higher-fidelity than the candidate. The community's uniform-4-bit MLX publish of the same model is exactly that: same architecture and weights modulo quantization noise, just at uniform 4-bit (no per-layer mixed precision).

The auto-resolver picks bf16 if available, falls back to uniform-4-bit otherwise. The fall-back was originally driven by a crude params × 2 bytes size estimate, which under-counted gemma-4-26B-A4B's MoE expert tensors and tried to load 110 GB of bf16 into 36 GB of RAM. Now we hit HfApi.model_info() and sum the actual safetensors shard sizes. The resolver is exact and the OOM is gone.

What the framework caught

Tier-1 across our 12 pre-v0.1.0 quants flagged the two Gemma-4 27 B+ models as obvious calibration regressions (KL ≥ 0.93). Tier-2 on the same set surfaced two more, Qwen3.5-27B and Qwen3.6-27B, with low KL but degraded BFCL and IFEval from the WikiText-only calibration. We re-quanted those four and re-ran the same suite.

Tier-2 numbers for the four re-quants are landing as the sweep finishes. We'll update this section with the per-model deltas (MMLU / GSM8K / IFEval / BFCL / HumanEval / Capability Score) as soon as the data is in. Cross-link from the model cards on HuggingFace.

Reproducing

Everything in this post runs from the CLI. No special setup beyond pip install mlx-optiq:

terminalbash

# Tier 1: fast smoketest (KL + GSM8K-50)
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task smoketest

# Tier 2: full headline (MMLU + GSM8K + IFEval + BFCL + HumanEval + Score)
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task all --score

# Single tasks if you only need one number
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task bfcl
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task humaneval

# Custom reference for KL (skip auto-resolver)
optiq eval ./my-quant --task kl --reference-model Qwen/Qwen3.5-9B --reference-mode bf16

Every task above is callable on its own. Pick the one you need with optiq eval --task <name>.

— the mlx-optiq team