mlx-optiq
Engineering · April 28, 2026

What you calibrate on is what you protect.

mlx-optiq is a sensitivity-driven quantizer: for every layer in a model, it measures how much the output distribution shifts when that layer is quantized to k bits, then a knapsack solver hands out the bit budget to the layers that need it most. The whole pipeline rests on a single thing: what input data we run through the model to take that measurement.

If your calibration data is prose, you'll protect the layers that matter for prose. If your model is supposed to call tools, you'll quietly degrade tool-calling.

For most of mlx-optiq's life that calibration set was WikiText-2. This release replaces it with optiq.jsonl: 32 hand-curated samples spanning prose, multi-step reasoning, code, agent loops, and function-calling. The file ships inside the Python package, so optiq convert never touches the network for calibration data. This post is the why, the what, and how to reproduce or replace it.

Why WikiText was the wrong floor

WikiText-2 was the right calibration set for the GPT-2 era. It's clean, English, well-distributed, large enough to sample from, and the models you were quantizing were fundamentally next-token predictors over web prose. A small batch of WikiText sentences would activate roughly the same circuits the model used in production.

Modern instruction-tuned and reasoning-tuned LLMs aren't that. Qwen3.5-9B-Instruct is asked to do five very different things in a typical day: explain a concept, work through a multi-step proof inside a <think> block, write Python that compiles, execute an agent loop with tool feedback, and emit valid JSON for a function call. Each of those uses different layer subspaces.

When the calibration set is exclusively prose, two failure modes happen at once:

  • Tool-calling layers look "robust." They aren't activated by prose, so quantizing them produces ~zero KL divergence on the calibration probe. The optimizer happily drops them to the lowest bit-width. In production the model now hallucinates JSON brackets.
  • Reasoning-critical layers look like noise. The <think> sub-circuit is most active during long multi-step solutions; a 256-token WikiText snippet barely touches it.

You can't see this from a perplexity number. WikiText perplexity stays flat. GSM8K and IFEval on the quantized artifact quietly drop a few percent. The calibration set isn't lying. It's just not being asked the question that matters.

The five domains

We picked five capability slices that match what people actually run mlx-optiq quants for, and built a calibration sample from a public dataset for each.

DomainSamplesSource datasetWhat it activates
prose5wikitext-2-raw-v1Baseline next-token prediction. The thing WikiText was always good for.
thought6open-r1/Mixture-of-Thoughts (math)R1-style reasoning with explicit <think> blocks. Long, structured, self-correcting.
code6nvidia/OpenCodeReasoningProgramming problems with step-by-step reasoning into a working Python solution.
agent8lambda/hermes-agent-reasoning-tracesMulti-turn agent loops: system prompt → user → think+tool_call → tool_result → continuation.
tool7NousResearch/hermes-function-calling-v1Function-calling traces with tool schemas. Forces the model into JSON-emitting subspace.

32 samples total · ~124 K characters · ~31 K tokens. Small enough that even a 31 B-dense model finishes the per-layer KL pass in under an hour on a 36 GB Mac, dense enough that every major capability slice gets at least 5 K activated tokens of probe pressure.

Schema

Each line is a JSON object. Two shapes: raw text for prose, chat for everything else:

optiq.jsonljsonl
// raw text, used as-is
{"domain": "prose", "text": "..."}

// chat: runs through the model's tokenizer.apply_chat_template()
{"domain": "thought", "messages": [{"role": "user", "content": "..."}, ...]}

// chat with tools: schema is rendered into the system prompt by the tokenizer
{"domain": "tool", "messages": [...], "tools": [{"name": "...", ...}]}

The loader (optiq.calibration.datasets.load_llm_calibration) is opinionated about one thing: chat samples go through the target model's own chat template before tokenization. That means a sample we built from a Qwen-flavored agent trace gets re-rendered into Gemma chat tokens before being fed to a Gemma model. The activated subspace is the production subspace, not the donor model's.

Per-domain design notes

prose

WikiText-2 still earns its keep at 15 % weight. It's the only domain that activates the model purely as a next-token predictor without instruction-tuned templating, and it keeps the calibration set honest about the underlying language model. Without it, mixed-precision quants started skewing too aggressively toward instruction layers.

thought

open-r1's Mixture-of-Thoughts math split, chosen because the chain-of-thought is long, structured, and self-correcting. The <think> block exercises the residual stream in a way short-form QA never does, and the math domain forces step-by-step intermediate computation rather than free-form discussion.

code

nvidia/OpenCodeReasoning has the same shape as thought but in Python instead of math: competition-style problems with explicit reasoning into a working solution. Strong predictor of HumanEval pass@1 and code-generation BFCL slices.

agent

The hardest domain to source. Public agent-trace datasets are either short demos (don't exercise the loop) or production logs that pile up to ~86 K characters per trace. We use lambda/hermes-agent-reasoning-traces and aggressively truncate: keep the first 7 turns, cap each message at 600 characters. The result preserves the shape of an agent loop (system → user → think+tool_call → tool_result → continuation) without spending the entire token budget on one trace's tail.

tool

NousResearch/hermes-function-calling-v1 in func_calling config. We filter for traces that contain at least one explicit <tool_call> (or equivalent name+arguments JSON), so every sample exercises the JSON-emitting head. Tools schemas are extracted from the source's <tools>...</tools> block and attached as a structured field, then re-rendered through the target model's tool template at calibration time.

Why ship the JSONL inside the package

Three reasons.

  1. Reproducibility. A bf16-source quantization run in 2026 should be byte-identical to one in 2027. If the calibration set is "stream from HuggingFace at convert time," you can't promise that. Datasets get re-uploaded, the source repos change, streaming order is non-deterministic.
  2. Offline. optiq convert already needs network for the bf16 source weights; calibration shouldn't add a second hop. The JSONL is 328 KB. It travels in the wheel.
  3. Auditable. One file, 32 lines, you can read all of them. There's nowhere for a subtle distribution bias to hide.

Building your own

The mix above is what we ship as the default. The CLI accepts a path to your own JSONL (same schema) for domain-specific quants:

terminalbash
# Default: ships in the package, no network.
optiq convert Qwen/Qwen3.5-9B --calibration-mix optiq

# Roll your own (same schema, point at any .jsonl on disk).
optiq convert Qwen/Qwen3.5-9B --calibration-mix ./my-domain-mix.jsonl

If you want to rebuild the default mix from scratch (different seed, different per-domain weighting, different sources), the script is checked into the repo:

terminalbash
# Re-derive optiq.jsonl from the 5 source datasets, deterministic seed=42.
python scripts/build_calibration.py

The builder streams from HuggingFace, applies the per-domain length filters and truncation rules, and writes the result back to optiq/calibration/data/optiq.jsonl. About 90 seconds end-to-end on a warm cache.

What this buys you

The interesting part is what changes downstream: which layers the optimizer protects, and what that does to held-out evals like IFEval (instruction-following), BFCL (tool-calling), and HumanEval (code) on the same target BPW.

That's a long enough story to be its own post. See the eval-framework writeup for the per-model deltas on Qwen3.5-27B, Qwen3.6-27B, and the Gemma-4 26B-A4B / 31B re-quants.

— the mlx-optiq team