mlx-optiq frequently asked questions

Question 1

What is mlx-optiq?

Accepted Answer

mlx-optiq is an MLX-native optimizing compiler that quantizes LLMs for Apple Silicon. Instead of giving every layer the same bit-width, it measures each layer's sensitivity on calibration data and allocates higher precision to layers that need it, lower precision to layers that don't. The result is a mixed-precision quant that runs in stock mlx-lm with no custom runtime, typically at higher accuracy than uniform 4-bit at the same average bits-per-weight.

Question 2

How is mlx-optiq different from uniform 4-bit quantization?

Accepted Answer

Uniform 4-bit treats every layer the same. mlx-optiq measures per-layer KL sensitivity on a calibration mix, then runs a greedy knapsack to assign bit-widths (4 / 8 bits by default) such that the total budget hits a target average BPW. Sensitive layers stay at 8-bit, robust layers go to 4-bit. On the small end the recovery is dramatic: gemma-4-e4b drops to 23.5% GSM8K at uniform 4-bit; mlx-optiq lifts it to 55.5% at the same disk size.

Question 3

How do I install mlx-optiq on a Mac?

Accepted Answer

pip install mlx-optiq. Requires Python 3.11+ and Apple Silicon (M1 or later). For the convert workflow add the convert extra: pip install 'mlx-optiq[convert]'. For the local web UI: pip install 'mlx-optiq[lab]'. Full install guide at https://mlx-optiq.com/docs/install.

Question 4

How do I quantize Qwen3.5 on a Mac with mlx-optiq?

Accepted Answer

Run optiq convert Qwen/Qwen3.5-9B --target-bpw 5.0 --candidate-bits 4,8. The CLI downloads the bf16 weights, runs per-layer sensitivity, allocates bits to hit the 5.0 BPW target, converts via mlx_lm.convert, and writes the artifact to ./optiq_output. Add --reference uniform_4bit if you're memory-constrained; that path streams bf16 layers from disk instead of holding them all in RAM. Pre-built Qwen3.5 quants from 0.8B to 35B-A3B are on Hugging Face under mlx-community/Qwen3.5-...-OptiQ-4bit.

Question 5

What is mixed-precision KV cache and why does it matter?

Accepted Answer

The KV cache is the activations stored across the attention layers during long-context generation. Uniform 4-bit KV is catastrophic because layer 0's KV is roughly 56x more sensitive than the average layer; quantizing it the same as every other layer wrecks long-context accuracy. mlx-optiq runs a separate per-layer sensitivity pass on the KV cache and assigns higher bit-widths to the few layers that need it. The result: 4-bit average KV that matches fp16 quality on hash-hop and reaches 34% lower peak memory than fp16 at 32k context.

Question 6

Does mlx-optiq work with Claude Code, Codex, or other coding agents?

Accepted Answer

Yes. The optiq serve command starts a local server that speaks both the OpenAI Chat Completions API and Anthropic's Messages API on the same port. Claude Code and OpenClaw point at /v1/messages, Codex and OpenCode point at /v1/chat/completions, all using a fake sk-optiq-local API key. Copy-paste configs for each integration are at https://mlx-optiq.com/docs/integrations/.

Question 7

Can I fine-tune a quantized model with LoRA?

Accepted Answer

Yes. optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit --data ./jsonl_dir --rank 8 --rank-scaling by_bits. The trainer is MLX-native (no PyTorch detour) and supports sensitivity-aware rank scaling: layers that mlx-optiq kept at 8 bits get proportionally higher LoRA rank than layers it quantized to 4 bits, so the adapter capacity matches the bit budget. Output is a standard mlx-lm adapter directory you can serve via optiq serve --adapter.

Question 8

What is MTP speculative decoding?

Accepted Answer

Multi-Token Prediction is a speculative-decoding scheme that uses the model's own auxiliary head as the draft model. Greedy generation on Apple Silicon: 1.20x on Qwen3.5-4B, 1.32x on 9B, 1.40x on 27B. Enabled with optiq serve --mtp on a model whose checkpoint includes mtp.safetensors. Detailed methodology at https://mlx-optiq.com/blog/mtp-on-apple-silicon.

Question 9

Does mlx-optiq run on Linux or NVIDIA GPUs?

Accepted Answer

No. mlx-optiq targets Apple Silicon (M1 and later) and uses Apple's MLX runtime. For NVIDIA GPUs you would want bitsandbytes, GPTQ, or AWQ paired with vLLM or Transformers.

Question 10

Where can I find pre-built mlx-optiq quants?

Accepted Answer

On Hugging Face under https://huggingface.co/mlx-community, with the OptiQ-4bit suffix. OptiQ-4bit quants ship across these families: NVIDIA Nemotron 3 Nano (4B + 30B-A3B), MiniCPM5-1B, Qwen3.5 (0.8B, 2B, 4B, 9B, 27B, 35B-A3B), Qwen3.6 (27B, 35B-A3B), Gemma-4 (e2b, e4b, 12B, 26B-A4B, 31B). The full catalog with size and accuracy numbers is at https://mlx-optiq.com/models.

Question 11

Which inference tools can load OptiQ quants?

Accepted Answer

mlx-optiq and mlx-lm are the two supported paths. In mlx-optiq everything runs: text, image input on the VLM families, and MTP speculative decoding. In stock mlx-lm the text path loads and generates directly with mlx_lm.load("mlx-community/<model>-OptiQ-4bit"), since mlx-lm is the library OptiQ quantizes with. Other Mac front-ends such as mlx-vlm, LM Studio, and oMLX load MLX weights through their own stack and wrap mlx-vlm for image models, so support there depends on that stack; text usually works, while image input and newer architectures can lag until mlx-vlm's per-layer mixed-precision handling and Gemma-4 shared-KV support catch up upstream. The language weights load and generate correctly in mlx-lm and mlx-optiq everywhere, so a failure in another tool is a loader gap, not a broken quant.

Question 12

Can I run code execution and web search from a local chat?

Accepted Answer

Yes. OptiQ Lab ships a chat surface with three tools the model can call: web_search (DuckDuckGo, no API key), python (AST-checked sandbox with PNG-inline matplotlib output), and terminal (bash one-liner with token-aware command blocking). The tool-call orchestrator includes a healer for six malformed shapes that quantized models commonly emit. Full description at https://mlx-optiq.com/blog/lab-chat-tools.

Frequently asked questions

What is mlx-optiq?

How is it different from uniform 4-bit quantization?

How do I install mlx-optiq?

How do I quantize Qwen3.5 on a Mac?

What is mixed-precision KV cache?