Frequently asked questions
Common questions about running quantized LLMs on Apple Silicon with mlx-optiq.
What is mlx-optiq?
mlx-optiq is an MLX-native optimizing compiler that quantizes LLMs for Apple Silicon. Instead of giving every layer the same bit-width, it measures each layer's sensitivity on calibration data and allocates higher precision to layers that need it, lower precision to layers that don't. The result is a mixed-precision quant that runs in stock mlx-lm with no custom runtime, typically at higher accuracy than uniform 4-bit at the same average bits-per-weight.
How is it different from uniform 4-bit quantization?
Uniform 4-bit treats every layer the same. mlx-optiq measures per-layer KL sensitivity on a calibration mix, then runs a greedy knapsack to assign bit-widths (4 / 8 bits by default) such that the total budget hits a target average BPW. Sensitive layers stay at 8-bit, robust layers go to 4-bit. On the small end the recovery is dramatic: gemma-4-e4b drops to 23.5% GSM8K at uniform 4-bit; mlx-optiq lifts it to 55.5% at the same disk size. Full methodology in the research post.
How do I install mlx-optiq?
pip install mlx-optiq. Requires Python 3.11+ and Apple Silicon (M1 or later). For the convert workflow add the convert extra:
$ pip install "mlx-optiq[convert]" # quantize HF models $ pip install "mlx-optiq[lab]" # local web UI
How do I quantize Qwen3.5 on a Mac?
$ optiq convert Qwen/Qwen3.5-9B --target-bpw 5.0 --candidate-bits 4,8
The CLI downloads bf16 weights, runs per-layer sensitivity, allocates bits to hit the 5.0 BPW target, converts via mlx_lm.convert, and writes the artifact to ./optiq_output. Add --reference uniform_4bit if you're memory-constrained; that path streams bf16 layers from disk instead of holding them in RAM. Pre-built Qwen3.5 quants from 0.8B to 35B-A3B are on mlx-community.
What is mixed-precision KV cache?
The KV cache is the activations stored across the attention layers during long-context generation. Uniform 4-bit KV is catastrophic because layer 0's KV is roughly 56x more sensitive than the average layer; quantizing it the same as every other layer wrecks long-context accuracy. mlx-optiq runs a separate per-layer sensitivity pass on the KV cache and assigns higher bit-widths to the few layers that need it. The result: 4-bit average KV that matches fp16 quality on hash-hop and reaches 34% lower peak memory than fp16 at 32k context. Full methodology and benchmarks.
Does mlx-optiq work with Claude Code, Codex, or other coding agents?
Yes. optiq serve starts a local server that speaks both the OpenAI Chat Completions API and Anthropic's Messages API on the same port. Claude Code and OpenClaw point at /v1/messages, Codex and OpenCode at /v1/chat/completions, all using a fake sk-optiq-local API key. Copy-paste configs at Integrations.
Can I fine-tune a quantized model with LoRA?
Yes. The trainer is MLX-native (no PyTorch detour) and supports sensitivity-aware rank scaling: layers mlx-optiq kept at 8 bits get proportionally higher LoRA rank than layers it quantized to 4 bits, so adapter capacity matches the bit budget.
$ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
--data ./jsonl_dir --rank 8 --rank-scaling by_bits
Output is a standard mlx-lm adapter directory you can serve via optiq serve --adapter. Fine-tuning docs.
What is MTP speculative decoding?
Multi-Token Prediction is a speculative-decoding scheme that uses the model's own auxiliary head as the draft model. Greedy generation on Apple Silicon: 1.20x on Qwen3.5-4B, 1.32x on 9B, 1.40x on 27B. Enabled with optiq serve --mtp on a model whose checkpoint includes mtp.safetensors. Detailed methodology.
Does mlx-optiq run on Linux or NVIDIA GPUs?
No. mlx-optiq targets Apple Silicon (M1 and later) and uses Apple's MLX runtime. For NVIDIA GPUs you would want bitsandbytes, GPTQ, or AWQ paired with vLLM or Transformers.
Where can I find pre-built quants?
On Hugging Face under mlx-community, with the OptiQ-4bit suffix. Thirteen quants ship as of v0.1.2: MiniCPM5-1B, Qwen3.5 (0.8B, 2B, 4B, 9B, 27B, 35B-A3B), Qwen3.6 (27B, 35B-A3B), Gemma-4 (e2b, e4b, 26B-A4B, 31B). The full catalog with size and accuracy numbers is at /models.
Can I run code execution and web search from a local chat?
Yes. OptIQ Lab ships a chat surface with three tools the model can call: web_search (DuckDuckGo, no API key), python (AST-checked sandbox with PNG-inline matplotlib output), and terminal (bash one-liner with token-aware command blocking). The tool-call orchestrator includes a healer for six malformed shapes that quantized models commonly emit. Full description.