Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.
mlx-optiq is the open-source toolkit for running large language models natively on Mac — per-layer sensitivity analysis for mixed-precision weights, LoRA fine-tuning that respects the bit budget, and a server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). No GPU cluster, no API key.
$ pip install mlx-optiq
Drop-in 4-bit quants. Same weights, smarter bits.
Ten production mlx-optiq-quantized LLMs on Hugging Face — Qwen3.5, Qwen3.6 and Gemma-4 families, from 0.8 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.
Qwen3.5-9B-OptiQ-4bit
The default daily-driver: 9 B parameters in 5.6 GB, GSM8K parity with the original at 4-bit. Long context to 64 k via mixed-precision KV.
Qwen3.6-27B-OptiQ-4bit
Frontier-class reasoning at 15.7 GB. Fits comfortably on a 36 GB Mac with room for fine-tuning.
gemma-4-26B-A4B-it-OptiQ-4bit
Sparse mixture-of-experts: 26 B total parameters, 4 B active per token. mlx-optiq's per-layer pass quantizes router and expert blocks independently.
gemma-4-31B-it-OptiQ-4bit
The largest dense quant we ship: 31 B in 18.1 GB. 96.0% GSM8K — within noise of the original. Use it when you need the strongest single model on a 36 GB Mac.
From zero to a serving LLM in three commands.
Each step is reversible and works with stock MLX tools — mlx-optiq is additive. Skip any of these and you still have a working pipeline.
Install
Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Tested on Python 3.11+ on M2 / M3 / M4.
$ pip install mlx-optiq
Use a pre-built quant
Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata — no special loader required.
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200) print(out)
Serve with mixed-precision KV
The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.
# 1-2 min — once per model $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 -o ./kv # OpenAI + Anthropic compatible server on :8080 # /v1/chat/completions (OpenAI) # /v1/messages (Anthropic — works with Claude Code, anthropic SDK, etc.) $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/kv_config.json \ --port 8080
One sensitivity signal. Six places it pays off.
mlx-optiq's per-layer KL-divergence pass is a single measurement reused across the entire optimization stack. The same numbers that decide weight bit-width also decide KV bit-width and LoRA rank.
Mixed-precision weights
Per-layer KL on calibration data + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized at the same average BPW.
Mixed-precision KV cache
Independent sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than the average — uniform 4-bit KV is catastrophic.
Sensitivity-aware LoRA
optiq lora train assigns higher adapter rank to layers mlx-optiq kept at 8-bit. Same parameter budget, smarter capacity allocation.
Hot-swap adapters
Mounted-LoRA primitive: keep N adapters resident on one base model, switch per-request via ContextVar. No reload, no GPU re-upload.
TurboQuant research
Rotation-based vector quantization that preserves attention inner products. Quality-critical research path with library bindings.
OpenAI and Anthropic API
One server, both protocols. /v1/chat/completions for OpenAI clients; /v1/messages for Anthropic clients — point Claude Code at your local quant.
Sensitivity, in 30 seconds.
Uniform 4-bit quantization treats every layer the same — but layers are not the same. mlx-optiq measures, then allocates.
1. Measure
For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.
2. Allocate
Greedy knapsack on the table: start every layer at the lowest bit-width,
then greedily upgrade the layer that buys the most KL-reduction per extra bit
until the average bit-budget is exhausted. Layers like lm_head
and the first/last attention blocks are protected at 8-bit by default.
3. Convert
Hand the per-layer bit map to mlx_lm.convert as a quant
predicate. The output is a standard MLX checkpoint that loads anywhere
stock mlx-lm loads — with sensitivity metadata stashed on
the side for downstream LoRA training.
# Auto-routes between bf16 and uniform-4-bit reference # based on available RAM. $ optiq convert Qwen/Qwen3.5-9B \ --target-bpw 4.5 \ --candidate-bits 4,8 \ --reference auto \ -o optiq_output/Qwen3.5-9B
--reference auto
picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline
with bf16-streaming probes — so 27 B+ models still get a calibration-driven
signal on a 36 GB Mac. The full methodology lives in
our research write-up →
Make your Mac an LLM workstation.
Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.