mlx-optiq
Optimizing compiler · MLX

Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.

mlx-optiq is the open-source toolkit for running large language models natively on Mac — per-layer sensitivity analysis for mixed-precision weights, LoRA fine-tuning that respects the bit budget, and a server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). No GPU cluster, no API key.

Per-layer bit allocation · sample LLM
8-bit · sensitive layers 4-bit · robust layers
$ pip install mlx-optiq
10
pre-built quants
3.4×
avg compression
+62%
decode at 64k
+32pp
small-model recovery
Apple Silicon native · tested on M3 Max 36 GB · works on M2 / M3 / M4 · MIT licensed · zero vendor lock-in

02 Quickstart

From zero to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools — mlx-optiq is additive. Skip any of these and you still have a working pipeline.

i

Install

Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Tested on Python 3.11+ on M2 / M3 / M4.

terminalbash
$ pip install mlx-optiq
ii

Use a pre-built quant

Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata — no special loader required.

generate.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)
print(out)
iii

Serve with mixed-precision KV

The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.

terminalbash
# 1-2 min — once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 -o ./kv

# OpenAI + Anthropic compatible server on :8080
# /v1/chat/completions  (OpenAI)
# /v1/messages          (Anthropic — works with Claude Code, anthropic SDK, etc.)
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/kv_config.json \
    --port 8080
Where to next Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt into your IDE — it's the entire library reference in one Markdown file.

03 What it does

One sensitivity signal. Six places it pays off.

mlx-optiq's per-layer KL-divergence pass is a single measurement reused across the entire optimization stack. The same numbers that decide weight bit-width also decide KV bit-width and LoRA rank.

i

Mixed-precision weights

Per-layer KL on calibration data + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized at the same average BPW.

+15.5pp GSM8K · Qwen3.5-0.8B
ii

Mixed-precision KV cache

Independent sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than the average — uniform 4-bit KV is catastrophic.

+62% decode at 64k · Qwen3.5-4B
iii

Sensitivity-aware LoRA

optiq lora train assigns higher adapter rank to layers mlx-optiq kept at 8-bit. Same parameter budget, smarter capacity allocation.

−12% val loss · GSM8K subset
iv

Hot-swap adapters

Mounted-LoRA primitive: keep N adapters resident on one base model, switch per-request via ContextVar. No reload, no GPU re-upload.

~50 MB per extra adapter
v

TurboQuant research

Rotation-based vector quantization that preserves attention inner products. Quality-critical research path with library bindings.

100% needle vs 73% affine
vi

OpenAI and Anthropic API

One server, both protocols. /v1/chat/completions for OpenAI clients; /v1/messages for Anthropic clients — point Claude Code at your local quant.

curl, OpenAI SDK, Anthropic SDK, Claude Code

04 How it works

Sensitivity, in 30 seconds.

Uniform 4-bit quantization treats every layer the same — but layers are not the same. mlx-optiq measures, then allocates.

1. Measure

For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.

2. Allocate

Greedy knapsack on the table: start every layer at the lowest bit-width, then greedily upgrade the layer that buys the most KL-reduction per extra bit until the average bit-budget is exhausted. Layers like lm_head and the first/last attention blocks are protected at 8-bit by default.

3. Convert

Hand the per-layer bit map to mlx_lm.convert as a quant predicate. The output is a standard MLX checkpoint that loads anywhere stock mlx-lm loads — with sensitivity metadata stashed on the side for downstream LoRA training.

convert.shbash
# Auto-routes between bf16 and uniform-4-bit reference
# based on available RAM.
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 4.5 \
    --candidate-bits 4,8 \
    --reference auto \
    -o optiq_output/Qwen3.5-9B
runs in 1–2 min on a 9 B model · longer for 27 B+
Why this scales A single calibration-driven sensitivity path. --reference auto picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline with bf16-streaming probes — so 27 B+ models still get a calibration-driven signal on a 36 GB Mac. The full methodology lives in our research write-up →
Get started

Make your Mac an LLM workstation.

Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.