Optimizing compiler · MLX

Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.

mlx-optiq is the open-source toolkit for running large language models natively on Mac — per-layer sensitivity analysis for mixed-precision weights, LoRA fine-tuning that respects the bit budget, and a server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). No GPU cluster, no API key.

Per-layer bit allocation · sample LLM

8-bit · sensitive layers 4-bit · robust layers

$ pip install mlx-optiq

pre-built quants

3.4×

avg compression

+62%

decode at 64k

+32pp

small-model recovery

Apple Silicon native · tested on M3 Max 36 GB · works on M2 / M3 / M4 · MIT licensed · zero vendor lock-in

01 Pre-built models

Drop-in 4-bit quants. Same weights, smarter bits.

Ten production mlx-optiq-quantized LLMs on Hugging Face — Qwen3.5, Qwen3.6 and Gemma-4 families, from 0.8 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.

Dense · text

Qwen3.5-9B-OptiQ-4bit

The default daily-driver: 9 B parameters in 5.6 GB, GSM8K parity with the original at 4-bit. Long context to 64 k via mixed-precision KV.

5.6 GB on disk 90.0% GSM8K 3.2× compression

Dense · text

Qwen3.6-27B-OptiQ-4bit

Frontier-class reasoning at 15.7 GB. Fits comfortably on a 36 GB Mac with room for fine-tuning.

15.7 GB on disk 95.0% GSM8K +1.0pp vs uniform-4

MoE · 4 B active

gemma-4-26B-A4B-it-OptiQ-4bit

Sparse mixture-of-experts: 26 B total parameters, 4 B active per token. mlx-optiq's per-layer pass quantizes router and expert blocks independently.

14.9 GB on disk 94.0% GSM8K +2.0pp vs uniform-4

Dense · text

gemma-4-31B-it-OptiQ-4bit

The largest dense quant we ship: 31 B in 18.1 GB. 96.0% GSM8K — within noise of the original. Use it when you need the strongest single model on a 36 GB Mac.

18.1 GB on disk 96.0% GSM8K 3.3× compression

all 10 models →

02 Quickstart

From zero to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools — mlx-optiq is additive. Skip any of these and you still have a working pipeline.

Install

Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Tested on Python 3.11+ on M2 / M3 / M4.

terminalbash

$ pip install mlx-optiq

Use a pre-built quant

Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata — no special loader required.

generate.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)
print(out)

iii

Serve with mixed-precision KV

The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.

terminalbash

# 1-2 min — once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 -o ./kv

# OpenAI + Anthropic compatible server on :8080
# /v1/chat/completions  (OpenAI)
# /v1/messages          (Anthropic — works with Claude Code, anthropic SDK, etc.)
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/kv_config.json \
    --port 8080

Where to next Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt into your IDE — it's the entire library reference in one Markdown file.

03 What it does

One sensitivity signal. Six places it pays off.

mlx-optiq's per-layer KL-divergence pass is a single measurement reused across the entire optimization stack. The same numbers that decide weight bit-width also decide KV bit-width and LoRA rank.

Mixed-precision weights

Per-layer KL on calibration data + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized at the same average BPW.

+15.5pp GSM8K · Qwen3.5-0.8B

Mixed-precision KV cache

Independent sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than the average — uniform 4-bit KV is catastrophic.

+62% decode at 64k · Qwen3.5-4B

iii

Sensitivity-aware LoRA

optiq lora train assigns higher adapter rank to layers mlx-optiq kept at 8-bit. Same parameter budget, smarter capacity allocation.

−12% val loss · GSM8K subset

Hot-swap adapters

Mounted-LoRA primitive: keep N adapters resident on one base model, switch per-request via ContextVar. No reload, no GPU re-upload.

~50 MB per extra adapter

TurboQuant research

Rotation-based vector quantization that preserves attention inner products. Quality-critical research path with library bindings.

100% needle vs 73% affine

OpenAI and Anthropic API

One server, both protocols. /v1/chat/completions for OpenAI clients; /v1/messages for Anthropic clients — point Claude Code at your local quant.

curl, OpenAI SDK, Anthropic SDK, Claude Code

04 How it works

Sensitivity, in 30 seconds.

Uniform 4-bit quantization treats every layer the same — but layers are not the same. mlx-optiq measures, then allocates.

1. Measure

For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.

2. Allocate

Greedy knapsack on the table: start every layer at the lowest bit-width, then greedily upgrade the layer that buys the most KL-reduction per extra bit until the average bit-budget is exhausted. Layers like lm_head and the first/last attention blocks are protected at 8-bit by default.

3. Convert

Hand the per-layer bit map to mlx_lm.convert as a quant predicate. The output is a standard MLX checkpoint that loads anywhere stock mlx-lm loads — with sensitivity metadata stashed on the side for downstream LoRA training.

convert.shbash

# Auto-routes between bf16 and uniform-4-bit reference
# based on available RAM.
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 4.5 \
    --candidate-bits 4,8 \
    --reference auto \
    -o optiq_output/Qwen3.5-9B

runs in 1–2 min on a 9 B model · longer for 27 B+

Why this scales A single calibration-driven sensitivity path. --reference auto picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline with bf16-streaming probes — so 27 B+ models still get a calibration-driven signal on a 36 GB Mac. The full methodology lives in our research write-up →

Get started

Make your Mac an LLM workstation.

Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.

read the docs → browse models

Quantize, fine-tune and serve LLMs entirely on Apple Silicon.

Drop-in 4-bit quants. Same weights, smarter bits.

Qwen3.5-9B-OptiQ-4bit

Qwen3.6-27B-OptiQ-4bit

gemma-4-26B-A4B-it-OptiQ-4bit

gemma-4-31B-it-OptiQ-4bit

From zero to a serving LLM in three commands.

Install

Use a pre-built quant

Serve with mixed-precision KV

One sensitivity signal. Six places it pays off.

Mixed-precision weights

Mixed-precision KV cache

Sensitivity-aware LoRA

Hot-swap adapters

TurboQuant research

OpenAI and Anthropic API

Sensitivity, in 30 seconds.

1. Measure

2. Allocate

3. Convert

Make your Mac an LLM workstation.

Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.