mlx-optiq
Optimizing compiler · MLX

Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.

Run large language models natively on a Mac. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). On Gemma-4, send it an image, not just text. No GPU cluster, no API key.

Per-layer bit allocation · sample LLM
Per-layer bit allocation across a 32-layer transformer: tall emerald bars are 8-bit protected layers, short warm-grey bars are 4-bit.
8-bit · sensitive layers 4-bit · robust layers
$ pip install mlx-optiq
3.1×
avg compression vs bf16
+1.4×
decode via MTP / drafter
+13.6
best Capability Score gain

02 Quickstart

From zero to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.

i

Install

Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Python 3.11+ on Apple Silicon.

terminalbash
$ pip install mlx-optiq
ii

Use a pre-built quant

Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata. No special loader required.

generate.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)
print(out)
iii

Serve with mixed-precision KV

The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.

terminalbash
# 1-2 min, once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 -o ./kv

# OpenAI + Anthropic compatible server on :8080
# /v1/chat/completions  (OpenAI)
# /v1/messages          (Anthropic; works with Claude Code, anthropic SDK, etc.)
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/kv_config.json \
    --port 8080
Where to next Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt into your IDE. It's the entire library reference in one Markdown file.

03 What it does

One sensitivity signal. A whole toolkit around it.

A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows) sits around that core.

i

Mixed-precision weights

Per-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4.

Higher accuracy at the same disk size as uniform-4
ii

Mixed-precision KV cache

A separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not.

Faster long-context decode without breaking quality
iii

LoRA, two ways

Fine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and switch them per request, no reload.

Sensitivity-aware rank · hot-swap mounted adapters
iv

Text and images, one stack

Run text models, and send pictures to the ones that take vision. The vendored vision tower rides in a bf16 sidecar, so one repo loads text-only under mlx-lm or full image+text under OptIQ. Vision docs.

VLM + LLM · one artifact, no mlx-vlm dependency
v

OpenAI and Anthropic APIs

optiq serve speaks both the OpenAI and Anthropic protocols from one process. Point Claude Code, Codex, OpenCode, OpenClaw, or Hermes Agent at a local quant.

curl your local LLM, or drive your coding agent with it
vi

OptIQ Lab, a local GUI

A web UI for the whole workflow: quantize wizard, SFT/DPO fine-tuning with a dataset designer, and chat with sandboxed tools (web search, Python, terminal) and image upload.

Quantize, design data, fine-tune, and chat in the browser

04 How it works

Sensitivity, in three steps.

Uniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates.

1. Measure

For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.

2. Allocate

Greedy knapsack on the table: start every layer at the lowest bit-width, then greedily upgrade the layer that buys the most KL-reduction per extra bit until the average bit-budget is exhausted. Layers like lm_head and the first/last attention blocks are protected at 8-bit by default.

3. Convert

Hand the per-layer bit map to mlx_lm.convert as a quant predicate. The output is a standard MLX checkpoint that loads anywhere stock mlx-lm loads, with sensitivity metadata stashed on the side for downstream LoRA training.

convert.shbash
# Auto-routes between bf16 and uniform-4-bit reference
# based on available RAM.
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 5.0 \
    --candidate-bits 4,8 \
    --reference auto \
    -o optiq_output/Qwen3.5-9B
runs in 1–2 min on a 9 B model · longer for 27 B+
Why this scales A single calibration-driven sensitivity path. --reference auto picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline with bf16-streaming probes, so 27 B+ models still get a calibration-driven signal on a 36 GB Mac. The full methodology lives in our research write-up →

05 How it compares

Where mlx-optiq sits among the Mac LLM options.

A snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes.

mlx-optiq mlx-lm llama.cpp
Per-layer mixed-precision weights Yes, calibration-driven Uniform 4-bit Block-wise K-quant
Per-layer mixed-precision KV cache Yes Uniform 4 / 8 / fp16 Group-wise int8 only
Sensitivity-aware LoRA fine-tuning Rank scaled by per-layer bits Constant rank LoRA Inference only
OpenAI and Anthropic compatible server One process, both OpenAI only llama-server (OpenAI shim)
Text and image input Yes Text only Image via separate build
Sandboxed tool support for chat Three tools: web search, Python, terminal
Reading the table mlx-optiq is the only path on this list that uses calibration-driven, per-layer bit allocation for both weights and KV cache, in the native MLX runtime, with serving, fine-tuning, and image input on the vision models in the same package. The others are great at what they target. They just target different things.
Get started

Make your Mac an LLM workstation.

Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.