# mlx-optiq > mlx-optiq is an open-source Python toolkit for running large language models entirely on Apple Silicon. It provides three things in one PyPI package: (1) sensitivity-driven mixed-precision quantization that beats uniform 4-bit at the same size; (2) sensitivity-aware LoRA fine-tuning with PEFT-compatible output; and (3) an OpenAI-compatible inference server with mixed-precision quantized KV cache and hot-swappable LoRA adapters. Tested on M2 / M3 / M4 Macs. Free, MIT-licensed. ## Install ``` pip install mlx-optiq ``` Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.11+. Optional extras: `mlx-optiq[convert]` (psutil for RAM precheck), `mlx-optiq[eval]` (datasets for GSM8K), `mlx-optiq[serve]` (uvicorn/fastapi), `mlx-optiq[all]`. ## Pre-built models on Hugging Face All ten quants live under the `mlx-community` organization on HF. They load with stock `mlx_lm.load(...)` — no special runtime. ### Qwen3.5 family (dense + 1 sparse MoE) - `mlx-community/Qwen3.5-0.8B-OptiQ-4bit` — 0.5 GB · 27.0% GSM8K (+15.5pp vs uniform-4) - `mlx-community/Qwen3.5-2B-OptiQ-4bit` — 1.4 GB · 48.0% GSM8K - `mlx-community/Qwen3.5-4B-OptiQ-4bit` — 2.8 GB · 81.5% GSM8K (+2.0pp) - `mlx-community/Qwen3.5-9B-OptiQ-4bit` — 5.6 GB · 90.0% GSM8K (default daily-driver) - `mlx-community/Qwen3.5-27B-OptiQ-4bit` — 15.7 GB · 87.5% GSM8K - `mlx-community/Qwen3.5-35B-A3B-OptiQ-4bit` — 20.1 GB · 89.5% GSM8K (sparse MoE, 3 B active) ### Qwen3.6 family - `mlx-community/Qwen3.6-27B-OptiQ-4bit` — 15.7 GB · 95.0% GSM8K (+1.0pp) - `mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit` — 20.1 GB · 89.5% GSM8K (256-expert MoE, 3 B active) ### Gemma-4 family (instruct) - `mlx-community/gemma-4-e2b-it-OptiQ-4bit` — 4.0 GB · 13.0% GSM8K (+7.5pp) - `mlx-community/gemma-4-e4b-it-OptiQ-4bit` — 6.0 GB · 55.5% GSM8K (+32.0pp — best small-model recovery) - `mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit` — 14.9 GB · 94.0% GSM8K (sparse MoE, 4 B active) - `mlx-community/gemma-4-31B-it-OptiQ-4bit` — 18.1 GB · 96.0% GSM8K (strongest dense quant) ## Loading any pre-built quant ```python from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain mixed-precision quantization."}], tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=300) print(out) ``` For Qwen3.5/3.6 reasoning models, pass `enable_thinking=False` to skip the `...` channel for faster (slightly less accurate) output. ## Streaming generation ```python from mlx_lm import load, stream_generate from mlx_lm.sample_utils import make_sampler model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") sampler = make_sampler(temp=0.6, top_p=0.95) for response in stream_generate(model, tok, prompt="...", max_tokens=200, sampler=sampler): print(response.text, end="", flush=True) ``` ## Quantizing your own model ```bash # auto-routes between bf16 and uniform_4bit reference based on RAM optiq convert Qwen/Qwen3.5-9B \ --target-bpw 4.5 \ --candidate-bits 4,8 \ --reference auto \ -o ./optiq_output/Qwen3.5-9B ``` Two reference modes: - `bf16` (gold standard, requires bf16 in RAM, ~2 × params in GB) - `uniform_4bit` (for big models, builds 4-bit baseline + streams bf16 layer-by-layer from disk) - `auto` (default — picks bf16 if it fits, else uniform_4bit) The output is a standard MLX checkpoint with per-layer bit assignments stored in metadata. It loads anywhere stock `mlx-lm` loads. ## Mixed-precision KV-cache serving One-time sensitivity pass, then serve: ```bash optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 --candidate-bits 4,8 \ -o ./kv/qwen35_9b optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --port 8080 ``` Delivers +31% to +62% decode speedup at 64k context on Qwen3.5 4B/9B vs fp16 KV. KV-quant currently broken on Gemma-4 (shared-KV attention upstream limitation) — use stock fp16 KV for Gemma-4 long-context serving. ## OpenAI- and Anthropic-compatible API `optiq serve` exposes BOTH endpoints from the same process: - OpenAI: `/v1/chat/completions` (streaming SSE) - Anthropic: `/v1/messages` (streaming SSE) — works with Claude Code and the official `anthropic` Python SDK ```python # OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used") resp = client.chat.completions.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", messages=[{"role": "user", "content": "hi"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="") # Anthropic client — same server from anthropic import Anthropic client = Anthropic(base_url="http://localhost:8080", api_key="not-used") resp = client.messages.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", max_tokens=300, messages=[{"role": "user", "content": "hi"}], ) print(resp.content[0].text) ``` Claude Code via env vars (one-line setup): ```bash export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" claude # now driven by your local quant ``` ## Sensitivity-aware LoRA fine-tuning ```bash optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --max-seq-length 1400 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 1000 \ -o ./my_adapter optiq lora info ./my_adapter ``` `--rank-scaling by_bits` gives 8-bit mlx-optiq layers 2× the adapter rank of 4-bit layers at the same total parameter budget. The same layers mlx-optiq kept at 8-bit during quantization also get more adapter capacity. Output is PEFT-compatible (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution. Data format is JSONL (one example per line, `{"text": "..."}` or `{"messages": [...]}`) — same as `mlx_lm.lora`. ### Empirical training-ceiling map (M3 Max 36 GB, default config) | Model | Max seq len | Peak mem | |---|---|---| | Qwen3.5-0.8B | 2,800 | 23.4 GB | | Qwen3.5-2B | 2,400 | 19.3 GB | | Qwen3.5-4B | 1,600 | 24.8 GB | | Qwen3.5-9B | 1,400 | 25.4 GB | | Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB | | gemma-4-26B-A4B | 512 | 27.6 GB | | Qwen3.5-35B-A3B / Qwen3.6-35B-A3B | 128 | 25.3 GB | | gemma-4-31B-it | 32 | 21.4 GB | Two distinct failure modes when pushing past these: - **Memory cliff** (~27-28 GB): macOS uses compressed memory, throughput drops 9-30% - **MTLResource cliff** (independent of bytes): Apple GPUs cap at 499 K simultaneously bound resources. 2 B at T=3,200 hits a hard `kIOGPUCommandBufferCallbackErrorOutOfMemory` even at 22 GB peak. Don't extrapolate "more GB headroom" → "longer T". ## Hot-swap mounted LoRA adapters ```python from optiq.adapters.mount import mount_adapter_on_model, AdapterActivation mount_adapter_on_model(model, "agent-A", "./adapter_a") mount_adapter_on_model(model, "agent-B", "./adapter_b") with AdapterActivation("agent-A"): out_a = generate(model, tok, prompt=p, max_tokens=100) with AdapterActivation("agent-B"): out_b = generate(model, tok, prompt=p, max_tokens=100) ``` Or via CLI: ```bash optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --adapter ./adapter_a --adapter ./adapter_b ``` Then select per-request via the OpenAI `model` field. Memory: ~50 MB per extra adapter vs ~5 GB per full base copy. ## CLI reference - `optiq convert MODEL [--target-bpw 4.5] [--candidate-bits 4,8] [--reference auto|bf16|uniform_4bit] [-o PATH]` - `optiq kv-cache MODEL [--target-bits 4.5] [--candidate-bits 4,8] [-o PATH]` - `optiq lora train MODEL --data PATH [--rank 8] [--rank-scaling by_bits|constant|by_kl|by_quantile] [--num-layers 16] [--max-seq-length 1024] [--iters 1000] [-o PATH]` - `optiq lora info ADAPTER_PATH` - `optiq serve --model MODEL [--kv-config PATH] [--adapter PATH-OR-REPO] [--host 127.0.0.1] [--port 8080]` - `optiq eval MODEL_PATH --task gsm8k --baseline UNIFORM_PATH --n-samples 200` - `optiq latency MODEL_PATH --calibrate` - `optiq --version` ## How sensitivity works (algorithm) For each `(layer L, candidate bits b)`: 1. Forward-pass calibration data with all weights at reference precision; record output logits. 2. Replace just L's weight with a simulate-quantized copy at b bits (round-trip quantize→dequantize). 3. Forward-pass the same calibration data; record perturbed logits. 4. Compute KL divergence between reference and perturbed logits, averaged over samples. 5. Restore L; move to next layer. Then greedy knapsack: start every layer at the lowest bit, greedily upgrade the layer with the largest KL-reduction-per-bit until the average BPW reaches target. `lm_head`, `embed_tokens`, first/last attention blocks are protected at 8-bit by default. Calibration: WikiText-2 validation, 32 sequences × 128 tokens. Generic web text suffices because we measure relative layer sensitivity, not absolute accuracy. ## Site map - https://mlx-optiq.com/ — overview - https://mlx-optiq.com/models — all 10 pre-built quants - https://mlx-optiq.com/docs/ — documentation index - https://mlx-optiq.com/docs/install — installation - https://mlx-optiq.com/docs/quants — using pre-built quants - https://mlx-optiq.com/docs/sensitivity — methodology - https://mlx-optiq.com/docs/qwen3.5 — Qwen3.5 family guide - https://mlx-optiq.com/docs/qwen3.6 — Qwen3.6 family guide - https://mlx-optiq.com/docs/gemma-4 — Gemma-4 family guide - https://mlx-optiq.com/docs/finetune — LoRA fine-tuning - https://mlx-optiq.com/docs/serve — KV-quant serving - https://mlx-optiq.com/docs/cli — CLI reference - https://mlx-optiq.com/blog/ — engineering posts and research - https://mlx-optiq.com/blog/gemma-4-support — Gemma-4 family launch (e2b/e4b/26B-A4B/31B), +32 pp recovery on e4b - https://mlx-optiq.com/blog/turboquant-rotated-attention — research path: rotated-space KV attention, 100% needle vs 73% affine - https://mlx-optiq.com/blog/sensitivity-aware-lora — LoRA fine-tuning with rank scaled by per-layer bit assignment - https://mlx-optiq.com/blog/not-all-layers-are-equal — research foundation: per-layer sensitivity for weights and KV cache - https://mlx-optiq.com/experiments — research threads - https://mlx-optiq.com/results — benchmark results ## Distribution - PyPI: https://pypi.org/project/mlx-optiq/ - Hugging Face quants: https://huggingface.co/mlx-community - License: MIT