# mlx-optiq

> mlx-optiq is an open-source Python toolkit for running large language models entirely on Apple Silicon. It provides three things in one PyPI package: (1) sensitivity-driven mixed-precision quantization that beats uniform 4-bit at the same size; (2) sensitivity-aware LoRA fine-tuning with PEFT-compatible output; and (3) an OpenAI-compatible inference server with mixed-precision quantized KV cache and hot-swappable LoRA adapters. Tested on M2 / M3 / M4 Macs. Free, MIT-licensed.

## Install

```
pip install mlx-optiq
```

Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.11+.

Optional extras: `mlx-optiq[convert]` (psutil for RAM precheck), `mlx-optiq[eval]` (datasets for GSM8K), `mlx-optiq[serve]` (uvicorn/fastapi), `mlx-optiq[all]`.

## Pre-built models on Hugging Face

All ten quants live under the `mlx-community` organization on HF. They load with stock `mlx_lm.load(...)` — no special runtime.

### Qwen3.5 family (dense + 1 sparse MoE)
- `mlx-community/Qwen3.5-0.8B-OptiQ-4bit` — 0.5 GB · 27.0% GSM8K (+15.5pp vs uniform-4)
- `mlx-community/Qwen3.5-2B-OptiQ-4bit` — 1.4 GB · 48.0% GSM8K
- `mlx-community/Qwen3.5-4B-OptiQ-4bit` — 2.8 GB · 81.5% GSM8K (+2.0pp)
- `mlx-community/Qwen3.5-9B-OptiQ-4bit` — 5.6 GB · 90.0% GSM8K (default daily-driver)
- `mlx-community/Qwen3.5-27B-OptiQ-4bit` — 15.7 GB · 87.5% GSM8K
- `mlx-community/Qwen3.5-35B-A3B-OptiQ-4bit` — 20.1 GB · 89.5% GSM8K (sparse MoE, 3 B active)

### Qwen3.6 family
- `mlx-community/Qwen3.6-27B-OptiQ-4bit` — 15.7 GB · 95.0% GSM8K (+1.0pp)
- `mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit` — 20.1 GB · 89.5% GSM8K (256-expert MoE, 3 B active)

### Gemma-4 family (instruct)
- `mlx-community/gemma-4-e2b-it-OptiQ-4bit` — 4.0 GB · 13.0% GSM8K (+7.5pp)
- `mlx-community/gemma-4-e4b-it-OptiQ-4bit` — 6.0 GB · 55.5% GSM8K (+32.0pp — best small-model recovery)
- `mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit` — 14.9 GB · 94.0% GSM8K (sparse MoE, 4 B active)
- `mlx-community/gemma-4-31B-it-OptiQ-4bit` — 18.1 GB · 96.0% GSM8K (strongest dense quant)

## Loading any pre-built quant

```python
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)
```

For Qwen3.5/3.6 reasoning models, pass `enable_thinking=False` to skip the `<think>...</think>` channel for faster (slightly less accurate) output.

## Streaming generation

```python
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

for response in stream_generate(model, tok, prompt="...", max_tokens=200, sampler=sampler):
    print(response.text, end="", flush=True)
```

## Quantizing your own model

```bash
# auto-routes between bf16 and uniform_4bit reference based on RAM
optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 4.5 \
    --candidate-bits 4,8 \
    --reference auto \
    -o ./optiq_output/Qwen3.5-9B
```

Two reference modes:
- `bf16` (gold standard, requires bf16 in RAM, ~2 × params in GB)
- `uniform_4bit` (for big models, builds 4-bit baseline + streams bf16 layer-by-layer from disk)
- `auto` (default — picks bf16 if it fits, else uniform_4bit)

The output is a standard MLX checkpoint with per-layer bit assignments stored in metadata. It loads anywhere stock `mlx-lm` loads.

## Mixed-precision KV-cache serving

One-time sensitivity pass, then serve:

```bash
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --port 8080
```

Delivers +31% to +62% decode speedup at 64k context on Qwen3.5 4B/9B vs fp16 KV.

KV-quant currently broken on Gemma-4 (shared-KV attention upstream limitation) — use stock fp16 KV for Gemma-4 long-context serving.

## OpenAI- and Anthropic-compatible API

`optiq serve` exposes BOTH endpoints from the same process:
- OpenAI: `/v1/chat/completions` (streaming SSE)
- Anthropic: `/v1/messages` (streaming SSE) — works with Claude Code and the official `anthropic` Python SDK

```python
# OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used")
resp = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

# Anthropic client — same server
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8080", api_key="not-used")
resp = client.messages.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    max_tokens=300,
    messages=[{"role": "user", "content": "hi"}],
)
print(resp.content[0].text)
```

Claude Code via env vars (one-line setup):
```bash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
claude    # now driven by your local quant
```

## Sensitivity-aware LoRA fine-tuning

```bash
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --max-seq-length 1400 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 1000 \
    -o ./my_adapter

optiq lora info ./my_adapter
```

`--rank-scaling by_bits` gives 8-bit mlx-optiq layers 2× the adapter rank of 4-bit layers at the same total parameter budget. The same layers mlx-optiq kept at 8-bit during quantization also get more adapter capacity.

Output is PEFT-compatible (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution.

Data format is JSONL (one example per line, `{"text": "..."}` or `{"messages": [...]}`) — same as `mlx_lm.lora`.

### Empirical training-ceiling map (M3 Max 36 GB, default config)

| Model | Max seq len | Peak mem |
|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB |
| Qwen3.5-2B | 2,400 | 19.3 GB |
| Qwen3.5-4B | 1,600 | 24.8 GB |
| Qwen3.5-9B | 1,400 | 25.4 GB |
| Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB |
| gemma-4-26B-A4B | 512 | 27.6 GB |
| Qwen3.5-35B-A3B / Qwen3.6-35B-A3B | 128 | 25.3 GB |
| gemma-4-31B-it | 32 | 21.4 GB |

Two distinct failure modes when pushing past these:
- **Memory cliff** (~27-28 GB): macOS uses compressed memory, throughput drops 9-30%
- **MTLResource cliff** (independent of bytes): Apple GPUs cap at 499 K simultaneously bound resources. 2 B at T=3,200 hits a hard `kIOGPUCommandBufferCallbackErrorOutOfMemory` even at 22 GB peak. Don't extrapolate "more GB headroom" → "longer T".

## Hot-swap mounted LoRA adapters

```python
from optiq.adapters.mount import mount_adapter_on_model, AdapterActivation

mount_adapter_on_model(model, "agent-A", "./adapter_a")
mount_adapter_on_model(model, "agent-B", "./adapter_b")

with AdapterActivation("agent-A"):
    out_a = generate(model, tok, prompt=p, max_tokens=100)

with AdapterActivation("agent-B"):
    out_b = generate(model, tok, prompt=p, max_tokens=100)
```

Or via CLI:
```bash
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter ./adapter_a --adapter ./adapter_b
```

Then select per-request via the OpenAI `model` field. Memory: ~50 MB per extra adapter vs ~5 GB per full base copy.

## CLI reference

- `optiq convert MODEL [--target-bpw 4.5] [--candidate-bits 4,8] [--reference auto|bf16|uniform_4bit] [-o PATH]`
- `optiq kv-cache MODEL [--target-bits 4.5] [--candidate-bits 4,8] [-o PATH]`
- `optiq lora train MODEL --data PATH [--rank 8] [--rank-scaling by_bits|constant|by_kl|by_quantile] [--num-layers 16] [--max-seq-length 1024] [--iters 1000] [-o PATH]`
- `optiq lora info ADAPTER_PATH`
- `optiq serve --model MODEL [--kv-config PATH] [--adapter PATH-OR-REPO] [--host 127.0.0.1] [--port 8080]`
- `optiq eval MODEL_PATH --task gsm8k --baseline UNIFORM_PATH --n-samples 200`
- `optiq latency MODEL_PATH --calibrate`
- `optiq --version`

## How sensitivity works (algorithm)

For each `(layer L, candidate bits b)`:
1. Forward-pass calibration data with all weights at reference precision; record output logits.
2. Replace just L's weight with a simulate-quantized copy at b bits (round-trip quantize→dequantize).
3. Forward-pass the same calibration data; record perturbed logits.
4. Compute KL divergence between reference and perturbed logits, averaged over samples.
5. Restore L; move to next layer.

Then greedy knapsack: start every layer at the lowest bit, greedily upgrade the layer with the largest KL-reduction-per-bit until the average BPW reaches target. `lm_head`, `embed_tokens`, first/last attention blocks are protected at 8-bit by default.

Calibration: WikiText-2 validation, 32 sequences × 128 tokens. Generic web text suffices because we measure relative layer sensitivity, not absolute accuracy.

## Site map

- https://mlx-optiq.com/ — overview
- https://mlx-optiq.com/models — all 10 pre-built quants
- https://mlx-optiq.com/docs/ — documentation index
- https://mlx-optiq.com/docs/install — installation
- https://mlx-optiq.com/docs/quants — using pre-built quants
- https://mlx-optiq.com/docs/sensitivity — methodology
- https://mlx-optiq.com/docs/qwen3.5 — Qwen3.5 family guide
- https://mlx-optiq.com/docs/qwen3.6 — Qwen3.6 family guide
- https://mlx-optiq.com/docs/gemma-4 — Gemma-4 family guide
- https://mlx-optiq.com/docs/finetune — LoRA fine-tuning
- https://mlx-optiq.com/docs/serve — KV-quant serving
- https://mlx-optiq.com/docs/cli — CLI reference
- https://mlx-optiq.com/blog/ — engineering posts and research
- https://mlx-optiq.com/blog/gemma-4-support — Gemma-4 family launch (e2b/e4b/26B-A4B/31B), +32 pp recovery on e4b
- https://mlx-optiq.com/blog/turboquant-rotated-attention — research path: rotated-space KV attention, 100% needle vs 73% affine
- https://mlx-optiq.com/blog/sensitivity-aware-lora — LoRA fine-tuning with rank scaled by per-layer bit assignment
- https://mlx-optiq.com/blog/not-all-layers-are-equal — research foundation: per-layer sensitivity for weights and KV cache
- https://mlx-optiq.com/experiments — research threads
- https://mlx-optiq.com/results — benchmark results

## Distribution

- PyPI: https://pypi.org/project/mlx-optiq/
- Hugging Face quants: https://huggingface.co/mlx-community
- License: MIT