mlx-optiq
Workflow · serving

KV-quant serving

optiq serve is a drop-in replacement for mlx_lm.server. It exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process — point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain curl at the same local URL. On top of that: sensitivity-aware quantized KV cache for long-context throughput, and mounted LoRA adapters that swap per request.

Quickstart

terminalbash
# Stock fp16 KV serving — works for any mlx-optiq quant
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Then call it like any OpenAI endpoint:

curl_chat.shbash
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mlx-community/Qwen3.5-9B-OptiQ-4bit",
      "messages": [{"role": "user", "content": "What is RoPE?"}],
      "max_tokens": 300,
      "stream": false
    }'

Mixed-precision KV cache

Default optiq serve uses fp16 KV. For long context (16 k+) on Qwen3.5 / 3.6, a sensitivity-driven quantized KV cache delivers 30-60% decode speedup with no quality loss.

Step 1 — measure

terminalbash
# 1-2 min. Once per model.
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 \
    --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

# writes ./kv/qwen35_9b/kv_config.json
# [{"layer_idx": 3, "bits": 8, "group_size": 64}, ...]

Step 2 — serve

terminalbash
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --max-tokens 32768 --temp 0.6 --top-p 0.95
Why this works Layer 0's KV is often 56× more sensitive than the layer-average — uniform 4-bit KV is catastrophic. A single 8-bit layer (the most KV-sensitive one, often layer 3 in Qwen3.5's hybrid attention) protects quality while every other layer runs 4-bit. Apple Silicon's mx.quantized_matmul also handles the 8-bit fast path more efficiently than 4-bit, so protecting that one layer also flips it onto a faster kernel. Quality and speed point the same way.

For empirical numbers across Qwen3.5 sizes, see benchmark results.

Mounted LoRA adapters

mlx-optiq ships a reversible mounted LoRA primitive — distinct from mlx-lm's load_adapters which merges adapter weights into the base. Mounted adapters stay separate; a ContextVar selects which one is active per request.

One adapter at startup

terminalbash
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter ./my_adapter

Multiple adapters from Hugging Face

terminalbash
# Auto-downloads from HF, mounts both
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter codelion/agent-A \
    --adapter codelion/agent-B

Pick one per request via the model field in the OpenAI request body — set it to the adapter name:

switch_per_request.shbash
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "agent-A", "messages": [...] }'

$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "agent-B", "messages": [...] }'
Memory math One Qwen3.5-9B-OptiQ-4bit base is ~5.6 GB. Each LoRA adapter is ~50 MB. 10 adapters co-resident ≈ 6.1 GB, vs ~56 GB if you spun up one full model copy per adapter. Switching is free — no weight reload, no GPU re-upload.

Python API

If you want to skip the CLI and embed serving in your own process:

embed_serve.pypython
from optiq.serve import create_app
import uvicorn

app = create_app(
    model_name="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    kv_config_path="./kv/qwen35_9b/kv_config.json",
    adapters=[("agent-A", "./adapter_a"),
              ("agent-B", "./adapter_b")],
)
uvicorn.run(app, host="0.0.0.0", port=8080)

Mounted-adapter Python API

Lower-level: mount adapters and switch them programmatically without going through the HTTP server.

mount.pypython
from mlx_lm import load, generate
from optiq.adapters.mount import (
    mount_adapter_on_model,
    AdapterActivation,
)

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
mount_adapter_on_model(model, "agent-A", "./adapter_a")
mount_adapter_on_model(model, "agent-B", "./adapter_b")

with AdapterActivation("agent-A"):
    out_a = generate(model, tok, prompt=p, max_tokens=100)

with AdapterActivation("agent-B"):
    out_b = generate(model, tok, prompt=p, max_tokens=100)

OpenAI client compatibility

Use the official openai Python client by pointing it at your local server:

openai_client.pypython
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",  # local server, but key is required
)

resp = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Anthropic API — point Claude Code at your local quant

The same server simultaneously answers Anthropic's /v1/messages endpoint with the exact response shape Claude clients expect. This means you can drive a local mlx-optiq quant from any tool that speaks the Anthropic API — Claude Code, the official anthropic Python SDK, or your own integrations.

terminalbash
# Same optiq serve invocation — no extra flag needed.
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Anthropic SDK against your local quant:

anthropic_client.pypython
from anthropic import Anthropic

client = Anthropic(
    base_url="http://localhost:8080",
    api_key="not-used",
)
resp = client.messages.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    max_tokens=300,
    messages=[{"role": "user", "content": "hi"}],
)
print(resp.content[0].text)

Claude Code via env var (one line):

claude_code.shbash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude    # now driven by your local quant
What's translated The shim accepts Anthropic-shaped requests, translates them into the OpenAI request the underlying mlx-lm engine wants, runs generation, and translates the response back into Anthropic shape — including streaming events. system, messages, max_tokens, stream, temperature, and top_p all work. Tool-use parameters are accepted but route through the same generation path (the underlying model does what it does — there's no server-side function-calling router).

Production tips

  • Bind to 127.0.0.1 for local-only use, or behind a reverse proxy. Don't expose the raw 0.0.0.0 binding to the public internet — there's no auth.
  • Set max-concurrency. The MLX runtime is single-process; concurrent generations share the GPU and degrade tail latency.
  • Tune --max-tokens conservatively. Each in-flight request keeps a KV cache resident; long contexts dominate memory.
  • Pre-load adapters. Loading a new adapter mid-flight stalls all in-flight requests. Mount everything you need at startup.