KV-quant serving
optiq serve is a drop-in replacement for mlx_lm.server. It exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process — point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain curl at the same local URL. On top of that: sensitivity-aware quantized KV cache for long-context throughput, and mounted LoRA adapters that swap per request.
Quickstart
# Stock fp16 KV serving — works for any mlx-optiq quant $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --port 8080
Then call it like any OpenAI endpoint:
$ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/Qwen3.5-9B-OptiQ-4bit", "messages": [{"role": "user", "content": "What is RoPE?"}], "max_tokens": 300, "stream": false }'
Mixed-precision KV cache
Default optiq serve uses fp16 KV. For long context (16 k+) on Qwen3.5 / 3.6, a sensitivity-driven quantized KV cache delivers 30-60% decode speedup with no quality loss.
Step 1 — measure
# 1-2 min. Once per model. $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 \ --candidate-bits 4,8 \ -o ./kv/qwen35_9b # writes ./kv/qwen35_9b/kv_config.json # [{"layer_idx": 3, "bits": 8, "group_size": 64}, ...]
Step 2 — serve
$ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --max-tokens 32768 --temp 0.6 --top-p 0.95
mx.quantized_matmul also handles the 8-bit fast path more efficiently than 4-bit, so protecting that one layer also flips it onto a faster kernel. Quality and speed point the same way.
For empirical numbers across Qwen3.5 sizes, see benchmark results.
Mounted LoRA adapters
mlx-optiq ships a reversible mounted LoRA primitive — distinct from mlx-lm's load_adapters which merges adapter weights into the base. Mounted adapters stay separate; a ContextVar selects which one is active per request.
One adapter at startup
$ optiq serve \
--model mlx-community/Qwen3.5-9B-OptiQ-4bit \
--adapter ./my_adapter
Multiple adapters from Hugging Face
# Auto-downloads from HF, mounts both $ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --adapter codelion/agent-A \ --adapter codelion/agent-B
Pick one per request via the model field in the OpenAI request body — set it to the adapter name:
$ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "agent-A", "messages": [...] }' $ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "agent-B", "messages": [...] }'
Python API
If you want to skip the CLI and embed serving in your own process:
from optiq.serve import create_app import uvicorn app = create_app( model_name="mlx-community/Qwen3.5-9B-OptiQ-4bit", kv_config_path="./kv/qwen35_9b/kv_config.json", adapters=[("agent-A", "./adapter_a"), ("agent-B", "./adapter_b")], ) uvicorn.run(app, host="0.0.0.0", port=8080)
Mounted-adapter Python API
Lower-level: mount adapters and switch them programmatically without going through the HTTP server.
from mlx_lm import load, generate from optiq.adapters.mount import ( mount_adapter_on_model, AdapterActivation, ) model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") mount_adapter_on_model(model, "agent-A", "./adapter_a") mount_adapter_on_model(model, "agent-B", "./adapter_b") with AdapterActivation("agent-A"): out_a = generate(model, tok, prompt=p, max_tokens=100) with AdapterActivation("agent-B"): out_b = generate(model, tok, prompt=p, max_tokens=100)
OpenAI client compatibility
Use the official openai Python client by pointing it at your local server:
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-used", # local server, but key is required ) resp = client.chat.completions.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", messages=[{"role": "user", "content": "hi"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="")
Anthropic API — point Claude Code at your local quant
The same server simultaneously answers Anthropic's /v1/messages endpoint with the exact response shape Claude clients expect. This means you can drive a local mlx-optiq quant from any tool that speaks the Anthropic API — Claude Code, the official anthropic Python SDK, or your own integrations.
# Same optiq serve invocation — no extra flag needed. $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --port 8080
Anthropic SDK against your local quant:
from anthropic import Anthropic client = Anthropic( base_url="http://localhost:8080", api_key="not-used", ) resp = client.messages.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", max_tokens=300, messages=[{"role": "user", "content": "hi"}], ) print(resp.content[0].text)
Claude Code via env var (one line):
export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" $ claude # now driven by your local quant
system, messages, max_tokens, stream, temperature, and top_p all work. Tool-use parameters are accepted but route through the same generation path (the underlying model does what it does — there's no server-side function-calling router).
Production tips
- Bind to
127.0.0.1for local-only use, or behind a reverse proxy. Don't expose the raw0.0.0.0binding to the public internet — there's no auth. - Set max-concurrency. The MLX runtime is single-process; concurrent generations share the GPU and degrade tail latency.
- Tune
--max-tokensconservatively. Each in-flight request keeps a KV cache resident; long contexts dominate memory. - Pre-load adapters. Loading a new adapter mid-flight stalls all in-flight requests. Mount everything you need at startup.