Integration · Claude Code

Claude Code

Anthropic's Claude Code is a terminal-based coding agent that uses the Anthropic Messages API. Point it at optiq serve via ANTHROPIC_BASE_URL and it'll talk to your local OptiQ model instead of Anthropic's hosted Claude.

1. Install Claude Code

terminalbash

$ npm install -g @anthropic-ai/claude-code

2. Start `optiq serve`

terminalbash

$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --mtp --mtp-depth 2 \
    --port 8080

The default --anthropic flag installs the /v1/messages endpoint. The --mtp flag enables ~1.4-1.8× decode speedup on Qwen3.5 / 3.6 family.

3. Point Claude Code at OptiQ

terminalbash

# In your shell rc, or per-session:
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=sk-optiq-local
export ANTHROPIC_MODEL=mlx-community/Qwen3.5-9B-OptiQ-4bit

$ claude

That's it. Claude Code will route every message through your local optiq serve. To go back to hosted Claude, unset the three env vars.

4. Right-size the context window (auto-compact)

Claude Code decides when to auto-compact the conversation by comparing the token usage the server reports against the context window it assumes the model has (~200k for a Claude model). Point it at a smaller-context local model and it won't compact until far past the model's real limit — the model overflows and generation fails first.

--context-scale FACTOR fixes the timing: it multiplies the token counts in the reported usage by FACTOR, so Claude Code's "compact at N% of the window" logic fires at the right real-token point. Only the reported usage is scaled — generation, the KV cache, and the prompt are untouched. Compute the factor as (window Claude Code assumes) / (your model's context):

terminalbash

# a model with a 32k context, behind Claude Code's ~200k assumption:
# 200000 / 32000 ≈ 6.25
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --mtp --mtp-depth 2 \
    --context-scale 6.25 \
    --port 8080

Now Claude Code auto-compacts as the conversation approaches your model's real ceiling instead of overrunning it. Leave it at the default 1.0 when your model's context already matches (or exceeds) what the client assumes.

Notes

Tool use: Qwen and Llama models that emit <tool_call>...</tool_call> blocks are translated into Anthropic tool_use content blocks transparently. Models without native tool-call training (most small ones) won't drive Claude Code's full agentic loop; pick 9B+ for serious coding work.
Streaming: works out of the box. Claude Code shows tokens as they arrive.
Prompt caching: multi-turn reuse is automatic — after the first turn, the server reuses the KV of the shared conversation prefix and prefills only the new tokens, so every turn is near-instant to first token (a ~4× TTFT cut on a ~4k context, growing with model size). No configuration; see Prompt caching.
Thinking on/off by name: reasoning models expose a thinking toggle as a model-id suffix — set ANTHROPIC_MODEL=…-OptiQ-4bit:no-think for direct answers (faster, no rambling) or :think for full reasoning, with no extra request fields.
Model id is forgiving: optiq serve is single-model by default, so if you forget to set ANTHROPIC_MODEL (Claude Code then sends claude-…) or send a basename, it's served the one local model rather than 404ing. Pass --allow-model-switch only if you want one server to hot-swap between cached quants.
Auth token: any string starting with sk-optiq- works. Mirrors Unsloth's sk-unsloth-* convention.
?beta=true query string: Claude Code appends ?beta=true to /v1/messages for the prompt-caching beta. Our endpoint strips the query string and routes the request normally, so the beta header is a no-op on the server side without breaking the wire.
Verified: tested against Claude Code 2.1.201 on macOS (Apple Silicon).

Why a local Claude Code matters Claude Code is one of the most polished coding agents shipping today. Running it against a local model gives you the same UX with no API spend, no network round-trips, and complete data sovereignty. The OptiQ + Qwen3.5-9B-MTP combination is fast enough for a fluid edit-and-run loop on M3 Pro and up.

Claude Code

1. Install Claude Code

2. Start optiq serve

3. Point Claude Code at OptiQ

4. Right-size the context window (auto-compact)

Notes

2. Start `optiq serve`