mlx-optiq
Integration · Codex

Codex

OpenAI's Codex CLI uses the OpenAI Responses API exclusively (Chat Completions was deprecated for Codex in 2026). optiq serve exposes /v1/responses by default so Codex talks to your local OptIQ-quantized model with a one-block config addition.

1. Install Codex

terminalbash
$ npm install -g @openai/codex

2. Start optiq serve

terminalbash
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --mtp --mtp-depth 2 \
    --port 8080

The default --responses flag installs /v1/responses; --mtp adds in-checkpoint MTP speculation for ~1.4-1.8× decode speedup on Qwen3.5 / 3.6 family.

3. Configure Codex

Edit ~/.codex/config.toml and add:

~/.codex/config.tomltoml
[model_providers.optiq]
name                  = "OptIQ Local"
base_url              = "http://localhost:8080/v1"
env_key               = "OPTIQ_AUTH_TOKEN"
wire_api              = "responses"
requires_openai_auth  = false

[profiles.optiq]
model_provider = "optiq"
model          = "mlx-community/Qwen3.5-9B-OptiQ-4bit"

Then export the auth token and launch Codex with this profile:

terminalbash
export OPTIQ_AUTH_TOKEN=sk-optiq-local

$ codex -p optiq

Notes

  • wire_api = "responses" is required. Codex no longer accepts wire_api = "chat".
  • Tool calls: Codex relies on function-calling for its edit / run / search loop. Models without robust function-calling training won't drive the full agent. Qwen3.5-9B-OptiQ and up handle this well; smaller models work for plain chat only.
  • Built-in tools (web_search, file_search, computer_use): silently dropped by our shim. Codex's local-tool stack (apply_patch, shell, etc.) runs in the CLI itself and works fine.
  • Streaming: works. Codex relies on response.output_text.delta and response.completed events; both are emitted in spec-compliant order.
  • Verified: tested against Codex v0.130.0 on macOS (Apple Silicon).
Codex + MTP on Apple Silicon Codex's edit-run-review loop is decode-heavy. The MTP speculation in --mtp stays at ~70% acceptance on Qwen3.5/3.6 family for typical code-edit prompts. Pairs well with longer max_tokens settings since the MTP head amortizes the per-token cost.