mlx-optiq
Workflow · MTP and spec decoding

MTP and spec decoding on Apple Silicon

OptIQ ships two flavors of speculative decoding. Both follow the same draft, verify, accept loop, but the source of drafts is different per model family.

Qwen3.5 and Qwen3.6 each ship with an extra tiny prediction head bundled into the model weights. The literature calls it MTP, for Multi-Token Prediction. OptIQ uses this head as a draft model when you turn on speculative decoding. The base model still produces every output token, but the head tries to guess ahead, and on the cycles where its guess matches what the base model wanted, that token is free.

Gemma-4 does not ship an MTP head. Google instead publishes a separate small drafter model, the so-called -assistant variant, that is a 4-layer Q-only model trained to share K and V with two specific layers of the target. OptIQ loads this drafter alongside the target and runs the same outer spec loop. The plumbing is different (typed K/V sharing, target-hidden conditioning, two-layer-type cache slicing) but from the serving side it is just another spec backend.

On a 24 GB M4 Pro with our published OptIQ-4bit quants and greedy decoding we measure 1.20x on Qwen3.5-4B, 1.32x on 9B, and 1.40x on Qwen3.6-27B with the MTP head, and 1.18x geomean on Gemma-4 E4B with the -assistant drafter. Bigger Qwen models benefit more because the verify cost amortizes over a slower base. Gemma-4's smaller speedup follows from a lower acceptance rate, which has more to do with bf16 numeric drift in mlx-lm's multi-token verify than with the drafter quality, see the methodology section.

Turning it on

For Qwen3.5 or Qwen3.6 with their bundled MTP head:

serve.shbash
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit --mtp

For Gemma-4 with its -assistant drafter (Phase 3 ships γ=1 greedy):

spec.pypython
from optiq.runtime.spec import GemmaAssistantDrafter, spec_generate, SpecConfig
from mlx_lm.utils import load_model, load_tokenizer

target, _ = load_model("mlx-community/gemma-4-E4B-it-4bit", lazy=False)
tokenizer = load_tokenizer("mlx-community/gemma-4-E4B-it-4bit")
drafter   = GemmaAssistantDrafter.from_pretrained(
              "mlx-community/gemma-4-E4B-it-assistant-bf16")

for ev in spec_generate(target, drafter, tokenizer, prompt, SpecConfig(gamma=1)):
    if ev.kind == "token": print(ev.text, end="")

The Server page in OptIQ Lab has both a checkbox for the Qwen MTP path and a "Spec drafter" picker for the Gemma -assistant path. They are mutually exclusive: pick one or the other per loaded model. Qwen paths look up the model's recommended sampling settings from generation_config.json and apply them unless you pass your own via --temp, --top-p, and --top-k. The Gemma path is currently greedy only.

Numbers

We report greedy here because it isolates the speedup measurement from sampling noise and matches how unsloth and the upstream MTP literature publish their headlines.

ModelBase tok/sMTP tok/sSpeedupAcceptance
Qwen3.5-4B29.235.01.20x67%
Qwen3.5-9B19.525.81.32x66%
Qwen3.6-27B6.08.41.40x72%

With Qwen's recommended production sampler (temp=1.0, top_p=0.95, top_k=20), the speedup is smaller but still positive everywhere:

ModelBase tok/sMTP tok/sSpeedupAcceptance
Qwen3.5-4B28.731.21.09x56%
Qwen3.5-9B19.122.31.17x56%
Qwen3.6-27B6.28.01.30x56%

Acceptance is the literature definition, drafts_accepted / drafts_attempted, read straight from the engine.

Gemma-4 E4B with -assistant drafter

Greedy, γ=1, 200-token generation, five prompt categories, median of three runs each on M4 Pro 24 GB:

Prompt typeBase tok/sSpec tok/sSpeedupAcceptance
math29.9738.661.29x37.5%
code29.6837.091.25x34.0%
prose31.1936.651.18x30.3%
dialogue31.6635.171.11x29.5%
reasoning30.4332.141.06x25.5%
Geomean1.18x31.4%

Acceptance is lower than Qwen MTP for two reasons. First, the Gemma drafter is a separately trained Q-only model that has to predict the target's distribution from a few shared cache layers, not a head that was co-trained on the target's loss. Second, mlx-lm's multi-token verify forward is not bit-identical to the equivalent sequence of single-token forwards due to bf16 attention precision; the largest diff we measured was 0.68 in logit magnitude at the second of two positions. This means a draft the target would have accepted in a single-token verify can be rejected in a multi-token verify. Greedy outputs still match a baseline greedy run for a long prefix (200 tokens identical on our reasoning prompt), then drift on longer sequences.

γ-sweep on the math prompt above (200 tokens, median of three):

γSpec tok/sSpeedup
baseline29.091.00x
139.021.34x
236.911.27x
328.040.96x
423.830.82x
520.390.70x

γ=1 is optimal on Metal for the same reason it is optimal for Qwen MTP: the K-token verify forward scales near-linearly with K, while acceptance stays roughly constant per draft slot. The math is in Getting MTP to actually work on Apple Silicon, "What about depth 2 or higher". γ>1 is implemented and lossless within the same bf16 precision bound as γ=1, but ships defaulted to γ=1.

Where MTP does not pay off

For Qwen3.5 0.8B the base model is already at 130 tokens per second, and the speculation overhead per cycle eats more than the head can give back. We measured a regression to about 0.7x. The 2B model lands close to break-even. Skip MTP at these sizes. 4B and up consistently win.

About depth

The default is depth 1, meaning one drafted token per cycle. Depth 2 and above does not help on Apple Silicon. The reason is that Metal's K-token verify forward scales close to linearly with K, while on CUDA the same forward is nearly free due to spare matmul throughput on Tensor Cores. We measured depth 2 through 4 and depth 1 wins every single configuration. We also tried HuggingFace's adaptive depth heuristic (raise K after a clean cycle, lower it on partial accept). It lost 4 to 17 percent depending on the model and sampler. So we ship a fixed depth 1.

How MTP plays with the rest of OptIQ

Weight quantization is fine. The MTP head ships as a 4-bit projection with a BF16 final layer in our quants, matching the host model's bit width on the projection weights.

All three serving endpoints (OpenAI, Anthropic, Responses) work with MTP. Tools using these APIs do not need any knowledge that speculation is happening.

Chat templates work the same. enable_thinking=False still applies.

Compatibility

FamilySpec backendStatus
Qwen3.5Bundled MTP headYes for 4B and up. 0.8B and 2B regress; skip.
Qwen3.6Bundled MTP headYes for 27B.
Gemma-4 E4BExternal -assistant drafterYes. Greedy, γ configurable; γ=1 default and optimal on Metal. 1.18x geomean on M4 Pro 24 GB.
Gemma-4 E2B / 26B / 31BNo drafter publishedGoogle has not released matching -assistant weights for the other Gemma-4 sizes.

Methodology

Each Qwen measurement runs in its own subprocess for clean memory state. The prompt is a 166-token Python question. We generate 512 tokens. Decode tokens per second comes from mlx-lm's GenerationResponse.generation_tps, which is measured after prefill so it captures only the decoding phase.

For Gemma-4 we run five chat-templated prompts (one each from math, code, prose, dialogue, reasoning), 200-token generations, median of three runs per prompt, all in a single Python process. Decode tokens per second is wall-clock n_tokens / elapsed for both baseline and spec, with a warm-up generation discarded.

Qwen acceptance comes from OptiqEngine's drafts_accepted and drafts_attempted counters. Gemma acceptance comes from the equivalent counters inside SpecStats in optiq.runtime.spec. Both follow the standard literature definition, so the comparisons are apples to apples with unsloth and llama.cpp numbers.

Hardware: Apple M4 Pro, 24 GB unified memory, 19.1 GB Apple-recommended working set.

For the longer story of how we got MTP working correctly on this stack, see the blog post on Apple Silicon MTP. For the Gemma -assistant path, see Gemma-4 spec decoding on Apple Silicon.