Workflow · MTP and spec decoding

MTP and spec decoding on Apple Silicon

OptiQ ships two flavors of speculative decoding. Both follow the same draft, verify, accept loop, but the source of drafts is different per model family.

Qwen3.5 and Qwen3.6 each ship with an extra tiny prediction head bundled into the model weights. The literature calls it MTP, for Multi-Token Prediction. OptiQ uses this head as a draft model when you turn on speculative decoding. The base model still produces every output token, but the head tries to guess ahead, and on the cycles where its guess matches what the base model wanted, that token is free.

Gemma-4 does not ship an MTP head. Google instead publishes a separate small drafter model, the so-called -assistant variant, that is a 4-layer Q-only model trained to share K and V with two specific layers of the target. OptiQ loads this drafter alongside the target and runs the same outer spec loop. The plumbing is different (typed K/V sharing, target-hidden conditioning, two-layer-type cache slicing) but from the serving side it is just another spec backend.

On a 24 GB M4 Pro with our published OptiQ-4bit quants and greedy decoding we measure 1.20x on Qwen3.5-4B, 1.32x on 9B, and 1.40x on Qwen3.6-27B with the MTP head, and 1.18x geomean on Gemma-4 E4B with the -assistant drafter. Bigger Qwen models benefit more because the verify cost amortizes over a slower base. Gemma-4's smaller speedup follows from a lower acceptance rate, which has more to do with bf16 numeric drift in mlx-lm's multi-token verify than with the drafter quality, see the methodology section.

Turning it on

For Qwen3.5 or Qwen3.6 with their bundled MTP head:

serve.shbash

$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit --mtp

For Gemma-4 with its -assistant drafter (Phase 3 ships γ=1 greedy):

spec.pypython

from optiq.runtime.spec import GemmaAssistantDrafter, spec_generate, SpecConfig
from mlx_lm.utils import load_model, load_tokenizer

target, _ = load_model("mlx-community/gemma-4-E4B-it-4bit", lazy=False)
tokenizer = load_tokenizer("mlx-community/gemma-4-E4B-it-4bit")
drafter   = GemmaAssistantDrafter.from_pretrained(
              "mlx-community/gemma-4-E4B-it-assistant-bf16")

for ev in spec_generate(target, drafter, tokenizer, prompt, SpecConfig(gamma=1)):
    if ev.kind == "token": print(ev.text, end="")

The Server page in OptiQ Lab has both a checkbox for the Qwen MTP path and a "Spec drafter" picker for the Gemma -assistant path. They are mutually exclusive: pick one or the other per loaded model. Qwen paths look up the model's recommended sampling settings from generation_config.json and apply them unless you pass your own via --temp, --top-p, and --top-k. The Gemma path is currently greedy only.

Numbers

We report greedy here because it isolates the speedup measurement from sampling noise and matches how unsloth and the upstream MTP literature publish their headlines.

Model	Base tok/s	MTP tok/s	Speedup	Acceptance
Qwen3.5-4B	29.2	35.0	1.20x	67%
Qwen3.5-9B	19.5	25.8	1.32x	66%
Qwen3.6-27B	6.0	8.4	1.40x	72%

With Qwen's recommended production sampler (temp=1.0, top_p=0.95, top_k=20), the speedup is smaller but still positive everywhere:

Model	Base tok/s	MTP tok/s	Speedup	Acceptance
Qwen3.5-4B	28.7	31.2	1.09x	56%
Qwen3.5-9B	19.1	22.3	1.17x	56%
Qwen3.6-27B	6.2	8.0	1.30x	56%

Acceptance is the literature definition, drafts_accepted / drafts_attempted, read straight from the engine.

Gemma-4 E4B with `-assistant` drafter

Greedy, γ=1, 200-token generation, five prompt categories, median of three runs each on M4 Pro 24 GB:

Prompt type	Base tok/s	Spec tok/s	Speedup	Acceptance
math	29.97	38.66	1.29x	37.5%
code	29.68	37.09	1.25x	34.0%
prose	31.19	36.65	1.18x	30.3%
dialogue	31.66	35.17	1.11x	29.5%
reasoning	30.43	32.14	1.06x	25.5%
Geomean			1.18x	31.4%

Acceptance is lower than Qwen MTP for two reasons. First, the Gemma drafter is a separately trained Q-only model that has to predict the target's distribution from a few shared cache layers, not a head that was co-trained on the target's loss. Second, mlx-lm's multi-token verify forward is not bit-identical to the equivalent sequence of single-token forwards due to bf16 attention precision; the largest diff we measured was 0.68 in logit magnitude at the second of two positions. This means a draft the target would have accepted in a single-token verify can be rejected in a multi-token verify. Greedy outputs still match a baseline greedy run for a long prefix (200 tokens identical on our reasoning prompt), then drift on longer sequences.

γ-sweep on the math prompt above (200 tokens, median of three):

γ	Spec tok/s	Speedup
baseline	29.09	1.00x
1	39.02	1.34x
2	36.91	1.27x
3	28.04	0.96x
4	23.83	0.82x
5	20.39	0.70x

γ=1 is optimal on Metal for the same reason it is optimal for Qwen MTP: the K-token verify forward scales near-linearly with K, while acceptance stays roughly constant per draft slot. The math is in Getting MTP to actually work on Apple Silicon, "What about depth 2 or higher". γ>1 is implemented and lossless within the same bf16 precision bound as γ=1, but ships defaulted to γ=1.

Where MTP does not pay off

For Qwen3.5 0.8B the base model is already at 130 tokens per second, and the speculation overhead per cycle eats more than the head can give back. We measured a regression to about 0.7x. The 2B model lands close to break-even. Skip MTP at these sizes. 4B and up consistently win.

About depth

The default is depth 1, meaning one drafted token per cycle. Depth 2 and above does not help on Apple Silicon. The reason is that Metal's K-token verify forward scales close to linearly with K, while on CUDA the same forward is nearly free due to spare matmul throughput on Tensor Cores. We measured depth 2 through 4 and depth 1 wins every single configuration. We also tried HuggingFace's adaptive depth heuristic (raise K after a clean cycle, lower it on partial accept). It lost 4 to 17 percent depending on the model and sampler. So we ship a fixed depth 1.

How MTP plays with the rest of OptiQ

Weight quantization is fine. The MTP head ships as a 4-bit projection with a BF16 final layer in our quants, matching the host model's bit width on the projection weights.

All three serving endpoints (OpenAI, Anthropic, Responses) work with MTP. Tools using these APIs do not need any knowledge that speculation is happening.

Chat templates work the same. enable_thinking=False still applies.

Compatibility

Family	Spec backend	Status
Qwen3.5	Bundled MTP head	Yes for 4B and up. 0.8B and 2B regress; skip.
Qwen3.6	Bundled MTP head	Yes for 27B.
Gemma-4 E4B	External `-assistant` drafter	Yes. Greedy, γ configurable; γ=1 default and optimal on Metal. 1.18x geomean on M4 Pro 24 GB.
Gemma-4 E2B / 26B / 31B	No drafter published	Google has not released matching `-assistant` weights for the other Gemma-4 sizes.

Methodology

Each Qwen measurement runs in its own subprocess for clean memory state. The prompt is a 166-token Python question. We generate 512 tokens. Decode tokens per second comes from mlx-lm's GenerationResponse.generation_tps, which is measured after prefill so it captures only the decoding phase.

For Gemma-4 we run five chat-templated prompts (one each from math, code, prose, dialogue, reasoning), 200-token generations, median of three runs per prompt, all in a single Python process. Decode tokens per second is wall-clock n_tokens / elapsed for both baseline and spec, with a warm-up generation discarded.

Qwen acceptance comes from OptiqEngine's drafts_accepted and drafts_attempted counters. Gemma acceptance comes from the equivalent counters inside SpecStats in optiq.runtime.spec. Both follow the standard literature definition, so the comparisons are apples to apples with unsloth and llama.cpp numbers.

Hardware: Apple M4 Pro, 24 GB unified memory, 19.1 GB Apple-recommended working set.

For the longer story of how we got MTP working correctly on this stack, see the blog post on Apple Silicon MTP. For the Gemma -assistant path, see Gemma-4 spec decoding on Apple Silicon.