Research · serving

Getting MTP to actually work on Apple Silicon

Topic Research Reading time 8 min Related MTP guide · TurboQuant postmortem

Qwen3.5 and Qwen3.6 ship with a small extra prediction head bundled into the model. The literature calls it MTP, short for Multi-Token Prediction. The point of having it is to use the head as a draft model for speculative decoding: it spits out K candidate tokens fast, and the main model verifies all K in one parallel forward pass. When the head's guesses turn out to match what the main model would have produced anyway, you get those tokens essentially free.

We expected this to be a straightforward feature to wire through OptiQ. Hook into the serving stack's already-patched stream_generate path, route through our OptiqEngine, run a bench, publish the speedup. The plan was one afternoon.

It took longer. The wiring worked on the first try. What did not work was almost everything around it. We found three separate issues, fixed them in order, and ended up at the same 1.4x speedup unsloth publishes for the same model class on an RTX 6000. This is what each fix was.

The memory thing

OptiQ's install_mtp_speculation patched mlx-lm's stream_generate so that when --mtp was on, generation routed through our OptiqEngine. The engine, on first request, called mtplx_load(model_path), which calls mlx_lm.load internally. The catch: mlx-lm.server had already loaded the same model. So we were holding two full copies of the base weights in unified memory.

For Qwen3.5-9B this meant peak memory went from 6.5 GB without MTP to 16.7 GB with it. The MTP head itself is only about 185 MB. The other 10 GB was the duplicate. On a 24 GB Mac this also explained why 27B refused to run with MTP at all. Two copies of a 17 GB model do not fit anywhere.

The fix was an OptiqEngine.from_loaded(model, tokenizer, path) classmethod that takes the already-loaded model and injects MTP support into it in place. After the fix, 9B MTP peak dropped to 6.7 GB. 27B MTP fit at 17.6 GB peak, which is what actually made benching it possible at all.

The verify was wrong at temperature

Greedy decoding worked fine. The verify took the main model's argmax and compared it to the drafted token. If they matched, accept. Standard greedy spec decoding.

At temp > 0 the same code path applied, just with sampled tokens instead of argmaxes. Sample from the main model's softmax, compare to the drafted token, accept on equality. This is wrong in a subtle way: both sides are now independent samples from their respective distributions. Two independent samples almost never match exactly.

We measured this on 4B with temp=1.0, top_p=0.95, top_k=20 (the recipe Qwen ships in their generation_config.json). Acceptance came out at 28 percent. The greedy version on the same model was 67 percent.

The correct probabilistic verify is rejection sampling, due to Leviathan et al and Chen et al. You accept draft dt with probability min(1, p_main(dt) / q_draft(dt)). If you reject, you sample a replacement from the residual distribution (p_main - q_draft)_+ rescaled. Rewriting the verify to do this raised 4B acceptance at the same sampler from 28 percent to 57 percent, and throughput went from 0.87x base (a regression) to 1.11x.

Truncation has to apply on both sides

Rejection sampling fixed the catastrophic case, but our temp=1.0 numbers were still below greedy. We were losing 8 to 17 percent across model sizes versus the greedy 1.36x at 27B.

Unsloth's published numbers report around 83 percent acceptance at this model class. We were at 57 percent. The gap turned out to be sampling truncation. Unsloth's MTP guide specifies temp=1.0 with top_p=0.95 and top_k=20. Our verify was using the untruncated softmax even though the draft path was applying top_p and top_k. Truncating both the draft and verify distributions to the same support (top 20 tokens, top 0.95 cumulative mass) makes the L1 overlap between them much higher, and that overlap is exactly what rejection sampling acceptance measures.

After applying the same truncation on both sides, acceptance on 9B jumped from 32 percent to 63 percent. Throughput jumped from 1.00x to 1.28x.

We also added the model's recommended sampler from generation_config.json as a default in optiq serve and optiq lab. Without that, users who set just --temp 1.0 would see the same diffuse-distribution behavior we hit. The new default reads top_p and top_k from the model's published settings unless the user passes them explicitly.

What about depth 2 or higher

vLLM, HuggingFace Transformers, and unsloth's MTP guide all use K=2 or higher as defaults. We tested K=2 through K=4 on all three model sizes and found K=1 universally wins on Apple Silicon. The numbers, greedy:

Model	K=1	K=2	K=3	K=4
Qwen3.5-9B	1.28x	1.19x	0.89x	0.74x
Qwen3.6-27B	1.36x	1.34x	0.94x	0.74x

The reason is hardware. On CUDA the K-token verify forward is nearly free because Tensor Cores have spare throughput during single-token decode, so adding tokens to verify just uses idle compute. Apple Silicon's Metal backend has a compute-to-bandwidth ratio roughly 10x lower than H100. K=2 verify on Metal costs about 2x what K=1 verify costs, not the 1.1x that CUDA gets.

Tokens-per-verify-cost works out like this:

Outcome	Tokens	Cost	Tokens/cost
K=1 full accept	2	1x	2.00
K=2 full accept	3	2x	1.50
K=2 partial (1/2)	2	2x	1.00

K=1's best case is already higher than K=2's best case on Apple Silicon. No algorithmic trick beats this without a custom Metal kernel that reduces the K-token verify cost.

We also tried HuggingFace's dynamic depth heuristic, where K rises after a fully accepted cycle and falls on partial accept. It lost 4 to 17 percent across all sizes versus fixed K=1. The adapter cannot find any working pattern of K=2 cycles that outweighs the cost when the cost ratio is 2x.

Where we landed

The OptiQ 0.1.0 MTP numbers, M4 Pro 24 GB, greedy decoding, 512-token generation:

Model	Base tok/s	MTP tok/s	Speedup
Qwen3.5-4B	29.2	35.0	1.20x
Qwen3.5-9B	19.5	25.8	1.32x
Qwen3.6-27B	6.0	8.4	1.40x

With Qwen's recommended sampler (temp=1.0, top_p=0.95, top_k=20), the same range becomes 1.09x / 1.17x / 1.30x. The 27B greedy number is within 5 percent of unsloth's published 1.4x on the same model class running on an RTX 6000. We will take that on a Mac.

For 0.1.0 we ship K=1 with the numbers above. The MTP guide has the reference table, methodology, and the --mtp flag.