Qwen3.5 on Apple Silicon
Qwen3.5 is Alibaba's late-2025 release: a six-model dense lineup plus one sparse MoE. We ship mlx-optiq-4-bit quants for all six. They share the same chat template, hybrid linear+full attention architecture, and the same <think>...</think> reasoning channel.
Available quants
| Model | Size | GSM8K | Best for |
|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 0.5 GB | 27.0% | Toy agents, prompt rewriters |
| Qwen3.5-2B-OptiQ-4bit | 1.4 GB | 48.0% | Local-only chat, classifiers |
| Qwen3.5-4B-OptiQ-4bit | 2.8 GB | 81.5% | Sweet spot for laptops |
| Qwen3.5-9B-OptiQ-4bit | 5.6 GB | 90.0% | Default daily-driver |
| Qwen3.5-27B-OptiQ-4bit | 15.7 GB | 87.5% | Long-form reasoning |
| Qwen3.5-35B-A3B-OptiQ-4bit | 20.1 GB | 89.5% | Sparse MoE, 3B active |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "What is the capital of Australia?"}], tokenize=False, add_generation_prompt=True, ) print(generate(model, tok, prompt=prompt, max_tokens=200))
Recommended sampling
Qwen3.5-Instruct variants behave well at:
from mlx_lm.sample_utils import make_sampler # Reasoning / math / code sampler = make_sampler(temp=0.6, top_p=0.95, top_k=20) # Conversational chat sampler = make_sampler(temp=0.7, top_p=0.9) # Deterministic / classification sampler = make_sampler(temp=0.0)
Reasoning channel
Qwen3.5 has built-in chain-of-thought via <think>...</think> tags. Default-on for instruct variants:
# Default: thinking enabled (slower, more accurate on math/logic) prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Disable thinking for snappier replies (chat, classification) prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
To programmatically strip the <think> block before showing output to the user:
import re out = generate(model, tok, prompt=prompt, max_tokens=800) final = re.sub(r"<think>.*?</think>", "", out, flags=re.DOTALL).strip()
Hybrid attention
Qwen3.5 uses a hybrid architecture: most layers are linear-attention (Gated DeltaNet), and a sparse subset are full-attention. Layer indices [3, 7, 11, 15, 19, 23, ...] (every 4th) are full-attention; the rest are linear. This matters for KV-cache serving — only the full-attention layers carry a KV cache that needs sensitivity-aware quantization.
kv-cache command correctly skips them.
Long-context serving
For 64 k-token contexts, run a one-time KV sensitivity pass and serve with the resulting config:
# 1-2 min — once per model $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 --candidate-bits 4,8 \ -o ./kv/qwen35_9b # Serve at :8080 with mixed-precision KV $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --max-tokens 32768 --temp 0.6 --top-p 0.95
This delivers +31% decode speedup at 64k context on Qwen3.5-9B compared to fp16 KV. See benchmark results for the full A/B.
Fine-tuning recipes
Empirical training-ceiling map at iogpu.wired_limit_mb=0 on a 36 GB Mac (default config: q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits):
| Model | Max seq len | Peak mem | Tokens/sec |
|---|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB | 29.2 |
| Qwen3.5-2B | 2,400 | 19.3 GB | 38.3 |
| Qwen3.5-4B | 1,600 | 24.8 GB | 19.1 |
| Qwen3.5-9B | 1,400 | 25.4 GB | 21.6 |
| Qwen3.5-27B | 512 | 27.7 GB | 11.4 |
| Qwen3.5-35B-A3B | 128 | 25.3 GB | 32.2 |
# 9B at T=1400 — proven sweet spot, peak 25.4 GB $ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --max-seq-length 1400 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 1000 \ -o ./my_adapter
See the full LoRA fine-tuning guide for the algorithm, rank-scaling explanation and PEFT-compat output format.