mlx-optiq
Family guide · Qwen3.5

Qwen3.5 on Apple Silicon

Qwen3.5 is Alibaba's late-2025 release: a six-model dense lineup plus one sparse MoE. We ship mlx-optiq-4-bit quants for all six. They share the same chat template, hybrid linear+full attention architecture, and the same <think>...</think> reasoning channel.

Available quants

ModelSizeGSM8KBest for
Qwen3.5-0.8B-OptiQ-4bit0.5 GB27.0%Toy agents, prompt rewriters
Qwen3.5-2B-OptiQ-4bit1.4 GB48.0%Local-only chat, classifiers
Qwen3.5-4B-OptiQ-4bit2.8 GB81.5%Sweet spot for laptops
Qwen3.5-9B-OptiQ-4bit5.6 GB90.0%Default daily-driver
Qwen3.5-27B-OptiQ-4bit15.7 GB87.5%Long-form reasoning
Qwen3.5-35B-A3B-OptiQ-4bit20.1 GB89.5%Sparse MoE, 3B active

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of Australia?"}],
    tokenize=False,
    add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=200))

Recommended sampling

Qwen3.5-Instruct variants behave well at:

sampling.pypython
from mlx_lm.sample_utils import make_sampler

# Reasoning / math / code
sampler = make_sampler(temp=0.6, top_p=0.95, top_k=20)

# Conversational chat
sampler = make_sampler(temp=0.7, top_p=0.9)

# Deterministic / classification
sampler = make_sampler(temp=0.0)

Reasoning channel

Qwen3.5 has built-in chain-of-thought via <think>...</think> tags. Default-on for instruct variants:

thinking.pypython
# Default: thinking enabled (slower, more accurate on math/logic)
prompt = tok.apply_chat_template(messages,
    tokenize=False, add_generation_prompt=True)

# Disable thinking for snappier replies (chat, classification)
prompt = tok.apply_chat_template(messages,
    tokenize=False, add_generation_prompt=True,
    enable_thinking=False)

To programmatically strip the <think> block before showing output to the user:

strip_think.pypython
import re
out = generate(model, tok, prompt=prompt, max_tokens=800)
final = re.sub(r"<think>.*?</think>", "", out, flags=re.DOTALL).strip()

Hybrid attention

Qwen3.5 uses a hybrid architecture: most layers are linear-attention (Gated DeltaNet), and a sparse subset are full-attention. Layer indices [3, 7, 11, 15, 19, 23, ...] (every 4th) are full-attention; the rest are linear. This matters for KV-cache serving — only the full-attention layers carry a KV cache that needs sensitivity-aware quantization.

Why this matters for mlx-optiq The sensitivity-aware KV quantization pass focuses on the full-attention layers (the ones with a real KV cache). Linear-attention layers don't have a per-token KV the same way; mlx-optiq's kv-cache command correctly skips them.

Long-context serving

For 64 k-token contexts, run a one-time KV sensitivity pass and serve with the resulting config:

serve.shbash
# 1-2 min — once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

# Serve at :8080 with mixed-precision KV
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --max-tokens 32768 --temp 0.6 --top-p 0.95

This delivers +31% decode speedup at 64k context on Qwen3.5-9B compared to fp16 KV. See benchmark results for the full A/B.

Fine-tuning recipes

Empirical training-ceiling map at iogpu.wired_limit_mb=0 on a 36 GB Mac (default config: q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits):

ModelMax seq lenPeak memTokens/sec
Qwen3.5-0.8B2,80023.4 GB29.2
Qwen3.5-2B2,40019.3 GB38.3
Qwen3.5-4B1,60024.8 GB19.1
Qwen3.5-9B1,40025.4 GB21.6
Qwen3.5-27B51227.7 GB11.4
Qwen3.5-35B-A3B12825.3 GB32.2
finetune.shbash
# 9B at T=1400 — proven sweet spot, peak 25.4 GB
$ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --max-seq-length 1400 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 1000 \
    -o ./my_adapter

See the full LoRA fine-tuning guide for the algorithm, rank-scaling explanation and PEFT-compat output format.