Qwen3.6 on Apple Silicon
Qwen3.6 is the early-2026 successor to Qwen3.5 — frontier-class reasoning at sizes that fit on consumer Apple Silicon. We ship two quants: a dense 27 B and a 256-expert MoE with 3 B active per token. Both load with stock mlx_lm.load and reach 89–95 % on GSM8K.
Available quants
| Model | Size | GSM8K | Best for |
|---|---|---|---|
| Qwen3.6-27B-OptiQ-4bit | 15.7 GB | 95.0% | Strongest dense quant we ship |
| Qwen3.6-35B-A3B-OptiQ-4bit | 20.1 GB | 89.5% | Sparse MoE, 256 experts, 3B active |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Compare REINFORCE and PPO in two paragraphs."}], tokenize=False, add_generation_prompt=True, ) print(generate(model, tok, prompt=prompt, max_tokens=600))
Recommended sampling
from mlx_lm.sample_utils import make_sampler # Strong reasoning baseline (Qwen3.6 supports thinking mode) sampler = make_sampler(temp=0.6, top_p=0.95, top_k=20) # Conversational sampler = make_sampler(temp=0.7, top_p=0.9)
The 35B-A3B MoE model
Qwen3.6-35B-A3B is a 256-expert sparse mixture-of-experts: 35 B total parameters, only 3 B active per token. The fused expert tensor (switch_mlp in MLX terminology) gets quantized as a single layer in mlx-optiq's sensitivity pass — but each expert independently uses the assigned bit-width.
Expect MoE inference to be faster than the dense 27 B at the same memory footprint, because only 3 B of weights actually multiply per token. The sensitivity pass is also faster because there are fewer "layers" (the experts collapse into single switch_mlp tensors).
Long-context serving
# Sensitivity pass (1-2 min, once per model) $ optiq kv-cache mlx-community/Qwen3.6-27B-OptiQ-4bit \ --target-bits 4.5 --candidate-bits 4,8 \ -o ./kv/qwen36_27b # Mixed-precision KV serving $ optiq serve --model mlx-community/Qwen3.6-27B-OptiQ-4bit \ --kv-config ./kv/qwen36_27b/kv_config.json \ --max-tokens 32768 --temp 0.6 --top-p 0.95
Fine-tuning
27B fits at max-seq-length=512 on a 36 GB Mac with default rank=8 LoRA on q_proj/v_proj. The 35B-A3B MoE caps at max-seq-length=128 due to per-expert memory overhead but trains 3× faster per iteration than the dense 27B because of sparse activation.
# Qwen3.6-27B at T=512, peak ~27.7 GB $ optiq lora train mlx-community/Qwen3.6-27B-OptiQ-4bit \ --data ./my_data \ --max-seq-length 512 \ --rank 8 --rank-scaling by_bits \ --iters 1000 -o ./my_adapter # Qwen3.6-35B-A3B at T=128, peak ~25.3 GB, 32 tok/s $ optiq lora train mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit \ --data ./my_data \ --max-seq-length 128 \ --rank 8 --rank-scaling by_bits \ --iters 2000 -o ./my_moe_adapter
--reference uniform_4bit automatically (won't fit bf16 in RAM). Plan ~80 GB of disk headroom for a full pass. The pre-built quants on Hugging Face are usually what you want.
Next: see how sensitivity works, or read the fine-tuning guide.