mlx-optiq
Family guide · Qwen3.6

Qwen3.6 on Apple Silicon

Qwen3.6 is the early-2026 successor to Qwen3.5 — frontier-class reasoning at sizes that fit on consumer Apple Silicon. We ship two quants: a dense 27 B and a 256-expert MoE with 3 B active per token. Both load with stock mlx_lm.load and reach 89–95 % on GSM8K.

Available quants

ModelSizeGSM8KBest for
Qwen3.6-27B-OptiQ-4bit15.7 GB95.0%Strongest dense quant we ship
Qwen3.6-35B-A3B-OptiQ-4bit20.1 GB89.5%Sparse MoE, 256 experts, 3B active

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Compare REINFORCE and PPO in two paragraphs."}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=600))

Recommended sampling

sampling.pypython
from mlx_lm.sample_utils import make_sampler

# Strong reasoning baseline (Qwen3.6 supports thinking mode)
sampler = make_sampler(temp=0.6, top_p=0.95, top_k=20)

# Conversational
sampler = make_sampler(temp=0.7, top_p=0.9)

The 35B-A3B MoE model

Qwen3.6-35B-A3B is a 256-expert sparse mixture-of-experts: 35 B total parameters, only 3 B active per token. The fused expert tensor (switch_mlp in MLX terminology) gets quantized as a single layer in mlx-optiq's sensitivity pass — but each expert independently uses the assigned bit-width.

Expect MoE inference to be faster than the dense 27 B at the same memory footprint, because only 3 B of weights actually multiply per token. The sensitivity pass is also faster because there are fewer "layers" (the experts collapse into single switch_mlp tensors).

Long-context serving

serve.shbash
# Sensitivity pass (1-2 min, once per model)
$ optiq kv-cache mlx-community/Qwen3.6-27B-OptiQ-4bit \
    --target-bits 4.5 --candidate-bits 4,8 \
    -o ./kv/qwen36_27b

# Mixed-precision KV serving
$ optiq serve --model mlx-community/Qwen3.6-27B-OptiQ-4bit \
    --kv-config ./kv/qwen36_27b/kv_config.json \
    --max-tokens 32768 --temp 0.6 --top-p 0.95

Fine-tuning

27B fits at max-seq-length=512 on a 36 GB Mac with default rank=8 LoRA on q_proj/v_proj. The 35B-A3B MoE caps at max-seq-length=128 due to per-expert memory overhead but trains 3× faster per iteration than the dense 27B because of sparse activation.

finetune.shbash
# Qwen3.6-27B at T=512, peak ~27.7 GB
$ optiq lora train mlx-community/Qwen3.6-27B-OptiQ-4bit \
    --data ./my_data \
    --max-seq-length 512 \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Qwen3.6-35B-A3B at T=128, peak ~25.3 GB, 32 tok/s
$ optiq lora train mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit \
    --data ./my_data \
    --max-seq-length 128 \
    --rank 8 --rank-scaling by_bits \
    --iters 2000 -o ./my_moe_adapter
Re-quantizing locally Because the bf16 base is a vision-language model with a ~50 GB on-disk footprint, re-quantizing locally requires the bf16 weights cached and uses --reference uniform_4bit automatically (won't fit bf16 in RAM). Plan ~80 GB of disk headroom for a full pass. The pre-built quants on Hugging Face are usually what you want.

Next: see how sensitivity works, or read the fine-tuning guide.