Using mlx-optiq quants
mlx-optiq-quantized models are standard MLX checkpoints. They load with the unmodified mlx_lm.load function and generate with mlx_lm.generate. The only difference from a uniform-4-bit checkpoint is the per-layer bit-width recorded in metadata.
One-shot generation
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Why is mixed-precision quantization a good idea?", max_tokens=300) print(out)
Streaming generation
For interactive UIs and CLIs, stream tokens as they come:
from mlx_lm import load, stream_generate from mlx_lm.sample_utils import make_sampler model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") sampler = make_sampler(temp=0.6, top_p=0.95) for response in stream_generate( model, tok, prompt="Write a haiku about Apple Silicon.", max_tokens=200, sampler=sampler, ): print(response.text, end="", flush=True)
Chat templates
Instruction-tuned models (Qwen3.5-*-Instruct, Qwen3.6-*, Gemma-4-*-it) need their chat template applied. Always do this for chat-style use:
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit") messages = [ {"role": "system", "content": "You are a concise expert."}, {"role": "user", "content": "Explain RoPE in 3 bullet points."}, ] prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=400) print(out)
<think>...</think> reasoning channel. Pass enable_thinking=False to apply_chat_template to skip it (much faster, slightly less accurate on math/logic), or leave it on for best quality.
Multi-turn chat loop
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") messages = [{"role": "system", "content": "You are helpful."}] while True: user = input("> ") if not user: break messages.append({"role": "user", "content": user}) prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) reply = generate(model, tok, prompt=prompt, max_tokens=800) print(reply) messages.append({"role": "assistant", "content": reply})
Inspecting mlx-optiq metadata
Each mlx-optiq checkpoint records its per-layer bit assignment in config.json under the quantization field, plus a sidecar optiq_metadata.json with the full sensitivity table:
import json from huggingface_hub import snapshot_download local = snapshot_download( "mlx-community/Qwen3.5-9B-OptiQ-4bit", allow_patterns=["optiq_metadata.json", "config.json"], ) meta = json.load(open(f"{local}/optiq_metadata.json")) print("Method: ", meta["method"]) print("Achieved BPW: ", round(meta["achieved_bpw"], 3)) print("8-bit layers: ", meta["n_high_bits"]) print("4-bit layers: ", meta["n_low_bits"])
Memory ceilings on a 36 GB Mac
Approximate working-set sizes during single-prompt inference at moderate context (≤8 k tokens). Add ~2 GB for the framework and OS reserve.
| Model | Disk | Inference RAM (≤8 k) | Recommended Mac |
|---|---|---|---|
| Qwen3.5-0.8B | 0.5 GB | ~1.5 GB | 8 GB+ |
| Qwen3.5-2B | 1.4 GB | ~3 GB | 16 GB+ |
| Qwen3.5-4B | 2.8 GB | ~5 GB | 16 GB+ |
| Qwen3.5-9B | 5.6 GB | ~9 GB | 24 GB+ |
| Qwen3.5-27B | 15.7 GB | ~22 GB | 36 GB+ |
| Qwen3.6-27B | 15.7 GB | ~22 GB | 36 GB+ |
| Qwen3.5-35B-A3B | 20.1 GB | ~26 GB | 36 GB+ |
| Qwen3.6-35B-A3B | 20.1 GB | ~26 GB | 36 GB+ |
| gemma-4-26B-A4B-it | 14.9 GB | ~21 GB | 32 GB+ |
| gemma-4-31B-it | 18.1 GB | ~24 GB | 36 GB+ |
For 64 k context inference on the 9 B and larger models, use mixed-precision KV serving — it both shrinks the cache and speeds decode.
Next: dive into how sensitivity works, or jump to your model family's getting-started guide.