mlx-optiq
Family guide · MiniCPM5

MiniCPM5-1B on Apple Silicon

OpenBMB's MiniCPM5-1B is a 1.08B-parameter Llama-architecture base released under Apache-2.0. The mlx-optiq build is one HF repo: a 4.5-BPW mixed-precision quant that fits in 875 MB on disk and runs comfortably on any M-series Mac. The model ships a hybrid chat template: pass enable_thinking=true for a <think> reasoning channel, or leave it off for fast direct answers.

The quant

ModelSize on diskCapability Scorevs uniform-4Best for
MiniCPM5-1B-OptiQ-4bit875 MB30.28+4.44Fast on-device assistant, fine-tuning base

Per-benchmark breakdown

Benchmarkuniform-4OptIQ-4 (mixed)Δ
MMLU (5-shot, 1000)49.0%52.4%+3.4
GSM8K (no thinking)1.7%2.7%+1.0
IFEval (strict)58.6%64.7%+6.1
BFCL V3 (simple AST)0.0%0.0%0.0
HumanEval (pass@1)45.7%57.9%+12.2
HashHop (overall)0.0%4.0%+4.0
KL vs bf16 (mean)0.3500.1362.6× closer
Why HumanEval and HashHop jump so much Sensitivity-aware allocation gives the layers OptIQ measured as most fragile (output projection, gate, and the last few blocks) 8-bit precision; the robust ones stay at 4-bit. For a 1B model with limited redundancy, that targeting matters a lot. HashHop on uniform-4-bit falls off a cliff to 0% on all hop levels; OptIQ keeps the 1-hop case usable. Same pattern shows up on small Gemma quants — see the sensitivity-aware research.

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/MiniCPM5-1B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Summarize the plot of The Iliad in three sentences."}],
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(generate(model, tok, prompt=prompt, max_tokens=300))

Hybrid reasoning: think or no-think

MiniCPM5's chat template accepts an enable_thinking flag. With it on, the model emits a <think>...</think> block before the answer, useful for math, multi-step planning, or anywhere you want chain-of-thought. The recipe per the model card:

Modetemperaturetop_pUse when
No-think (default)0.70.95Fast assistant, rewriting, conversational
Think0.90.95Math, code, multi-hop reasoning

Pass the flag via chat_template_kwargs at the OpenAI endpoint level or as a keyword to apply_chat_template directly. optiq serve forwards chat_template_kwargs verbatim, so you can flip thinking per-request from any OpenAI-compatible client.

Serving

terminalbash
$ optiq serve --model mlx-community/MiniCPM5-1B-OptiQ-4bit --port 8000

# From any OpenAI-compatible client:
$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mlx-community/MiniCPM5-1B-OptiQ-4bit",
         "messages":[{"role":"user","content":"What is 17 * 23?"}],
         "chat_template_kwargs":{"enable_thinking":true}}'

Fine-tuning

The 1B size makes MiniCPM5 the smallest base in the mlx-optiq lineup that's still capable enough to fine-tune for real tasks. On a 24 GB Mac, LoRA training fits comfortably at max_seq_length=2048 with all 7 Unsloth target modules adapted — no num-layers / sequence-length workarounds needed. The sensitivity-aware LoRA overlay reads the optiq_metadata.json sidecar and gives 8-bit layers 2× the adapter rank of 4-bit layers at the same parameter budget.

terminalbash
$ optiq lora train mlx-community/MiniCPM5-1B-OptiQ-4bit \
      --data ./my_training_data \
      --preset default \
      --max-seq-length 2048

License + provenance

MiniCPM5-1B is Apache-2.0 from openbmb/MiniCPM5-1B. The mlx-optiq quant is bit-for-bit deterministic from that base and a 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert openbmb/MiniCPM5-1B --target-bpw 5.0 --candidate-bits 4,8.

— the mlx-optiq team