Family guide · MiniCPM5

MiniCPM5-1B on Apple Silicon

OpenBMB's MiniCPM5-1B is a 1.08B-parameter Llama-architecture base released under Apache-2.0. The mlx-optiq build is one HF repo: a 4.5-BPW mixed-precision quant that fits in 875 MB on disk and runs comfortably on any M-series Mac. The model ships a hybrid chat template: pass enable_thinking=true for a <think> reasoning channel, or leave it off for fast direct answers.

The quant

Model	Size on disk	Capability Score	vs uniform-4	Best for
MiniCPM5-1B-OptiQ-4bit	875 MB	30.28	+4.44	Fast on-device assistant, fine-tuning base

Per-benchmark breakdown

Benchmark	uniform-4	OptiQ-4 (mixed)	Δ
MMLU (5-shot, 1000)	49.0%	52.4%	+3.4
GSM8K (no thinking)	1.7%	2.7%	+1.0
IFEval (strict)	58.6%	64.7%	+6.1
BFCL V3 (simple AST)	0.0%	0.0%	0.0
HumanEval (pass@1)	45.7%	57.9%	+12.2
HashHop (overall)	0.0%	4.0%	+4.0
KL vs bf16 (mean)	0.350	0.136	2.6× closer

Why HumanEval and HashHop jump so much Sensitivity-aware allocation gives the layers OptiQ measured as most fragile (output projection, gate, and the last few blocks) 8-bit precision; the robust ones stay at 4-bit. For a 1B model with limited redundancy, that targeting matters a lot. HashHop on uniform-4-bit falls off a cliff to 0% on all hop levels; OptiQ keeps the 1-hop case usable. Same pattern shows up on small Gemma quants — see the sensitivity-aware research.

Hello world

hello.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/MiniCPM5-1B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Summarize the plot of The Iliad in three sentences."}],
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(generate(model, tok, prompt=prompt, max_tokens=300))

Hybrid reasoning: think or no-think

MiniCPM5's chat template accepts an enable_thinking flag. With it on, the model emits a <think>...</think> block before the answer, useful for math, multi-step planning, or anywhere you want chain-of-thought. The recipe per the model card:

Mode	temperature	top_p	Use when
No-think (default)	0.7	0.95	Fast assistant, rewriting, conversational
Think	0.9	0.95	Math, code, multi-hop reasoning

Pass the flag via chat_template_kwargs at the OpenAI endpoint level or as a keyword to apply_chat_template directly. optiq serve forwards chat_template_kwargs verbatim, so you can flip thinking per-request from any OpenAI-compatible client.

Serving

terminalbash

$ optiq serve --model mlx-community/MiniCPM5-1B-OptiQ-4bit --port 8000

# From any OpenAI-compatible client:
$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mlx-community/MiniCPM5-1B-OptiQ-4bit",
         "messages":[{"role":"user","content":"What is 17 * 23?"}],
         "chat_template_kwargs":{"enable_thinking":true}}'

Fine-tuning

The 1B size makes MiniCPM5 the smallest base in the mlx-optiq lineup that's still capable enough to fine-tune for real tasks. On a 24 GB Mac, LoRA training fits comfortably at max_seq_length=2048 with all 7 Unsloth target modules adapted — no num-layers / sequence-length workarounds needed. The sensitivity-aware LoRA overlay reads the optiq_metadata.json sidecar and gives 8-bit layers 2× the adapter rank of 4-bit layers at the same parameter budget.

terminalbash

$ optiq lora train mlx-community/MiniCPM5-1B-OptiQ-4bit \
      --data ./my_training_data \
      --preset default \
      --max-seq-length 2048

License + provenance

MiniCPM5-1B is Apache-2.0 from openbmb/MiniCPM5-1B. The mlx-optiq quant is bit-for-bit deterministic from that base and a 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert openbmb/MiniCPM5-1B --target-bpw 5.0 --candidate-bits 4,8.

— the mlx-optiq team