MiniCPM5-1B on Apple Silicon
OpenBMB's MiniCPM5-1B is a 1.08B-parameter Llama-architecture base released under Apache-2.0. The mlx-optiq build is one HF repo: a 4.5-BPW mixed-precision quant that fits in 875 MB on disk and runs comfortably on any M-series Mac. The model ships a hybrid chat template: pass enable_thinking=true for a <think> reasoning channel, or leave it off for fast direct answers.
The quant
| Model | Size on disk | Capability Score | vs uniform-4 | Best for |
|---|---|---|---|---|
| MiniCPM5-1B-OptiQ-4bit | 875 MB | 30.28 | +4.44 | Fast on-device assistant, fine-tuning base |
Per-benchmark breakdown
| Benchmark | uniform-4 | OptIQ-4 (mixed) | Δ |
|---|---|---|---|
| MMLU (5-shot, 1000) | 49.0% | 52.4% | +3.4 |
| GSM8K (no thinking) | 1.7% | 2.7% | +1.0 |
| IFEval (strict) | 58.6% | 64.7% | +6.1 |
| BFCL V3 (simple AST) | 0.0% | 0.0% | 0.0 |
| HumanEval (pass@1) | 45.7% | 57.9% | +12.2 |
| HashHop (overall) | 0.0% | 4.0% | +4.0 |
| KL vs bf16 (mean) | 0.350 | 0.136 | 2.6× closer |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/MiniCPM5-1B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Summarize the plot of The Iliad in three sentences."}], tokenize=False, add_generation_prompt=True, enable_thinking=False, ) print(generate(model, tok, prompt=prompt, max_tokens=300))
Hybrid reasoning: think or no-think
MiniCPM5's chat template accepts an enable_thinking flag. With it on, the model emits a <think>...</think> block before the answer, useful for math, multi-step planning, or anywhere you want chain-of-thought. The recipe per the model card:
| Mode | temperature | top_p | Use when |
|---|---|---|---|
| No-think (default) | 0.7 | 0.95 | Fast assistant, rewriting, conversational |
| Think | 0.9 | 0.95 | Math, code, multi-hop reasoning |
Pass the flag via chat_template_kwargs at the OpenAI endpoint level or as a keyword to apply_chat_template directly. optiq serve forwards chat_template_kwargs verbatim, so you can flip thinking per-request from any OpenAI-compatible client.
Serving
$ optiq serve --model mlx-community/MiniCPM5-1B-OptiQ-4bit --port 8000 # From any OpenAI-compatible client: $ curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"mlx-community/MiniCPM5-1B-OptiQ-4bit", "messages":[{"role":"user","content":"What is 17 * 23?"}], "chat_template_kwargs":{"enable_thinking":true}}'
Fine-tuning
The 1B size makes MiniCPM5 the smallest base in the mlx-optiq lineup that's still capable enough to fine-tune for real tasks. On a 24 GB Mac, LoRA training fits comfortably at max_seq_length=2048 with all 7 Unsloth target modules adapted — no num-layers / sequence-length workarounds needed. The sensitivity-aware LoRA overlay reads the optiq_metadata.json sidecar and gives 8-bit layers 2× the adapter rank of 4-bit layers at the same parameter budget.
$ optiq lora train mlx-community/MiniCPM5-1B-OptiQ-4bit \
--data ./my_training_data \
--preset default \
--max-seq-length 2048
License + provenance
MiniCPM5-1B is Apache-2.0 from openbmb/MiniCPM5-1B. The mlx-optiq quant is bit-for-bit deterministic from that base and a 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert openbmb/MiniCPM5-1B --target-bpw 5.0 --candidate-bits 4,8.
— the mlx-optiq team