Nemotron 3 Nano on Apple Silicon
NVIDIA's Nemotron 3 Nano is a hybrid: it interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. In the 4B (dense), only four of the 42 backbone blocks are true attention; the rest are linear-attention SSM or MLP. The 30B-A3B routes through 128 experts at ≈3 B active parameters per token. Both load through mlx-lm's built-in nemotron_h class — the custom modeling files ship in each repo and are picked up automatically.
The quants
| Model | Size on disk | Capability Score | vs uniform-4 | Best for |
|---|---|---|---|---|
| NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit | 20.6 GB | 69.15 | +2.02 | Strongest of the two — math, code, long context |
| NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit | 2,938 MB | 63.60 | +0.24 | Small dense assistant, hybrid-SSM experimentation |
Per-benchmark breakdown — 30B-A3B
| Benchmark | uniform-4 | OptIQ-4 (mixed) | Δ |
|---|---|---|---|
| MMLU (5-shot, 1000) | 74.8% | 76.2% | +1.3 |
| GSM8K (3-shot CoT) | 78.5% | 81.6% | +3.1 |
| IFEval (strict) | 67.5% | 69.1% | +1.7 |
| BFCL V3 (simple AST) | 74.0% | 74.0% | 0.0 |
| HumanEval (pass@1) | 86.0% | 89.0% | +3.0 |
| HashHop (overall) | 22.0% | 25.0% | +3.0 |
Per-benchmark breakdown — 4B
| Benchmark | uniform-4 | OptIQ-4 (mixed) | Δ |
|---|---|---|---|
| MMLU (5-shot, 1000) | 63.3% | 64.0% | +0.7 |
| GSM8K (3-shot CoT) | 79.9% | 81.5% | +1.6 |
| IFEval (strict) | 56.0% | 56.2% | +0.2 |
| BFCL V3 (simple AST) | 75.5% | 75.5% | 0.0 |
| HumanEval (pass@1) | 80.5% | 77.4% | -3.1 |
| HashHop (overall) | 25.0% | 27.0% | +2.0 |
Hello world
from mlx_lm import load, generate model, tok = load("mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain why hybrid Mamba+attention models scale to long contexts."}], tokenize=False, add_generation_prompt=True, ) print(generate(model, tok, prompt=prompt, max_tokens=300))
Hybrid KV cache
Only the four full-attention layers carry a KV cache — the Mamba2 blocks keep recurrent state instead, which is what gives the architecture its flat long-context memory profile. The repo ships a kv_config.json from a real sensitivity pass that covers just those attention layers: three at 4-bit, one at 8-bit, 5.0 average KV bits. Point optiq serve at it for mixed-precision KV.
optiq kv-cache looked for a self_attn submodule and assumed one cache slot per layer. NemotronH names its attention module mixer and skips MLP layers in the prompt cache, so the old code raised ZeroDivisionError. v0.1.5 classifies each layer as attention / SSM / MLP and maps cache slots to the right layer indices. Upgrade with pip install -U mlx-optiq before running optiq kv-cache on this family.
Serving
$ optiq serve --model mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit \ --kv-config kv_config.json --port 8000 # From any OpenAI-compatible client: $ curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit", "messages":[{"role":"user","content":"What is 17 * 23?"}]}'
License + provenance
Nemotron 3 Nano 4B is distributed under the NVIDIA Nemotron Open Model License, from mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16. The mlx-optiq quant inherits that license. It's deterministic from the bf16 base and the 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16 --target-bpw 5.0 --candidate-bits 4,8 --reference bf16.
— the mlx-optiq team