mlx-optiq
Family guide · Nemotron 3

Nemotron 3 Nano on Apple Silicon

NVIDIA's Nemotron 3 Nano is a hybrid: it interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. In the 4B (dense), only four of the 42 backbone blocks are true attention; the rest are linear-attention SSM or MLP. The 30B-A3B routes through 128 experts at ≈3 B active parameters per token. Both load through mlx-lm's built-in nemotron_h class — the custom modeling files ship in each repo and are picked up automatically.

The quants

ModelSize on diskCapability Scorevs uniform-4Best for
NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit20.6 GB69.15+2.02Strongest of the two — math, code, long context
NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit2,938 MB63.60+0.24Small dense assistant, hybrid-SSM experimentation

Per-benchmark breakdown — 30B-A3B

Benchmarkuniform-4OptIQ-4 (mixed)Δ
MMLU (5-shot, 1000)74.8%76.2%+1.3
GSM8K (3-shot CoT)78.5%81.6%+3.1
IFEval (strict)67.5%69.1%+1.7
BFCL V3 (simple AST)74.0%74.0%0.0
HumanEval (pass@1)86.0%89.0%+3.0
HashHop (overall)22.0%25.0%+3.0

Per-benchmark breakdown — 4B

Benchmarkuniform-4OptIQ-4 (mixed)Δ
MMLU (5-shot, 1000)63.3%64.0%+0.7
GSM8K (3-shot CoT)79.9%81.5%+1.6
IFEval (strict)56.0%56.2%+0.2
BFCL V3 (simple AST)75.5%75.5%0.0
HumanEval (pass@1)80.5%77.4%-3.1
HashHop (overall)25.0%27.0%+2.0
The MoE 30B is the cleaner win The 30B-A3B clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks — including the fused routed-expert tensors that OptIQ assigns per-layer 4/8-bit (most stay at 4-bit, which keeps the model at 5.05 BPW / 20.6 GB). The dense 4B is a tighter +0.24: it wins four of six and trades a little HumanEval, and its disk delta runs richer because a Mamba2 block carries only two linears so more of them land at 8-bit. Every metric gets one equal vote; disk size sits next to the score as a second axis. See the eval-framework writeup.

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain why hybrid Mamba+attention models scale to long contexts."}],
    tokenize=False,
    add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=300))

Hybrid KV cache

Only the four full-attention layers carry a KV cache — the Mamba2 blocks keep recurrent state instead, which is what gives the architecture its flat long-context memory profile. The repo ships a kv_config.json from a real sensitivity pass that covers just those attention layers: three at 4-bit, one at 8-bit, 5.0 average KV bits. Point optiq serve at it for mixed-precision KV.

NemotronH KV support landed in v0.1.5 Earlier optiq kv-cache looked for a self_attn submodule and assumed one cache slot per layer. NemotronH names its attention module mixer and skips MLP layers in the prompt cache, so the old code raised ZeroDivisionError. v0.1.5 classifies each layer as attention / SSM / MLP and maps cache slots to the right layer indices. Upgrade with pip install -U mlx-optiq before running optiq kv-cache on this family.

Serving

terminalbash
$ optiq serve --model mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit \
      --kv-config kv_config.json --port 8000

# From any OpenAI-compatible client:
$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit",
         "messages":[{"role":"user","content":"What is 17 * 23?"}]}'

License + provenance

Nemotron 3 Nano 4B is distributed under the NVIDIA Nemotron Open Model License, from mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16. The mlx-optiq quant inherits that license. It's deterministic from the bf16 base and the 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16 --target-bpw 5.0 --candidate-bits 4,8 --reference bf16.

— the mlx-optiq team