Family guide · Nemotron 3

Nemotron 3 on Apple Silicon

NVIDIA's Nemotron 3 is a hybrid: it interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the MoE variants add a sparse expert mixture. In the 4B (dense), only four of the 42 backbone blocks are true attention; the rest are linear-attention SSM or MLP. The 30B-A3B routes through 128 experts at ≈3 B active parameters per token. The flagship Super 120B-A12B goes further: a 512-expert MoE (22 active) that, at 2-bit, runs on a 36 GB Mac by streaming its experts off SSD. All load through mlx-lm's built-in nemotron_h class.

The quants

Model	Size on disk	Capability Score	vs uniform-4	Best for
NVIDIA-Nemotron-3-Super-120B-A12B-OptiQ-2bit	47.5 GB	—	—	Super flagship · 2-bit, SSD-streamed on a 36 GB Mac
NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit	20.6 GB	72.32	+2.02	Strongest of the two, math, code, long context
NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit	2.9 GB	66.68	+0.24	Small dense assistant, hybrid-SSM experimentation

The Super 120B, streamed off SSD The Nemotron-3-Super-120B-A12B is the family's flagship. Its OptiQ quant is an extreme 2-bit static build (47.5 GB) that runs on a 36 GB Mac: ~14 GB resident (the Mamba blocks, attention and shared experts stay in RAM), and the 34 GB of routed experts stream off the SSD, ~3 tok/s. No Capability Score on this one by design. An extreme quant ships with a coherence demo instead: it wrote and plays its own Flappy Bird. Read the write-up.

Per-benchmark breakdown, 30B-A3B

Benchmark	uniform-4	OptiQ-4 (mixed)	Δ
MMLU (5-shot, 1000)	74.8%	76.2%	+1.3
GSM8K (3-shot CoT)	78.5%	81.6%	+3.1
IFEval (strict)	67.5%	69.1%	+1.7
BFCL V3 (simple AST)	74.0%	74.0%	0.0
HumanEval (pass@1)	86.0%	89.0%	+3.0
HashHop (overall)	22.0%	25.0%	+3.0

Per-benchmark breakdown, 4B

Benchmark	uniform-4	OptiQ-4 (mixed)	Δ
MMLU (5-shot, 1000)	63.3%	64.0%	+0.7
GSM8K (3-shot CoT)	79.9%	81.5%	+1.6
IFEval (strict)	56.0%	56.2%	+0.2
BFCL V3 (simple AST)	75.5%	75.5%	0.0
HumanEval (pass@1)	80.5%	77.4%	-3.1
HashHop (overall)	25.0%	27.0%	+2.0

The MoE 30B is the cleaner win The 30B-A3B clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks, including the fused routed-expert tensors that OptiQ assigns per-layer 4/8-bit (most stay at 4-bit, which keeps the model at 5.05 BPW / 20.6 GB). The dense 4B is a tighter +0.24: it wins four of six and trades a little HumanEval, and its disk delta runs richer because a Mamba2 block carries only two linears so more of them land at 8-bit. Every metric gets one equal vote; disk size sits next to the score as a second axis. See the eval-framework writeup.

Hello world

hello.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain why hybrid Mamba+attention models scale to long contexts."}],
    tokenize=False,
    add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=300))

Hybrid KV cache

Only the four full-attention layers carry a KV cache, the Mamba2 blocks keep recurrent state instead, which is what gives the architecture its flat long-context memory profile. The repo ships a kv_config.json from a real sensitivity pass that covers just those attention layers: three at 4-bit, one at 8-bit, 5.0 average KV bits. Point optiq serve at it for mixed-precision KV.

NemotronH KV support NemotronH names its attention module mixer (not self_attn) and skips MLP layers in the prompt cache. optiq kv-cache classifies each layer as attention / SSM / MLP and maps cache slots to the right layer indices.

Serving

terminalbash

$ optiq serve --model mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit \
      --kv-config kv_config.json --port 8000

# From any OpenAI-compatible client:
$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit",
         "messages":[{"role":"user","content":"What is 17 * 23?"}]}'

License + provenance

Nemotron 3 Nano 4B is distributed under the NVIDIA Nemotron Open Model License, from mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16. The mlx-optiq quant inherits that license. It's deterministic from the bf16 base and the 40-sample calibration mix (optiq.jsonl), reproducible via optiq convert mlx-community/NVIDIA-Nemotron-3-Nano-4B-BF16 --target-bpw 5.0 --candidate-bits 4,8 --reference bf16.

, the mlx-optiq team