Engineering · June 19, 2026

A 122-billion-parameter model, on a laptop.

Topic SSD expert streaming Reading time 6 min Related quant methods

Qwen3.5-122B-A10B ships as 244 GB of bf16 weights. Below, it is playing a Flappy Bird game it wrote, on a 36 GB MacBook, with no GPU and no cloud.

A 2-bit quant of Qwen3.5-122B-A10B playing a Flappy Bird game it wrote, on a 36 GB Mac

The quant is 44 GB on disk. While it runs, 12 GB sits in RAM. The other 35 GB streams off the SSD, one expert at a time. This is the headline of mlx-optiq 0.2.5: large mixture-of-experts models that don't fit in memory now run anyway. Here is how, and why we report a game instead of a benchmark for quants this aggressive.

Most of a 122 B MoE never fires for a given token. So most of it never needs to be in memory.

The experts live on disk

A mixture-of-experts model is mostly experts. Qwen3.5-122B-A10B has 256 experts per layer across 48 layers, and 98% of its parameters sit in those tensors. Only 8 of the 256 fire for any given token. The rest are dead weight, literally: loaded into RAM, never read.

So we stop loading them. mlx-optiq 0.2.5 keeps the experts on the SSD and reads only the active ones, per token, as the router selects them. Attention, the router, the embeddings, the few sensitive blocks stay resident. The experts stream.

Part of the model	Size	Where it lives
Attention, embeddings, router, scales	10.7 GB	resident in RAM
256 experts × 48 layers	35 GB	streamed from SSD
Peak while generating	12 GB	on a 36 GB Mac

The model on disk is six times larger than what it holds in memory. Decode runs at about 5 tokens a second. The active experts are read by byte range from the shards on each step, with the small bf16 scales kept resident, so the residency stays flat no matter how big the model on disk gets.

The quant: 2-bit, by rule not by guess

Fitting 122 B in 44 GB means 2-bit weights for the experts. 2-bit is lossy. Spend it carelessly and the model breaks.

mlx-optiq's usual method measures each layer's sensitivity with a calibration pass: perturb one layer, watch the output distribution move, allocate bits where it moves most. That works beautifully up to about 30 B. On a 122 B MoE it would run for days, and it needs the full-precision model resident as a reference, which defeats the purpose.

So 0.2.5 adds a second method, static. It does no measurement at all. It assigns bits from architecture alone, using the priors the calibration pass keeps rediscovering: the embedding and output head, the first and last block, attention, and the MoE router get the high bits; the dense MLP and the routed experts stay low.

We checked that skipping the measurement does not cost quality. On the smallest base, Qwen3.5-0.8B, static lands the exact same GSM8K as the full calibration method, 34.5%, while converting 125 times faster and at a lower bit-width. For a typical transformer, the structure carries most of the signal. The expensive measurement is insurance for the cases structure cannot predict, and on a model this large that insurance is not affordable. For the 122 B, static put 4-bit on the router, attention, and the protected blocks, and 2-bit on the experts. The average comes out at 2.5 bits per weight.

Why a game, not a benchmark

Scoring an extreme quant of a huge model runs into an honest wall. A full capability suite is thousands of generations. At 5 tokens a second through SSD streaming, GSM8K alone is hours, and the code and long-context metrics are days each. You cannot run a six-metric suite on a streaming 122 B and call it a release.

The field already settled on the alternative. When unsloth ships a 1.58-bit DeepSeek-R1, they do not post MMLU. They show the model writing a working Flappy Bird, or a spinning-heptagon physics sim, on hardware people actually own. Size, the machine it runs on, tokens per second, and a coherence demo. That is the report.

So we asked the 2-bit 122 B for Flappy Bird in one HTML file. It returned 7.7 KB of self-contained markup: a canvas, a game loop, gravity, collision, a score counter. node --check passes on the script. The braces balance. An autopilot reading the game's own state threads the pipes and the score climbs, which a broken game cannot do. The gif at the top is that game, played.

A 2-bit, 44 GB quant of a 122 B model, writing and running working code, on a 36 GB laptop. That says more than a number would.

Run it

terminalbash

$ pip install -U mlx-optiq

# streaming turns on automatically for a MoE too big to fit resident
$ optiq serve --model mlx-community/Qwen3.5-122B-A10B-OptiQ-2bit \
              --stream-experts

Open the Lab, ask for a game, and watch it render in the Canvas pane. The same playbook makes any large MoE laptop-runnable: convert with --method static --candidate-bits 2,4, smoke-test, ship the card with the demo. The two methods are laid out side by side in the sensitivity guide.

— the mlx-optiq team