Streaming weights from RAM to GPU for prefill makes sense due to batching and pc...

simonw · 2026-04-24T06:23:41 1777011821

There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

EnPissant · 2026-04-24T18:16:08 1777054568

I don't mean to be a jerk, but 2-bit quant, reducing experts from 10 to 4, who knows if the test is running long enough for the SSD to thermal throttle, and still only getting 5.5 tokens/s does not sound useful to me.

simonw · 2026-04-24T19:26:28 1777058788

It's a lot more useful than being entirely unable to try out the model.

EnPissant · 2026-04-24T20:19:21 1777061961

But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.

Even worse, the github repo advertises:

> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

Hiding the fact that active params is _not_ 17B.

simonw · 2026-04-24T22:09:21 1777068561

It doesn't have to be a 2-bit quant - see the update at the bottom of my post:

> Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.

That was also just the first version of this pattern that I encountered, it's since seen a bunch of additional activity from other developers in other projects.

I linked to some of those in this follow-up: https://simonwillison.net/2026/Mar/24/streaming-experts/

inventor7777 · 2026-04-24T14:04:31 1777039471

On Apple Silicon Macs, the RAM is shared. So while maybe not up to raw GPU VRAM speeds, it still manages over 450GB/s real world on M4 Pro/Max series, to any place that it is needed.

They all do have a limitation from the SSD, but the Apple SSDs can do over 17GB/s (on high end models, the more normal ones are around 8GB/s)

EnPissant · 2026-04-24T18:18:21 1777054701

Yeah, I am mostly only talking about the SSD bottleneck being too slow. No way Apple gets 17GB/s sustained. SSDs thermally throttle really fast, and you have some random access involved when it needs the next expert.