So far I'm unimpressed for local inference. got 11 tokens per second on omlx on ...

fshen · 2026-04-23T02:08:57 1776910137

I use the same computer as you do. m5 can run faster:

pip install mlx_lm

python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code

mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300

Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB

hresvelgr · 2026-04-23T01:21:47 1776907307

You have better specs than I do and I'm running the same model almost twice as fast through GGUF on llama cpp. I'd try some different harnesses.

AlexC04 · 2026-04-23T14:09:49 1776953389

One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!

someguydave · 2026-04-22T23:42:52 1776901372

I got about 7 tokens/sec generation on an M2 max macbook running 8-bit quant on an MLX version.

mswphd · 2026-04-23T00:02:40 1776902560

this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).

noman-land · 2026-04-22T22:07:55 1776895675

OpenCode seems to be a lot better than Claude at using local models.

jedisct1 · 2026-04-23T06:42:58 1776926578

For local models, you should check out https://swival.dev instead of Claude Code.