So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.
Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.
vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.
From what I understand, the steps are:
1. launch vLLM
2. execute a vLLM configure command like "use kv-turboquant for model xyz"
3. that's it
I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.
SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!
this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).
Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.