Since Gemma 4 came this easter the gap from self hosting models to Claude has de...

djfergus · 2026-04-22T22:45:24 1776897924

>> I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

I’m just pleased by the competition, agree with the ideal of free and local but sustainable competition is key: driving $200 p/m down to a much much lower number.

datadrivenangel · 2026-04-22T21:47:39 1776894459

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

chr15m · 2026-04-22T22:49:55 1776898195

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

da-x · 2026-04-23T12:23:56 1776947036

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.

verdverm · 2026-04-23T00:05:42 1776902742

There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.

djyde · 2026-04-22T23:01:08 1776898868

What level of programming tasks can a 27B model handle? Even with Claude, I'm occasionally not satisfied, and I can't imagine how effective a 27B model would be.

sleepyeldrazi · 2026-04-23T09:27:06 1776936426

I ran 3 prompts (short versions, full version in the repo):

- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.

- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).

- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.

and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.

I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .

Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.

__s · 2026-04-23T00:43:50 1776905030

Basic triage is good. I've found I need to mostly handle programming, but local models have been good for pointing me at where to look with just "investigate https://github.com/HarbourMasters/Shipwright/issues/6232" as prompt

justinclift · 2026-04-23T01:07:21 1776906441

> Qwen 3.6:27b uses 29/32gb of vram

What context size are you using for that?

Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.

skirmish · 2026-04-23T03:29:28 1776914968

I squeezed it into 24 GiB VRAM (since I have RX7900XTX):

-- Q5_K_M Unsloth quantization on Linux llama.cpp

-- context 81k, flash attention on, 8-bit K/V caches

-- pp 625 t/s, tg 30 t/s

rsolva · 2026-04-24T22:05:26 1777068326

I have the same GPU and get very good results, even better than Gemma 4 26B A4B, using the following setup (Fedora 43 Silverblue, podman compose):

  services:
    llama:
      image: ghcr.io/ggml-org/llama.cpp:server-vulkan
      container_name: llama-qwen3.6-27b-dense
      ports:
        - 4201:8080
      volumes:
        - ./Qwen3.6-27B-Q4_K_M.gguf:/models/model.gguf:ro,z
        - ./mmproj-BF16.gguf:/models/mmproj.gguf:ro,z
      devices:
        - /dev/dri
      group_add:
        - video
      command: >
        -m /models/model.gguf
        --mmproj /models/mmproj.gguf
        --alias "Qwen3.6 27b Dense"
        -ngl 99
        -c 98304
        -b 2048
        --host 0.0.0.0
        --port 8080
        --parallel 2
        --kv-unified
        --ubatch-size 2048
        --flash-attn on
        -cb
        --jinja
        --no-webui
        -ctk q8_0
        -ctv q8_0
        --image-min-tokens 1024
        --temp 0.6
        --top-k 20
        --top-p 0.95
        --repeat-penalty 1
        --presence-penalty 1.5
        --reasoning auto
      restart: unless-stopped

tgtweak · 2026-04-23T03:31:24 1776915084

Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.

Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.

pawelduda · 2026-04-22T22:35:18 1776897318

How many tokens/s do you get on RTX 5090?

gfosco · 2026-04-23T01:25:11 1776907511

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.

pawelduda · 2026-04-23T11:15:33 1776942933

Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?

girvo · 2026-04-23T03:05:20 1776913520

Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks

overgard · 2026-04-22T22:58:39 1776898719

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

datadrivenangel · 2026-04-22T23:38:19 1776901099

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

angoragoats · 2026-04-23T00:40:05 1776904805

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.