We have an OCR job running with a lot of domain specific knowledge. After testin...

mistercheph · 2026-03-06T02:56:05 1772765765

While I believe that performance varies with respect to prompt, I have a seriously hard time believing that using the same prompt that was effective with the previous model would perform worse with the next generation of the same model from that lab and the same prompt.

deaux · 2026-03-06T03:49:17 1772768957

You shouldn't have a hard time believing it. There are thousands of different domains out there. You find it hard to believe that any of them would perform worse in your scenario?

Labs are still really optimizing for maybe 10 of those domains. At most 25 if we're being incredibly generous.

And for many domains, "worse" can hardly be benched. Think about creative writing. Think about a Burmese cooking recipe generator.

bethekidyouwant · 2026-03-06T15:12:14 1772809934

Bruh, how do you evaluate a batch of 1000 jobs against a x model for creative writing or cooking recipes? It’s vibes all the way down. This reeks like some kind of blog spam seo nonsense.

deaux · 2026-03-06T17:28:15 1772818095

The entire point is that you _don't_ for creative writing, vibes are the whole point, and those vibes often get worse across model updates for the same prompts.