Opus 4.5 and 4.6 were the first models that i could talk to and get a sense that they really "understood" WHY i'm saying the things i am
Opus 4.7 kinda took that away, it's a definite regression. it doesn't extrapolate.
———————————————
refactor this thing? sure, will do! wait, what do you mean "obviously do not refactor the unrelated thing that's colocated in the same file"? i'm sorry, you're absolutely right, conceptually these two things have nothing to do with each other. i see it now. i shouldn't have thought they're the same just because they're in the same file.
———————————————
whereas GPT 5.5, much like Opus 4.6, gets it.
i wanted to build a MIDI listener for a macOS app i'm making, and translate every message into a new enum. that enum was to be opinionated and not to reflect MIDI message data. moreover, i explicitly said not to do bit shifting or pointer arithmetic as part of the transport.
what did Opus 4.7 do? it still used pointer arithmetic for the parsing! should i have to be this explicit? it also seemingly didn't care that i wanted the enum to be opinionated and not reflect the raw MIDI values. Opus 4.6 got it right (although with ugly, questionable implementation).
GPT 5.5 both immediately understood that I didn't want pointer arithmetic because of the risk of UB and that shuffling around bits is cumbersome and out of place. it started searching for alternatives, looking up crates to handle MIDI transports and parsing independently.
then it built out a very lean implementation that was immediately understandable. even when i told Opus 4.7 to use packages, and even how to use them, it still added a ton of math weirdness, matching against raw MIDI packet bytes, indirection after indirection, etc. even worse, it still did so after giving them the public API i wanted them to implement.
GPT 5.5 nailed it first try. incredibly impressed with this model and feel much safer delegating some harder tasks to it
Are you in the UK? I've not had this happen to me (I'm not in the UK) so I'm wondering if the Online Safety Act has affected this, as it has with other products.
about once a month since my childhood i get an awful, slowly ramping headache that makes it unable for me to think or function.
it seems to happen more when i'm overweight, making me think it's blood pressure (BP) related, but then doing the valsalva maneuvre, which spikes BP, doesn't cause any problems at all.
i've tried acetaminophen, even 1.2g of it, to no avail. it doesn't help.
i've also tried every other remedy, such as curcumin, fire/ice locally, hot and cold showers, neck massages, working out muscles that may be involved in it, everything. nothing helps.
except for ibuprofen. 400-600mg kills it every time.
at least for me, there seems to be a definite difference, as ibuprofen can anecdotally help in some situations that acetaminophen can't. i wonder what exactly it can / can't treat and why.
i've noticed this too. i suspect they either did weight surgery on the model and distilled it from Mythos Preview, or they are currently not saving it correctly / having another adaptive thinking bug.
i asked it to look at writing research (especially from NN/g) and come up with some alternatives for a heading that is roughly supposed to convey:
"this app lets you create custom shortcuts for all mac apps, even sophisticated ones, mouse wheel ones, ..."
it came up with the following headlines:
1. ONE APP, MANY MAC APPS
2. ONE INSTALL, MANY APPS ONE APP.
3. ONE PACK PER MAC APP.
4. ALL YOUR APP SHORTCUTS IN ONE PLACE
5. [app name] IS ONE APP. PACKS COVER THE REST.
whereas GPT-5.4 came up with
1. The Shortcuts Your Apps Are Missing
2. What Your Apps Still Don't Let You
3. Do The Parts You Still Do by Hand
4. When Built-In Shortcuts Run Out
5. Where Your Apps Stop Short
now both of these aren't amazing, but please tell me how in the world "ONE APP MANY MAC APPS" makes sense as a headline for fucking anything lol
that's not something even GPT-3.5 would come up with.
They swapped the tokenizer which either means a new pretrain, or token/weights surgery. The latter one seems more likely both because
- economics: i'd wager a bet that Opus 4.7 is just distilled Mythos Preview
- performance: surgery like this would explain the spiky performance and weird issues
one thing to keep in mind is that you have to use GPT-5.4 differently from codex. they "work" in different ways. i was aghast when i noticed how terrible Codex was against Claude Code only to conclude it was me who wasn't using it right a couple days later
Opus 4.6 and 4.7 are better than GPT-5.4 xhigh, but only marginally. I can't give proper pointers on what to change because it's incredibly hard to quantify.
In essence, though, GPT-5.4 needs explicit instructions not to take liberties - this is included in the default system prompt of Claude Code which leads me to think Opus is just as overzealous as GPT-5.4 unless explicitly told off.
And it takes EVERYTHING you say at face value. Questions like "don't you see why this is bad?" will be answered with "yeah, i do." which is also kind of cool...
because with Opus in Claude Code i constantly have to reassure the model i'm not insinuating anything, lest it takes my question and runs with it into a frenzy of "oh shit my bad let me fix it im so sorry" type changes.
but how true is this? this is almost impossible to measure and those that do[1] find no significant difference
i personally haven't noticed any downgrade at all.
it's entirely possible there's a mass delusion going on where everyone gets wowed by 4.6 initially, then accepts the new baseline and gets used to it, then thinks that baseline is no longer impressive and thus degraded
it doesn't help that anthropic changed defaults for its claude code harness for all users suddenly
the best and only evidence i've seen for actual degradation is that the web version of opus 4.6 failed the car wash test, and since you cannot simply choose to "disable adaptive thinking" and other parameters with the web version, you truly may have gotten a worse product
Opus 4.5 and 4.6 were the first models that i could talk to and get a sense that they really "understood" WHY i'm saying the things i am
Opus 4.7 kinda took that away, it's a definite regression. it doesn't extrapolate.
———————————————
refactor this thing? sure, will do! wait, what do you mean "obviously do not refactor the unrelated thing that's colocated in the same file"? i'm sorry, you're absolutely right, conceptually these two things have nothing to do with each other. i see it now. i shouldn't have thought they're the same just because they're in the same file.
———————————————
whereas GPT 5.5, much like Opus 4.6, gets it.
i wanted to build a MIDI listener for a macOS app i'm making, and translate every message into a new enum. that enum was to be opinionated and not to reflect MIDI message data. moreover, i explicitly said not to do bit shifting or pointer arithmetic as part of the transport.
what did Opus 4.7 do? it still used pointer arithmetic for the parsing! should i have to be this explicit? it also seemingly didn't care that i wanted the enum to be opinionated and not reflect the raw MIDI values. Opus 4.6 got it right (although with ugly, questionable implementation).
GPT 5.5 both immediately understood that I didn't want pointer arithmetic because of the risk of UB and that shuffling around bits is cumbersome and out of place. it started searching for alternatives, looking up crates to handle MIDI transports and parsing independently.
then it built out a very lean implementation that was immediately understandable. even when i told Opus 4.7 to use packages, and even how to use them, it still added a ton of math weirdness, matching against raw MIDI packet bytes, indirection after indirection, etc. even worse, it still did so after giving them the public API i wanted them to implement.
GPT 5.5 nailed it first try. incredibly impressed with this model and feel much safer delegating some harder tasks to it
reply