The idea that AI can write code like a seasoned software developer but not being able to use its own tooling that can be learned through 11 chapters tutorial doesn't make any sense.
If they’re reaching the same results across a variety of the most popular public models, it doesn’t seem like that big a deal to know if it was Opus 4 or Opus 4.5
Reproducibility is (supposed to be) a cornerstone of science. Model versions are absolutely critical to understand what was actually tested and how to reproduce it.
Yea this feels like saying “if you give them good enough specs they’ll produce the code you want” which reduces to…writing the code yourself. Just with more steps.
I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it.
In 10 and prior you could even move it to other monitors, just by dragging and dropping it. It's baffling they thought that functionality was a bug that people wanted 'fixed'.