It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.
I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.
This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.
LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
We don't want hallucinations either, I promise you.
A few biased defenses:
- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.
- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."
- On the flip side, GPT-5.5 has the highest accuracy score.
- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.
- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.
- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.
Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.
Yes, that it was especially rural environments and not having much options otherwise to live around while building.
Sauna that was built then wasn't just one hot room, but it also had at minimum small changing room dressing/undressing, relaxing between turns in steam room. Also if it was first building made then adding also lounge which served as living space with beds and cooking stove while building house was common. With sauna you had place to stay warm first winter, able to get warm water, wash clothes, yourselves and even a give birth old times. Building sauna first made lot of sense.
These days sauna for home builders is more about getting sauna somewhere in that floorplan where works well for the intended users of that house.
Due to lack of running water in those times (and still in many cottages) cooking is done above a fire, water is brought from the lake. A kitchen won't serve you well if you're just trying to get through a long winter of -30c.
Yes and no. Some channels were chosen deliberately, while others were more serendipitous. General networking, conversations, and hunting for startup-related information/tips happened to surface prospects without explicitly looking for them.
In hindsight, there are two things I’d change. One micro-level, one macro-level.
Micro: I would have started creating content on day one, even before anything was built. In any domain, you need time to establish credibility, and the easiest way to accelerate that is by consistently sharing insights, lessons, and thought processes. Building trust starts long before you ship a product. That’s going to be a major focus for us moving forward.
Macro: I would have pushed myself much harder on the networking front a looooong time ago. Earlier in my career, I didn’t step outside my comfort zone enough, and I avoided social platforms entirely throughout the 2010s. Once I decided to build a company, I had to completely overhaul that mindset. My advice to anyone early in their career -> spend a few extra hours each week meeting people, attending events, and expanding your circle. It feels optional, but over time the compounding effects are enormous.
With every alternative, the prevailing issue is the fact that your data is as safe as the company your data is with. But I think this can be remedied by doubly external backups.
Backblaze is like if Amazon spun AWS S3 out as its own business (and it added some backup helper tooling as a result) though, I wouldn't really worry any more about it. You could write a second copy to S3 Glacier Deep Archive (using B2 for instant access when you wanted to restore or on a new device) and still be much cheaper.
That article feels overly dismissive. Notably more recent Mars missions such as Tianwen and Hope do have normal bayer cameras onboard producing color images.
Practically speaking, having a backlog of vast amounts of similar data to you, the true way to get out of the loop is to try something different. To do the things that wouldn't fit as a data point in that list format.
I am sure that at the very least you'll get some ideas on how to approach this. I have aggregated different sources on that, but I am not an expert myself at this point in time.
I even go as far as avoiding writing code in the first place. You can give a very simple imaginary problem to solve and see how they think. I believe good software engineers are essentially problem solvers, the code just happens to be the tool for it.
I have asked to estimate the amounts of skittles in a litre cube box.
If nobody at your company requires a candidate to write code, you won't filter out the candidates who can't code, and they'll be working with you, not with companies that require candidates to write code.
Even the easiest five-minute coding problem is better than nothing.
We end up applying multiple filters throughout the process, but try similarly to keep them as simple as possible not to too strongly bias for one thing.
I've not had skittles in years, are they as big as smarties? I think a little smaller? Sure, you can start your estimation by laying them down in a 10x10 grid, and stacked 20 high, but that's very far off. Maybe 13x13x20? So that would be the most loosely packed version, if we shift them off so that layer+1 is in the depressions of layer+0, maybe we can stack 30 high. So yeah, that's my best mathematical approach and I guess it's at least 50% off.
It's probably apt to ask yourself what sort of revenue model you prefer to support for the value provided. YouTube is a rare case where they offer multiple options.
Meh. Most videos on the web are crap. Having advertising makes some kinds of crap worse. And even if all advertising revenue went away, enthusiasts will always post videos for fun. In fact most videos don’t earn any significant advertising revenue.
Personally advertising is a high cost to myself that I am not willing to pay. I even avoid restaurants that have televisions, or that play a radio station instead of buying music that has no advertising in it. (I also frequent a couple of places that are willing to leave the music off for me, at least early in the day before the lunch rush, and one that doesn’t play music at all. That’s a different story though.)
People who have native sponsors inside their uploads are usually attempting to derive a significant amount of their living income through their videos, and if they're on sponsor block that means they've attracted enough viewership that this is doubly so.
It's fine if you want to deprive them of their livelihood which affords them the ability to make professional and polished videos but don't kid yourself about your motivations. And speaking as an amateur videographer, it takes an enormous amount of time and work to be able to make even a short 20 to 30 minute video.
True. I don’t mind depriving them of their livelihood, because advertising is intrusive. I don’t have to allow advertising onto my screen. There are more good youtube videos with no ads than anyone could watch in a lifetime. If 100% of people started blocking ads, and companies stopped buying ads as a result, then a lot of video creators would do something else. But like I said, many people create videos for reasons other than just money, and they won’t leave.
The videos that don’t make money are the unpopular ones. Are they the ones you watch? Have you listened to the content creators you do consume and what they have to say about doing it for the joy of it all
Ross Whatshisname of Accursed Farms (https://www.youtube.com/channel/UCJ6KZTTnkE-s2XFJJmoTAkw/vid...) does an extremely in–depth review of a video game about once a month, and doesn’t do sponsorships. He gets 200k–500k views on every review, which is very popular by any measure.
I agree that truly unpopular videos with a dozen views will never make money, by advertising or otherwise. But the converse is not true; as I have demonstrated, there are plenty of popular videos that don’t bother to try to make money by sponsorships. You know every single one of those is turning down a VPN company every week of the year though.
Source: https://artificialanalysis.ai/models?omniscience=omniscience...
reply