Hacker Newsnew | past | comments | ask | show | jobs | submit | silvertaza's commentslogin

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...


grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.


> While hallucination is probably closer to 100% depending on the question.

But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.


No one serious uses grok.

@grok is this true?

This comment deserves more love


YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt

Why not? Honest question.

Because the Grok models offer nothing different in serious contexts from the other leading models which don't come with a heaping pile of bad baggage.

It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

There's something off with this because Haiku should not be that good.

Hallucination benchmarks accept "I don't know", which Haiku did at least a little. Here are other benchmarks corroborating: https://suprmind.ai/hub/ai-hallucination-rates-and-benchmark...

I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.

The hallucination benchmark is hallucinating

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.


We don't want hallucinations either, I promise you.

A few biased defenses:

- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

- On the flip side, GPT-5.5 has the highest accuracy score.

- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.


On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.

Totally agreed, this has been and will continue to be a problem for all existing models.

> Like are programmers and engineers using LLMs completely differently than I'm doing

No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.


Not around the sauna per se, but sauna is often built first because it serves as a place to live while you're building the house!


Yes, that it was especially rural environments and not having much options otherwise to live around while building.

Sauna that was built then wasn't just one hot room, but it also had at minimum small changing room dressing/undressing, relaxing between turns in steam room. Also if it was first building made then adding also lounge which served as living space with beds and cooking stove while building house was common. With sauna you had place to stay warm first winter, able to get warm water, wash clothes, yourselves and even a give birth old times. Building sauna first made lot of sense.

These days sauna for home builders is more about getting sauna somewhere in that floorplan where works well for the intended users of that house.


>sauna is often built first because it serves as a place to live while you're building the house

wouldn't a kitchen accomplish that goal better?


Due to lack of running water in those times (and still in many cottages) cooking is done above a fire, water is brought from the lake. A kitchen won't serve you well if you're just trying to get through a long winter of -30c.


The sauna provides heating.


Average yearly temperature in Finland is reported as 6.5 Celsius


Was the multi-channel approach deliberate? Would you have done something different other time around?


Yes and no. Some channels were chosen deliberately, while others were more serendipitous. General networking, conversations, and hunting for startup-related information/tips happened to surface prospects without explicitly looking for them.

In hindsight, there are two things I’d change. One micro-level, one macro-level.

Micro: I would have started creating content on day one, even before anything was built. In any domain, you need time to establish credibility, and the easiest way to accelerate that is by consistently sharing insights, lessons, and thought processes. Building trust starts long before you ship a product. That’s going to be a major focus for us moving forward.

Macro: I would have pushed myself much harder on the networking front a looooong time ago. Earlier in my career, I didn’t step outside my comfort zone enough, and I avoided social platforms entirely throughout the 2010s. Once I decided to build a company, I had to completely overhaul that mindset. My advice to anyone early in their career -> spend a few extra hours each week meeting people, attending events, and expanding your circle. It feels optional, but over time the compounding effects are enormous.


With every alternative, the prevailing issue is the fact that your data is as safe as the company your data is with. But I think this can be remedied by doubly external backups.


B2 having an S3-compatible API available makes this particularly easy :)


Backblaze is like if Amazon spun AWS S3 out as its own business (and it added some backup helper tooling as a result) though, I wouldn't really worry any more about it. You could write a second copy to S3 Glacier Deep Archive (using B2 for instant access when you wanted to restore or on a new device) and still be much cheaper.


I feel like we're in the extremes in both cases, where general customer interfaces are all super information sparse, and terminals are heavy on text.

Having more visual terminals and more text UIs elsewhere would be best of both worlds.


If the visible map here is gray, why is it red-ish in the sky looking at it with the naked eye?



That article feels overly dismissive. Notably more recent Mars missions such as Tianwen and Hope do have normal bayer cameras onboard producing color images.


I love the spirit!

Practically speaking, having a backlog of vast amounts of similar data to you, the true way to get out of the loop is to try something different. To do the things that wouldn't fit as a data point in that list format.


Hi there, I just wrote about it here: https://silvertaza.com/product-market-fit/

I am sure that at the very least you'll get some ideas on how to approach this. I have aggregated different sources on that, but I am not an expert myself at this point in time.


I even go as far as avoiding writing code in the first place. You can give a very simple imaginary problem to solve and see how they think. I believe good software engineers are essentially problem solvers, the code just happens to be the tool for it.

I have asked to estimate the amounts of skittles in a litre cube box.


If nobody at your company requires a candidate to write code, you won't filter out the candidates who can't code, and they'll be working with you, not with companies that require candidates to write code.

Even the easiest five-minute coding problem is better than nothing.


Yes, but the OP asked for an idiot-filter first.

We end up applying multiple filters throughout the process, but try similarly to keep them as simple as possible not to too strongly bias for one thing.


I find that a horrible question.

I've not had skittles in years, are they as big as smarties? I think a little smaller? Sure, you can start your estimation by laying them down in a 10x10 grid, and stacked 20 high, but that's very far off. Maybe 13x13x20? So that would be the most loosely packed version, if we shift them off so that layer+1 is in the depressions of layer+0, maybe we can stack 30 high. So yeah, that's my best mathematical approach and I guess it's at least 50% off.

Did I pass?


It's probably apt to ask yourself what sort of revenue model you prefer to support for the value provided. YouTube is a rare case where they offer multiple options.


Meh. Most videos on the web are crap. Having advertising makes some kinds of crap worse. And even if all advertising revenue went away, enthusiasts will always post videos for fun. In fact most videos don’t earn any significant advertising revenue.

Personally advertising is a high cost to myself that I am not willing to pay. I even avoid restaurants that have televisions, or that play a radio station instead of buying music that has no advertising in it. (I also frequent a couple of places that are willing to leave the music off for me, at least early in the day before the lunch rush, and one that doesn’t play music at all. That’s a different story though.)


People who have native sponsors inside their uploads are usually attempting to derive a significant amount of their living income through their videos, and if they're on sponsor block that means they've attracted enough viewership that this is doubly so.

It's fine if you want to deprive them of their livelihood which affords them the ability to make professional and polished videos but don't kid yourself about your motivations. And speaking as an amateur videographer, it takes an enormous amount of time and work to be able to make even a short 20 to 30 minute video.


True. I don’t mind depriving them of their livelihood, because advertising is intrusive. I don’t have to allow advertising onto my screen. There are more good youtube videos with no ads than anyone could watch in a lifetime. If 100% of people started blocking ads, and companies stopped buying ads as a result, then a lot of video creators would do something else. But like I said, many people create videos for reasons other than just money, and they won’t leave.


The videos that don’t make money are the unpopular ones. Are they the ones you watch? Have you listened to the content creators you do consume and what they have to say about doing it for the joy of it all


> The videos that don’t make money are the unpopular ones.

I disagree with that. A selection of channels that I watch regularly enough:

The Spiffing Brit doesn’t do sponsorships, and his videos are quite popular (https://www.youtube.com/channel/UCRHXUZ0BxbkU2MYZgsuFgkQ/vid...).

This Old Tony (https://www.youtube.com/watch?v=W9rlWu9KWe8) is not quite as popular, if you count popularity by video views, but still very popular and no sponsorships.

What about Ben Eater? (https://www.youtube.com/watch?v=g_koa00MBLg) He is exactly the kind of enthusiast who would make amazing videos even if all advertising disappeared.

Ross Whatshisname of Accursed Farms (https://www.youtube.com/channel/UCJ6KZTTnkE-s2XFJJmoTAkw/vid...) does an extremely in–depth review of a video game about once a month, and doesn’t do sponsorships. He gets 200k–500k views on every review, which is very popular by any measure.

Posy (https://www.youtube.com/channel/UCmEmX_jw_pRp5UbAdzkZq-g/vid...) is another enthusiast who makes videos for the sheer delight of it and whose videos are quite popular. If you haven’t watched his videos about LCD displays then you haven’t lived.

I agree that truly unpopular videos with a dozen views will never make money, by advertising or otherwise. But the converse is not true; as I have demonstrated, there are plenty of popular videos that don’t bother to try to make money by sponsorships. You know every single one of those is turning down a VPN company every week of the year though.


I've often enjoyed videos that are unpopular. I don't really care.

Either I watch or I don't watch, but I won't watch ads.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: