More

silvertaza · 2026-04-23T19:57:41 1776974261

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

dubcanada · 2026-04-23T20:42:37 1776976957

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

Jensson · 2026-04-24T03:47:32 1777002452

> While hallucination is probably closer to 100% depending on the question.

But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.

elAhmo · 2026-04-23T21:16:25 1776978985

No one serious uses grok.

ajdegol · 2026-04-23T21:38:57 1776980337

@grok is this true?

for_i_in_range · 2026-04-24T13:27:09 1777037229

This comment deserves more love

NamlchakKhandro · 2026-04-24T05:15:31 1777007731

RALaBarge · 2026-04-23T22:50:15 1776984615

YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt

d0gsg0w00f · 2026-04-24T03:06:07 1776999967

Why not? Honest question.

phillipcarter · 2026-04-24T16:27:22 1777048042

Because the Grok models offer nothing different in serious contexts from the other leading models which don't come with a heaping pile of bad baggage.

MagicMoonlight · 2026-04-24T10:40:20 1777027220

It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

simianwords · 2026-04-23T20:23:25 1776975805

There's something off with this because Haiku should not be that good.

camgunz · 2026-04-24T06:55:59 1777013759

Hallucination benchmarks accept "I don't know", which Haiku did at least a little. Here are other benchmarks corroborating: https://suprmind.ai/hub/ai-hallucination-rates-and-benchmark...

rattray · 2026-04-24T01:36:05 1776994565

I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.

jwpapi · 2026-04-23T20:37:32 1776976652

The hallucination benchmark is hallucinating

dakolli · 2026-04-23T20:30:10 1776976210

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

tedsanders · 2026-04-24T01:24:45 1776993885

We don't want hallucinations either, I promise you.

A few biased defenses:

- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

- On the flip side, GPT-5.5 has the highest accuracy score.

- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.

calf · 2026-04-24T01:26:14 1776993974

On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.

William_BB · 2026-04-24T06:50:11 1777013411

Totally agreed, this has been and will continue to be a problem for all existing models.

> Like are programmers and engineers using LLMs completely differently than I'm doing

No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.

silvertaza · 2026-04-05T19:22:18 1775416938

Not around the sauna per se, but sauna is often built first because it serves as a place to live while you're building the house!

mesrik · 2026-04-05T19:55:39 1775418939

Yes, that it was especially rural environments and not having much options otherwise to live around while building.

Sauna that was built then wasn't just one hot room, but it also had at minimum small changing room dressing/undressing, relaxing between turns in steam room. Also if it was first building made then adding also lounge which served as living space with beds and cooking stove while building house was common. With sauna you had place to stay warm first winter, able to get warm water, wash clothes, yourselves and even a give birth old times. Building sauna first made lot of sense.

These days sauna for home builders is more about getting sauna somewhere in that floorplan where works well for the intended users of that house.

fsckboy · 2026-04-05T20:00:29 1775419229

>sauna is often built first because it serves as a place to live while you're building the house

wouldn't a kitchen accomplish that goal better?

wolfpack_mick · 2026-04-05T20:15:49 1775420149

Due to lack of running water in those times (and still in many cottages) cooking is done above a fire, water is brought from the lake. A kitchen won't serve you well if you're just trying to get through a long winter of -30c.

oldestofsports · 2026-04-05T20:13:38 1775420018

The sauna provides heating.

vixen99 · 2026-04-06T13:35:43 1775482543

Average yearly temperature in Finland is reported as 6.5 Celsius

silvertaza · 2025-12-09T12:01:48 1765281708

Was the multi-channel approach deliberate? Would you have done something different other time around?

jverducci · 2025-12-11T19:08:13 1765480093

Yes and no. Some channels were chosen deliberately, while others were more serendipitous. General networking, conversations, and hunting for startup-related information/tips happened to surface prospects without explicitly looking for them.

In hindsight, there are two things I’d change. One micro-level, one macro-level.

Micro: I would have started creating content on day one, even before anything was built. In any domain, you need time to establish credibility, and the easiest way to accelerate that is by consistently sharing insights, lessons, and thought processes. Building trust starts long before you ship a product. That’s going to be a major focus for us moving forward.

Macro: I would have pushed myself much harder on the networking front a looooong time ago. Earlier in my career, I didn’t step outside my comfort zone enough, and I avoided social platforms entirely throughout the 2010s. Once I decided to build a company, I had to completely overhaul that mindset. My advice to anyone early in their career -> spend a few extra hours each week meeting people, attending events, and expanding your circle. It feels optional, but over time the compounding effects are enormous.

silvertaza · on March 10, 2024

With every alternative, the prevailing issue is the fact that your data is as safe as the company your data is with. But I think this can be remedied by doubly external backups.

didgeoridoo · on March 10, 2024

B2 having an S3-compatible API available makes this particularly easy :)

OJFord · on March 10, 2024

Backblaze is like if Amazon spun AWS S3 out as its own business (and it added some backup helper tooling as a result) though, I wouldn't really worry any more about it. You could write a second copy to S3 Glacier Deep Archive (using B2 for instant access when you wanted to restore or on a new device) and still be much cheaper.

silvertaza · on March 2, 2024

I feel like we're in the extremes in both cases, where general customer interfaces are all super information sparse, and terminals are heavy on text.

Having more visual terminals and more text UIs elsewhere would be best of both worlds.

silvertaza · on April 22, 2023

If the visible map here is gray, why is it red-ish in the sky looking at it with the naked eye?

willidiots · on April 22, 2023

https://wehoville.com/2013/01/07/why-are-images-from-space-p...

zokier · on April 22, 2023

That article feels overly dismissive. Notably more recent Mars missions such as Tianwen and Hope do have normal bayer cameras onboard producing color images.

silvertaza · on March 16, 2023

I love the spirit!

Practically speaking, having a backlog of vast amounts of similar data to you, the true way to get out of the loop is to try something different. To do the things that wouldn't fit as a data point in that list format.

silvertaza · on Jan 7, 2023

Hi there, I just wrote about it here: https://silvertaza.com/product-market-fit/

I am sure that at the very least you'll get some ideas on how to approach this. I have aggregated different sources on that, but I am not an expert myself at this point in time.

silvertaza · on Dec 7, 2022

I even go as far as avoiding writing code in the first place. You can give a very simple imaginary problem to solve and see how they think. I believe good software engineers are essentially problem solvers, the code just happens to be the tool for it.

I have asked to estimate the amounts of skittles in a litre cube box.

pklausler · on Dec 7, 2022

If nobody at your company requires a candidate to write code, you won't filter out the candidates who can't code, and they'll be working with you, not with companies that require candidates to write code.

Even the easiest five-minute coding problem is better than nothing.

silvertaza · on Dec 8, 2022

Yes, but the OP asked for an idiot-filter first.

We end up applying multiple filters throughout the process, but try similarly to keep them as simple as possible not to too strongly bias for one thing.

wink · on Dec 8, 2022

I find that a horrible question.

I've not had skittles in years, are they as big as smarties? I think a little smaller? Sure, you can start your estimation by laying them down in a 10x10 grid, and stacked 20 high, but that's very far off. Maybe 13x13x20? So that would be the most loosely packed version, if we shift them off so that layer+1 is in the depressions of layer+0, maybe we can stack 30 high. So yeah, that's my best mathematical approach and I guess it's at least 50% off.

Did I pass?

silvertaza · on Sept 15, 2022

It's probably apt to ask yourself what sort of revenue model you prefer to support for the value provided. YouTube is a rare case where they offer multiple options.

db48x · on Sept 15, 2022

Meh. Most videos on the web are crap. Having advertising makes some kinds of crap worse. And even if all advertising revenue went away, enthusiasts will always post videos for fun. In fact most videos don’t earn any significant advertising revenue.

Personally advertising is a high cost to myself that I am not willing to pay. I even avoid restaurants that have televisions, or that play a radio station instead of buying music that has no advertising in it. (I also frequent a couple of places that are willing to leave the music off for me, at least early in the day before the lunch rush, and one that doesn’t play music at all. That’s a different story though.)

throwaway675309 · on Sept 16, 2022

People who have native sponsors inside their uploads are usually attempting to derive a significant amount of their living income through their videos, and if they're on sponsor block that means they've attracted enough viewership that this is doubly so.

It's fine if you want to deprive them of their livelihood which affords them the ability to make professional and polished videos but don't kid yourself about your motivations. And speaking as an amateur videographer, it takes an enormous amount of time and work to be able to make even a short 20 to 30 minute video.

db48x · on Sept 16, 2022

True. I don’t mind depriving them of their livelihood, because advertising is intrusive. I don’t have to allow advertising onto my screen. There are more good youtube videos with no ads than anyone could watch in a lifetime. If 100% of people started blocking ads, and companies stopped buying ads as a result, then a lot of video creators would do something else. But like I said, many people create videos for reasons other than just money, and they won’t leave.

space_fountain · on Sept 15, 2022

The videos that don’t make money are the unpopular ones. Are they the ones you watch? Have you listened to the content creators you do consume and what they have to say about doing it for the joy of it all

db48x · on Sept 16, 2022

> The videos that don’t make money are the unpopular ones.

I disagree with that. A selection of channels that I watch regularly enough:

The Spiffing Brit doesn’t do sponsorships, and his videos are quite popular (https://www.youtube.com/channel/UCRHXUZ0BxbkU2MYZgsuFgkQ/vid...).

This Old Tony (https://www.youtube.com/watch?v=W9rlWu9KWe8) is not quite as popular, if you count popularity by video views, but still very popular and no sponsorships.

What about Ben Eater? (https://www.youtube.com/watch?v=g_koa00MBLg) He is exactly the kind of enthusiast who would make amazing videos even if all advertising disappeared.

Ross Whatshisname of Accursed Farms (https://www.youtube.com/channel/UCJ6KZTTnkE-s2XFJJmoTAkw/vid...) does an extremely in–depth review of a video game about once a month, and doesn’t do sponsorships. He gets 200k–500k views on every review, which is very popular by any measure.

Posy (https://www.youtube.com/channel/UCmEmX_jw_pRp5UbAdzkZq-g/vid...) is another enthusiast who makes videos for the sheer delight of it and whose videos are quite popular. If you haven’t watched his videos about LCD displays then you haven’t lived.

I agree that truly unpopular videos with a dozen views will never make money, by advertising or otherwise. But the converse is not true; as I have demonstrated, there are plenty of popular videos that don’t bother to try to make money by sponsorships. You know every single one of those is turning down a VPN company every week of the year though.

brokenmachine · on Sept 16, 2022

I've often enjoyed videos that are unpopular. I don't really care.

Either I watch or I don't watch, but I won't watch ads.