I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien ...

PunchyHamster · 2026-03-06T09:15:02 1772788502

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

Case in point: I was getting memory errors on my gaming machine, that persisted even after replacing the sticks. It caused windows bluesreen maybe once a month so I kinda lived with it as I couldn't afford to replace whole setup (I theoretized something on motherboard is wrong)

Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone

versteegen · 2026-03-06T09:56:06 1772790966

I'm surprised "faulty PSU" is not on GP's list of common problems. Almost every unstable computer I've ever experienced has been due to either a dying PSU (not an under-specced one) or dying power conversion capacitors on the motherboard.

chedabob · 2026-03-06T10:56:30 1772794590

Ye some of the weirdest issues I've fixed have been PSU related.

I had a PC come to me that would boot fine, but if you opened the CD drive it'd shut off instantly.

urxvtcd · 2026-03-06T15:42:04 1772811724

There's a Polish electronics forum that's infamous because it's kind of actively hostile to them noobs. "Blacklisted power supply, closing thread." is a micro meme at this point.

drob518 · 2026-03-06T14:51:33 1772808693

I concur. A lot of “flakey” issues can be traced to poor quality power supplies. That’s a component that doesn’t get any attention in spec sheets other than a max power rating and I think a lot of manufacturers skimp there. As long as the system boots up and runs for a few minutes, they ship it.

MrDrMcCoy · 2026-03-06T18:09:43 1772820583

Heck, even dirty power from the wall can contribute. I've seen improvements in stability from putting things behind power conditioners.

drob518 · 2026-03-06T18:34:43 1772822083

Definitely that too, particularly in 2nd-world countries. I remember having a difficult time with dirty power for some hardware products I was responsible for at one time, where the customers were in the Middle East nd Africa in the 1990s. We ended up having to have the PS manufacturer do a redesign to help compensate for dirty power. It can be done, but it costs a bit more.

likelystory · 2026-03-06T12:32:26 1772800346

I could see that:

- Firefox may be more prevalent on those using Linux, since FF is less “corporate” than Chrome or Edge.

- People using Linux are probably putting Linux on old machines that had versions of Windows that are no longer supported.

However, what I can’t say next is “PSUs would get old and stop putting out as much” because that doesn’t tend to happen. They just die.

Those running Linux on some old tower may hook up too many devices to an underpowered PSU which could cause problems, but I doubt this is the norm.

If it’s not PSUs, what is it? It’s not electromagnetic radiation doing the bitflipping because that’s too rare.

Maybe bitflips could be caused by low-quality peripherals.

People also don’t vacuum out laptops like they used to vacuum out towers and desktops, so maybe it’s dust.

Or maybe it’s all a ruse and FF is buggy, but they don’t have time to figure it out.

sandworm101 · 2026-03-06T15:54:36 1772812476

>> People using Linux are probably putting Linux on old machines

Maybe for linux noobs. But i would suggest that most linux users are not noobs booting a disused pentium from a live CD. They are running linux on the same hardware as windows users. I would further suggest that as anyone installing a not-windows OS is more tech savvy than the average, that linux users actually take better care of thier machines. Linux users take pride in thier machines whereas the average windows user barely knows that computers have fans.

As any linux user for thier specifications and they will quote system reports and memory figues like Marisa Tomei discussing engine timings. Ask a random windows user and they will probably start with the name of the store that sold it.

PaulDavisThe1st · 2026-03-06T17:12:19 1772817139

Unix user for 35 years, Linux for 30+ years ... my case fan died during the summer of last year ... just took the side panel off and kept things running.

So much for taking pride in my machine :)

sandworm101 · 2026-03-07T02:00:22 1772848822

An exception to prove the rule. You fixed it yourself and are here proud of your machine.

I did basically the same thing recently when I built an AI rig. I tried to put it in a sever rack case but the fan noise was too much. So I ditched the rack and put in an open mining frame.

agumonkey · 2026-03-07T15:13:38 1772896418

It's the powerhouse of the dell :p

BorisMelnik · 2026-03-06T14:10:24 1772806224

yeah dell consumer pc psus were so awful

mock-possum · 2026-03-06T17:06:18 1772816778

Which is kinda crazy to me, in light of how durable their business laptops have been in my experience. I’ve owned maybe 6 pc laptops in my career, and the only 2 that’ve survived that nearly 20 year space are both dells.

stubish · 2026-03-07T03:18:19 1772853499

Does Dell design and/or build their own laptops? Depending on the year it is likely just their brand and specs, designed and built by an ODM.

dvngnt_ · 2026-03-06T00:31:18 1772757078

GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.

alexchantavy · 2026-03-06T07:53:40 1772783620

The PvP was so deep too. You would go 4v4 or 8v8 and coordinate a “3, 2, 1 spike” on a target so that all your damage would arrive at the same time regardless of spell windup times and be too much for the other team’s healer to respond to.

Could also fake spike to force the other team’s healer to waste their good heal on the wrong player while you downed the real target. Good times.

ndesaulniers · 2026-03-06T03:09:14 1772766554

I still remember summoning flesh golems as a necromancer! Too much of my life sunk into GW1. Beat all 4(?) expansions. Logged in years later after I finally put it down to find someone had guessed my weak password, stole everything, then deleted all my characters. C'est la vie.

jiggunjer · 2026-03-06T00:44:34 1772757874

Didn't they launch a remake of gw1 recently. Maybe I can get my kids hooked on that instead of this Roblox crap.

pndy · 2026-03-06T01:06:41 1772759201

Yes, they did relaunch it as Guild Wars Reforged with Steam Deck and controller support and other changes

https://wiki.guildwars.com/wiki/Guild_Wars_Reforged

hobofan · 2026-03-06T09:57:20 1772791040

Yes they did, but the social bump that was there shortly after release has significantly calmed down already.

It did rekindle my love for the game, but most outposts are empty, even in the international districts, so I think it's hard to get hooked on it for new joiners.

post-it · 2026-03-06T02:31:32 1772764292

For what it's worth, Roblox is how I discovered code at age 10.

Cthulhu_ · 2026-03-06T09:35:43 1772789743

It was ZZT for me, no idea how old I was, probably 8-10 or so.

But when you take a bird's eye view, it's interesting and great to see how over the years, games where you can build your own games remain popular and a common entryway into software development.

But also how Epic went from ZZT via Unreal to Fortnite, with the latter now being another platform (or what Zucc wanted to call a metaverse) for creativity.

Other notable mentions off the top of my head where people can build or invent their own games (in-game, via an external editor or through community support) or go crazy in besides Roblox are Second Life (...I think), LittleBigPlanet, Warcraft/Starcraft (which led to the genre of MOBAs), Geometry Dash, Mario Maker, TES, Source engine games, Minecraft, etc etc.

youarentrightjr · 2026-03-06T03:04:00 1772766240

How do you mean? Is there programming inside the game (ala Minecraft or Factorio)?

cortesoft · 2026-03-06T04:02:59 1772769779

Roblox is basically a developer platform for making games

LoganDark · 2026-03-06T03:07:16 1772766436

Roblox has a development environment for creating games (Roblox Studio) and the engine uses a fork of Lua as a scripting language.

I also was introduced to programming through Roblox.

dpe82 · 2026-03-06T01:42:28 1772761348

As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.

Cthulhu_ · 2026-03-06T09:29:14 1772789354

I heard the same thing from a colleague who worked on a Dutch banking app, they were quite diligent in fixing logic bugs but said that once you fix all of those, the rest is space rays.

As an aside, Apple and Google's phone home crash reports is a really good system and it's one factor that makes mobile app development fun / interesting.

grishka · 2026-03-06T08:44:35 1772786675

For the Mastodon Android app, I also sometimes see crashes that make no sense. For example, how about native crashes, on a thread that is created and run by the system, that only contains system libraries in its stack trace, and that never ran any of my code because the app doesn't contain any native libraries to begin with?

Unfortunately I've never looked at crashes this way when I worked at VKontakte because there were just too many crashes overall. That app had tens of millions of users so it crashed a lot in absolute numbers no matter what I did.

izacus · 2026-03-07T15:51:32 1772898692

Framework, runtime, drivers and chips have bugs too. It's very easy to have some underlying component that corrupts your memory.

gf000 · 2026-03-06T09:20:38 1772788838

Well, vendors' randomly modified android systems are chock full of bugs, so it could have easily been some fancy os-specific feature failing not just in your case, but probably plenty other apps.

dpe82 · 2026-03-06T17:39:56 1772818796

Usually I'd just look at clusters of crashes (those that had similar stack traces) but sometimes when you're running a very small % experiment there's not enough signal so you end up looking at everything. And oh boy was there a lot of noise.

In an app with >billion users you get all kinds of wild stuff.

saagarjha · 2026-03-06T13:17:09 1772803029

Bugs in the system libraries?

Helmut10001 · 2026-03-06T05:11:46 1772773906

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

Agingcoder · 2026-03-06T07:19:12 1772781552

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

thebruce87m · 2026-03-06T08:29:13 1772785753

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

smalley · 2026-03-06T21:09:56 1772831396

The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.

There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.

Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066

thebruce87m · 2026-03-06T23:12:37 1772838757

I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.

smalley · 2026-03-07T01:34:39 1772847279

If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.

I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .

If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data

thebruce87m · 2026-03-07T08:47:29 1772873249

We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.

My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.

Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.

Agingcoder · 2026-03-06T12:36:54 1772800614

I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.

We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.

RealityVoid · 2026-03-06T14:11:56 1772806316

I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".

EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.

Agingcoder · 2026-03-06T15:41:56 1772811716

Well my admins eventually believed me , so I’m fairly comfortable with what I said.

We also had a few thousands of physical servers with about of terabyte of ram each.

You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones

RealityVoid · 2026-03-06T17:32:39 1772818359

Ok, I am sure there is _some_ amount of unrepairable errors.

But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!

Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.

You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.

So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.

And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)

Agingcoder · 2026-03-06T18:04:44 1772820284

We’re in agreement.

I think we diverge on ‘making it go away in my book’.

When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.

So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.

I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.

Thanks for taking the time to reply !

RealityVoid · 2026-03-06T19:05:45 1772823945

Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.

But this is sort of the march of nines.

My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!

Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.

Agingcoder · 2026-03-06T22:35:01 1772836501

:-) never forget Occam’s razor !

No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.

In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.

close04 · 2026-03-06T11:50:44 1772797844

If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.

kasabali · 2026-03-06T08:13:44 1772784824

were they 3-bit flips?

thfuran · 2026-03-06T13:54:31 1772805271

It seems extremely unlikely that you’d end up with a lot of those but no smaller detectable errors.

hurfdurf · 2026-03-06T09:15:50 1772788550

Why? Intel making and keeping it workstation/Xeon-exclusive for a premium for too long. And AMD is still playing along not forcing the issue with their weird "yeah, Zen supports it, but your mainboard may or may not, no idea, don't care, do your own research" stance. These days it's a chicken and egg problem re: price and availability and demand. See also https://news.ycombinator.com/item?id=29838403

m000 · 2026-03-06T09:57:53 1772791073

Maybe it's high time for some regulation?

E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?

AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.

funcDropShadow · 2026-03-06T12:38:11 1772800691

OMG no. Politician have no business making technological decisions. They make it harder to innovate, i.e. to invent the next generation of ECC with a different name.

m000 · 2026-03-06T13:44:30 1772804670

I would argue that in the present conditions, regulation can actually foster and guide real innovation.

With no regulations in place, companies would rather innovate in profit extraction rather improving technology. And if they have enough market capture, they may actually prefer to not innovate, if that would hurt profits.

cestith · 2026-03-06T15:24:38 1772810678

ECC is like Ethernet. The name doesn’t have to change for the technology to update.

saagarjha · 2026-03-06T13:18:07 1772803087

Politicians don’t have to be dumb.

m000 · 2026-03-07T19:02:14 1772910134

Reading this again, did you forget your trailing /s?

free652 · 2026-03-06T12:21:36 1772799696

Cost. You are about to making computers 10-20% more expensive.

Computers also aren't used much these days, and phones and tables don't have ECC

m000 · 2026-03-06T13:32:03 1772803923

ECC has only 10-15% more transistor count. So you're only making one component of the computer 15% more expensive. This should have been a non-brainer, at least before the recent DRAM price hikes.

Also, while computers may not be used much for cosmic rays to be a risk factor, but they're still susceptible to rowhammer-style attacks, which ECC memory makes much harder.

Finally, if you account for the current performance loss due to rowhammer counter-measures, the extra cost of ECC memory is partially offset.

Helmut10001 · 2026-03-06T09:29:45 1772789385

Thanks for the details. I agree and had the same experience, trying to figure out if an AMB motherboard supports ECC or not. It is almost impossible to know ahead of trying it. At least we have ZFS now for parity checks on cold storage.

Dylan16807 · 2026-03-06T07:45:49 1772783149

Well for DDR5 that's 25% more chips which isn't great even if you don't get ripped off by market segmentation.

It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.

hikarudo · 2026-03-06T12:37:28 1772800648

Why 25%, shouldn't it be 12.5%? 8 ECC bits for every 64 bits.

ciupicri · 2026-03-06T13:46:07 1772804767

DDR5 ECC RDIMMs (R=registered) have 16 extra bits. From the specifications for Kingston's KSM64R52BS8-16MD [1]:

> x80 ECC (x40, 2 independent I/O sub channels)

On the other hand ECC UDIMMs (U=unbuffered) have only 8. From the specifications for Kingston's KSM56E46BS8KM-16HA [2]:

> x72 ECC (x36, 2 independent I/O sub channels)

Though if I remember correctly, the specifications for the older DDR4 ECC RDIMMs mention only 72 bits.

[1]: https://www.kingston.com/datasheets/KSM64R52BS8-16HA.pdf

[2]: https://www.kingston.com/datasheets/KSM56E46BS8KM-16HA.pdf

PunchyHamster · 2026-03-06T09:16:20 1772788580

In case of Intel it's mostly coz they want to sell it as enterprise/workstation feature and make people pay extra.

AMD has been better on it but BIOS/mobo vendors not so much

epx · 2026-03-06T12:35:45 1772800545

And checksummed filesystems.

sznio · 2026-03-06T10:26:58 1772792818

What I'm wondering, even without ECC, afaik standard ram still has a parity bit, so a single flip should be detected. With ECC it would be fixed, without ECC it would crash the system. For it to get through and cause an app to malfunction you need two bit flips at least.

ciupicri · 2026-03-06T12:00:58 1772798458

I think standard RAM used to have long long time ago, but not anymore. DDR5 finally readd it sort of.

roryirvine · 2026-03-06T15:25:04 1772810704

Yes, 30 pin SIMMs (the most common memory format from the mid-80s to the mid-90s) came in either '8 chip' or '9 chip' variants - the 9th chip being for the parity bit.

Most motherboards supported both, and the choice of which to use came down to the cost differential at the time of building a particular machine. The wild swings in DRAM prices meant that this could go from being negligible to significant within the course of a year or two!

When 72 pin SIMMs were introduced, they could in theory also come in a parity version but in reality that was fairly rare (full ECC was much better, and only a little more expensive). I don't think I ever saw an EDO 72 pin SIMM with parity, and it simply wasn't an option for DIMMs and later.

meindnoch · 2026-03-06T11:20:36 1772796036

Wrong. Regular RAM has no parity bit.

colechristensen · 2026-03-06T05:18:59 1772774339

Bit flips do not only happen inside RAM

Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.

The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

PunchyHamster · 2026-03-06T09:18:11 1772788691

> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.

Nobody does

> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree

colechristensen · 2026-03-06T18:35:25 1772822125

Sure they do, you just have to think about it a different way.

It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.

colinb · 2026-03-06T07:05:19 1772780719

> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?

gmueckl · 2026-03-06T07:30:43 1772782243

For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

d1sxeyes · 2026-03-06T07:44:25 1772783065

At least three copies, so you can recover based on consensus.

Dylan16807 · 2026-03-06T07:52:53 1772783573

If your pieces of important data are very tiny, that's probably your best option.

If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.

d1sxeyes · 2026-03-06T12:24:12 1772799852

Ah, true! You just restore the one that matches its hash. Elegant.

rixed · 2026-03-06T16:39:56 1772815196

A single hash should be enough.

Dylan16807 · 2026-03-06T19:55:07 1772826907

Yes, but what's easier depends on layout. "Consensus" makes me think of multiple entire nodes, and in that situation you can have a nice symmetry by making each node store one copy and one small hash.

If you're doing something that's more centralized then one hash might be simpler, but if you're centralized then you should probably use your own error correction codes instead of having multiple copies.

qznc · 2026-03-06T16:48:17 1772815697

In many cases the system is perfectly safe when it shuts off. Two is enough for that.

pizza · 2026-03-06T11:20:51 1772796051

“never go to sea with two chronometers, take one or three”

DennisP · 2026-03-06T16:56:17 1772816177

Seems like chronometers would be a case where two are better than one, because the mistakes are analog. If they don't exactly agree, just take the average. You'll have more error than if you were lucky enough to take the better chronometer, but less than if you had taken only the worse one. Minimizing the worst case is probably the best way to stay off the rocks.

Helmut10001 · 2026-03-06T09:35:01 1772789701

I use ZFS even on consumer devices, these days. Parity checks all the way!

vntok · 2026-03-06T07:22:32 1772781752

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

qznc · 2026-03-06T07:24:40 1772781880

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

gmueckl · 2026-03-06T07:32:55 1772782375

A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.

Helmut10001 · 2026-03-06T05:20:33 1772774433

Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.

Tomte · 2026-03-06T07:08:46 1772780926

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.

colechristensen · 2026-03-06T05:39:31 1772775571

It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.

But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.

ZiiS · 2026-03-06T06:10:46 1772777446

It should be fairly easy to see statistically if ECC helps, people do run Firefox on it.

The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.

bpye · 2026-03-06T07:41:05 1772782865

I believe caches and maybe registers often have ECC too though I'm sure there are still gaps.

bell-cot · 2026-03-06T10:48:09 1772794089

Talk to someone in consumer sales about customer priorities. A bit-cheaper computer? Or one which which is, in theory, more resilient against some rare random sort of problem which customers do not see as affecting them.

jodrellblank · 2026-03-06T17:31:43 1772818303

This is getting off-topic but I’m amazed by this ability to reach out to computers around the world as a sensor array and infer things we can’t easily find out in other ways. It’s in popular culture and HN comments most often as spyware and mass surveillance of people, and that’s a bit of a shame.

GPS location and movement data is what gives Google maps its near-real-time view of traffic on all roads, and busy-ness of all shops.

I think they collect location data from people riding public transport so they can tell you how long people wait on average at bus stops before getting on a bus.

Does Google collect atmospheric pressure readings from phone altimeters and use it for weather models? Could they?

Kindle collects details on books people read, how far they read, where they stop, which sections they highlight and quote, which words they look up in dictionaries.

I wonder if anyone’s curated a list of things like this which do happen or have been tried, excluding the “gathers user data for advertising” category which would become the biggest one, drowning out everything else.

I think current phones use accelerometer data to detect possible car crashes and call emergency services. Google could use that in aggregate to identify accident blackspots but I don’t know if they do. But that would be less useful because the police already know everywhere a big accident happens because people call the police. So that’s data easily found a different way.

seanw444 · 2026-03-06T18:47:50 1772822870

> It’s in popular culture and HN comments most often as spyware and mass surveillance of people, and that’s a bit of a shame.

I don't know whether you mean it's a shame that people consider it spyware, or if you meant that it's a shame that it manifests as spyware typically. I agree with the latter, not the former. It usually is spyware. If companies went for simple opt-in popups with a brief description of the reasoning, I'd be all for that. I sometimes opt-in to these requests myself, despite being a fairly privacy-conscious person, because I understand the benefit they have to the people collecting the data for good purposes. But when surveillance is opt-out (or no choice given), it's just spyware.

jodrellblank · 2026-03-06T21:06:35 1772831195

I mean what you did is a shame.

I asked to put the spyware aside for one sub-thread and focus on the astonishing worldwide sensor array, and you talked about the spyware and nothing else.

MBCook · 2026-03-06T17:47:34 1772819254

Doesn’t Google also use the phone accelerometer to try and spot earthquakes?

jodrellblank · 2026-03-07T16:08:58 1772899738

I don't know, but that's a good one. I wonder if they could do something like LIGO [1] which is an experiment of shining LASERS on mirrors 4km apart, to detect gravitational waves. Phone accelerometers don't have that kind of precision, but there are hundreds of millions of them and they are thousands of miles apart, is there possibly a signal among that noise?

[1] https://en.wikipedia.org/wiki/LIGO

mobilio · 2026-03-06T00:12:21 1772755941

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

john_strinlai · 2026-03-06T01:35:59 1772760959

for people that dont know, www.codeofhonor.com is netcoyotes (the gp comment) blog, and there is some good reading to be had there

Modified3019 · 2026-03-05T23:42:47 1772754167

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.

From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)

What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

monster_truck · 2026-03-06T02:48:19 1772765299

DDR4 and 5 both have similar heat sensitivity curves which call for increased refresh timings past 45C.

Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.

On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.

I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.

golem14 · 2026-03-06T00:46:51 1772758011

Hmm, I wonder if we see, now since we are in a RAM availability crisis, more borderline to bad RAMs creep into the supply chain.

If we had a time series graph of this data, it might be revealing.

monster_truck · 2026-03-06T02:54:27 1772765667

If you look around you'll see people already putting the new, chinese made DDR4 through its paces, it's holding up far better than anyone expected.

Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.

iamflimflam1 · 2026-03-06T07:55:55 1772783755

The danger is we’ll start to see more QA rejects coming into the market. The temptation to mix in factory rejects into your inventory is going to get very high for a lot of resellers.

kombine · 2026-03-06T07:53:58 1772783638

Where does one find these? I'm looking for DDR4 ECC for my homelab.

bpye · 2026-03-06T07:43:57 1772783037

Similar experience. I played with overclocking the DDR5 ECC memory I have on my system, it would appear to be stable and for quite a while it would be. But after a few days I'd notice a handful of correctable errors.

I now just run at the standard 5600MHz timing, I really don't find the potential stability trade off worth it. We already have enough bugs.

kmeisthax · 2026-03-06T00:19:04 1772756344

> From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.

I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.

[0] In practice, if they didn't, they'd all just flock to AMD.

gruez · 2026-03-06T00:37:29 1772757449

>[0] In practice, if they didn't, they'd all just flock to AMD.

only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.

pushedx · 2026-03-06T00:48:17 1772758097

At the beginning of your comment I was wondering if the "attitude" that was corporate serving was the anti-ECC stance or the pro-ECC stance (based on the full chunk that you quoted). I'm glad that by the end of the comment you were clearly pro ECC.

Any workstation where you are getting serious work done should use ECC

jug · 2026-03-06T01:04:41 1772759081

As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.

netcoyote · 2026-03-06T05:13:29 1772774009

Oh yeah, those were some good times. It was great getting early feedback from you & the other alpha testers, which really changed the course of our efforts.

I remember in the earlier builds we only had a “heal area” spell, which would also heal monsters, and no “resurrect” spell, so it was always a challenge to take down a boss and not accidentally heal it when trying to prevent a player from dying.

aiiane · 2026-03-06T08:20:16 1772785216

I remember one of the first impressions I had in GW1 during test events was the sense of scale in the world that still managed to avoid excessive harsh geometry angles for the most part. Not surprised to hear it was pushing more polygons than average.

P.S. GW1 remains one of my favorite games and the source of many good memories from both PvP and PvE. From fun stories of holding the Hall of Heroes to some unforgettable GvG matches, y'all made a great game.

pndy · 2026-03-05T08:36:04 1772699764

I didn't expect to read bits of GW story here from one of the founders - thanks!

samiv · 2026-03-06T11:10:10 1772795410

Plot twist. The memory bit flip checking code was actually buggy and contained UB.

No, seriously did you actually verify the code for correctness before relying on it's results?

arprocter · 2026-03-05T23:25:15 1772753115

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

Dylan16807 · 2026-03-06T07:18:46 1772781526

> And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.

monster_truck · 2026-03-06T02:36:04 1772764564

Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?

Analemma_ · 2026-03-05T22:46:10 1772750770

There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.

jnellis · 2026-03-06T03:32:33 1772767953

This was a design choice by AMD at the time for their Athlon Slot A cpus. Use the same slot A board which you could set the cpu speed by bridging a connections. Since the Slot A came in a package, you couldn't see the actual cpu etching. So shady cpu sellers would pull the cover off high speed cpus, and put them on slow speed cpus after overclocking them to unstable levels.

projektfu · 2026-03-06T00:35:34 1772757334

E.g., running a Pentium 75, at 75MHz.

nxobject · 2026-03-06T11:23:38 1772796218

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause

Oh god yes… Dell OptiPlexes and bad caps went together in those days. I’m half convinced Valve put the gray towers in Counter-Strike so IT employees wasting time could shoot them up for therapy.

Agentlien · 2026-03-06T05:50:11 1772776211

That's a really cool anecdote. The overclock makes sense. When we released Need For Speed (2015) I spent some time in our "war room", monitoring incoming crash reports and doing emergency patches for the worst issues.

The vast majority of crashes came from two buckets:

1. PCs running below our minimum specs

2. Bugs in MSI Afterburner.

kasabali · 2026-03-06T08:25:52 1772785552

> Bugs in MSI Afterburner.

Do you mean the OSD?

Agentlien · 2026-03-06T11:07:06 1772795226

It seemed to be the monitoring side of it which caused a lot of crashes. It was apparently a very common issue in many games around that time.

PaulHoule · 2026-03-06T13:51:39 1772805099

Back in the 90's I had an overclocked AMD486 machine which seemed OK most of the time but had segfaults compiling the Linux kernel. I sent in a bug report and Alan Cox closed it saying it was the fault of my machine being overclocked.

I dialed the machine back to the rated speed but it failed completely within 6 months.

fennecbutt · 2026-03-06T17:53:18 1772819598

That's awesome. But also guild waaars, GW2 I played from beta for years, but it just got boring. Endless expansions with weird story.

We need GW3 already but my fear is mmo as a genre is dying.

uncSoft · 2026-03-06T17:55:15 1772819715

They just need to call it GW Classic apparently and it will sell

sidewndr46 · 2026-03-06T13:31:27 1772803887

Well wow I wasn't expecting to see yet another story from Patrick Wyatt here in the comments! Much appreciated, I've enjoyed reading everything you've written over the years.

SunnyNeon · 2026-03-06T10:40:03 1772793603

How did you determine which of the causes it was?

danielEM · 2026-03-06T12:09:21 1772798961

> problems because Dell sourced the absolute cheapest stuff for their computers;

Price itself has nothing to cause problems, it is either bad design or false or incomplete data on datasheets or all of it. Please STOP spreading this narrative, the right thing is to make ads, datasheets, marketing materials etc, etc to tell you the truth that is necessary for you to make proper decision as client/consumer.

taneq · 2026-03-06T05:02:49 1772773369

Wow, that’s really interesting! I always suspected bit flips happened undetected way more than we thought, so it’s great to get some real life war stories about it. Also thanks for Guild Wars, many happy hours spent in GW2. :)

just_testing · 2026-03-06T02:53:58 1772765638

I loved reading your comment and got curious: how he detected the bitflips?

mayama · 2026-03-06T03:29:49 1772767789

It looks like computing math heavy process with known answer, like 301st prime, and comparing the result.

General memory testing programs like memtest86 or memtester sets random bits into memory and verify it.

Salgat · 2026-03-06T01:54:13 1772762053

Mike is such a legend.

cookiengineer · 2026-03-06T02:49:07 1772765347

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

benatkin · 2026-03-06T16:35:22 1772814922

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

Yikes. Dude, you're getting a Packard Bell.

andrepd · 2026-03-06T13:47:18 1772804838

Amazing story! Reminds me of old gamasutra posts like these https://web.archive.org/web/20170522151205/http://www.gamasu...

jiggawatts · 2026-03-06T02:58:10 1772765890

Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.

netcoyote · 2026-03-06T05:46:01 1772775961

For RTS games I wish we could blame bit flips, but more typically it is uninitialized memory, incorrectly-not-reinitialized static variables, memory overwrites, use-after-free, non-deterministic functions (eg time), and pointer comparisons.

God I love C/C++. It’s like job security for engineers who fix bugs.

blep-arsh · 2026-03-06T08:56:36 1772787396

Some games are reliable enough. I found out the DRAM in my PC was going bad when Factorio started behaving weird. Did a memory test to confirm. Yep, bitflips.

yownie · 2026-03-06T20:40:50 1772829650

this exactly the type of stories I come to HN to read, thanks!

hsbauauvhabzb · 2026-03-05T23:15:20 1772752520

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

nomel · 2026-03-06T00:13:03 1772755983

I think the most reasonable take would be to just tell the users hardware is borked, they're going to have a bad outside the game too, and point them to one of the many guides around this topic.

I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!

To counter that, we're LONG overdue for ECC in all consumer systems.

AlotOfReading · 2026-03-06T01:37:46 1772761066

I put engineering effort into handling bad hardware all the time because safety critical, :)

It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.

80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.

shakna · 2026-03-06T01:18:41 1772759921

I think I sit in another camp. A lot of my engineering efforts are in working around bad hardware.

Better the user sees some lag due to state rebuild versus a crash.

Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.

andai · 2026-03-06T00:07:58 1772755678

That's an interesting idea. How might you implement that? Like RAID but on the level of variables? Maybe the one valid use case for getters/setters? :)

hsbauauvhabzb · 2026-03-06T00:43:39 1772757819

As another user fairly pointed out, ECC. But a compiler level flag would probably achieve the redundancy, sourcing stuff from disk etc would probably still need to happen twice to ensure that bit flips do not occur, etc.

rurban · 2026-03-06T07:24:56 1772781896

I hate HW soo much. To revise the biggest problems in computing, beside out of tokens: HW bugs