This wasn't a change of an entire codebase, only one portion of it (the tag engine). But more importantly, when only 6 web severs are running stackoverflow.com there's a decent chance a non-triggered GC happens stalling the servers in rotation anyway, granted this lessens the chance of a user-facing stall overall. Also, it complicates the build processes since that queue of rotating servers out gets involved, for example if 3-4 are in a GC rotation then 1-2 are starting the build loop, you're running only on 5 and 6, while that's fine and doesn't even hit 50% CPU, it's a little risky. Worse is you have the very real possibility of taking ALL servers down during a build and GC combo, which we'd never want to happen.
They rewrote a relatively small part of their code base, a small part of their in memory cache. For a team who a) wrote the code base and b) change it every day, this wasn't a big deal more then it was an interesting problem. From an operations standpoint I'd consider app servers that need to be rotated every hour or so a very brittle architecture. (I don't let that kind of think go without protesting wildly.)
It does? What is the "root cause" though? You seem to take it to mean the generation of references that must be traversed in gen2, but that sounds like just as much a hack to me: they modify the code in ways that don't represent the semantics of the problem to work around a limitation of the implementation.
Hell, you could argue the "root cause" is the use of a garbage-collected environment in the first place. All popular GC implementations have latency issues like this. All of them. If you can't deal with occasional high latencies, you should identify that requirement before choosing Java or C#.
All the solutions provided are just patching around that issue. I don't see anything in the post that looks like a root cause, and the GP's post has the advantage of being much simpler to implement.
I consider the root cause to be a mis-match between the memory management needs of the application and the assumptions of the GC. Their solution matches the application's memory usage with the assumptions of the GC.
What would be more useful to them is to be able to provide either hints to the GC, or to actually have some control over memory management. That is, keep the application the same, but let it influence GC behavior. They're forced to go at it the other way: keep the GC the same, but change the application so what the GC does becomes the right thing.
they modify the code in ways that don't represent the semantics of the problem to work around a limitation of the implementation.
That is always true when you start optimizing for performance. And nick_craver explains why the proposed suggestion may be easier to say, but there are hidden complexities. Personally, once you start introducing external control to your application like that, alarm bells go off in my head.
I think this would be a cleaner approach than changing the entire code base.