Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Outage in Europe (google.com)
266 points by vanburen on Nov 12, 2021 | hide | past | favorite | 176 comments


There were outages around the same time last year. Somebody in the HN thread commented back then that the employees evaluation and promotion window ends around december/eoy, thus more releases are made.

https://en.m.wikipedia.org/wiki/Google_services_outages


Perf has been over for almost a month now, and the evaluation period was over more than two months ago.


> and the evaluation period was over more than two months ago

Excellent time to slack off a bit, make a few mistakes, then come next eval you can point to a marked improvement over the intervening six-to-nine months!


The line of logic this thread has followed so far suggests there will be a reason for a Google outage every month of the year.


Couldn‘t have phrased it any better.


I believe the term of art is “just so story.”


Then it might, indeed, be unrelated. Thanks for clarification!


Could it not be knock-on effects of earlier changes? Or indeed post-review changes from scared/precarious engineers?


I think we are exactly mid way between the end year 2021 perf and mid year 2022 perf, so its not clear to me that this could be to blame


That explains:

The people who did the stuff have either gotten their promotions and moved on or didn't get a promotion and have given up.

:-)


He he. It's kind of hilarious. Maybe they should stagger reviews to ensure high availability? :-D


Nothing wrong with it. People make errors, distributed systems aren't easy. More frequent changes - more likely a bug was introduced.

My post is just a speculation, let's wait for the actual technical details doc from Google.


Well, there are various whitepapers that reformulate this exact truth you've highlighted: issues tend to happen more when changes are made.

Regarding the technical details doc, Google will never state that outright in individual postmortems. And they will definitely not draw this to the logical conclusion regarding the spiky yearly activity.


> logical conclusion regarding the spiky yearly activity

Why is it logical? There’s tons of changes being deployed at all times, at all large companies.

Across products, verticals, everything - hundreds of changes at any given point. Some of these changes can introduce hard-to-predict bugs in globally distributed systems. Most of the time, external users don’t even notice before they’re fixed.

Like another commenter said, performance reviews do not coincide with with the year-end at Google and other companies.


Hypothetically speaking? 11 months a year there's no incentive to cut corners and only do 2 weeks of testing on something that really needs 3. If one month a year rushing things out with a bit less testing is rewarded, I can believe some people would respond to that.

Of course, I wouldn't go as far as to call this a "logical conclusion" as the evidence I've seen is very slim.


Except that there are two a year at Google., so it's 10 months of the year, 5 and 5, and Covid's made everything weird.


> Like another commenter said, performance reviews do not coincide with with the year-end at Google and other companies.

When are they? My bet is that they're roughly around the same time period, even if they don't start or close on December 31st.


It can’t be logical because it is based on bad facts.

We wrote evals in August. Anybody racing to get things launched before perf did so months ago. The timing described in the top post is just false.

We write the next one in February. We are almost exactly half way between perf cycles. Your hunch that this lines up with the end of the year is false.


Far more likely IMO to coincide with trying to get Exciting New Features out before release freezes for some conference or other.


There is also a lot of pressure on engineers around black Friday and Cyber Monday to get things done before any change embargoes come into place. This is coming from someone who just worked 16 hours straight to scale things before a change freeze.

I don't know how or if this would impact Google, but I'm sure someone there has at least thought about those dates.


November 11 is a public holiday in US. And it seems it has in recent years became a company holiday in many companies too. May be that break in established over many years cadence of workday and holidays plays a role here, like say affecting smoothness of support transition between US-EMEA-APJ.


Veterans Day is not a day off for US Googlers.


> evaluation and promotion window ends around december

Googler here, that is not when those promotion/perf evaluation windows happen.


I would be very interested in any research about code quality with relation to promotion packets. Ideally acamdemic.

I am not sure where to look for it


Is there an objective way to measure code quality?

We barely understand the systems we’re writing code for - I don’t see how you could objectively judge the code to manage those systems without fully understanding the systems first.

You could have wave some metrics like number of bugs or test coverage, but I can’t think of any (aforementioned included) that aren’t subject to tons of confounding variables.


Some people are definitely not gatting a raise.


More likely someone actually will. A blameless postmortem will be written, and the people that will fix the bug or systems issue will have something to work on with high visibility and high impact, which tends to translate in good performance ratings.

(Googler, opinions are my own)


"Impact: fixed processes that led to 8 hr outage" seems like an easy case to make.


And this is why nobody serious should build on GCP.

Google's more interested in placating their primadonna engineers than solving customer problems.


Then they get fired instead, well done ha.


Comments like this make me wonder if people really expect engineers to be fired because of an outage? I do not work at Google, but none of my workplaces would fire engineers because of a failure. Mistakes happen. As long as they are not repeated, everything is good.

If your company fires people in situations like this, run away and never look back.


Googler here, not speaking on the behalf of the company, my opinions are my own

People do absolutely NOT get fired over incidents. Making mistakes is human. An incident will prompt a review of the systems and safeguards in place to prevent such an incident, much like an airline incident investigation -

basically "somebody fat-fingered it" is never the answer, postmortems are always blameless

EDIT: now that I think of it, the opposite thing happens after a major incident - a systemic failure should be identified, people are being hired to fix it :)


> Googler here, not speaking on the behalf of the company, my opinions are my own

Why do employees at big tech names (FAANG et al.) are so often so cautious as to include this as a foreword everywhere? Twitter bios are full of that, for instance.

It is crazy to me; who would expect anything else that our opinions being your own and nothing more? Who would expect that your word (with all due respect) is worth anything with regards to the company's PR?

Is there an actual risk in the US? Have there been trials or anything that push people to add such statements?


Can't speak for other companies, but this is covered in basic training at Google - if you're not authorized to speak on behalf of the company, you must make it clear when your writings may be mistaken or constructed to represent the company.

Basically the company has specially trained people that speak on behalf of the company, and that message should not be confounded by personal opinions of other employees. For example, on the recent FB outage, there was an employee posting inside information on reddit - media companies just took it at face value and ran around with it reporting as it was what FB itself was saying about the outage.

I'm not aware of any actual risks in the US, but then again I'm not in the US. For me this seems a minor point, and I actually enjoy separating my public persona from the company for which I work, being it Google or a small startup.


> On the recent FB outage, there was an employee posting inside information on reddit - media companies just took it at face value and ran around with it reporting as it was what FB itself was saying about the outage.

To be fair, I think the media would have done that even with a "speaking only for myself" disclaimer.


Every company says that, the obvious solution is to never name it when you speak. Why do these people need to say "Im a googler" and then immediately "but forget it, I speak on my own"... obviously there s value in the fact they re at Google and it will color their discourse which is already probably forbidden.

Dont name your company if you intend to speak for yourself.


It's the same at all large companies - its CYA boiler plate.

Though I almost got to be the official spokesperson for British Telecom responding on the alt.2600 news group about the the Met police VMB hack - press office was cool but the internal security was not.


Would you say: "I live in [city], opinions are my own", or "I am married to [person], opinions are my own"?

If no, why are you doing this for the company you're trading skills and time against money?


I work at a large tech company, and they do mention in the on boarding materials that we represent the company, so we should be careful in our social media profiles. My solution to this is to not associate my social media profiles with my employer. This is technically not really what we’re supposed to do, and I might have to change that approach at some point if I move high enough in the org to start getting attention from people, but this works for me better than disclaimers on all my posts.


if they wanted to control this so bad they'd provision you a managed account like how email addresses are managed.


Yet all these googlers breached it immediately in the first sentence by naming their company...


Sarcastically, "I'm a googler, opinions my own" reads a lot like "I'm a googler, just so you know".

I didn't want to emphasize on that on the first comment yet to be honest, I find it pedantic because it's pointless, legally speaking.


> Why do employees at big tech names (FAANG et al.) are so often so cautious as to include this as a foreword everywhere?

This isn’t just ‘big tech’ - I work at a relatively small tech company, but I’d never want anything I say about the company to be mistaken as some sort of ‘official statement’ especially if it related to an incident that possibly had a financial impact on external parties, and could conceivably be misused in that context in the future.

I go as far as never writing private emails from my work mail for the same sort of reasons - although that is from a possibly over-abundance of caution.


It's in the spirit of full disclosure, which some, including me, appreciate.


Part of it is because the company asks us to. Part of it is because I think it's reasonable to tell people your biases, and it can avoid the situation where substantive conversation gets derailed by "gotchas". If I make a comment about how I think Google Meet has the best noise cancellation of any video chat software, even though I don't work on Meet or anything adjacent to it, it's still a bad look if someone can dig through my comment history and pull out a previous comment about how I work for Google.


> It is crazy to me; who would expect anything else that our opinions being your own and nothing more?

Rumor mill journalists will mine social media and forum comments and write entire articles about "so and so FAANG employee gives hint at future merger" when some dev comments how much they've enjoyed using some library recently.


They want to mention their employer to gain authority in the discussion, but since mentioning their employer is a legal/PR risk, they need to follow it up with a disclaimer (this only partially mitigates the risk, but it’s worth it to get the brag in).


It s because they only hire idiots at Google. Im from a big company, I just dont name it and assume humans understand my opinions are my own :D


Especially when it's not an opinion.


In fairness, 'opinion' is a horrendously vague and ill-defined word. It does double duty as (i) 'normative value' and (ii) 'personal understanding of the descriptive facts', which two senses are constantly confused - for example right here.

That's why we constantly get "it's just my opinion" used in reference to type-ii opinions (personal understanding of descriptive fact), when it's only really appropriate to type-i opinions (normative value).

Many conversations would be far clearer if it were abandoned in favour of more precise language, IMPUOTDF.


"Many conversations would be far clearer if it were abandoned in favour of more precise language, IMPUOTDF"... um whats IMPUOTDF? I did try to google it, but only this post was found.


Sorry, I was joking: 'in my personal understanding of the descriptive facts', referring back to that second definition of 'opinion' earlier in the comment.


Yeah, this 'blameless' ethos has definitely trickled down from FAANG to decently-sized decently-reputed places I've worked at - and certainly to #EngTwitter.

I think it's a bit over-applied in some cases. Does it not commit you to the theorem that every process can be made so perfect as to be completely invulnerable to one human being making a mistake? (At least, in the form exemplified by the common tweets to the effect that "your processes are to blame for $incident, not your interns/engineers/etc".)

Even if you required two-person auth for every single thing, two people will make a mistake now and then, and in reality - due to our being social animals - the two probabilities are not truly independent.

I just don't see how this is feasible in reality. A more realistic principle feels like: "people will infrequently make mistakes, and that's of course natural and human and forgiveable, but far fewer incidents should be vulnerable to human error than currently are".


I of course agree that mistakes are inevitable. That being said, the point of blameless culture is not to make a process invulnerable to mistakes. Instead during a post-mortem, we look at how to prevent that particular incident from happening again.


You're totally right and the SRE book by Google goes over this - the company's culture does not allow firing people for outages. If you're somewhere this still happens, run away (or you'd better be getting paid more than top-end ICs at Google.


Why would you fire an engineer you have just spent millions to train?


What tech company spends millions on training anyone?


It's reframing lost revenue, not talking about literal training cost.


The other comments already explained it, but I'm wondering how you haven't come across this 'saying' before. It's so overused and also cheesy in my opinion.


People are born every day. Every day tens of thousands of people will hear about hacker news, the pyramids, Darth Vader being someone's father, for the first time.


It was not meant ill-willed. I was just wondering. Regarding what you said, I think I understand what your point is. On the other hand it makes some difference among whose people you ask - some things are much likelier to be common sense (or at least heard of) than in other places. Whatever.


Lol maybe if people actually did spend millions training their people up front we could do better?


139,995 employees at Google * 1,000,000 = $139,995,000,000

$140 billion dollars. On training.

On the one hand... you know what, I'd love to work in an environment like that. Seriously.

On the other hand... what's the argument you make to the CFO in support of this? Honest question, interested to hear answers.


I work part time in the Army. In the Army when you go from their equivalent of junior to mid-level they take you out of your job for eight months of dedicated personal development, before you start your first mid-level job. When you go to their equivalent of senior they take you out for a year.

I don't know how much that costs all-in, including the salaries, instructors, facilities, but might be starting to approach a million.

That's valuing training!


But the army is a cost center! The workers have some money shaved off their salary to pay for an army that allows delinquents and half-disabled to pew-pew guns in the forest, leaving them in peace. It's not comparable to a productive entreprise that needs to build things for people or perish.

For instance, if Google fails and cant profit, it cant just shoot at their client until they pay. Your organisation can.


> It's not comparable to a productive entreprise that needs to build things for people or perish.

Well we had to build an Army to win against fascism in the Second World War or we all would have perished.

And 'perished' means literally dead or subject to fascism, not just going out of business.


WW2 was 70+ years ago. The USSR fell 30+ years ago. Military budgets are still incredibly high. The army as it is today does not need to be that efficient. And even during WW2 times the US did not face a credible threat of invasion. The last time the US faced an invasion on the mainland was during the war of 1812.


> The army as it is today does not need to be that efficient.

You want the Army to be... less efficient? Spend more for less capability?


No I mean that for the US army today the downside to the army being inefficient is that money is wasted. Not great, but not a disaster. For a different country that could mean the country gets invaded or the government collapses (like Afghanistan).


Middle and upper management get there via connections and picking things up on their own. It's unsurprising that they don't want to be subjected to competition with lesser people who can "merely" be trained to do their job as well, or better.


Some jobs can be learned on the job.

But I'm glad that learning to kill people (military) is not taught that way.


If your job is something like a staff officer in a Brigade, you could learn that on the job, the Google way, because exercising is also 'the job', but they don't get you to do that - instead they take the time to fully pull you out of all work commitments for dedicated personal development. These periods of personal development are about personal skills rather than combat training, which you've already done by this point.


Someone who just cost their company millions in revenues is gonna be _extremely_ careful not to make the same mistake again. Hence, million-dollsr training.


The GP is a reference to the anecdote about IBM’s Thomas Watson not firing a executive who had made an error costing the company a substantial amount of money.


The implication is that millions were spent training the person who made the mistake when they cost the company millions by making that mistake.


An outage can be pretty expensive but it is training for those whom triggered it and/or those that fixes it.


I'd guess because Europe-wide outages are costing more than millions


But firing someone doesn't undo an incident. It just introduces other weird incentives. People become afraid to change things for fear of breaking something, or when something does break they try to hide it rather than feeling like they can immediately ask for and receive help to fix it.

The only time someone should be fired for causing an outage is if they're negligent or sloppy or mess things up all the time. This is rare. Almost always outages in large systems are the combination of many factors — latent bugs, design flaws, abnormal load, etc, any one or two of which wouldn't take the site down. But when the combine in a perfect storm that nobody foresaw things fall over.


But now you have someone in your team who will never, ever make that same mistake again and should be your new go-to guy for all X related changes (X being DNS or what-have-you). Firing someone with that type of experience does not lead to success.

100% of all devs make huge mistakes, at least once.


> But now you have someone in your team who will never, ever make that same mistake again and that should be your new go-to guy for all DNS related changes.

I'm not entirely sure that's always true. For example, i've seen people introduce N+1 issues into a codebase, spend evenings fixing them and refactoring code to fix production issues... just to later introduce those very same types of issues.

Sure, you can learn from mistakes, have post-mortems and so on (provided that your org even does those and that anyone listens and cares about the conclusions from those), but to me it feels like the most foolproof way is to ensure that no-one can make these mistakes again, be it with a checklist (which tend to be ignored, honestly), or better yet, an automated CI step or a new test suite.

In my eyes, it's basically the same as with unit tests - everyone agrees that you need them, but people rarely write enough of them. So if you introduce something to prevent them from not doing what they should, e.g. a quality gate within a CI step which will disallow a merge once the coverage falls below a set margin, suddenly things are a lot better in the long run.


N+1 issues aren't nearly as devastating as N^2. Commend them for not putting your systems to a complete halt, then teach them how to reason about this properly.

>> a quality gate

Yes, this, also.


> N+1 issues aren't nearly as devastating as N^2.

Depends on the project, i guess: if you're unlucky enough to be working on a monolith and suddenly a page takes 5'000 SQL queries to load as opposed to 100, because someone thought that initializing data through service/DB calls in a loop is "easier" than writing views in the DB, it might still kill the entire system anyways, depending on the count of users.

And once this data initialization is sufficiently complicated and convoluted for you not to be able to rewrite it and them not wanting to rewrite it, all while "the business" is breathing down on your neck, you might either want to introduce caching (and possibly run into cache invalidation problems down the road), or just freshen up your CV.

I guess i'd also like to expand on the previous suggestion and advise others to consider performance/load testing as well, especially when coupled with APM solutions like Skywalking or even Matomo analytics, both of which can allow you to aggregate the historical page load times, CPM and overall performance of your applications, to figure out what went wrong when.


Still, that engineer (if its engineer's fault) is extremely unlikely to make that issue again. IMHO, the problem is systemic in that why the system allows such errors (if its human error) to happen. Given Google's scale I think a lot of the generally known scenarios are covered and what you see is tens of services interacting in not obvious ways. Those unobvious scenarios manifest in situations like this.


It makes no sense at all. After the outage you have not only a review of the causes and appropriate remedies, but also more experienced people who are now more aware of possible consequences of seemingly unrelated actions and will take extra care not to make these things happen in the future.

Also, such cases are rarely the "fault" of a single person. Or, the direct/immediate cause is often not the main one.


I felt a great disturbance in the force, as if millions of pagers suddenly cried out in terror


Is it me or are big provider outages getting more common these days?


As time passes, the more big cloud providers there are, and the more complex they individually get (more products).

Assuming the chance of an outage is actually static. If there is one (highly reliable and trusted) provider with five products one year, and three providers with ten products the next year, the chance of you seeing an outage has gone way up because of surface area.


More reason to promote for self-host, and decentralised systems like smtp, matrix.org, ActivityPub. The idea of having all the data and server concentrated in a few player such as google, amazon, ccp is not reliable for the digital operation of the planet.


The question is: should we trust small, underfunded, hobbyist servers more than large corporations that have a money-driven reason to maintain high quality of services?


for the average end user: heck no.

for those with budgets to hire server administrators (or pay for third-party managed services)? yes.

second category includes almost anyone with a large following like institutional users.

hell the incumbent social media service operators can white-label their existing software and sell this as a service.


Not really many open source project such as nextcloud, mattermost are easy to host your self, or the hosting provider provide it as an app.


Yes. Like some sort of workspace? Google could make one. I wonder what they'd call it?


I’m sure Google wonders the same thing, given that it’s had like five different names over the years.


Disclaimer: am Googler.

Self-hosting is a good option, especially when mixed with multi-cloud offerings.

The big rub is that it's really hard to approach the same level of availability the cloud offerings already give you. Depending on work-load, self-hosting is typically more expensive.

This is why the SRE book talks about availability budget. You can't have 100% uptime, so how much do you want to pay to get close to it?


Except that Google services don't run within GCP, but apart from it (mostly on Borg I'd guess).


Their internal platform is likely analogous in terms of accumulating complexity (and a bit of cruft) over time though


If anyone else is curious, Borg is Google's cluster management software: https://kubernetes.io/blog/2015/04/borg-predecessor-to-kuber...


The above applies to cloud as understood by techies just as b2c cloud as understood by laypeople.


I think it's just you, probably recency bias. I recall the days of Gmail being globally down for hours, more than once (3 times, IIRC), back in 2009.


I also can't think of the last time I saw Twitter's Failwhale, which almost became their default mascot at one point.

(Yes, I know they discontinued it, but I haven't seen the service "over capacity" in some time. I say this as a casual web user of it, not an active account holder or app user).


Twitter had a global outage this year (April 16).

Facebook had an October outage.

This happens quite regularly with distributed and complex systems.


Almost like bad weather. Perhaps we'll have cloud outage forecasts at some point.


I think they're reasonable common it's just becoming more of a think to notice.


There’s always this question on every “X is down” post on HN.

Quite boring commentary tbh.


ask anybody who researches complex and distributed systems and see what they say.


Outage reports in UTC, amazing!

(glares at AWS)


I want to smack whomever invented time zones.

Everyone is bad at them. Trading desks, international trade offices, space related offices all have a set of clocks on the wall because even really smart people just suck at figuring out time zones. I even use https://everytimezone.com/ in lieu of a set of clocks.

What a needless complication. I wish we could all just switch to UTC and stop daylight savings time.


I can't imagine anyone taking kindly to a change as big as eradicating time zones. Get rid of DST, yes, but only if the U.S. goes up an hour (the extra sunlight at 6pm during DST is nice).


I can imagine hordes of programmers everywhere taking kindly to it. It's a simpler system.


Hahaha. This one made me laugh hard.


No it will make you laugh in about 30 minutes. Oh wait, DST. Never mind.


And many people blame their internet provider as google outages are rare.


Now that I think of it, I might have blamed my provider or at least restarted my modem if Hacker News was ever down...


It’s the perfect connection test.

9 times out of 10 if you can’t connect to hn the problem is you, not them. Probably even 99/100


I usually check the HN website to see if my internet is down, both on my mobile and on my work computer. I used to just type "test" in the browser address bar hoping for the related Google SERPs but checking for HN seems way faster nowadays.


HN is down pretty frequently, but I think it shows custom error page served from Cloudflare.


I don't think it's that so much as the fact that that half of the internet services that rely on Google along the line somewhere (but not visibly to the users) are down, at least for me. And yet the Google search page can be reached (again, speaking from my European perspective, sample size one). To make it even more confusing: these issues are limited to my WiFi. If I use my phone connection as a hotspot the problems disappear.

So from the perspective of a regular person it is very reasonable to conclude that it is their internet provider that is down.


And I think no internet provider has such a good name where an outage on their part would be unheard of. Much more so than google.


People and software, when Facebook was done my mobile thought the internet was unavailable.


ITT: "My rasperry has been up for 2 years with almost no downtime, what is up with these suckers?"


It's valid criticism for projects that don't need the scale or features that cloud-based solutions offer.

A single machine won't give you the same level of scaling and management features, but also won't have hundreds of distributed moving parts that could break and take your service down as a side-effect.


I feel like you are selling the cloud short for SMBs. The majority of my career has been spent working for shops that have their own datacenters or colo. It's so freaking cheap to run a huge service it's not even funny. And it's a dream to have so much compute and storage floating around that you can "waste" it without even thinking -- it's already paid for!

However, when your small the economics flip. You want to run as little as possible as cheap as possible. And a load balancer pointing at a two asgs in different availability zones is by far the cheapest way to get something production ready on the internet. You can go from your garage to a 100M business and probably never need to graduate from this architecture.


"For a Linux user, you can already build a Google yourself quite trivially with wget -R and grep."


Yep, we're hit by that right now. It's not a total outage, we're losing about 10-15% of our calls to bigquery from within cloudfunctions. What we have using a VPC connector is ok. Fortunately areas affected are ancillary (mainly monitoring, ironically), our service is still running.


Degoogled European here. Genuinely didn't even notice. Google can fuck off.


Probably related to the Google Cloud "service disruption" that started earlier this morning: https://status.cloud.google.com/incidents/1xkAB1KmLrh5g3v9ZE...

It was very frustrating to see DNS working perfectly fine (since DNS is always the problem) and the connections just timing out.


reCaptcha and translate are also down. Even search is affected it is super slow.


Yeah had recaptcha issues too


And meet. Meetings starting is super slow, sometimes it times out.


Let me crosslink to invidious tracker:

https://github.com/iv-org/invidious/issues/2577


The Google imap service has also been unstable in South Africa this morning.


Google Forms definitely has issues, too. I just had an issue where I couldn't fill out a form.


I do wonder how many companies and businesses got affected by the outage.



We just released a rewrite of an (web)app today so lot’s of support calls and mails. Really unfortunate timing, hah


So what you're saying is, you brought down Google with your rewrite. :)


EU to AI:

Here's a €2.4 billion fine for you.

AI to EU:

OK, human.


I was also getting 500 error in GCP console


Looks like it might be resolved now.


Docs was a bit flaky earlier on.


Maybe they are just upset because they got fined. It's common by big companies.


Or maybe this is unrelated and it's just one of the many cloud outages contradicting their uptime promises. Welcome to the reality of cloud computing, where everything's foggy and the hallucinogenic fumes made us believe downtime was a thing of the past.


I guess the network admin who skipped the classes on Network Fundamentals, has moved on from their recent placement in OVH, to Facebook, then to Google.

Next stop AWS, firewall maintenance division?!


I do laugh at people that say "Using the cloud means less downtime than your own server", then things like this come along :D


It’s still less downtime than your own server and you have hundreds of engineers 24/7 ready to diagnose and fix issues


Really, downtime of google so far this year: More than 0 minutes. Downtime of my own server: 0 minutes.

Of course I'm not trying to run a massively scalable service coping with millions of customers, because I don't need that.


You think that's how statistics works? Obviously it's possible to keep your own server running with 0 downtime. But you are exposed to a much higher risk of severe downtimes longer than the cloud provider would likely be. Be it hardware failure, grid, ISP, whatever.


Claim was

"It’s still less downtime than your own server"

Claim disproven


That claim was a direct snippet from your original comment. So unless you meant that you laugh at people who are talking specifically about your personal server, I'd say you are being selectively literal.


Cloud kool-aid drinkers say "The cloud is always better"

I say that in many cases your own server is better. My company runs its own on-prem confluence, it's been taken down for updates on a regular basis at a known maintenence time. That's far better than losing it because a cloud based one was hosted on GCP this morning when we actually need the data

Obviously in many other cases the cloud is better. You wouldn't want to server a million cusotmers across the world from a single server in your own basement. That's not the only model.


I agree completely with that added nuance. Cloud infra is definitely overused and is sometimes seen as the only option these days, even for the simplest of projects. Having your own metal is often times much more convenient and cheaper. Plus it's much more enjoyable to work with.

My only gripe was with your gross over-simplification that read like "hurr durr, my server hasn't gone down this year, so self-hosting has better uptime than Google". It's such an unnecessary and baseless argument.


Vast vast majority of arguments you hear on HN is that it's impossible to run your own equipment. That's as ridiculous as saying you could run dropbox off a raspberry pi, but tends to get pushed to the top, and the schadenfreude when these events inevitability come along is too good to miss.

Every few months another major outage of another cloud provider hits the headlines, meanwhile millions of small companies have no problem with the uptime of their 'legacy' services.

I was at a farm a few weeks ago, the farmer had a server in a closet. It did break on occasion, when there was a power outage. His desktop and internet broke too, so what would the point in his server working.

If it was hosted on google it wouldn't have been working this morning, despite his computer being fine.

If you build your business processes around accepting failure, it's not a problem. It's far easier to keep at least one out of 3 machines online for 99.999% of the time than to keep a single service running for the same time.


To be fair, this is exactly how statistics work. The average downtime of self-hosted servers might be higher (arguably) but many people are under Google's yearly downtime (aka variance is high).


>It’s still less downtime than your own server

what makes you think so, actually?


In my experience, it's very easy to achieve high uptime through luck. If I only run a single server, and it has a 10% chance of failing in any given year, I have a 90% chance of achieving 100% uptime in a given year.

In my experience it's also very easy to think you've got all your bases covered when actually you haven't. I'm protected against mains power failures by a UPS and a generator - but a UPS switchgear fault can cut my power even without a mains power outage. My server has dual power supplies and network cards - but that won't help me if a clumsy worker sent to replace the server above mine unplugs mine by mistake. And so on.

If I think I'm doing a better job than a billion-dollar corporation with hundreds of thousands of servers, does that mean I am? Or is it more likely I'm fooling myself?


I also have a personal dedicated server that never crashed or restarted this year. However I am not sure how much it was actually available. I know for a fact that there were multiple network issues at OVH. I also know that had my server been home, it would have been worse (Optimum residential is awful).

The server not failing is not the only outage mode.


Indeed. People do not realize just how advanced the reliability infrastructure of those services is. Things like diesel power generators have been baked into cloud datacenters for what, a decade now? Probably longer. Show me your alternative power source when the power goes out (and power does disappear, everywhere, eventually).


Diesel generators have been baked into my on prem equipment room for at least 40 years

For you average person working in an average office if the power goes out you're not going to be working anyway, so it doesn't matter if your server is offline too.


This so much.

And an excellent corollary to this is that when you have lucky 100% uptime there is no incentive to optimize mean time to recovery.

Sure the raspberry pi in your closet has been running fine for years, 100% uptime, but then a component fails. Do you have a replacement on hand? Are you continuously monitoring it to know it went down? The component failed at 3am, did it page you? Did you hop right out of bed to rush to fix it?

Single systems can have really nice uptime until they don’t. Then you are hoping that the people on hand can repair what’s going on after months or years of never having to do that. Mean time to recovery might be a week while you wait for new hardware or a few hours while you google some error message you’ve never encountered.

People can run their own systems if they want to, but they shouldn’t confuse good luck with rigorous engineering.


Do you have some external ping test running continuously? If not, you really have no idea.


Of course, how else would I know the exact failure times, my service monitoring only polls every few minutes. That doesn't help me with my knowing how my service is performing though, I'm not offering a ping service.

More importantly, I do know I had an error loading twitter at Thu 11 Nov 04:17:01 GMT 2021, however my websites (and google and hackernews) were working. At 17:53:01 GMT Twitter took over 2 seconds to load it's first http page, far beyond the normal 400ms. Google took 23 seconds to load this morning, BBC News just 0.071.

On the other hand I also need to provide services which can't cope with outages measured in milliseconds. Good luck with complaining to a cloud provider that your traffic vanished for 4 seconds. Those services thus have multiple connections on independent hardware and circuits with no single point of failure


This is so misinformed by industry propaganda. Modern cloud services are very often unavailable from a region without any apparent reasons. Or the service appears to be available but some specific feature doesn't work. Or it's available but just really slow or dropping packets.

When you have a "simple" (already quite complex) BGP(TCP(HTTP)) tunnel, chances are things just work and it's easy enough to diagnose issues. Between anycast, auto-scaling, and WAFs (among others) you have added so many layers of complexity that the chance of a "random" error somewhere in the stack has dramatically increased and diagnosis is close to impossible.

It used to be simple that web services are either up or down. Now it's not so obvious anymore, and that's definitely not being reflected in the 5 9's SLAs falsely promised by cloud providers. I can say with confidence most selfhosted systems i've been close to have much better uptime than modern cloud services and are much cheaper to service and maintain.


I agree with specific dimensions of this sentiment.

Firstly, I think that the GP comment ever so slightly misapplies the localized awareness available in smaller environments to the cloud. Yes, there are tons more engineers to look at stuff, but those engineers are a) already bogged down keeping up with internal infra, politics and bureaucratic machinery, and b) quite some distance further away from actual errors occurring on the ground because everything's aggregated to the hilt so the stats remain comprehensible at the higher scale.

I also agree that the newer stuff that is less mature has an order of magnitude more intrinsic leaky abstractions than old designs, which I do think were more hygienic and espoused more effective separation of concerns than what is used today.

Also, not to nitpick, but the "classical" canonical definition would probably look closer to BGP(TCP(TLSv1.3(HTTP/1.1))), with the modern equivalent being BGP(UDP(HTTP2)) and the future being BGP(UDP(QUIC)). I do agree that the rapid consolidation of HTTP and TLS, without wide-scale awareness and slow, methodological development of general introspection tooling, does make things net worse in general. I suspect infra will still be using HTTP/1.1 for a long time into the future until this materially changes.


Right?

My private website on my RPI is running now 2 Years without a problem and only minimal downtime due to rebooting for the new kernels.

It is amazing how much uptime you can achieve with a 5$ Computer in comparison to a 1730000000000$ (1,73 tera $) Company. Even if you compensate for dynamic content.


Maybe Google's servers are also running without a problem and you can't reach it for a different reason. Self-reported uptime is not a good measure of server availability.

Was your home internet available all of the time? How many times did you reboot your modem?


(Google is currently processing multiple terabits of traffic per second using several billion dollars' worth of distributed infrastructure.)


And yet it doesn't mean his RPi-powered website needs to ever sustain that kind of traffic, so he doesn't have to take the additional risk of running a distributed system.

In contrast, in the cloud you typically take on the risk of the underlying distributed platform (optimized for managing thousands of VMs, etc) even though you only need a single machine for yourself.


Actually it's not, and that's the problem


FWIW I vaguely recall some kind of website that showed approximate bandwidth use... it was something like a cross between statcounter and alexa except for network stats and the like.

I think it purported Google to be hovering around 15Gbps. That sounded humongously wrong, like I'd expect global traffic <-> Google to at least be a couple terabits, right?

I'd be very curious if there's a way to actually ballpark the number. Or maybe it is in fact possible to implement services that reliably track this sort of thing, and my futile searches earlier just weren't finding them...


Yeah but google has also 27,169 software engineers.


Didn't even noticed. Work doesn't use any GCloud components and personally its been a while since I've degoogled myself. No Google docs, mail or search. Not even the quad 8 nameservers.


Not sure why you got downvoted. I'm also a degoogled European and didn't even notice.


Downvoted from people who are jealous, perhaps?


Seems that google is not anymore the company it used to be advertised. The myth prevails though.

It’s a pity as some of its services are frankly among the best, and I depend on them.


> Seems that google is not anymore the company it used to be advertised.

Do you mean about the "Don't be evil" slogan it started with? Even back then, it was a pretty obvious move for any movie villain.

Or do you mean you actually bought into the reliability promises of cloud providers thinking downtime would not exist anymore?


due their reliability and over the tp. It always overfelivered. during its first decade, If anyone read any of the books from its founders, papers, stories related to the company and them it seemed to be the new "non zero day" infrastructure, however the longer the company alive is the more it feels just like any other.

Unrelated: not sure what is going on on HN since a few months but almost all comments I get to write get automatically downvoted, even little things that no one should really care much about.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: