Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Wikipedia Outage (wikimediastatus.net)
39 points by fekle on Feb 22, 2023 | hide | past | favorite | 48 comments


Wikipedia is one of those services about whose reliability I have never thought, simply because it always seems to work. Very impressive especially considering that their tech stack doesn't appear to be the most modern or is (in the case of php) especially known for its robustness.


They do have a pretty complex infra with redundancy, extensive caching, the whole works. But I think they use a relatively boring, reliable stack throughout. It doesn't look like they run a lot of GARTNER MAGIC QUADRANT Cloud Solutions to Supercharge™ their web-scale data blazingly.

On the plus side solid stable software can run for decades without incidents, as long as the hardware is willing. The main negative is that it's not as Supercharged™ as the latest cloud product with a 99.9% SLA, which itself relies on 5 other internal microservices with 99.9% SLA each.


I doubt anyone at HN takes Gartner seriously, as James Plamondon (Microsoft) wrote:

> Analysts sell out - that's their business model. But they are very concerned that they never look like they are selling out, so that makes them very prickly to work with.

The question is, who is gullible enough to still believe Gartner has any value? You'd think by now even the most clueless of MBAs would have caught on.


5 internal microservices? I thought the Supercharged™ had deployed 256 microservices last week alone.


Lots of valid criticism exist for PHP, but not robust isn't something I would say.

(Lots of the criticism is exaggerated as well.)


> or is (in the case of php) especially known for its robustness.

It's 2023, can this pointless PHP bashing please stop once and for all if you're not talking about Wordpress?

In some cases, "keep it simple and stupid" is the thing to go - MediaWiki has mastered this.


I'm a big fan of "boring" technology for this reason. Huge knowledge base behind the existing stuff. No real need to change.


Remember that they mostly serve static content to non-logged-in users.

As long as the cache layer is working, most people wouldn't notice an outage.


Maybe that’s why it’s so reliable.


Thought the same thing. Finding the root cause is easier when you don't have to debug dozens of abstraction layers (KVM, Containerd/Docker, Kubernetes, etc).


Operating at this scale you can assume that all these things you mentioned do exist :)


It's all public: https://wikitech.wikimedia.org/wiki/Wikimedia_infrastructure

You can even get access to devops/SREs tickets (and IRC messages)


Docker? You’re probably right. K8s? Wouldn’t be so sure. Would like to hear from the horse’s mouth though.

Edit: looks like it’s used. Wonder if it’s more tooling than production traffic. No time to dig further.



You’re implying serious and useful work can be done in a stable tech stack? Preposterous.

Seriously though, not a fan of php at all but the js tooling is rocket science in comparison.


> Seriously though, not a fan of php at all but the js tooling is rocket science in comparison.

laughs in left-pad, Webpack, Grunt, ...

JavaScript tooling absolutely sucks - even for a moderate sized project, `npm install` can take many minutes, often enough because some native code has to be compiled. Webpack builds can take even longer.

In contrast, with PHP you run `composer install` and it works, no bundling or whatever required.


I think that’s what they meant - “rocket science” as in “way too complicated for 99% of down-to-earth jobs”


Back when I was doing serious work in php it was ftp (yes, no ‘s’) the php files onto an Apache box and that was it! Go went the same way, at least.


The Foundation have invested a lot over the last decade in reliability. Nowadays it is in a great place, but about 10 years ago it was up and down all the time.


Uh? If anything wikipedia is robust because their tech stack isn't modern.

Also php is robust as fuck


Wikipedia is maybe the most important open source project ever (competing only with linux, but that is operating at a more hidden technical level)

This (non)episode of "wikipedia outage" is as good prompt as any to get our collective human chatGPT going :-). Here is my take:

Wikipedia's world-changing impact has been stifled in the era of centralized walled gardens regurgitating memes and ads and isn't really future-proof. Yes you can run you own mediawiki but 1) it not very easy or pleasant at all and 2) knowledge base federation [0] is still a distant dream and 3) its whole stack is nowhere near being integrated with other open source tools that people have at their disposals (heck, even the wiki markup is now a lone exception in the markdown monoculture)

Somehow the next decade of wikipedia should look much more decentralized, resilient and integrated and that could be a sublime next level.

[0] https://www.mediawiki.org/wiki/Wikibase/Federation


Unclear there's substantial additional value by it being decentralized.


there is vastly more (and ever exploding) knowledge than what the current wikipedia model can integrate (talking always about public knowledge).

Unless there is a breakthrough in how the open source ecosystem integrates, validates, federates public knowledge somebody else will claim that they can better "organize the world's information" (using, e.g "AI") and we will probably not like the new state of the world.

I think there is substantial value in preventing that outcome from happening. NB: I use the term decentralization not just a technical database/server architecture but also as a means of engaging far more resources.


> there is vastly more (and ever exploding) knowledge than what the current wikipedia model can integrate (talking always about public knowledge).

Sure, but there is two vastly different "distributed" architecture at stake here:

- the social one, as in everyone can contribute easily in positive constructive manner, with optimized user path for each user level and interest, and no worry to have about any form of online harassment/bullying or outdoor threats like the-angry-party-staring-at you that might make you vanish for contributing the-wrong-thing-under-the-bad-perspective

- the technical one, which is mainly about "no single point of failure"

Sure there are things that largely overlap, but still two vastly independent set.


> Sure there are things that largely overlap, but still two vastly independent set.

Yes they are by no means the same aspect but the technical (the design part) and "social" domain (how human actors interact using the technology, what incentives they have to contribute, how easy it is to contribute, how to form consensus etc) are never very far apart.

My argument is a general one. If, say, ten years from now wikipedia is to have the same quantum leap impact and positive role it had twenty years ago (when it first got going) it will have to be much more ambitious. But that ambition is unlikely to be served well by a forever centralized service because most knowledge is produced, stored and consumed in distributed ways.

Just for conreteness, think of all those other sources of public knowledge: from openstreetmap, to research paper archives, musea collections and many other public institutions of all kinds (laws and regulations, public statistics etc). Right now everything must come to a centralized store but the size of wikipedia is already flattening [0] and a lot of that data never gets referenced or is a very cumbersome manual job.

In any case, as per link in my other comment some federation is part of the plan (wikibases) and there is the broader "linked data" movement. Its just that at some point those ideas have to get real legs and start walking :-).

[0] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia


Indeed, I'd rather have a single version of a truth most people can agree on, rather than thousands of bubbles each spewing their own agenda.


That would be one of the many issues to resolve in a federated context. But achieving "consensus" is never trivial and compromises and choices underlie its current model too.


Decentralization implies more complex failure modes and unclear paths to resolve them. Not sure how that'd help.


thats a strange assertion when the internet was famously invented to eliminate complex failure modes. Quoting from "DARPA and the Internet Revolution" By Mitch Waldrop [0]

"Finally, Roberts decided to make the network completely decentralized, with no one master computer responsible for sorting the packets and routing them to their destination. Such a Grand Central Station approach would have been much simpler to implement, Roberts knew. But one blown transistor could have taken the whole network down.

So instead, he decreed that the ARPA sites would be linked in a complex pattern rather like the Interstate highway map, and that each site would share the routing responsibilities equally. That is, the computer there would read the digital address on each packet as it came in, accept that packet if the address was local, or else send it off again on the next stage of its journey.

This “packet-switching” approach would mean more complexity and more programming during the set-up phase. But the final system would be far more robust: No one failure could bring it down."

[0] https://www.darpa.mil/about-us/timeline/modern-internet


Seems like they have fixed it, but it doesn't hurt to mention Kiwix. Download a full copy of Wikipedia as a .zim file, and you're good to go in even low connectivity scenarios.

The instance of Wikipedia (and other sites) that I'm hosting can be tried out here if you wish: https://kiwix.ounapuu.ee


You can even set up a local kiwix-server as a web server to serve up the ZIM for those who don't have Kiwix installed (or space on their devices, the full English-language Wikipedia with images is about 82GB).


That’s what my setup is running off of, just a kiwix-serve command pointed at a folder of .zim files and exposed through nginx.

I have a write-up of the technical details as well: https://ounapuu.ee/posts/2021/12/09/self-hosting-wikipedia/


Wow, this project is super cool. Thanks for sharing it.


FYI they also have a nice public Grafana instance at https://grafana.wikimedia.org

It has all the info about HTTP requests, DB info, CDN, etc.

The wiki page about this https://wikitech.wikimedia.org/wiki/Grafana


Interesting. I kinda expected more traffic. 20 edits/s, 100k request/s. One tuned box could handle that


Not sure about the edits, but the title of the requests graph including "Varnish frontend CDN" and the query ("job_method_status:varnish_requests...") imply to me that those are the requests from a frontend cache (Varnish) or series of caches to the backend, rather than all requests.

(I'm not 100% sure you intended to suggest Wikipedia only gets 100k requests/s, I just thought it was an interesting discussion point.)


Based on all the monitoring I've ever seen, the stats for Varnish would be the requests into Varnish.

If you wanted actual requests into the app servers you would look at those stats, and from that site, it's like ~5k req/s.

https://grafana.wikimedia.org/d/RIA1lzDZk/application-server...

I'm not sure why you think 100k req/s is low for Wikipedia


Feb 22, 2023 - 09:25 UTC : Investigating - Issues with accessing some wikis

Feb 22, 2023 - 09:36 UTC : Monitoring - A fix has been implemented


It's already back up. Was this post really necessary?


I'm not sure that the person who submitted this could have considered the fact that the service has now been restored, regardless of how imminent the restoration was.

It's an interesting event (albeit it relatively low impact one) which we would not have known about if it wasn't posted here. Also, check out the wikimedia service status page, pretty cool, huh?


Indeed. I not only find it impressive how honest it is (as opposed to others[1]) but also how quickly they resolved the event.

[1] - https://ably.com/blog/honest-status-reporting-aws-service


The incident took ten minutes to resolve. What's the point of immediately rushing to HN whenever a large website is transiently down?


These kinds of posts are interesting to learn more about the architectures and failure modes at play.

I agree that HN should not be a down detector but for a high profile service with infrequent outages like Wikipedia in this instance, a post with discussion is acceptable in my eyes.


Tangential: what is the current funding status of the Wikimedia Foundation related to the donation banners that have started popping up on Wikipedia a day ago or so?


Running the structure is expensive. According to Wikimedia, base salaries are 20000 USD to 30000 USD per month or so. It has to be paid somehow.

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_salarie...


It switched from a charity-like operation, with barebones staffing and incredible benefits for users per dollar spent, to a 'splash the cash and hope' strategy.

I personally don't want my money spent like that, so I don't donate.



From the page:

In 2005, they were doing 1.6 billion pageviews monthly, with a bandwidth budget of $5000/mo and a staff of 1.

By 2015, their pageviews had gone up 11x. Their bandwidth costs had gone up 33x, and their staffing costs 1250x.

In my view, both bandwidth and staffing costs should be nearly linear with pageviews - and ideally a little sublinear.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: