Lichess: Post-Mortem of Our Longest Downtime

carlsborg|1 year ago

The main lichess engine (lila, open source) is a single monolith program that's deployed on a single server. It serves ~5 million games per day. But there are a several other pieces too. They discuss the architecture here https://www.youtube.com/watch?v=crKNBSpO2_I

BTW consider donating if you use lichess.

justinclift|1 year ago

Wow. ~US$40k/mo running costs, with about US$5k/mo for server hosting:

https://lichess.org/costs

It looks like the servers are individually managed via OVH or similar, rather than running their own gear in co-location or similar. Wonder why?

squigz|1 year ago

https://lichess.org/patron

hilux|1 year ago

I'm a patron!

I really appreciate the benefits package for patrons. Thibault is zee best.

jefurii|1 year ago

There's also a nice architectural diagram on their GitHub page: https://github.com/lichess-org/lila

theideaofcoffee|1 year ago

I guess some of my questions are addressed in the latter half of the post, but I'm still puzzled why a prominent service didn't have a plan for what looked like a run of the mill hardware outage. It's hard to know exactly what happened as I'm having trouble parsing some of the post (what is a 'network connector'? is it a cable? nic?). What were some of the 'increasingly outlandish' workarounds? Are they actually standing up production hosts manually, and was that the cause of a delay or unwillingness to get new hardware goin? I think it would be important to have all of that set down either in documentation or code seeing as most of their technical staff are either volunteers, who may come and go, or part timers. Maybe they did, it's not clear.

It's also weird seeing that they are still waiting on their provider to tell them exactly what was done to the hardware to get it going again, that's usually one of the first things a tech mentions: "ok, we replaced the optics in port 1" or "I replaced that cable after seeing increased error rates", something like that.

trod123|1 year ago

You are not wrong that this is puzzling, especially when viewed through the perspective lens of a professional with background in these areas (10 years).

There are many red flags which beg questions.

That said, I stopped taking them at their word years ago, this isn't the first time they've had dubious announcements following entirely preventable failures. In my mind, they really don't have any professional credibility.

People in the business of System Administration would follow basic standard practices that eliminate most of these risks.

The linked post isn't a valid post-mortem, if it were it would contain unambiguous details of the timetables and specifics, both of the failure domains and resolutions.

As you say, a network connector could mean any number of things. Its ambiguous, and ambiguity in technical material is used to hide or mislead most times which is why professionals detailing a post mortem would remove any possible ambiguity they could.

It is common professional practice to have a recovery playbook, and a plan for disaster recovery for business continuity which is tested at least every 6 months, usually quarterly. This is true of both charities and business.

Based on their post, they don't have one and they don't follow this well known industry practice. You really cannot call yourself a System Administrator if you don't follow the basics of the profession.

TPOSNA covers these basics for those not in the profession, its roughly two decades old now, it is well established, and ignorance of the practices isn't a valid excuse.

Professional budgets also always have a fund for emergencies based on these BC/DR plans. Additionally, using resilient design is common practice; single points of failures are not excusable in production failure domains especially when zero-downtime must be achieved.

Automated Deployment is a standard practice as well factoring into RTO and capacity planning improvements. Cattle not Pets.

Also, you don't ever wait on a vendor to take action. You make changes, and revert when the issue gets resolved.

First thing I would have done is set the domain DNS TTL to 5 minutes upon alerted failures (as a precaution), and then if needed point the DNS to a viable alternative server (either deployed temporarily or running in parallel).

Failures inevitably happen which is why you risk manage this using a topology with load balancers/servers set up in HA groups, eliminating any single provider as a single point of failure.

This is so basic that any junior admin knows these things.

Outlandish workarounds only happen when you do not have a plan and you are dredging the bottom of the barrel.

holsta|1 year ago

This response and post-mortem is superior to most commercial services I have seen in recent years.

hyperbovine|1 year ago

That's basically every aspect of their service. The founder Thibault Duplessis is criminally undercompensated (his choice) for running a site that is better designed, faster, and more popular than 99% of commercial websites out there.

nomilk|1 year ago

Exact same thought went through my head. Also note in the first few paragraphs they acknowledge the worst impacts to users. That's very selfless - often corporate postmortems downplay the impact, which frustrates users more. Incidentally, a critical service I use (Postmark) had an outage this week and I didn't even hear from them (I found out via a random twitter post). Shows the difference.

redbell|1 year ago

> so you, as our beneficiaries and stakeholders, who support us and encourage us — deserve to get clarification on what happened

Is it that complicated for big tech to reply politely with the above statement when they suddenly disable your account for no obvious reason!

morgante|1 year ago

The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.

ctippett|1 year ago

Once the private link was reestablished, could they not have tunneled out to the internet via another server acting as a sort of gateway?

Disclaimer: I'm not a network engineer so I may be misunderstanding the practicality and complexity of such a workaround.

unknown|1 year ago

[deleted]

lazyant|1 year ago

summary for the lazy: OVH

53 comments