top | item 20414407

Twitter was down

695 points| idlewords | 6 years ago |status.twitterstat.us | reply

485 comments

[+] rossdavidh|6 years ago|reply

Ok, this is too many high-profile, apparently unrelated outages in the last month to be completely a coincidence. Hypotheses:

1) software complexity is escalating over time, and logically will continue to until something makes it stop. It has now reached the point where even large companies cannot maintain high reliability.

2) internet volume is continually increasing over time, and periodically we hit a point where there are just too many pieces required to make it work (until some change the infrastructure solves that). We had such a point when dialup was no longer enough, and we solved that with fiber. Now we have a chokepoint somewhere else in the system, and it will require a different infrastructure change

3) Russia or China or Iran or somebody is f*(#ing with us, to see what they are able to break if they needed to, if they need to apply leverage to, for example, get sanctions lifted

4) Just a series of unconnected errors at big companies

5) Other possibilities?

[+] bdd|6 years ago|reply

I work at Facebook. I worked at Twitter. I worked at CloudFlare. The answer is nothing other than #4.

#1 has the right premise but the wrong conclusion. Software complexity will continue escalating until it drops by either commoditization or redefining problems. Companies at the scale of FAANG(+T) continually accumulate tech debt in pockets and they eventually become the biggest threats to availability. Not the new shiny things. The sinusoidal pattern of exposure will continue.

[+] idlewords|6 years ago|reply

Write a script to fire random events and you will notice they sometimes cluster in ways that look like a pattern.

[+] JaRail|6 years ago|reply

First, I think our general uptime metrics are trending upwards. Recovery times tend to be much shorter as well.

Big services are bigger, more mission-critical parts can fail.

Continuous development culture is designed with failure as part of the process. We don't spend time looking for obscure issues when they'll be easier to find by looking at metrics. This is fine when a staggered deployment can catch an issue with a small number of users. It's bad when that staggered deployment creates a side-effect that isn't fixed by rolling it back. Much harder to fix corrupted metadata, etc.

Automated systems can propagate/cascade/snowball mistakes far more quickly than having to manually apply changes.

We notice errors more now. Mistakes are instantly news.

[+] johngalt|6 years ago|reply

5) Operational reliability is both difficult and unsexy.

The fancy new feature, increasing traffic, or adding AI to something will generate headlines, accolades, and positive attention. Not having outages is something everyone expects by default. This goes double for work that prevents outages. No one wins awards for what doesn't happen.

How many medals are pinned on the guys installing fire sprinklers?

[+] t0astbread|6 years ago|reply

Or maybe it's because the internet is tendentially becoming just a few companies' data centers? Afaik Twitter moved to GCP a few months ago. Maybe this is another Google outage?

[+] listic|6 years ago|reply

I (don't) like how you exclude Russia, China, Iran and somebody from your definition of 'us'.

[+] godarni|6 years ago|reply

Lots of people on vacation this time of year. Would be interesting to see if there is a seasonal component to the reliability of these services.

[+] papito|6 years ago|reply

#1. I think the art of keeping things simple is being lost. These days people will mush together ten different cloud services and 5,000 dependencies just for a Hello World.

[+] moret1979|6 years ago|reply

One possibility on 5) Too many KPIs and quarter goals to be reached, too many corners cut.

[+] humanfactor|6 years ago|reply

1/2) These are web apps. Big web apps but web apps none the less. We know what can go wrong theres nothing really new here. How would you quantify "too many pieces to make work". Is 1000 too many , 10000 ???? There are millions pieces of data on your harddrive and they work fine. In general the idea of variance can be solved with redundancy. Maybe there are not enough backups at twitter.

5/4) Incompetent people lead by incompetent people surrounded by yes men and a drug culture. Also having a company that demonizes conservatives which are some of the best engineers (scientist are squares naturally)

Human error is bound to happen and software is complex but so are rockets and supply chains. Things can go right and things can go wrong. Usually when they do go wrong there is a human error reason.

Does twitter foster a place where human error can occur more frequently that other places? I dont know. I have my bias about the company and any sjw company but thats very anecdotal.

Twitter worked yesterday and it doesnt work today. Doesnt really have to mean anything really important except for the fact that there is a blind spot in their process which they need to harden.

I guess the first person to ask is the dev op , then the developer. Something wasnt tested enough. That happens in commercial software, deadlines cant wait.

3)Russia / China / Iran ... stop watching CNN. You are parroting talking points. If twitter is crushed America could care less and would probably turn up sanctions, not lift them. Taking down twitter wont cripple anything in America except for certain marketers budgets.

[+] outworlder|6 years ago|reply

Brains are excellent pattern matchers.

Brains also suck at statistics.

[+] pennaMan|6 years ago|reply

>July 11, 2019 7:56PM UTC[Identified] The outage was due to an internal configuration change, which we're now fixing. Some people may be able to access Twitter again and we're working to make sure Twitter is available to everyone as quickly as possible.

Seems #4 is the winner

[+] narrator|6 years ago|reply

6) White House social media conference just started.

https://www.10tv.com/article/trump-hosts-white-house-summit-...

[+] mcqueenjordan|6 years ago|reply

https://en.wikipedia.org/wiki/Poisson_distribution

[+] djtriptych|6 years ago|reply

I've been suspecting 3) for a few months now, and I'm quite curious how our government would handle it if it _were_ the case. Only a few of these outages have had plausible post-mortems ever made public.

[+] MrStonedOne|6 years ago|reply

Operational consistency creates a hidden single point of failure.

If everybody is doing the same things and setting things up the same way to ensure reliability then any failures or short comings in that system are shared by all.

[+] AnIdiotOnTheNet|6 years ago|reply

It's #1. The real question is how this isn't blindingly obvious to everyone.

[+] jayd16|6 years ago|reply

My guess is its a slow news time of year coupled with more usage of cloud services which means these types of stories are higher profile.

[+] Qwertystop|6 years ago|reply

Relating to 1: https://www.youtube.com/watch?v=pW-SOdj4Kkk (Jonathan Blow's "Preventing the Collapse of Civilization"... perhaps a melodramatic title, but well-said overall.)

[+] marenkay|6 years ago|reply

Or we just managed to construct the biggest circular dependency ever using the whole internet and a combination of all hyped languages and frameworks.

That would in turn lead to an insanely fragile system with increasing amounts of failures that seem unexplainable/weird.

[+] chrismarlow9|6 years ago|reply

Everything is made of plastic these days, even software. It's immediately put out as soon as an MVP is ready. Too many managers with zero coding experience. The marketing people have taken the browser. Time to start over.

[+] asark|6 years ago|reply

This is a pattern one might see if there were a secret, rolling disclosure of some exceptionally-bad software vulnerability, I'd think. Or same of some kind of serious but limited malware infection across devices of a certain class that sees some use at any major tech company. If you also didn't want to clue anyone else (any other governments) in that you'd found something (in either case), you might fix the problem this way. Though at that point it might be easier to just manufacture some really bad "routing issue" and have everyone fix it at once, under cover of the network problem.

[+] depr|6 years ago|reply

so like all software has reached peak complexity this month?

[+] rossdavidh|6 years ago|reply

Ok, I have one to add myself:

6) We used to have many small outages at different websites. Now, with so many things that once were separate small sites aggregated on sites like FB, Twitter, Reddit, etc we have a few large sites, so we have aggregated the failures along with that. The failure rate, by this theory, is the same, but we have replaced "many small failures" with "periodic wide-spread failures, big enough to make headlines". Turning many small problems into a few bigger ones. Just another hypothesis.

[+] dv_dt|6 years ago|reply

Another possibility: US (or other) authorities are requiring some sort of monitoring software or hardware that where disruption of service is unavoidable during install

[+] NightlyDev|6 years ago|reply

Software is getting increasingly complex. Why? To ensure better uptime, amongst other things. The funny part is that all the complexity often leads to downtime.

A single server would usually have less downtime than Google, Facebook and so on. But Google and Facebook needs this complexity to handle the amount of traffic they're getting.

Makes me wonder why people are trying to do stuff like Google when they're not Google. Keeping it simple is the best solution.

[+] DaveInTucson|6 years ago|reply

> Just a series of unconnected errors at big companies

Except that "at big companies" is basically selection bias, problems at little companies don't get noticed because they're, well, small companies.

And the underlying issue of the "unconnected errors" is that software is rather like the airline industry: things don't really get fixed until there's a sufficiently ugly crash.

[+] bArray|6 years ago|reply

For point #3, there are a few irregularities:

1. Services all going down one after another. 1 goes down - it happens. 2 go down - it happens sometimes. 3 go down - quite a rare sequence of events. But now a large number of silicon valley companies have experienced service outage over the last few weeks.

2. Russian sub that is said to be a "deep sea research vessel" somehow experiences a fire whilst in international waters [1]. It has been suspected that it could have been tapping undersea cables. Let's imagine for a moment a scenario where they were caught in the act, some NATO sub decides to put an end to it and Russia cover it up to save face.

3. Russia announces tests to ensure that it could survive if completely cut off from the internet [2]. A few months later it's like somebody is probing US services in the same way.

4. There is currently a large NATO exercise in a simulated take-over of Russia happening in Countries close to Russia [3].

Of course it's completely possible it's all unconnected, but my tin foil hat brain says there is a game of cloak and daggers going on here. I would say that Russia's incentive for probing the US/NATO is to test it's weakness after undergoing a trade-war with China and raising sanctions against Iran. After all, Russian fighter planes regularly try to fly into UK airspace just to test their rapid response crews [4], this sort of behaviour is typical of them.

[1] https://en.wikipedia.org/wiki/Russian_submarine_Losharik

[2] https://techcrunch.com/2019/02/11/russia-internet-turn-off-d...

[3] https://sofiaglobe.com/2019/05/13/6000-military-personnel-to...

[4] https://www.theguardian.com/world/2018/jan/15/raf-fighters-i...

[+] idlewords|6 years ago|reply

So storytime! I worked at Twitter as a contractor in 2008 (my job was to make internal hockey-stick graphs of usage to impress investors) during the Fail Whale era. The site would go down pretty much daily, and every time the ops team brought it back up, Twitter's VCs would send over a few bottles of really fancy imported Belgian beer (the kind with elaborate wire bottle caps that tell you it's expensive).

I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?

Also at that time, I remember asking the head DB guy about a specific metric, and he ran a live query against the database in front of me. It took a while to return, so he used the time to explain how, in an ordinary setup, the query would have locked all the tables and brought down the entire site, but he was using special SQL-fu to make it run transparently.

We got so engrossed in the details of this topic that half an hour passed before we noticed that everyone had stopped working and was running around in a frenzy. Someone finally ran over and asked him if he was doing a query, he hit Control-C, and Twitter came back up.

[+] lukey_q|6 years ago|reply

A lot of high-profile outages recently. Can't actually remember the last time Twitter went fully down. Have to confess I immediately assumed an issue with my own connection, even though every other site is working.

Unrelated, but for some reason the phrase "I have no mouth and I must scream" just popped into my head

[+] neom|6 years ago|reply

I miss fail whale. :(

https://www.theatlantic.com/technology/archive/2015/01/the-s...

[+] whatshisface|6 years ago|reply

I remember once we were at three outages, someone posted that they thought three was a reasonably-sized random cluster given the rate at which services go down. How many outages have we had in the last 30 days, how many do we have per month on average, and how strongly can we reject the null hypothesis?

The formula for computing how unlikely this is is the Poisson distribution: `λ^k * e^-λ / k!`, where λ is the average number of outages every 30 days and k is the number of outages in the past 30 days. If you find the numbers, let me know what the answer is.

[+] lopespm|6 years ago|reply

A comment made before by another user about Facebook, Instagram and WhatsApp outages offers an interesting perspective:

"This outage coincides with FBs PSC (performance summary cycle) time. I wonder if this is folks trying to push features so they get “impact” for PSC."[1]

I wonder if the recent outages on other well known services could be heavily influenced by a similar phenomenon. If this holds water, it would be interesting to have an article or study around this issue. I certainly would be interested in reading it.

[1] https://news.ycombinator.com/item?id=20350579

[+] mikece|6 years ago|reply

I posted the question on Slack "How do you spread the word when Twitter goes down?" People thought that was so hilarious... until they realized Twitter was actually down.

Honestly, "Hacker News" was my answer which seems to be effectively correct -- and today I learned about the existence of twitterstat.us!

[+] djhworld|6 years ago|reply

https://downdetector.co.uk/status/twitter

is a pretty good resource too

[+] pcora|6 years ago|reply

google, apple, microsoft, facebook.. and now twitter? I keep asking the same, when is amazon's outage day?

[+] EvanAnderson|6 years ago|reply

All I can think, smugly, is that DNS, SMTP, HTTP, etc. don't "go down". Twitter should be a protocol, not a website.

[+] kevinlou|6 years ago|reply

It's weird seeing the go-to downtime tracker go down. I'm so wired to check Twitter that I kept refreshing for a good 10 seconds.

[+] abadabadingdong|6 years ago|reply

I wonder how many conspiracies this single outage will trigger.

[+] dthedev|6 years ago|reply

Pray for the team that has to handle this ticket.

[+] falsedan|6 years ago|reply

That’ll be fine, a post mortem will show that ops weren’t the cause and their comp package will help them get over this little package of stress

[+] idlewords|6 years ago|reply

On the status posts in particular I really miss the ability to sort comments by new on this site.

[+] dewey|6 years ago|reply

It has been a long time since I've seen the equivalent of the fail whale on Twitter. It was a weekly occurrence back in the days.

[+] unwabuisi|6 years ago|reply

I wish they would bring back the fail whale!

[+] KuhlMensch|6 years ago|reply

Years ago I read an amazing article (from HN) about how (complex) config, rather than code ends up being the cause of outages at scale. I always reflect on that, when designing almost anything these days.

[+] tschellenbach|6 years ago|reply

Really curious which part of their infrastructure was the root cause.

[+] anonymousjunior|6 years ago|reply

the internet is just falling apart these days

[+] abstract7|6 years ago|reply

My guess is that the whales have been securing parts of their codebases from internal leaks or something related but for security. Workflow disruptions. It may be bad code bitting them weeks or more after they pushed it.

There has been many embarrassing and controversial leaks this year. Allegations of uneven TOS enforcement. Hence the WH Social Media Summit. Could also be security related combo ahead of the elections that also is a bit sensitive for low-trust devs.

Imagine code getting pushed that only a smaller subset of devs are privy to. Possibly pushing obsfucated code or launching services outside of the standard pipeline.

Remember that the spectre and meltdown patches for the Linux kernal was a nightmare because the normal open and free-to-discuss-and-review workflow was broken. That applies too in these situations with large codebases that internally are 'open-source'.

[+] nevi-me|6 years ago|reply

I was in the middle of a loosely legal argument about the politics of my country, and tonight I had found obliging people to reason with me instead of calling me names.

The discussion was beautiful, until the app stopped working. I even thought I was blocked. I'm glad that it's just down.