top | item 4702392

App Engine down

152 points| soofaloofa | 13 years ago |code.google.com | reply

128 comments

order
[+] davidjgraph|13 years ago|reply
Before the doom and gloomers come out, this is the first time since leaving beta I can remember it happening.

We left AWS about 18 months ago after one of the outages and switched to GAE. I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's still a good decision (for now)....

[+] acdha|13 years ago|reply
One thing to remember: this took down all of app engine for at least an hour. AWS has had only 17 minutes of downtime affecting all of us-east this year (that network glitch a couple days after PyCon) - the rest of it has been a subset of the service amplified by people rediscovering that they weren't as redundant as they thought.

The correct less to draw is that any one point of infrastructure is a risk, so you need to scale wide. This is possible to do with AWS regions, or other providers - even internal bare iron if you're so inclined, but impossible to do with GAE because you're committed to a single-vendor API as well as their infrastructure.

[+] tedroden|13 years ago|reply
It's one of the first times the whole service has been down, but parts of the service go down at least once a week. memcache, task queues are "elevated" with regularity and urlfetch is frequently down totally. ("elevated" generally means unusable).

Of course master/slave even has scheduled downtime.

[+] lyddonb|13 years ago|reply
Yep. We've never seen anything like this before.
[+] namank|13 years ago|reply
Same. Minor updates and issues pop up with GAE but I've never experienced an outage such as this in my two years of use.

Wonder what's happening.

[+] mikegioia|13 years ago|reply
I have yet to experience downtime with RackspaceCloud and I've been using them for like 3 years.
[+] davis_m|13 years ago|reply
I think this is larger than just GAE.

http://internettrafficreport.com/namerica.htm

It seems like large portions of the internet are down.

[+] EwanToo|13 years ago|reply
Internet Traffic Report, while a nice concept, is unfortunately very misleading.

Their sample size is extremely small, and most of those are permanently down.

Have a look through their list of north american routers and find one of them where packet loss has gotten worse as their main overall graph for packet loss would suggest - I've just been through them all and couldn't find one.

[+] neanderdog|13 years ago|reply
May not be anything or may be..

I noticed a couple of days ago that some of our dns entries were mysteriously removed from level 3 servers which out of old habit are used for resolution (some of the ip's go back to uu-net/worldcom/mci)

Now the interesting bit is they were for private subnet ip's. They're working fine everywhere else.

Today the last of their dns servers removed the entries so I had users go to google (8.8.8.8) and all's well with our apps.

Level 3's entries for our external stuff is there, just the private subnet stuff is removed.

If others do this too and resolve with level 3....

edit: just found this: http://tracker.outages.org/reports/view/59

[+] X-Istence|13 years ago|reply
And here we have another website that does something similar:

http://www.internetpulse.net

And according to them all routes are up and running just fine,not only that bit ping times aren't elevated.

[+] untog|13 years ago|reply
Yes, Tumblr is experiencing difficulties and AFAIK they don't run anything on App Engine.
[+] sparkinson|13 years ago|reply
Now those are some interesting graphs.
[+] FiloSottile|13 years ago|reply
The internet is burning :O Well, seriously what could be a root cause that affects so many nodes?
[+] bsaul|13 years ago|reply
only 51% of north american internet is working ?? Am i reading this correctly ?
[+] aidos|13 years ago|reply
Add Dropbox to that list.
[+] cfontes|13 years ago|reply
Great find, I did not know this site.
[+] fidotron|13 years ago|reply
It's time we remembered the whole strength of the internet was that it was distributed and we avoided introducing single points of failure. We have ended up using vast amounts of infrastructure for no reason other than developer convenience (often with respect to security), when having local direct connections is often more suitable than shooting everything into the cloud.
[+] magicalist|13 years ago|reply
Huh? Historically, almost all content on the internet has had a single point of failure. Moving "into the cloud" is moving to a distributed model, and generally app engine will protect you from single points of failure. As others have noted, if you distribute your own self-managed servers across the globe, you end up with a very similar system, but now you have to manage it.

What appears to be happening here is that you're still vulnerable because the C&C infrastructure ultimately has a single command source, and so can be vulnerable if some code is pushed that affects the whole system. Your homegrown cloud will suffer from the same vulnerability, it may just be more or less easy to manage depending on how specialized your needs are compared to the more general requirements of GAE.

Edit: actually, maybe I missed what you're advocating with "local direct connections". This might make more sense from a user's perspective: if everyone ran their own little cloud, a failure may bring down reddit, but not reddit, heroku, pinterest, etc simultaneously. That's actually an interesting point, but I'm not sure if it really matters if they sync up their downtimes (since they would still have downtime, and maybe more or less of it depending on how much they could afford to invest into managing their distributed solution). I'm also not sure if that really solves the problem, since there are other concentrations in the network, just less visible ones (there are a fairly small number of major datacenters around the world, for instance, and managing your own colocated server doesn't matter if the whole building goes dark).

I do agree that at the very least we need to maintain an ecosystem of "cloud" providers, however.

[+] mhurron|13 years ago|reply
The resiliency of the internet has nothing to do with the uptime of any service running at endpoints. The resiliency is about the resiliency of the network, a router goes down, there are paths around it, not necessarily paths to every endpoint hanging off it or the services they offer.
[+] digeridoo|13 years ago|reply
We shifted from services that fail often independently, to services failing rarely all at once. Clearly the former is going to be more noticeable and has greater societal impact, but as a business I'll take the latter any day.
[+] lurker14|13 years ago|reply
Which is better? Having a day of downtime each year, or not launching at all?
[+] minm|13 years ago|reply
Point well made. We operate our tonido relay server across the world using linode, softlayer, herztner and it currently supports couple of million devices and half a million users.In the last 2 years our downtime is less than 30 minutes. It is nil for end users since they migrate to the nearest relay server hosted offered by a different provider.

SPOF,security and control are the major issues with Iaas and pass offerings.

SPOF, security and control are the major problems for the Iaas and pass.

[+] daave|13 years ago|reply
And they've sent the all-clear:

At this point, we have stabilized service to App Engine applications. App Engine is now successfully serving at our normal daily traffic level, and we are closely monitoring the situation and working to prevent recurrence of this incident.

This morning around 7:30AM US/Pacific time, a large percentage of App Engine’s load balancing infrastructure began failing. As the system recovered, individual jobs became overloaded with backed-up traffic, resulting in cascading failures. Affected applications experienced increased latencies and error rates. Once we confirmed this cycle, we temporarily shut down all traffic and then slowly ramped it back up to avoid overloading the load balancing infrastructure as it recovered. This restored normal serving behavior for all applications.

We’ll be posting a more detailed analysis of this incident once we have fully investigated and analyzed the root cause.

Regards,

Christina Ilvento on behalf of the Google App Engine Team

https://groups.google.com/forum/#!topic/google-appengine-dow...

[+] abhijitr|13 years ago|reply
Meanwhile... Gmail etc are working quite fine. So the claim that if you build on GAE you "take advantage of the same infrastructure used for Google services!!" starts to ring a bit hollow.
[+] cilvento|13 years ago|reply
At about 7:30am US/Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. Many App Engine applications are experiencing slow responses and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. We are actively working on restoring service as quickly as possible.

We are posting regular updates to our downtime-notify list here: https://groups.google.com/forum/?fromgroups=#!topic/google-a...

Thanks, Christina, Google App Engine Product Manager

[+] kjhughes|13 years ago|reply
What's the earliest sign of trouble you've had?

Pingdom reports my GAE-hosted site has been down since 2012-10-26 10:37:38 EST, a bit over an hour now.

UPDATE: My site is back. Delayed report from Pingdom says site came back online after 50 minutes. Performance is sketchy still. We're probably not in the clear yet.

At least we can now get to the status dash:

http://code.google.com/status/appengine

[+] jsdalton|13 years ago|reply
It's really quite remarkable (to be honest, inexcusable is probably a better word) that their status page is failing as well. My expectations for a company with Google's resources and infrastructure are a lot higher than that.

Nothing on their Twitter account either: https://twitter.com/app_engine

A poor handling of a systems failure in my opinion.

[+] Yoms|13 years ago|reply
Latest update:

"At approximately 7:30am Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. The symptoms that service users would experience include slow response and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. Google engineering teams are investigating a number of options for restoring service as quickly as possible, and we will provide another update as information changes, or within 60 minutes."

https://groups.google.com/forum/?fromgroups=#!topic/google-a...

[+] debacle|13 years ago|reply
I'm really happy I don't host in the cloud. How quickly are the cost savings of cloud computing obliterated by PR, customer service, and system administration time when an outage like this occurs?
[+] foolery|13 years ago|reply
Yup, all of the HRD apps are down. But the M/S apps are working.
[+] jis|13 years ago|reply
Well, my one M/S app is also down.
[+] tomnewton|13 years ago|reply
My Google contact said that 'SRE are all over it. Hope to have more details soon.' but that was about 30 minutes ago.

Does tumblr.com use app engine? They're down...

[+] libria|13 years ago|reply
Hm, bad week for the Cloud. Can't even get to the status page; hopefully it's not hosted on App Engine.

So going forward, what's the best way to protect against cloud downtime? Have a hot/standby failover with a different provider? Prepare customers' expectations for the possibility of server outages? Do a ton of research, pay $$$ for lots of nines uptime, and lambast the host when they don't deliver?

[+] bad_user|13 years ago|reply
Downtimes happen regardless, unless you have a lot of money and talent to spend on your own infrastructure and even then it's hard to beat cloud providers like Amazon, or Google, who have more resources and knowledge than you do.

The greatest thing about cloud-hosting is that you can just sit by and let them fix it. It usually takes about half an hour, or a couple of hours if the outage is severe, but usually less than the time it takes for an update of DNS records (unless you've got some proxy in front of your IPs, which would be another point of failure).

And then, even with these severe outages, the overall monthly uptime is still better than %99.9 and it's really hard to beat that, so just relax and let them fix it.

[+] acdha|13 years ago|reply
There's no such thing as "cloud downtime" - it's still servers, data centers, networks, same as everything else.

You need to decide how much uptime you're willing to pay for, how much your service can degrade for how long, and methodically address each level of the hierarchy between you and your customers – and it might be the case that you decide that the ongoing costs of your engineering support for e.g. wide geographic separation just aren't sustainable at the level your customers are willing to pay, particularly if you have something like a CDN helping keep your site partially responsive during less than catastrophic failures.

[+] davidjgraph|13 years ago|reply
I'd say the answer depends on how fast GAE recovers. If you're building redundancy over multiple clouds, if there's a lot of data:

1) It's very complex and expensive 2) You're looking at DNS to hot failover, in most cases.

If GAE can recover in less than 30 minutes and sticks to, say, one outage a year, you just can't justify the kind of cost you're looking at with 2 (seriously, it's a lot of cash).

[+] josh2600|13 years ago|reply
Build redundancy into your software to deal with single provider failure.
[+] bsaul|13 years ago|reply
I would love it so much to see people at google showing all the internal tools they're using to detect and solve this kind of issues. I can only imagine a war room with screens all over the place showing gigantic amount of red flashing lines :) Hope it doesn't last for long though, i was just praising what a good choice app engine has been so far 10 minutes ago...
[+] hugofierro|13 years ago|reply
I hope it's not due to DiRT Exercises (SRE Disaster Recovery Test). Looking forward to reading the post-mortem report!
[+] notreadbyhumans|13 years ago|reply
It's a bit nuts that they're hosting the status pages on the same infrastructure.