top | item 19056911

Microsoft Azure data deleted because of DNS outage

217 points| stonewhite | 7 years ago |nakedsecurity.sophos.com | reply

84 comments

order
[+] m0zg|7 years ago|reply
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport. I feel this should be rephrased for cloud computing at this point. The more people rely on cloud, the more these global fuck-ups are going affect them. Makes me feel pretty good about that server rack in my garage that addresses most of my own (and my business') compute needs.
[+] outworlder|7 years ago|reply
If you already have a rack available with spare capacity, and your business is not expected to blow up overnight, and can tolerate failures(and the turnaround time for your or your employees to fix stuff, order spare parts, etc), sure, why not?

The capacity is there, might as well use it.

That said, if you didn't have said rack, I'm not so sure it would be worth it to even make a purchase order. Sure things outside of your control may break when you are using a cloud. But guess what, things outside your control will also break, on-prem. Particularly hardware, and network connectivity. There is no way your networking can be better than, say, GCP's own networking, or that you can deploy redundant workloads across availability zones (or even regions!) yourself.

By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.

Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.

One thing I would say is: even if you are very happy with your current setup, if you have some time to automate a similar setup on the cloud (keyword: automate), then I would suggest doing just that, and offload backups to the cloud too. Even if only as a business continuity thing.

[+] userbinator|7 years ago|reply
Another perhaps more relevant way to think about this is that security and availability are at odds with each other, and in this case a system designed for security made a secure choice.

The more secure a system is designed to be, the more likely it is to treat unusual conditions as an attack and possibly perform some destructive action to thwart the assumed attacker. Think of phones configured to delete all data after X incorrect password attempts, HSMs with anti-tamper switches, etc.

[+] boulos|7 years ago|reply
Disclosure: I work on Google Cloud.

I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.

Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.

Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.

Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).

[+] Waterluvian|7 years ago|reply
I feel that while valid for you, this sentiment is highly out of touch with the reality of the needs, resources, and capabilities of most people who need these kinds of systems.

It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.

[+] bdibs|7 years ago|reply
Because nothing could happen to your rack (or garage)?
[+] lenticular|7 years ago|reply
I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.
[+] outworlder|7 years ago|reply
> I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

Google Cloud is fair competition – provided they have the service you need. AWS and Azure both beat them in number of services. If Google has it, then it should behave as expected, and some are downright impressive (GKE and VM auto migration on GCE).

Azure is... infuriating. Inconsistent, unreliable APIs, surprising behavior everywhere (attach an internal load balancer, lose internet connectivity!?), lots of restrictions on which features can be used with which SKUs.

I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.

[+] TheIronYuppie|7 years ago|reply
Disclosure: I work at Azure.

We're doing our best, but we're not going to suggest there's not more to do. Every major cloud provider has had issues at one point or another (I formerly worked at Amazon and Google), and I'll just say - we hear you, and we are fiercely committed to earning your trust.

[+] itsdrewmiller|7 years ago|reply
We use Azure and love it. We've had more problems over the past few years from downstream services being broken by AWS (S3 outage, etc.) than our primary apps being broken by Azure.
[+] shanemhansen|7 years ago|reply
Azure's ILBs are still just bizarre to me. It's the first load balancer I've ever worked with where sometimes a member of the pool can't reach the load balancer. At the packet level given their implementation, this makes sense but tell me how many times you've ever had to write a stats endpoint where the stats nodes had to do workarounds to send their own stats?

Source: https://docs.microsoft.com/en-us/azure/load-balancer/load-ba...

[+] briffle|7 years ago|reply
Try google cloud?
[+] Dayshine|7 years ago|reply
Well, I'm not sure that's true.

I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.

[+] stevenjohns|7 years ago|reply
That's because Azure is a poor service with mediocre support.

My anecdotal experience: I spent a couple of weeks (!) setting up our environment (Bitbucket, Django, Ubuntu, Dockerized) on Azure App Service and Azure Pipelines. Their documentation was incomplete, out-of-date and MS support staff struggled to help if you didn't have a Windows machine (their RDP software doesn't support Linux, Skype for Business doesn't support Linux and normal Skype for Linux doesn't support screen sharing).

Little things like trying to SSH into any machine so that you can execute commands on your docker container (for, say, database migrations or to check logs) is almost impossible. If it wasn't for the help of a lot of people on #docker in Freenode I would probably still be working on it.

I had to use Google Hangouts with a Microsoft support person's personal gmail account, while he was connected over VPN (since he was based in Shanghai), so I could show my issue. The support person was extremely pleasant to deal with and understanding, though, and he went above and beyond to help get my issue resolved even though it turned out to not be from his department.

However, after getting set up, I noticed I was getting 12 second (!) responses from an API I had written just to retrieve a logged-in user's first name, last name and email in JSON. This API resolves locally in 20ms - including layers of authentication.

This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.

After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:

> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.

> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.

> Please give it a test in higher tier and configure it with a compatible settings compared with your VM. In the meanwhile, you can monitor the slow queries via Query Performance Insight to find out what queries were running at a long time when those API were called.

> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .

...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.

Of course the next tier up from the $60/month that I was on was $160/month, and since we only have maybe two concurrent users at most it didn't make sense to triple our costs just to avoid 12 second database calls.

I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.

I don't think I'll ever go back. Not even for free.

[+] cddotdotslash|7 years ago|reply
Sounds like they built a dead-man's switch and then broke the process through which the man and the switch communicate.
[+] ajross|7 years ago|reply
It was garbage collection. They were deleting data that couldn't be accessed (because no key existed in the system to decrypt it), but the DNS failure fooled the detection into thinking that a failed lookup meant "no key exists". Yikes.

To be clear: it was ultimately only a 5 minute loss (and the fact that the DNS outage was simultaneous probably meant there wasn't much data being stored anyway) because they had a regular snapshot facility. So defense in depth saved them.

Still, yikes. That's a pretty disastrous bug.

[+] dtech|7 years ago|reply
Would be a pretty ineffective dead mans switch if a backup from 5 minutes ago is available
[+] excalibur|7 years ago|reply
> The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.

By what logic is this NOT a terrible idea?

[+] booi|7 years ago|reply
It sounds like they do this for a data security and compliance reasons but it seems like sloppy engineering to not consider unreachability as possible temporary error.
[+] sowbug|7 years ago|reply
It's reasonable to delay deleting encrypted data (which can take a long time) and just delete its keys (which is very fast) upon a user request to delete the data. If you believe in encryption, then once you delete the only remaining copies of keys, the encrypted data is as good as deleted.

So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).

But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.

[+] microtherion|7 years ago|reply
Well, it's certainly one way to guarantee database consistency after a network partition…
[+] mh8h|7 years ago|reply
At least it needs better error handling. A "not found" and "cannot resolve domain name" should behave differently.
[+] LoSboccacc|7 years ago|reply
The whole "managed resources have a mandatory public dns and ip" idea is insane

Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.

At least for the s3 bucket is justified because those are meant to be accessible, but the databases?

[+] snockerton|7 years ago|reply
It appears that the SLA guaranteed uptime for Azure SQL Database is 99.9% or 99.99%, depending on tier. That equates to the following allowable downtime per month (which I think is what they base SLA fulfillment on):

99.9: 43m 49.7s

99.99: 4m 23.0s

Sounds like they need to cough up some money for their four 9s customers...

[+] kthejoker2|7 years ago|reply
As the article indicates, MSFT is offering 3 months of free service to affected customers.