top | item 32931096

AWS vs. GCP reliability is wildly different

545 points| icyfox | 3 years ago |freeman.vc | reply

234 comments

order
[+] rwiggins|3 years ago|reply
There were 84 errors for GCP, but the breakdown says 74 409s and 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?

I suspect the 409 conflicts are probably from the instance name not being unique in the test. It looks like the instance name used was:

    instance_name = f"gpu-test-{int(time())}"
which has a 1-second precision. The test harness appears to do a `sleep(1)` between test creations, but this sort of thing can have weird boundary cases, particularly because (1) it does cleanup after creation, which will have variable latency, (2) `int()` will truncate the fractional part of the second from `time()`, and (3) `time.time()` is not monotonic.

I would not ask the author to spend money to test it again, but I think the 409s would probably disappear if you replaced `int(time())` with `uuid.uuid4()`.

Disclosure: I work at Google - on Google Compute Engine. :-)

[+] sitharus|3 years ago|reply
This is a very good point - AWS uses tags to give instances a friendly name, so the name does not have to be unique. The same logic would not fail on AWS.
[+] mempko|3 years ago|reply
What are your thoughts on the generally slower launch times with a huge variance on GCP?
[+] Cthulhu_|3 years ago|reply
I've naively used millisecond precision things for a long time - not in anything critical I don't think - but I've only recently come to more of an awareness that a millisecond is a pretty long time. Recent example is that I used a timestamp to version a record in a database, but it's feasible that in a Go application, a record could feasilby be mutated multiple times a millisecond by different users / processes / requests.

Unfortunately, millisecond-precise timestamps proved to be a bit tricky in combination with sqlite.

[+] okdood64|3 years ago|reply
Hope icyfox can try running this with a fix.
[+] dark-star|3 years ago|reply
I wonder why someone would equate "instance launch time" with "reliability"... I won't go as far as calling it "clickbait" but wouldn't some other noun ("startup performance is wildly different") have made more sense?
[+] mikewave|3 years ago|reply
Well, if your system elastically uses GPU compute and needs to be able to spin up, run compute on a GPU, and spin down in a predictable amount of time to provide reasonable UX, launch time would definitely be a factor in terms of customer-perceived reliability.
[+] iLoveOncall|3 years ago|reply
It is clickbait, the real title should be "AWS vs. GCP on-demand provisioning of GPU resources performance is wildly different".

That said, while I agree that launch time and provisioning error rate are not sufficient to define reliability, they are definitely a part of it.

[+] xmonkee|3 years ago|reply
GCP also had 84 errors compared to 1 for AWS
[+] hericium|3 years ago|reply
Cloud reliability is not the same as a reliability of already spawned VM.

Here it's the possibility to launch new VMs to satisfy dynamic projects' needs. Cloud provider should allow you to scale-up in a predictable way. When it doesn't - it can be called unreliable.

Also, "unreliable" is basically a synonym for "Google" these days.

[+] irjustin|3 years ago|reply
I'll say it is valid to use reliability.

If I depend on some performance metric, startup, speed, etc, my dependance on it equates to reliability. Not just on/off but the spectrum that it produces.

If a CPU doesn't operate at its 2GHz setting 60% of the time, I would say that's not reliable. When my bus shows up on time only 40% of the time - I can't rely on that bus to get me where I need to go consistently.

If the GPU took 1 hour to boot, but still booted, is it reliable? What about 1 year? At some point it tips over an "personal" metric of reliability.

The comparison to AWS which consistently out-performs GCP, while not explicitly, implicitly turns that into a reliability metric by setting the AWS boot time as "the standard".

[+] chrismarlow9|3 years ago|reply
I mean if you're talking about worst case systems you assume everything is gone except your infra code and backups. In that case your instance launch time would ultimately define what your downtime looks like assuming all else is equal. It does seem a little weird to define it that way but in a strict sense maybe not.
[+] thayne|3 years ago|reply
Well, I mean it is measuring how reliably you can get a GPU instance. But it certainly isn't the overall reliability. And depending on your workflow, it might not even be a very interesting measure. I would be more interested in seeing a comparison of how long regular non-GPU instances can run without having to be rebooted, and maybe how long it takes to allocate a regular VM.
[+] thesuperbigfrog|3 years ago|reply
"AWS encountered one valid launch error in these two weeks whereas GCP had 84."

84 times more launch errors seems like a valid definition for "less reliable".

[+] RajT88|3 years ago|reply
Reliability is a fair term, with an asterix. It is a specific flavor of reliability: deployment or scaling or net-new or allocation or whatever you want to call it.
[+] santoshalper|3 years ago|reply
I won't go so far as saying "you didn't read the article", but I think you missed something.
[+] rmah|3 years ago|reply
They are talking about the reliability of AWS vs GCP. As a user of both, I'd categorize predictable startup times under reliability because if it took more than a minute or so, we'd consider it broken. I suspect many others would have even tighter constraints.
[+] lacker|3 years ago|reply
Anecdotally I tend to agree with the author. But this really isn't a great way of comparing cloud services.

The fundamental problem with cloud reliability is that it depends on a lot of stuff that's out of your control, that you have no visibility into. I have had services running happily on AWS with no errors, and the next month without changing anything they fail all the time.

Why? Well, we look into it and it turns out AWS changed something behind the scenes. There's a different underlying hardware behind the instance, or some resource started being in high demand because of some other customers.

So, I completely believe that at the time of this test, this particular API was performing a lot better on AWS than on GCP. But I wouldn't count on it still performing this way a month later. Cloud services aren't like a piece of dedicated hardware where you test it one month, and then the next month it behaves roughly the same. They are changing a lot of stuff that you can't see.

[+] citizenpaul|3 years ago|reply
That was my thoughts. People are probably pummeling GCP GPU free tier right now with stable diffusion image generators. Since it seems like all the free plug and play examples use the google python notebooks.
[+] ryukoposting|3 years ago|reply
You've just perfectly characterized why on-site infrastructure will always have its place.
[+] RajT88|3 years ago|reply
Instance types and regions make a big difference.

Some regions and hardware generations are just busier than others. It may not be the same across cloud providers (although I suspect it is similar given the underlying market forces).

[+] remus|3 years ago|reply
> The offerings between the two cloud vendors are also not the same, which might relate to their differing response times. GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed. AWS only provisions defined VMs that have GPUs attached - the g4dn.x series of hardware here. Each of these instances are fixed in their CPU allocation, so if you want one particular varietal of GPU you are stuck with the associated CPU configuration.

At a surface level, the above (from the article) seems like a pretty straightforward explanation? GCP gives you more flexibility in configuring GPU instances at the trade off of increased startup time variability.

[+] btgeekboy|3 years ago|reply
I wouldn't be surprised if GCP has GPUs scattered throughout the datacenter. If you happen to want to attach one, it has to find one for you to use - potentially live migrating your instance or someone else's so that it can connect them. It'd explain the massive variability between launch times.
[+] politelemon|3 years ago|reply
A few weeks ago I needed to change the volume type on an EC2 instance to gp3. Following the instructions, the change happened while the instance was running. I didn't need to reboot or stop the instance, it just changed the type. While the instance was running.

I didn't understand how they were able to do this, I had thought volume types mapped to hardware clusters of some kind. And since I didn't understand, I wasn't able to distinguish it from magic.

[+] lomkju|3 years ago|reply
Having being a high scale AWS user with a bill of +$1M/month and now working since 2 years with a company which uses GCP. I would say AWS is superior and way ahead.

** NOTE: If you're a low scale company this won't matter to you **

1. GKE

When you cross a certain scale certain GKE components won't scale with you and SLOs on those components are crazy, it takes 15+ mins for us to update a GKE ingress controller backed Ingress.

Cloud Logging hasn't been able to keep up with our scale, disabled since 2 years now. This last Q we got an email from them to enable it and try it again on our clusters, still have to confirm these claims as our scale is more higher now.

Konnectivity agent release was really bad for us, it affected some components internally, total dev time we lost was more than 3 months debugging this issue. They had to disable konnectivity agent on our clusters, I had to collect TCP dumps and other evidences just to prove nothing was wrong on our end, fight with our TAM to get a meeting with the product team. After 4 months they agreed and reverted our clusters to SSH tunnels. Initially GCP support said they said they can't do this. Next Q Ill be updating the clusters hopefully they have fixed this by then.

2. Support.

I think AWS support always were more pro active in debugging with us, GCP support agents most of the times lack the expertise or proactiveness to debug/solve things in simple cases. We pay for enterprise support and don't see getting much from them. At AWS we had reviews of the infra how we could better it every 2 Qs and we got new suggestion and was also the time when we shared what we would like to see in their roadmap.

3.Enterprisyness is missing with design

A simple thing as cloudbuild doesn't have access to static IPs. We have to maintain a forward proxy just cause of this.

L4 LBs were a mess you could only use specified ports in a (L4 LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700, 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379, 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed these restrictions. I don't know who came up with this idea to allow only a few ports on a L4 LB. I think such design decisions make it less Enterprisy.

[+] outworlder|3 years ago|reply
Unclear what the article has to do with reliability. Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent. Azure feels like I'm starting a Boeing 747 instead of a VM.

However, there's one aspect where GCP is a clear winner on the reliability front. They auto-migrate instances transparently and with close to zero impact to workloads – I want to say zero impact but it's not technically zero.

In comparison, in AWS you need to stop/start your instance yourself so that it will move to another hypervisor(depending on the actual issue AWS may do it for you). That definitely has impact on your workloads. We can sometimes architect around it but there's still something to worry about. Given the number of instances we run, we have multiple machines to deal with weekly. We get all these 'scheduled maintenance' events (which sometimes aren't really all that scheduled), with some instance IDs(they don't even bother sending the name tag), and we have to deal with that.

I already thought stop/start was an improvement on tech at the time (Openstack, for example, or even VMWare) just because we don't have to think about hypervisors, we don't have to know, we don't care. We don't have to ask for migrations to be performed, hypervisors are pretty much stateless.

However, on GCP? We had to stop/start instances exactly zero times, out of the thousands we run and have been running for years. We can see auto-migration events when we bother checking the logs. Otherwise, we don't even notice the migration happened.

It's pretty old tech too:

https://cloudplatform.googleblog.com/2015/03/Google-Compute-...

[+] yolovoe|3 years ago|reply
EC2 live migrates instances too. Not sure where we are with rollout across the fleet.

The reason, from what I understand, why GCP does live migration more is because ec2 focused on live updates instead of live migration. Whereas GCP migrates instances to update servers, ec2 live updates everything down to firmware while instances are running.

Curious, what instance types are you using on EC2 that you see so much maintenance?

[+] jcheng|3 years ago|reply
> Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent.

FWIW this article is saying the opposite--it's AWS that beats GCP in startup speed.

[+] voidfunc|3 years ago|reply
> Azure feels like I'm starting a Boeing 747 instead of a VM.

Huh... interesting, this has not been my experience with Azure VM launch times. I'm usually surprised how quickly they pop up.

[+] user-|3 years ago|reply
I wouldn't call this reliability, which already has a loaded definition in the cloud world, and instead something along time-to-start or latency or something.
[+] systemvoltage|3 years ago|reply
It is though based on a specific definition. If X doesn't do Y based on Z metric with a large standard deviation and doesn't meet spec limits, it is not reliable as per the predefined tolerance T.

  X = Compute intances
  Y = Launch
  Z = Time to launch
  T = LSL (N/A), USL (10s), Std Dev (2s)
Where LSL is lower spec limit, USL is upper spec limit. LSL is N/A since we don't care if the instance launches instantly (0 seconds).

You can define T as per your requirements. Here we are ignoring the accuracy of the clock that measures time, assuming that the measurement device is infinitely accurate.

If your criteria is to, say for example, define reliability as how fast it shuts down, then this article isn't relevant. Article is pretty narrow in testing reliability, they only care about launch time.

[+] 0xbadcafebee|3 years ago|reply
Reliability in general is measured on the basic principle of: does it function within our defined expectations? As long as it's launching, and it eventually responds within SLA/SLO limits, and on failure comes back within SLA/SLO limits, it is reliable. Even with GCP's multiple failures to launch, that may still be considered "reliable" within their SLA.

If both AWS and GCP had the same SLA, and one did better than the other at starting up, you could say one is more performant than the other, but you couldn't say it's more reliable if they are both meeting the SLA. It's easy to look at something that never goes down and say "that is more reliable", but it might have been pure chance that it never went down. Always read the fine print, and don't expect anything better than what they guarantee.

[+] zmmmmm|3 years ago|reply
> In total it scaled up about 3,000 T4 GPUs per platform

> why I burned $150 on GPUs

How do you rent 3000 GPUs over a period of weeks for $150? Were they literally requisitioning it and releasing it immediately? Seems like this is quite a unrealistic type of usage pattern and would depend a lot on whether the cloud provider optimises to hand you back the same warm instance you just relinquished.

> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator

it's quite fascinating that GCP can do this. GPUs are physical things (!) do they provision every single instance type in the data center with GPUs? That would seem very expensive.

[+] orf|3 years ago|reply
AWS has different pools of EC2 instances depending on the customer, the size of the account and any reservations you may have.

Spawning a single GPU at varying times is nothing. Try spawning more than one, or using spot instances, and you’ll get a very different picture. We often run into capacity issues with GPU and even the new m6i instances at all times of the day.

Very few realistic company size workloads need a single GPU. I would willingly wait 30 minutes for my instances to become available if it meant all of them where available at the same time.

[+] playingalong|3 years ago|reply
This is great.

I have always been feeling there is so little independent content on benchmarking the IaaS providers. There is so much you can measure in how they behave.

[+] kccqzy|3 years ago|reply
Heard from a Googler that the internal infrastructure (Borg) is simply not optimized for quick startup. Launching a new Borg job often takes multiple minutes before the job runs. Not surprising at all.
[+] epberry|3 years ago|reply
Echoing this. The SRE book is also highly revealing about how Google request prioritization works. https://sre.google/sre-book/load-balancing-datacenter/

My personal opinion is that Google's resources are more tightly optimized than AWS and they may try to find the 99% best allocation versus the 95% best allocation on AWS.. and this leads to more rejected requests. Open to being wrong on this.

[+] dekhn|3 years ago|reply
A well-configured isolated borg cluster and well-configured job can be really fast. If there's no preemption (IE, no other job that is kicked off and gets some grace period), the packages are already cached locally, and there is no undue load on the scheduler, the resources are available, and it's a job with tasks, rather than multiple jobs, it will be close to instantaneous.

I spend a significant fraction of my 11+ years there clicking Reload on my job's borg page. I was able to (re-)start ~100K jobs globally in about 15 minutes.

[+] devxpy|3 years ago|reply
Is this testing for spot instances?

In my limited experience, persistent (on-demand) GCP instances always boot up much faster than AWS EC2 instances.

[+] encryptluks2|3 years ago|reply
I noticed that too and it does appear to be using spot instances. I have a feeling if it was ran without you may see much better startup times. Spot instances on GCP are hit and miss and you sort of have to build that into your workflow.
[+] marcinzm|3 years ago|reply
In my experience GPU persistent instances often simply don't boot up on GCP due to lack of available GPUs. One reason I didn't choose GCP at my last company.
[+] ajross|3 years ago|reply
Worth pointing out that the article is measuring provisioning latency and success rates (how quickly can you get a GPU box running and whether or not you get an error back from the API when you try), and not "reliability" as most readers would understand it (how likely they are to do what you want them to do without failure).

Definitely seems like interesting info, though.

[+] jupp0r|3 years ago|reply
That's interesting but not what I expected when I read "reliability". I would have expected SLO metrics like uptime of the network or similar metrics that users would care about more. Usually when scaling a system that's built well you don't have hard short constraints on how fast an instance needs to be spun up. If you are unable to spin up any that can be problematic of course. Ideally this is all automated so nobody would care much about whether it takes a retry or 30s longer to create an instance. If this is important to you, you have other problems.
[+] vienarr|3 years ago|reply
The article only talks about GPU start time, but the title is "CloudA vs CloudB reliability"

bit of a stretch, right

[+] runeks|3 years ago|reply
> These differences are so extreme they made me double check the process. Are the "states" of completion different between the two clouds? Is an AWS "Ready" premature compared to GCP? It anecdotally appears not; I was able to ssh into an instance right after AWS became ready, and it took as long as GCP indicated before I was able to login to one of theirs.

This is a good point and should be part of the test: after launching, SSH into the machine and run a trivial task to confirm that the hardware works.

[+] Animats|3 years ago|reply
> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed.

That would seem to indicate that asking for a VM on GCP gets you a minimally configured VM on basic hardware, and then it gets migrated to something bigger if you ask for more resources. Is that correct?

That could make sense if, much of the time, users get a VM and spend a lot of time loading and initializing stuff, then migrate to bigger hardware to crunch.

[+] humanfromearth|3 years ago|reply
We have constant autoscaling issues because of this in GCP - glad someone plotted this - hope people in GCP will pay a bit more attention to this. Thanks to the OP!