top | item 32932153

(no title)

mikewave | 3 years ago

Well, if your system elastically uses GPU compute and needs to be able to spin up, run compute on a GPU, and spin down in a predictable amount of time to provide reasonable UX, launch time would definitely be a factor in terms of customer-perceived reliability.

discuss

order

jhugo|3 years ago

All the clouds are pretty upfront about availability being non-guaranteed if you don't reserve it. I wouldn't call it a reliability issue if your non-guaranteed capacity takes some tens of seconds to provision. I mean, it might be your reliability issue, because you chose not to reserve capacity, but it's not really unreliability of the cloud — they're providing exactly what they advertise.

deanCommie|3 years ago

"Guaranteed" has different tiers of meaning - both theoretical and practical.

In many cases, "guaranteed" just means "we'll give you a refund if we fuck up". SLAs are very much like this.

IN PRACTICE, unless you're launching tens of thousands of instances of an obscure image type, reasonable customers would be able to get capacity, and promptly from the cloud.

That's the entire cloud value proposition.

So no, you can't just hand-waive past these GCP results and say "Well, they never said these were guaranteed".

dark-star|3 years ago

I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance.

The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really

irrational|3 years ago

I’d like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?

rco8786|3 years ago

Sure but not anywhere remotely near clearing the bar to simply calling that “reliability”.

Waterluvian|3 years ago

When I think “reliability” I think “does it perform the act consistently?”

Consistently slow is still reliability.

somat|3 years ago

It is not reliably running the machine but reliably getting the machine.

Like the article said, The promise of the cloud is that you can easily get machines when you need them the cloud that sometimes does not get you that machine(or does not get you that machine in time) is a less reliable cloud than the one that does.

VWWHFSfQ|3 years ago

I would still call it "reliability".

If the instance takes too long to launch then it doesn't matter if it's "reliable" once it's running. It took too long to even get started.

Art9681|3 years ago

Why would you scale to zero in high perf compute? Wouldn't it be wise to have a buffer of instances ready to pick up workloads instantly? I get that it shouldnt be necessary with a reliable and performant backend, and that the cost of having some instances waiting for job can be substantial depending on how you do it, but I wonder if the cost difference between AWS and GCP would make up for that and you can get an equivalent amount of performance for an equivalent price? I'm not sure. I'd like to know though.

thwayunion|3 years ago

> Why would you scale to zero in high perf compute?

Midnight - 6am is six hours. The on demand price for a G5 is $1/hr. That's over $2K/yr, or "an extra week of skiing paid for by your B2B side project that almost never has customers from ~9pm west coat to ~6am east coast". And I'm not even counting weekends.

But that's sort of a silly edge case (albeit probably a real one for lots of folks commenting here). The real savings are in predictable startup times for bursty work loads. Fast and low variance startup times unlock a huge amount of savings. Without both speed and predictability, you have to plan to fail and over-allocate. Which can get really expensive fast.

Another way to think about this is that zero isn't special. It's just a special case of the more general scenario where customer demand exceeds current allocation. The larger your customer base, and the burstier your demand, the more instances you need sitting on ice to meet customers' UX requirements. This is particularly true when you're growing fast and most of your customers are new; you really want a good customer experience every single time.

diroussel|3 years ago

Scaling to zero means zero cost when there is zero work. If you have a buffer pool, how long do you keep it populated when you have no work?

Maintaining a buffer pool is hard. You need to maintain state, have a prediction function, track usage through time, etc. just spinning up new nodes for new work is substantially easier.

And the author said he could spin up new nodes in 15 seconds, that’s pretty quick.

HenriTEL|3 years ago

GCP provides elactic features for that. One should use them instead of manually requesting new instances.

mikepurvis|3 years ago

Hopefully anyone with a workload that's that latency sensitive would a have preallocated pool of warmed up instances ready to go.

pier25|3 years ago

Wouldn't Cloud Run be a better product for that use case?