Well, if your system elastically uses GPU compute and needs to be able to spin up, run compute on a GPU, and spin down in a predictable amount of time to provide reasonable UX, launch time would definitely be a factor in terms of customer-perceived reliability.
jhugo|3 years ago
deanCommie|3 years ago
In many cases, "guaranteed" just means "we'll give you a refund if we fuck up". SLAs are very much like this.
IN PRACTICE, unless you're launching tens of thousands of instances of an obscure image type, reasonable customers would be able to get capacity, and promptly from the cloud.
That's the entire cloud value proposition.
So no, you can't just hand-waive past these GCP results and say "Well, they never said these were guaranteed".
dark-star|3 years ago
The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really
irrational|3 years ago
rco8786|3 years ago
Waterluvian|3 years ago
Consistently slow is still reliability.
unknown|3 years ago
[deleted]
somat|3 years ago
Like the article said, The promise of the cloud is that you can easily get machines when you need them the cloud that sometimes does not get you that machine(or does not get you that machine in time) is a less reliable cloud than the one that does.
VWWHFSfQ|3 years ago
If the instance takes too long to launch then it doesn't matter if it's "reliable" once it's running. It took too long to even get started.
Art9681|3 years ago
thwayunion|3 years ago
Midnight - 6am is six hours. The on demand price for a G5 is $1/hr. That's over $2K/yr, or "an extra week of skiing paid for by your B2B side project that almost never has customers from ~9pm west coat to ~6am east coast". And I'm not even counting weekends.
But that's sort of a silly edge case (albeit probably a real one for lots of folks commenting here). The real savings are in predictable startup times for bursty work loads. Fast and low variance startup times unlock a huge amount of savings. Without both speed and predictability, you have to plan to fail and over-allocate. Which can get really expensive fast.
Another way to think about this is that zero isn't special. It's just a special case of the more general scenario where customer demand exceeds current allocation. The larger your customer base, and the burstier your demand, the more instances you need sitting on ice to meet customers' UX requirements. This is particularly true when you're growing fast and most of your customers are new; you really want a good customer experience every single time.
diroussel|3 years ago
Maintaining a buffer pool is hard. You need to maintain state, have a prediction function, track usage through time, etc. just spinning up new nodes for new work is substantially easier.
And the author said he could spin up new nodes in 15 seconds, that’s pretty quick.
HenriTEL|3 years ago
mikepurvis|3 years ago
pier25|3 years ago