top | item 33744580

(no title)

l-p | 3 years ago

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

discuss

Twirrim|3 years ago

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.

marcinzm|3 years ago

In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.

whoknew1122|3 years ago

It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.

Spooky23|3 years ago

> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.

more_corn|3 years ago

All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.

moralestapia|3 years ago

Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?

adrr|3 years ago

Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

jepler|3 years ago

It seems like in nearly 3 out of every 4 years the whole internet is unusable on February 29... why pick on microsoft?

Godel_unicode|3 years ago

10 years ago, has there been something similar recently?

rufius|3 years ago

Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.

janober|3 years ago

We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.

ckdarby|3 years ago

> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).

aprdm|3 years ago

Depending on your scale, things are really not that complicated. If you can run your company from a single machine, having two for redundancy, and two internet links for redundancy, will likely go a loooooooooong way until something bad happens...

ethbr0|3 years ago

Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).

Spooky23|3 years ago

Microsoft support is really awful. Basically, if you need it regularly, you just pay for resident engineers who can bypass the wall between the product groups and you. I’ve had nothing but great experiences with those guys.

Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.

aprdm|3 years ago

YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!

SergeAx|3 years ago

Yes, and what is your contingency plan for said fiber going dark?

roflyear|3 years ago

I have DB connection issues at least a few times a week. Annoying.

marcosdumay|3 years ago

New Microsoft customer at all.

Insanity|3 years ago

The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".

unionpivo|3 years ago

Depends on how stable your needs are, but sometimes its cheaper even when you considerer total cost and not just for big deployments.

In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.

Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.

But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.