Ask HN: Azure has run out of compute – anyone else affected?

[+] l-p|3 years ago|reply

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

[+] Twirrim|3 years ago|reply

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.

[+] adrr|3 years ago|reply

Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

[+] rufius|3 years ago|reply

Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.

[+] janober|3 years ago|reply

We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.

[+] ckdarby|3 years ago|reply

> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).

[+] ethbr0|3 years ago|reply

Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).

[+] aprdm|3 years ago|reply

YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!

[+] SergeAx|3 years ago|reply

Yes, and what is your contingency plan for said fiber going dark?

[+] roflyear|3 years ago|reply

I have DB connection issues at least a few times a week. Annoying.

[+] marcosdumay|3 years ago|reply

New Microsoft customer at all.

[+] Insanity|3 years ago|reply

The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".

[+] xwowsersx|3 years ago|reply

Oof, that sucks and I feel for you. That said...

> setting up in a new region would be complicated for us.

Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.

When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.

Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.

I wish you luck.

[+] craigkerstiens|3 years ago|reply

This is nothing new, Azure has been having capacity problem for over a year now[1]. Germany is not the only region affected at all, it's the case for a number of instance types in some of their larger US regions as well. In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.

The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.

1. https://www.zdnet.com/article/azures-capacity-limitations-ar...

[+] cfeduke|3 years ago|reply

I worked briefly in an enterprise facing sales organization that targeted multi-cloud deployments. Azure always had capacity problems.

As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).

[+] bri3d|3 years ago|reply

Every cloud provider will have these issues with specific instance types in specific regions, although the Azure Germany situation sounds perhaps a bit more dire. At my past (much larger) employers we’ve always run into hardware capacity issues with AWS too - we’re just able to work around them.

Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.

I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.

[+] Innominate|3 years ago|reply

As much as this happens, I don't feel it's something to be expected or even okay.

The major cloud services are expensive. This extra cost is supposed to provide for cloud services' high level of flexibility. Running out of capacity should be a rare event and treated as a high priority problem to be fixed asap.

Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.

[+] PaulHoule|3 years ago|reply

There is a "minimal viable product" of documenting the configuration of your system so you can (1) run development, test, staging instances, (2) jump to another region when necessary, (3) from other disasters.

Ideally you have a script that goes from credentials to the service to a complete working instance.

[+] andrewstuart|3 years ago|reply

Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.

Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.

If it’s no, you have to guess again where you might get the instance you want and email them again and ask.

I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.

It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.

[+] victor106|3 years ago|reply

I am sorry to say but at this point Azure is so f’ed up I think it should only be considered after AWS and GCP.

The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.

[+] arecurrence|3 years ago|reply

This is not as rare as public clouds may lead people to believe. I have had to move workloads around since AWS began (even between public clouds on occasion).

In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.

I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.

[+] andrewstuart|3 years ago|reply

GPUs are better run in your own office.

All cloud providers charge much, much more for GPUs than if you run a local machine.

Cloud GPUs are also a lot slower than state of the art consumer GPUs.

Cloud GPUs: much slower, less available, much more expensive.

[+] dehrmann|3 years ago|reply

You want to be in a position where you can spin up in a nearby region and pretend it's local and have things be good enough for a while. Properly building out multi-region is hard, and multi-cloud isn't worth it because it improves how you handle rare events (where half the internet is already down) with ongoing operational toil.

[+] mmcconnell1618|3 years ago|reply

I used to be a technical seller for Azure. This situation is obviously not great for you as a customer but there are proactive steps you can take to prevent this going forward. Reach out to your sales team and work with them on your roadmap for compute requirements going forward. The sales team has a forecast tool that feeds back into the department that buys and racks the equipment. If you can provide enough lead time, they will make sure you have compute resources available in your subscriptions.

[+] wstuartcl|3 years ago|reply

What you describe is like the inverse of 90% of the reason companies host in the cloud. What makes needing to forcast and reach out to a sales guy to eventually stock hardware for your needs (while now competing against other customers for those resources) any better than hosting on prem.

AWS for sure has had resource constraints in different AZs (especially during flack Friday and holiday loads) but I have never had an issue finding resources to spin up especially if I was willing to be flexible on vm type.

[+] alexeldeib|3 years ago|reply

What VM sizes?

Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.

If changing regions sounds like a pain, consider gambling on other VM size availability.

(azure employee)

[+] janober|3 years ago|reply

Actually nothing fancy, for sure no GPUs. Just Standard_E4s_v4.

[+] whalesalad|3 years ago|reply

> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Yikes, this is totally the first thing you need to come to expect when working with MSFT.

[+] scotty79|3 years ago|reply

When Amazon S3 was a new thing, when I managed to convince my company move to it, when we just moved to serving some of our stuff from S3, first week, Amazon has an outage.

[+] janober|3 years ago|reply

Probably a good learning for the future ;-)

[+] jenscow|3 years ago|reply

Maybe Microsoft had just got their AWS bill?

[+] usgroup|3 years ago|reply

Well I thought that was funny :-)

[+] DannyBee|3 years ago|reply

Most of Europe expects the winter to be quite painful from a power perspective. It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

The timeframe they gave would match that kind of ask.

I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)

[+] arcturus17|3 years ago|reply

> It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

I doubt it. It will be easier - and probably safer - to ask citizens and physical industry (eg, factories) to bear the brunt than to risk having problems in critical IT infrastructure. Ask people and factories to turn the heat 3 degrees down and the effects will be more or less predictable. Asking to shut compute power down at random will have unpredictable consequences.

[+] analyst74|3 years ago|reply

Obviously Azure failed its customer here, but everyone with data centers in Europe is tightening their belts and preparing for the worst.

I suspect AWS and GCP just have more headroom in EU.

[+] andrewstuart|3 years ago|reply

Message to cloud providers:

List what you do you have available so we can choose.

Do not force users to randomly guess and be refused until eventually finding something available.

[+] RajT88|3 years ago|reply

Imagine if they did this in realtime. There's already DDOS attacks happening which are abusing the cloud free trials at scale - this would give them another attack vector.

I can see why they wouldn't want to do this.

[+] crmd|3 years ago|reply

I need big m4n instances with 100gbe for product demos, and spinning them up lately is like trying to get Taylor Swift tickets on Ticketmaster. We end up wasting money running them for days at a time instead of on demand because we’re afraid of losing them.

It’s infuriating that AWS doesn’t have an API that returns a list of AZs with available inventory for a given instance type.

[+] layer8|3 years ago|reply

Why would they make any promises, or be upfront about their resources at the risk of becoming less attractive compared to competitors with more resources? It’s not like many people are shunning the cloud for that reason today (although maybe they should).

[+] Too|3 years ago|reply

At bare minimum there should be a feature like "Give me any VM that closely resembles an E8s_v5 with at least 32G ram". Or "anything from these 4 approved types".

I don't always care if you give me a E8_v4 or a D8 instead, just give something. With all the 100 of variants of VMs that are available, finding an exact match is obviously an unnecessary constraint. Maybe they already simulate this behind the scenes, I don't know, though given the sizes are advertised with HW capabilities I'd imagine they can't really simulate a v4 using a v5 and vice versa.

Only place I've seen compute be treated this fluidly is in Container instance, which is a bad choice for many many other reasons.

[+] avereveard|3 years ago|reply

https://aws.amazon.com/ec2/spot/instance-advisor/

It's not the exact metric but you can find which have more availability without knowing the exact number (which is constantly changing anyway)

[+] robertlagrant|3 years ago|reply

Interesting semi-confirmed anecdote: when lockdown hit, Azure began to refuse to allocate servers. One of the main reasons was they prioritised servers in this way:

1. Government/health/defence cloud customers

2. Teams, which was exploding in use and they wanted to capitalise on it

3. Regular cloud customers

[+] dszoboszlay|3 years ago|reply

Good news is that today is Black Friday, so the e-commerce industry is running at peak capacity. In 30 days it will be Christmas, and by then (the very latest!) everybody will scale back, so you have a good chance to gain access to more compute before you reach the end of your runway.

[+] ttrrooppeerr|3 years ago|reply

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.

[+] deathanatos|3 years ago|reply

I've seen this before. I think it was in us-west1, ran out of VMs of the size we used for CI. Had to move to a different region. (Never moved back…)

It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.

> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to react is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)

[+] plantain|3 years ago|reply

I'm baffled to read stories that suggest Azure is a viable competitor to GCP/AWS - they're an absolute nightmare on capacity.

It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.

[+] Moissanite|3 years ago|reply

Azure Germany is a separate partition from the rest of Azure - presumably for compliance reasons. This is distinct from AWS, where Frankfurt is just another region, albeit one with high demand.

[+] Terretta|3 years ago|reply

> AWS .. Frankfurt is just another region

Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.

AWS customers with more money than cloud engineers kept clamoring for cross-region capabilities ("Like GCP has!"), and in last couple years AWS has been adding some.

Cloud customers should be careful what they wish for. If you count on it in the data center, and you don't see it in a well-architected cloud service provider, perhaps it's a legacy pattern best left on the datacenter floor. In this case, at some point hard partitioning could become tough to prove to audit and impossible to count on for resilience.

UPDATE TO ADD: See my123's link below, first published 2022-11-16, super helpful even if familiar with their approach.

PDF: https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-faul...

[+] tjungblut|3 years ago|reply

Yep, it's run by the Telekom entirely IIRC from my time back at MSFT. Microsoft "just deploys" Azure on it.

[+] option|3 years ago|reply

this: compliance plus lack of energy for new datacenter capacity. source: colleague who works at msft. they have a true crisis there and it will get worse.

341 comments