Last week we at n8n ran into problems getting a new database from Azure. After contacting support, it turns out that we can’t add instances to our k8s cluster either. Azure has told they'll have more capacity in April 2023(!) — but we’ll have to stop accepting new users in ~35 days if we don't get any more. These problems seem only in the German region, but setting up in a new region would be complicated for us.We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.
Is anyone else experiencing these problems?
[+] [-] l-p|3 years ago|reply
You're new to Azure I guess.
I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.
One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.
[+] [-] Twirrim|3 years ago|reply
Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.
On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.
What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.
The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.
This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.
The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.
[+] [-] adrr|3 years ago|reply
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
[+] [-] rufius|3 years ago|reply
I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.
[+] [-] janober|3 years ago|reply
[+] [-] ckdarby|3 years ago|reply
I don't believe that is even remotely correct.
It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.
I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.
It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).
[+] [-] ethbr0|3 years ago|reply
TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).
[+] [-] aprdm|3 years ago|reply
Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.
Cloud isn't that magical unicorn!
[+] [-] SergeAx|3 years ago|reply
[+] [-] roflyear|3 years ago|reply
[+] [-] marcosdumay|3 years ago|reply
[+] [-] Insanity|3 years ago|reply
Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).
I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".
[+] [-] xwowsersx|3 years ago|reply
> setting up in a new region would be complicated for us.
Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.
When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.
Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.
I wish you luck.
[+] [-] craigkerstiens|3 years ago|reply
The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.
1. https://www.zdnet.com/article/azures-capacity-limitations-ar...
[+] [-] cfeduke|3 years ago|reply
As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).
[+] [-] bri3d|3 years ago|reply
Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.
I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.
[+] [-] Innominate|3 years ago|reply
The major cloud services are expensive. This extra cost is supposed to provide for cloud services' high level of flexibility. Running out of capacity should be a rare event and treated as a high priority problem to be fixed asap.
Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.
[+] [-] PaulHoule|3 years ago|reply
Ideally you have a script that goes from credentials to the service to a complete working instance.
[+] [-] andrewstuart|3 years ago|reply
Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.
If it’s no, you have to guess again where you might get the instance you want and email them again and ask.
I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.
It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.
[+] [-] victor106|3 years ago|reply
The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.
[+] [-] arecurrence|3 years ago|reply
In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.
I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.
[+] [-] andrewstuart|3 years ago|reply
All cloud providers charge much, much more for GPUs than if you run a local machine.
Cloud GPUs are also a lot slower than state of the art consumer GPUs.
Cloud GPUs: much slower, less available, much more expensive.
[+] [-] dehrmann|3 years ago|reply
[+] [-] mmcconnell1618|3 years ago|reply
[+] [-] wstuartcl|3 years ago|reply
AWS for sure has had resource constraints in different AZs (especially during flack Friday and holiday loads) but I have never had an issue finding resources to spin up especially if I was willing to be flexible on vm type.
[+] [-] alexeldeib|3 years ago|reply
Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.
If changing regions sounds like a pain, consider gambling on other VM size availability.
(azure employee)
[+] [-] janober|3 years ago|reply
[+] [-] whalesalad|3 years ago|reply
Yikes, this is totally the first thing you need to come to expect when working with MSFT.
[+] [-] scotty79|3 years ago|reply
[+] [-] janober|3 years ago|reply
[+] [-] jenscow|3 years ago|reply
[+] [-] usgroup|3 years ago|reply
[+] [-] DannyBee|3 years ago|reply
The timeframe they gave would match that kind of ask.
I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)
[+] [-] arcturus17|3 years ago|reply
I doubt it. It will be easier - and probably safer - to ask citizens and physical industry (eg, factories) to bear the brunt than to risk having problems in critical IT infrastructure. Ask people and factories to turn the heat 3 degrees down and the effects will be more or less predictable. Asking to shut compute power down at random will have unpredictable consequences.
[+] [-] analyst74|3 years ago|reply
I suspect AWS and GCP just have more headroom in EU.
[+] [-] andrewstuart|3 years ago|reply
List what you do you have available so we can choose.
Do not force users to randomly guess and be refused until eventually finding something available.
[+] [-] RajT88|3 years ago|reply
I can see why they wouldn't want to do this.
[+] [-] crmd|3 years ago|reply
It’s infuriating that AWS doesn’t have an API that returns a list of AZs with available inventory for a given instance type.
[+] [-] layer8|3 years ago|reply
[+] [-] Too|3 years ago|reply
I don't always care if you give me a E8_v4 or a D8 instead, just give something. With all the 100 of variants of VMs that are available, finding an exact match is obviously an unnecessary constraint. Maybe they already simulate this behind the scenes, I don't know, though given the sizes are advertised with HW capabilities I'd imagine they can't really simulate a v4 using a v5 and vice versa.
Only place I've seen compute be treated this fluidly is in Container instance, which is a bad choice for many many other reasons.
[+] [-] avereveard|3 years ago|reply
It's not the exact metric but you can find which have more availability without knowing the exact number (which is constantly changing anyway)
[+] [-] robertlagrant|3 years ago|reply
1. Government/health/defence cloud customers
2. Teams, which was exploding in use and they wanted to capitalise on it
3. Regular cloud customers
[+] [-] dszoboszlay|3 years ago|reply
[+] [-] ttrrooppeerr|3 years ago|reply
You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.
[+] [-] deathanatos|3 years ago|reply
It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.
> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.
Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to react is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)
[+] [-] plantain|3 years ago|reply
It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.
[+] [-] Moissanite|3 years ago|reply
[+] [-] Terretta|3 years ago|reply
Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.
AWS customers with more money than cloud engineers kept clamoring for cross-region capabilities ("Like GCP has!"), and in last couple years AWS has been adding some.
Cloud customers should be careful what they wish for. If you count on it in the data center, and you don't see it in a well-architected cloud service provider, perhaps it's a legacy pattern best left on the datacenter floor. In this case, at some point hard partitioning could become tough to prove to audit and impossible to count on for resilience.
UPDATE TO ADD: See my123's link below, first published 2022-11-16, super helpful even if familiar with their approach.
PDF: https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-faul...
[+] [-] tjungblut|3 years ago|reply
[+] [-] option|3 years ago|reply