Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode

nik736|1 year ago

Most other AX models (AX42, AX52 and AX102) also have serious reliability issues, where they will fail after some months. They are based on a faulty motherboard. Hetzner has to replace most, if not all, motherboards for servers built before a certain date over the next 12 months [0]

[0] https://docs.hetzner.com/robot/dedicated-server/general-info...

babuskov|1 year ago

I have two AX42's. One has been stable since I got it during the Eurocup discount period. The other got replaced 2 times so far, but it looks like the latest replacement is holding up. So, it's like 50% failure rate based on my small sample. I guess only Hetzner and ASRock know the real numbers.

jonatron|1 year ago

At a previous company, devops would regularly find CPU fan failures on Hetzner. That's in addition to the usual expected HD/SSD failures. You've got to do your own monitoring, it's one of the reasons why unmanaged servers are cheaper than cloud instances.

jeffbee|1 year ago

I regularly find broken thermal solutions in azure and when I worked at Google it was also a low-level but constant irritant. When I joined Dropbox I said to my team on my first day that I could find a machine in their fleet running at 400MHz, and I was right: a bogus redundant PSU controller was asserting PROCHOT. These things happen whenever you have a lot of machines.

KennyBlanken|1 year ago

No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.

The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.

Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

throwaway984393|1 year ago

[deleted]

TZubiri|1 year ago

I'm heavily against both relying on free dependencies and going for the cheapest option.

If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.

Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.

Also, if your business is in the US, find a US host ffs.

V__|1 year ago

> Looking back, waiting six months could have helped us avoid many issues. Early adopters usually find problems that get fixed later.

This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.

esafak|1 year ago

GitHub is looking to add this feature to dependabot: https://github.com/dependabot/dependabot-core/issues/3651

InDubioProRubio|1 year ago

This is a wildly successfully pattern in nature, the old using the young and inexperienced, as enthusiastic test units.

In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.

pwmtr|1 year ago

Author of the blog post here.

Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)

This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.

fdr|1 year ago

It varies by system. As the legendary (to some) Kelly Johnson of the Skunk Works had as one of his main rules:

> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.

But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).

bayindirh|1 year ago

Dell has this problem sometimes. I remember getting the first batch one of their older servers when they were new. We had to replace motherboards' I/O (rear) section because the servers lost some devices on that part (e.g.: Ethernet controllers, iDRAC, sometimes BIOS) for some time. After shaking out these problems, they ran for almost a decade.

We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.

merb|1 year ago

Dell has tons of issues. A faulty mini board of the front led can actually stop the server from booting/running at all (even drac will be dead)

andai|1 year ago

> Hetzner didn’t confirm or deny the possibility of power limiting

What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?

Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?

radicality|1 year ago

Related and perhaps useful: I’ve seen this in multiple cloud offerings already, where the cpu scaling governor is set to some eco-friendly value, in benefit to the cloud provider and in zero benefit to you and much reduced peak cpu perf.

To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.

If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.

Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.

vitus|1 year ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.

The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.

pwmtr|1 year ago

At the time of our investigation, we found few articles supporting that power caps could potentially cause hardware degradation, though I don't have the exact sources at hand. I see the child comment shared one example, and after some searching, I found a few more sources [1], [2].

That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.

[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...

nickcw|1 year ago

Power = volts * amps

Volts is as supplied by the utility company.

Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!

The only way you can decrease power used by a server is by throttling the CPUs.

The normal way of throttling CPUs is via the OS which requires cooperation.

I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.

tecleandor|1 year ago

Yep, that's weird, I've always read that high power/temp can degrade electronics way faster. Any EE can shed a light here?

redleader55|1 year ago

Every rack in a data center has a power budget, which is actually constrained by how much heat the HVAC system can pull out of the DC, rather than how much power is available. Nevertheless it is limited per rack to ensure a few high power servers don't bring down a larger portion of the DC.

I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.

Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.

[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...

cibyr|1 year ago

One possibility is that at lower power settings, the CPUs don't get as hot, which means the fans don't spin up as much, which can mean that other components also get less airflow and then get hotter than they would otherwise. The fix for this is usually to monitor the temperature of those other components and include that as an input to the fan speed algorithm. No idea if that's what's actually going on here though.

wmf|1 year ago

Expert in server power management here. Your intuition is right and the comments/links to the contrary are wrong. Undervolting is unreliable but let's be clear: no one is undervolting servers. I don't even know if it's possible. Power limiting (e.g. RAPL) is completely safe to use because it keeps voltage, frequency, temperature, fan speed, etc within safe bounds.

OptionOfT|1 year ago

The only place I could find some answer that sheds some light was StackOverflow:

https://electronics.stackexchange.com/a/65827

> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.

chronid|1 year ago

We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...

rikafurude21|1 year ago

Similar thing happened to a AX102 I currently use, something related the network card which caused crashes. Thankfully hetzner support was helpful with replacement hardware. caused quite some grief but at least it was a good lesson in hardware troubleshooting. Worth it to me personally

yread|1 year ago

Yep same here. AX102 crashes with almost no load, nothing in the logs, won't come on. Hetzner looked at it multiple times and found either nothing or replaced cpu paste or a PSU connector. I migrated to AX162 and so far so good

urbandw311er|1 year ago

Would anybody with data center experience be able to hazard a guess on what type of commercial resolution Hetzner would have reached with the Motherboard supplier here? Would we assume all mobos replaced free of charge plus compensation?

wmf|1 year ago

When you buy name-brand servers you'll definitely get any faulty hardware replaced. Compensation would only happen if you negotiated for that and you'd have to pay extra. You're probably better off buying some kind of business interruption insurance instead of trying to get vendors to pay you for downtime (even if it is their fault).

Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.

babuskov|1 year ago

I think they probably got a batch of these really cheap in the first place, because those servers were offered without the setup fee initially. It was during the soccer World Cup in Germany.

jauntywundrkind|1 year ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

This was something I hadn't heard before, & a surprise to me.

scottcha|1 year ago

I’d like to see what cpu governor is running on those systems before assuming a power cap is in place. Lots of defaults installs of Linux ship with the power save governor running which is going to limit your max frequencies and through that the max power you can hit.

__m|1 year ago

schedutil on mine scheduled for mainboard replacement

trod1234|1 year ago

It would have been nice if they linked to the power metrics for the new servers.

I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.

vednig|1 year ago

as a CI/CD provider wouldn't it benefit if Ubicloud had their own servers?

immibis|1 year ago

Depends how many they need and how much control. Do they want to be a server company or an adapting-servers-to-run-your-CI/CD company or both? You can extract value from both parts of the equation, but theoretical economics tells us you can get the most value for the least effort by doing more of what you're best at and paying someone else to do what they're best at, rather than doing everything mediocrely yourself.

Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.

eitland|1 year ago

They are in the early stages.

I think the website said they recently raised 16 million euros (or dollars).

Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.

By using rented servers (and only renting them when a customer signs up) they avoid this problem.

wink|1 year ago

> One of the providers we like is Hetzner because of their affordable and reliable servers.

> In the days that followed, the crash frequency increased.

I don't find the article conclusive whether they would still call them reliable.

cbozeman|1 year ago

Hetzner's reliable... until they aren't.

Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.

immibis|1 year ago

Seems like this problem, was unforeseeable, is isolated to a particular current-generation/model of server motherboards (AX2), and doesn't usually happen. I had an AX41* previously with no such problem, so it's not all AXes, just all current-generation AXes (which is all of the AXes they give to new customers, so that's no consolation).

aduffy|1 year ago

To their credit they actually fixed the problem. Good luck getting this level of support from any of the big 3 public cloud providers.

unknown|1 year ago

[deleted]

dangoodmanUT|1 year ago

is there a provider that's like bare metal, but would detect these kinds of things mostly automatic? E.g. faulty or constantly crashing hardware.

greggyb|1 year ago

Managed servers: https://www.hetzner.com/managed-server/

There are also others, but Hetzner is under discussion here.

gtirloni|1 year ago

Anyone got experience with Ubicloud's OpenStack stack?

fdr|1 year ago

Ubicloud does not have an OpenStack dependency.

TacticalCoder|1 year ago

[deleted]

indulona|1 year ago

i am so glad my sign up process with hetzner failed when i was so dumb that i wanted to give them a chance even with the internet full of horrific stories of bad experiences from their customers. lucky me.

cbozeman|1 year ago

Hetzner is fine for what it is, you just need to know that it's all on you and only YOU.

YOU do the monitoring.

YOU do the troubleshooting.

YOU etc., etc.

If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.

Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.

108 comments