top | item 39804742

(no title)

I use zfsbootmenu with hrmph (https://github.com/leahneukirchen/hrmpf). You can see the list of packages here (https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).

I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

discuss

semi-extrinsic|1 year ago

On compute clusters there are quite a few "exotic" things that can go wrong. The workload orchestration is typically SLURM, which can throw errors and has a million config options to get lost in.

Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.

Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.

Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.