top | item 44393313

(no title)

GlenTheMachine | 8 months ago

Space roboticist here.

As with a lot of things, it isn't the initial outlay, it's the maintenance costs. Terrestrial datacenters have parts fail and get replaced all the time. The mass analysis given here -- which appears quite good, at first glance -- doesn't including any mass, energy, or thermal system numbers for the infrastructure you would need to have to replace failed components.

As a first cut, this would require:

- an autonomous rendezvous and docking system

- a fully railed robotic system, e.g. some sort of robotic manipulator that can move along rails and reach every card in every server in the system, which usually means a system of relatively stiff rails running throughout the interior of the plant

- CPU, power, comms, and cooling to support the above

- importantly, the ability of the robotic servicing system toto replace itself. In other words, it would need to be at least two fault tolerant -- which usually means dual wound motors, redundant gears, redundant harness, redundant power, comms, and compute. Alternately, two or more independent robotic systems that are capable of not only replacing cards but also of replacing each other.

- regular launches containing replacement hardware

- ongoing ground support staff to deal with failures

The mass analysis also doesn't appear to include the massive number of heat pipes you would need to transfer the heat from the chips to the radiators. For an orbiting datacenter, that would probably be the single biggest mass allocation.

discuss

order

vidarh|8 months ago

I've had actual, real-life deployments in datacentres where we just left dead hardware in the racks until we needed the space, and we rarely did. Typically we'd visit a couple of times a year, because it was cheap to do so, but it'd have totally viable to let failures accumulate over a much longer time horizon.

Failure rates tend to follow a bathtub curve, so if you burn-in the hardware before launch, you'd expect low failure rates for a long period and it's quite likely it'd be cheaper to not replace components and just ensure enough redundancy for key systems (power, cooling, networking) that you could just shut down and disable any dead servers, and then replace the whole unit when enough parts have failed.

rajnathani|8 months ago

Exactly what I was thinking when the OP comment brought up "regular launches containing replacement hardware", this is easily solvable by actually "treating servers as cattle and not pets" whereby one would simply over-provision servers and then simply replace faulty servers around once per year.

Side: Thanks for sharing about the "bathtub curve", as TIL and I'm surprised I haven't heard of this before especially as it's related to reliability engineering (as from searching on HN (Algolia) that no HN post about the bathtub curve crossed 9 points).

TheOtherHobbes|8 months ago

The analysis has zero redundancy for either servers or support systems.

Redundancy is a small issue on Earth, but completely changes the calculations for space because you need more of everything, which makes the already-unfavourable space and mass requirements even less plausible.

Without backup cooling and power one small failure could take the entire facility offline.

And active cooling - which is a given at these power densities - requires complex pumps and plumbing which have to survive a launch.

The whole idea is bonkers.

IMO you'd be better off thinking about a swarm of cheaper, simpler, individual serversats or racksats connected by a radio or microwave comms mesh.

I have no idea if that's any more economic, but at least it solves the most obvious redundancy and deployment issues.

asah|8 months ago

serious q: how much extra failure rate would you expect from the physical transition to space?

on one hand, I imagine you'd rack things up so the whole rack/etc moves as one into space, OTOH there's still movement and things "shaking loose" plus the vibration, acceleration of the flight and loss of gravity...

VectorLock|8 months ago

The original article even addresses this directly. Plus hardware returns over fast enough that you'll simply be replacing modules with a smattering of dead servers with entirely new generations anyways.

Coffeewine|8 months ago

It would be interesting to see if the failure rate across time holds true after a rocket launch and time spent in space. My guess is that it wouldn’t, but that’s just a guess.

drewg123|8 months ago

I'd naively assume that the stress of launch (vibration, G-forces) would trigger failures in hardware that had been working on the ground. So I'd expect to see a large-ish number of failures on initial bringup in space.

geon|8 months ago

Yes. I think I read a blogpost from Backblaze about running their Red Pod rack mounted chassis some 10 years ago.

They would just keep the failed drives in the chassi. Maybe swap out the entire chassi if enough drives died.

4ndrewl|8 months ago

A new meaning to the term "space junk"

NitpickLawyer|8 months ago

Appreciate the insights, but I think failing hardware is the least of their problems. In that underwater pod trial, MS saw lower failure rates than expected (nitrogen atmosphere could be a key factor there).

> The company only lost six of the 855 submerged servers versus the eight servers that needed replacement (from the total of 135) on the parallel experiment Microsoft ran on land. It equates to a 0.7% loss in the sea versus 5.9% on land.

6/855 servers over 6 years is nothing. You'd simply re-launch the whole thing in 6 years (with advances in hardware anyways) and you'd call it a day. Just route around the bad servers. Add a bit more redundancy in your scheme. Plan for 10% to fail.

That being said, it's a complete bonkers proposal until they figure out the big problems, like cooling, power, and so on.

nine_k|8 months ago

Indeed, MS had it easier with a huge, readily available cooling reservoir and a layer of water that additionally protects (a little) against cosmic rays, plus the whole thing had to be heavy enough to sink. An orbital datacenter would be in a opposite situation: all cooling is radiative, many more high-energy particles, and the weight should be as light as possible.

dragonwriter|8 months ago

> In that underwater pod trial, MS saw lower failure rates than expected

Underwater pods are the polar opposite of space in terms of failure risks. They don't require a rocket launch to get there, and they further insulate the servers from radiation compared to operating on the surface of the Earth, rather than increasing exposure.

(Also, much easier to cool.)

sheepybloke|8 months ago

The biggest difference is radiation. Even in LEO, you will get radiation-caused Single Events that will affect the hardware. That could be a small error or a destructive error, depending on what gets hit.

looofooo0|8 months ago

Power!? Isnt that just PV and batteries? LEO has like 1.5h orbit.

VectorLock|8 months ago

Power is solar and cooling is radiators. They did the math on it, its feasible and mostly an engineering problem now.

protocolture|8 months ago

Did Microsoft do any of that with their submersible tests?

My feeling is that, a bit like starlink, you would just deprecate failed hardware, rather than bother with all the moving parts to replace faulty ram.

Does mean your comms and OOB tools need to be better than the average american colo provider but I would hope that would be a given.

protocolture|8 months ago

>The mass analysis also doesn't appear to include the massive number of heat pipes you would need to transfer the heat from the chips to the radiators. For an orbiting datacenter, that would probably be the single biggest mass allocation.

And once you remove all the moving parts, you just fill the whole thing with oil rather than air and let heat transfer more smoothly to the radiators.

lumost|8 months ago

I used to build and operate data center infrastructure. There is very limited reason to do anything more than a warranty replacement on a GPU. With a high quality hardware vendor that properly engineers the physical machine, failure rates can be contained to less than .5% per year. Particularly if the network has redundancy to avoid critical mass failures.

In this case, I see no reason to perform any replacements of any kind. Proper networked serial port and power controls would allow maintenance for firmware/software issues.

oceanplexian|8 months ago

Why does it need to be robots?

On Earth we have skeleton crews maintain large datacenters. If the cost of mass to orbit is 100x cheaper, it’s not that absurd to have an on-call rotation of humans to maintain the space datacenter and install parts shipped on space FedEx or whatever we have in the future.

verzali|8 months ago

If you want to have people you need to add in a whole lot of life support and additional safety to keep people alive. Robots are easier, since they don't die so easily. If you can get them to work at all, that is.

monster_truck|8 months ago

That isn't going to last for much longer with the way power density projections are looking.

Consider that we've been at the point where layers of monitoring & lockout systems are required to ensure no humans get caught in hot spots, which can surpass 100C, for quite some time now.

spauldo|8 months ago

This sort of work is ideal for robots. We don't do it much on Earth because you can pay a tech $20/hr to swap hardware modules, not because it's hard for robots to do.

Robotbeat|8 months ago

Bingo.

It's all contingent on a factor of 100-1000x reduction in launch costs, and a lot of the objections to the idea don't really engage with that concept. That's a cost comparable to air travel (both air freight and passenger travel).

(Especially irritating is the continued assertion that thermal radiation is really hard, and not like something that every satellite already seems to deal with just fine, with a radiator surface much smaller than the solar array.)

wmf|8 months ago

Yeah, just attach a Haven module to the data center.

monster_truck|8 months ago

I suspect they'd stop at automatic rendezvous & docking. Use some sort of cradle system that holds heat fins, power, etc that boxes of racks would slot into. Once they fail just pop em out and let em burn up. Someone else will figure out the landing bit

I won't say it's a good idea, but it's a fun way to get rid of e-waste (I envision this as a sort of old persons home for parted out supercomptuers)

closewith|8 months ago

Spreading heavy metals in the upper atmosphere. Fun.

angadh|8 months ago

Thanks for the thorough comment—yes, the heat pipes etc haven’t been accounted for. Might be a future addition but the idea was to look at some key large parts and see where that takes us in terms of launch. The pipes would definitely skew the business case further. Similarly, the analysis is missing trusses.

Don’t even get me started on the costs of maintenance. I am sweating bricks just thinking of the mission architecture for assembly and how the robotic system might actually look. Unless there’s a single 4 km long deployable array (of what width?), which would be ridiculous to imagine.

Spooky23|8 months ago

Don’t you need to look at different failure scenarios or patterns in orbit due to exposure to cosmic rays as well?

It just seems funny, I recall when servers started getting more energy dense it was a revelation to many computer folks that safe operating temps in a datacenter should be quite high.

I’d imagine operating in space has lots of revelations in store. It’s a fascinating idea with big potential impact… but I wouldn’t expect this investment to pay out!

RecycledEle|8 months ago

What if we just integrate the hardware so it fails softly?

That is, as hardware fails, the system looses capacity.

That seems easier than replacing things on orbit, especially if StarShip becomes the cheapest way to launch to orbit because StarShip launches huge payloads, not a few rack mounted servers.

markemer|8 months ago

Not to mention radiation hardening. The soft error rate alone on these single digit nm chips would be massive.

hamburglar|8 months ago

Seems prudent to achieve fully robotic datacenters on earth before doing it in space. I know, I’m a real wet blanket.

Robotbeat|8 months ago

If mass is going to be as cheap as is needed for this to work anyway, there's no reason you can't just use people like in a normal datacenter.

HPsquared|8 months ago

The economics don't work the same on earth.

empath75|8 months ago

I think what you actually do is let it gradually degrade over time and then launch a new one.

callamdelaney|8 months ago

What, why would you fly out and replace it? It'd be much cheaper just to launch more.

intended|8 months ago

It sounds like building it on the moon would be better.

spauldo|8 months ago

Depends what you want to use it for. Ping time to the moon and back is about 2.5 seconds best case.

spullara|8 months ago

you don't replace it, you just let it fail and over time the datacenter wears out.