top | item 46865061

(no title)

lugao | 28 days ago

Only people who never interacted with data center reliability think it's doable to maintain servers with no human intervention.

discuss

order

mrweasel|27 days ago

Microsoft did do the experiment (Project Natick) where they had "datacenters" in pods under the sea with really good results. The idea was simply to ship enough extra capacity, but due to the environment, the failure rates where 1/8th of normal.

Still, dropping a pod into the sea makes more sense than launching it into space. At least cooling, power, connectivity and eventual maintenance is simpler.

The whole thing makes no sense and is seems like it's just Musk doing financial manipulation again.

https://news.microsoft.com/source/features/sustainability/pr...

zarzavat|27 days ago

> The whole thing makes no sense and is seems like it's just Musk doing financial manipulation again.

It's a fig leaf for getting two IPOs in one. There's no sense in analyzing it any further.

moontear|27 days ago

The experiment may have been successful, but if it was why don't we see underwater datacenters everywhere? It probably is a similar reason why we won't see space datacenters in the near future either.

Space has solar energy going for itself. With underwater you don't need to lug a 1420 ton rocket with a datacenter payload to space.

yencabulator|24 days ago

Dropping a pod into the sea makes more sense than launching it into space, and Microsoft decided it wasn't worth doing.

jmyeet|28 days ago

There are a class of people who may seem smart until they start talking about a subject you know about. Hank Green is a great example of this.

For many on HN, Elon buying Twitter was a wake up call because he suddenly started talking about software and servers and data centers and reliability and a ton of people with experience with those things were like "oh... this guy's an idiot".

Data centers in space are exactly like this. Your comment (correctly) alludes to this.

Companies like Google, Meta, Amazon and Microsoft all have so many servers that parts are failing constantly. They fail so often on large scales that it's expected things like a hard drive will fail while a single job might be running.

So all of these companies build systems to detect failures, disable running on that node until it's fixed, alerting someone to what the problem is and then bringing the node back online once the problem it's addressed. Everything will fail. Hard drives, RAM, CPUs, GPUs, SSDs, power supplies, fans, NICs, cables, etc.

So all data centers will have a number of technicians who are constantly fixing problems. IIRC Google's ratio tended to be about 10,000 servers per technician. Good technicians could handle higher ratios. When a node goes offline it's not clear why. Techs would take known good parts and basically replacce all of them and then figure out what the problem is later, dispose of any bad parts and put tested good parts into the pool of known good parts for a later incident.

Data centers in space lose all of this ability. So if you have a large number of orbital servers, they're going to be failing constantly with no ability to fix them. You can really only deorbit them and replace them and that gets real expensive.

Electronics and chips on satellites also aren't consumer grade. They're not even enterprise grade. They're orders of magnitude more reliable than that because they have to deal with error correction terrestial components don't due to cosmic rays and the solar wind. That's why they're a fraction of the power of something you can buy from Amazon but they cost 1000x as much. Because they need to last years and not fail, something no home computer or data center server has to deal with.

Put it this way, a hardened satellite or probe CPU is like paying $1 million for a Raspberry Pi.

And anybody who has dealt with data centers knows this.

fblp|28 days ago

Great comment on hardware and maintenance costs, and in comparison Elon wrote "My estimate is that within 2 to 3 years, the lowest cost way to generate AI compute will be in space." It's a pity this reads like the entire acquisition of xAi is based on "Elon's napkin math" (maybe he checked it with Grok)

rkagerer|28 days ago

Thanks for putting words to that; the paragraph which most stuck out to me as outlandish is (emphasis mine):

    The basic math is that launching a million tons per year of satellites generating 100 kW of compute power per ton would add 100 gigawatts of AI compute capacity annually, *with no ongoing operational or maintenance needs*.
I'm deeply disillusioned to arrive at this conclusion but the Occam's Razor in me feels this whole acquisition is more likely a play to increase the perceptual value of SpaceX before a planned IPO.

mosquitobiten|27 days ago

for me trying to apply some liquid TIM on a CPU in a space station in a big ass suit would be a total nightmare, maybe robots could make it bearable but the racks would get greassy fast from many failed attempts

e4325f|27 days ago

I'm pretty sure they don't harden compute in space anymore, that's one thing SpaceX pioneered with their cost-cutting approach early on.

skartik|27 days ago

Excellent comment.

Sparyjerry|27 days ago

[deleted]

WalterBright|27 days ago

> but they cost 1000x as much

Compute power has increased more than 1000x while the cost came down.

I recall paying $3000 for my first IBM PC.

> they need to last years and not fail

Not if they are cheap enough to build and launch. Quantity has a quality all its own.

everfrustrated|28 days ago

Might be why he's also investing in building their own fabs - if he can keep the silicon costs low then that flips a lot of the math here.

keepamovin|28 days ago

Whoa there, space-faring sysadmin. You really want that off-world contract tho?

lugao|28 days ago

Haha, hard pass on the job. I prefer my oxygen at 1 atm.

I'm not a data center technician myself, but I have deep respect for those folks and the complexity they manage. It's quite surprising the market still buys Musk's claims day after day.

lugao|26 days ago

I did some more reading and want to walk back my skepticism a bit. There is actually serious effort going into this, such as Google’s research on space-based AI infrastructure: https://research.google/blog/exploring-a-space-based-scalabl...

They highlight the exact reliability constraint I was thinking of: that replacing failed TPUs is trivial on Earth but impossible in space. Their solution is redundant provisioning, which moves the problem from "operationally impossible" to "extremely expensive."

You would effectively need custom, super-redundant motherboards designed to bypass dead chips rather than replace them. The paper also tackles the interconnect problem using specialized optics to sustain high bitrates, which is fascinating but seems incredibly difficult to pull off given that the constellation topology changes constantly. It might be possible, but the resulting hardware would look nothing like a regular datacenter.

Also this would require lots of satelites to rival a regular DC which is also very hard to justify. Let's see what the promised 2027 tests will reveal.

donny2018|27 days ago

I'd assume datacenters built for space would have different reliability standards. I mean, if a communication satellite (which already has a lot of electronic and computing components) can work unattended, then a satellite working as a server could too.

vagab0nd|27 days ago

You are right. But in the future we'll be refueling the satellites anyway. Might as well maintain the servers using robots all in one go.

SilverElfin|27 days ago

Right now that’s not the case. Satellites just store whatever fuel they need for orbital adjustments and by default, they fall back to earth and burn up at the end of their life. All the Starlink satellites are configured to fall back to earth within 5 years (the fuel is used to re-raise their orbit). The new proposed datacenters would sit in a higher orbit to avoid debris, allegedly, but that means it is even more expensive to get to them and refuel them, and the potential for future debris is far worse (since it wouldn’t fall back to earth and burn up for centuries or millennia).

angled|28 days ago

But … but what if we had solar-powered AI SREs to fix the solar-powered AI satellites… /in space/?

lugao|28 days ago

Maintaining modern accelerators requires frequent hands-on intervention -- replacing hardware, reseating chips, and checking cable integrity.

Because these platforms are experimental and rapidly evolving, they aren't 'space-ready.' Space-grade hardware must be 'rad-hardened' and proven over years of testing.

By the time an accelerator is reliable enough for orbit, it’s several generations obsolete, making it nearly impossible to compete or turn a profit against ground-based clusters.

elihu|28 days ago

Do they need to be maintained? If one compute node breaks, you just turn it off and don't worry about it. You just assume you'll have some amount of unrecoverable errors and build that into the cost/benefit analysis. As long as failures are in line with projections, it's baked in as a cost of doing business.

The idea itself may be sound, though that's unrelated to the question of whether Elon Musk can be relied on to be honest with investors about what their real failure projections and cost estimates are and whether it actually makes financial sense to do this now or in the near future.

lugao|28 days ago

AI clusters are heavily interconnected, the blast radius for single component failure is much larger than running single nodes -- you would fragment it beyond recovery to be able to use it meaningfully.

I can't get in detail about real numbers but it's not doable with current hardware by a large margin.

andrewinardeer|28 days ago

This guy invented reusable rockets that land themselves. I'm sure xAI is not just one guy. Plenty of talented people work there.