top | item 45439340

(no title)

ttfvjktesd | 5 months ago

The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.

discuss

order

827a|5 months ago

The biggest part missing from the opposing side is: Their view is very much rooted in the pre-Cloud hardware infrastructure world, where you'd pay sysadmins a full salary to sit in a dark room to monitor these servers.

The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)

ttfvjktesd|5 months ago

> do you think managing IAM and Terraform is free?

No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.

In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.

g413n|5 months ago

yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.

Aurornis|5 months ago

Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.

Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.

calvinmorrison|5 months ago

> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.

Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...

servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.

Yes these things can be done and a lot cheaper than paying AWS.

g413n|5 months ago

fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.

I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.

Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.

rtp4me|5 months ago

For drive issues, this is easy. Have a stack of replacements on hand and just open a "remote-hands" ticket with the CoLo provider to swap out the drive. This can usually be done in 1-2hrs from opening the ticket.

For server issues; again, pretty easy. Just use iKVM/IPMI and iPXE to diagnose a faulty server. Again, using "remote-hands" from the CoLo provider can help fix problems if your staff does not have the skills.

kabdib|5 months ago

I've built and maintained similar setups (10PB range). Honestly, you just shove disks into it, and when they fail you replace them. You need folks around to handle things like controller / infrastructure failure, but hopefully you're paying them to do other stuff, too.

g413n|5 months ago

someone has to go and power-cycle the machines every couple months it's chill, that's the point of not using ceph

ttfvjktesd|5 months ago

You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.

I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.

datadrivenangel|5 months ago

Assuming that they end up hiring a full time ops person at 500k annually total costs (250k base for a data center wizard), then that's 42k extra a month, or ~$70k. Still 200k per month lower than their next best offering.

paxys|5 months ago

So the drives are never going to fail? PSUs are never going to burn out? You are never going to need to procure new parts? Negotiate with vendors?