top | item 39060339

Ceph: A Journey to 1 TiB/s

411 points| davidmr | 2 years ago |ceph.io

210 comments

order

alberth|2 years ago

Ceph has an interesting history.

It was created at Dreamhost (DH), for their internal needs by the founders.

DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).

They spun Ceph off and Redhat bought it.

https://en.wikipedia.org/wiki/DreamHost

artyom|2 years ago

Yeah, as a customer (still one) I remember their "Hey, we're going to build this Ceph thing, maybe it ends up being cool" blog entry (or newsletter?) kinda just sharing what they were toying with. It was a time of no marketing copy and not crafting every sentence to sell you things.

I think it was the university project of one of the founders, and the others jumped in supporting it. Docker has a similar origins story as far as I know.

epistasis|2 years ago

A bit more to the story is that it was created also at UC Santa Cruz, by Sage Weil, a Dreamhost founder, while he was doing graduate work there. UCSC has had a lot of good storage research.

stuff4ben|2 years ago

I used to love doing experiments like this. I was afforded that luxury as a tech lead back when I was at Cisco setting up Kubernetes on bare metal and getting to play with setting up GlusterFS and Ceph just to learn and see which was better. This was back in 2017/2018 if I recall. Good ole days. Loved this writeup!

knicholes|2 years ago

I had to run a bunch of benchmarks to compare speeds of not just AWS instance types, but actual individual instances in each type, as some NVME SSDs have been more used than others in order to lube up some Aerospike response times. Crazy.

redrove|2 years ago

A Heketi man! I had the same experience around the same years, what a blast. Everything was so new..and broken!

amluto|2 years ago

I wish someone would try to scale the nodes down. The system described here is ~300W/node for 10 disks/node, so 30W or so per disk. That’s a fair amount of overhead, and it also requires quite a lot of storage to get any redundancy at all.

I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.

This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.

I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

pl4nty|2 years ago

Nvidia's SODIMM compute module interface can prove this concept already. I have two 7W ARM Turing RK1s arriving soon, each with PCIe 3x4 at 4GB/s, and the Turing Pi 2 cluster board can fit four in an ITX form factor. I'm expecting over 3Gbps per watt at a total cost of 820USD

PCIe lanes are the bottleneck so far - even my $90 2TB SSDs are rated at 7GB/s on PCIe 4x4. So I don't think SBCs are the optimal solution yet. Looks like Ampere's Altra line can do PCIe 4x128 at 40W so a 1U blade with 100G networking could be interesting. I've seen lots of bugs and missing optimisations with ARM though, even in a homelab, so this kind of solution might not be ready for datacenters yet

walrus01|2 years ago

10 Gbps is increasingly obsolete with very low cost 100 Gbps switches and 100Gbps interfaces. Something would have to be really tiny and low cost to justify doing a ceph setup with 10Gbps interfaces now... If you're at that scale of very small stuff you are probably better off doing local NVME storage on each server instead.

Palomides|2 years ago

here's a weird calculation:

this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts

a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data

so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do

I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now

anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks

jeffbee|2 years ago

I think the chief source of inefficiency in this architecture would be the NVMe controller. When the operating system and the NVMe device are at arm's length, there is natural inefficiency, as the controller needs to infer the intent of the request and do its best in terms of placement and wear leveling. The new FDP (flexible data placement) features try to address this by giving the operating system more control. The best thing would be to just hoist it all up into the host operating system and present the flash, as nearly as possible, as a giant field of dumb transistors that happens to be a PCIe device. With layers of abstraction removed, the hardware unit could be something like an Atom with integrated 100gbps NICs and a proportional amount of flash to achieve the desired system parallelism.

booi|2 years ago

Is that a lot of overhead? The disk itself uses about 10W and high speed controllers use about 75W leaves pretty much 100W for the rest of the system including overhead of about 10%. Scale up the system to 16 disks and there’s not a lot of room for improvement

somat|2 years ago

I have always wanted to set up a ceph system with one drive per node. The ideal form factor would be a drive with a couple network interfaces built in. western digital had a press release about an experiment they did that was exactly this, but it never ended up with drive you could buy.

The hardkernel HC2 SOC was a nearly ideal form factor for this, and I still have a stack of them laying around that I bought to make a ceph cluster, but I ran out of steam when I figured out they were 32bit. not to say it would be impossible I just never did it.

kbenson|2 years ago

There probably is a sweet spot for power to speed, but I think it's possibly a bit larger than you suggest. There's overhead from the other components as well. For example, the Mellanox NIC seems to utilize about 20W itself, and while the reduced numbers of drives might allow for a single port NIC which seems to use about half the power, if we're going to increase the number of cables (3 per 12 disks instead of 2 per 5), we're not just increasing the power usage of the nodes themselves put also possible increasing the power usage or changing the type of switch required to combine the nodes.

If looked at as a whole, it appears to be more about whether you're combining resources at a low level (on the PCI bus on nodes) or a high level (in the switching infrastructure), and we should be careful not to push power (or complexity, as is often a similar goal) to a separate part of the system that is out of our immediate thoughts but still very much part of the system. Then again, sometimes parts of the system are much better at handling the complexity for certain cases, so in those cases that can be a definite win.

wildylion|2 years ago

IIRC, WD has experimented with placing Ethernet and some compute directly onto hard drives some time back.

sigh I used to do some small-scale Ceph back in 2017 or so...

chx|2 years ago

There was a point in history when the total amount of digital data stored worldwide reached 1TiB for the first time. It is extremely likely this day was within the last sixty years.

And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.

qingcharles|2 years ago

That reminds me of a calculation I did which showed that my desktop PC would be more powerful than all of the computers on the planet combined in like 1978 :D

fiddlerwoaroof|2 years ago

It’s at least 20ish years ago: I remember an old sysadmin talking about managing petabytes before 2003

kylegalbraith|2 years ago

This is a fascinating read. We run a Ceph storage cluster for persisting Docker layer cache [0]. We went from using EBS to Ceph and saw a massive difference in throughput. Went from a write throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000 IOPS.

The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.

It’s been a massive improvement for our caching system.

[0] https://depot.dev/blog/cache-v2-faster-builds

guywhocodes|2 years ago

Did something very similar almost 10 years ago, EBS costs were 10x+ the cost for same perfomance CEPH cluster on the node disks. Eventually we switched to our own racks and cut it almost in ten again. We developed the inhouse expertise for how to do it and we were free.

e12e|2 years ago

Did you host ebs on bare metal? How are you hosting ceph - your own/rented metal, ec2 - VMs?

Wasn't immediately clear to me from the blog.

MPSimmons|2 years ago

The worst problems I've had with in-cluster dynamic storage were never strictly IO related, and were more the storage controller software in kubernetes having problems with real-world problems like pods dying and the PVCs not attaching until after very long timeouts expired, with the pod sitting in ContainerCreating until the PVC lock was freed.

This has happened in multiple clusters, using rook/ceph as well as Longhorn.

matheusmoreira|2 years ago

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

nullwarp|2 years ago

There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.

When it works, it works great - when it goes wrong it's a huge headache.

Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:

- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.

- longhorn was solid for me when i was in the k8s world

- glusterfs is still fine as long as you know what you are going into.

ianlevesque|2 years ago

I played around with it and it has a very cool web UI, object storage & file storage, but it was very hard to get decent performance and it was possible to get the metadata daemons stuck pretty easily with a small cluster. Ultimately when the fun wore off I just put zfs on a single box instead.

victorhooi|2 years ago

I have some experience with Ceph, both for work, and with homelab-y stuff.

First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.

For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.

Also, Ceph does prefer physical access to disks (similar to ZFS).

And you do need decent networking connectivity - I think that's the main thing people think of, when they think of high hardware requirements for Ceph. Ideally 10Gbe at the minimum - although more if you want higher performance - there can be a lot of network traffic, particularly with things like backfill. (25Gbps if you can find that gear cheap for homelab - 50Gbps is a technological dead-end. 100Gbps works well).

But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.

You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (https://rook.io/).

Hope this helps, and good luck! Let me know if you have any other questions.

mcronce|2 years ago

I run Ceph in my lab. It's pretty heavy on CPU, but it works well as long as you're willing to spring for fast networking (at least 10Gb, ideally 40+) and at least a few nodes with 6+ disks each if you're using spinners. You can probably get away with far fewer disks per node if you're going all-SSD.

aaronax|2 years ago

I just set up a three-node Proxmox+Ceph cluster a few weeks ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs of 1TB and 2TB mix (was 5 until I noticed one of my scavenged SSDs was failed). Single 1gbps network on each so I am seeing 30-120MB/s disk performance depending on things. I think in a few months I will upgrade to 10gbps for about $400.

I'm about 1/2 through the process of moving my 15 virtual machines over. It is a little slow but tolerable. Not having to decide on RAIDs or a NAS ahead of time is amazing. I can throw disks and nodes at it whenever.

chomp|2 years ago

I’ve ran Ceph in my home lab since Jewel (~8 years ago). Currently up to 70TB storage on a single node. Have been pretty successful vertically scaling, but will have to add a 2nd node here in a bit.

Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.

sgarland|2 years ago

Yes. I first tried it with Rook, and that was a disaster, so I shifted to Longhorn. That has had its own share of problems, and is quite slow. Finally, I let Proxmox manage Ceph for me, and it’s been a dream. So far I haven’t migrated my K8s workloads to it, but I’ve used it for RDBMS storage (DBs in VMs), and it works flawlessly.

I don’t have an incredibly great setup, either: 3x Dell R620s (Ivy Bridge-era Xeons), and 1GBe. Proxmox’s corosync has a dedicated switch, but that’s about it. The disks are nice to be fair - Samsung PM863 3.84 TB NVMe. They are absolutely bottlenecked by the LAN at the moment.

I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.

louwrentius|2 years ago

If you want decent performance, you need a lot of OSDs especially if you use HDD. But a lot of consumer SDDs will suffer terrible performance degradation with writes depending on the circumstances and workloads.

willglynn|2 years ago

The hardware minimums are real, and the complexity floor is significant. Do not deploy Ceph unless you mean it.

I started considering alternatives when my NAS crossed 100 TB of HDDs, and when a scary scrub prompted me to replace all the HDDs, I finally pulled the trigger. (ZFS resilvered everything fine, but replacing every disk sequentially gave me a lot of time to think.) Today I have far more HDD capacity and a few hundred terabytes of NVMe, and despite its challenges, I wouldn't dare run anything like it without Ceph.

antongribok|2 years ago

I run Ceph on some Raspberry Pi 4s. It's super reliable, and with cephadm it's very easy[1] to install and maintain.

My household is already 100% on Linux, so having a native network filesystem that I can just mount from any laptop is very handy.

Works great over Tailscale too, so I don't even have to be at home.

[1] I run a large install of Ceph at work, so "easy" might be a bit relative.

mikecoles|2 years ago

Works great, depending on what you want to do. Running on SBCs or computers with cheap sata cards will greatly reduce the performance. It's been running well for years after I found out the issues regarding SMR drives and the SATA card bottlenecks.

45Drives has a homelab setup if you're looking for a canned solution.

bluedino|2 years ago

Related question, how does someone get into working with Ceph? Other than working somewhere that already uses it.

mmerlin|2 years ago

Proxmox makes Ceph easy, even with just one single server if you are homelabbing...

I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.

loeg|2 years ago

Why would you bother with a distributed filesystem when you don't have to?

m463|2 years ago

I think you need 3 or was it 5 machines?

proxmox will use it - just click to install

mrb|2 years ago

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:

The cluster has 68 nodes, each a Dell PowerEdge R6615 (https://www.delltechnologies.com/asset/en-us/products/server...). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.

This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not bottlenecked by the U.2 link (8 Gbyte/s).

Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.

However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.

Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.

Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can reduce performance "up to 5x slower with some workloads" (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)

markhpc|2 years ago

I wanted to chime in and mention that we've never seen any issues with IOMMU before in Ceph. We have a previous generation of the same 1U chassis from Dell with AMD Rome processors in the upstream ceph lab and they don't suffer from the same issue despite performing similarly at the same scale (~30 OSDs). The customer did say they've seen this in the past in their data center. I'm hoping we can work with AMD to figure out what's going on.

I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.

magicalhippo|2 years ago

They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.

wmf|2 years ago

I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.

kaliszad|2 years ago

What surprises me is, why they went with the harder to cool 1U nodes and 10 SSDs/2x100Gb NICs instead of 2U nodes with 24 SSDs/2x200 or even 400Gb NICs. They could remove the network bottleneck and save on power thanks to larger, lower speed fans and less CPU packages, possibly with more cores per socket though. Also, having a smaller number of nodes increases the blast radius but with even 34 nodes this is probably not such a problem. However, with less nodes they could have a flatter network with 4 switches or so too.

PiratesScorn|2 years ago

Blast radius is the primary factor as you say and just generally makes things like patching and HW replacements less stressful. The racks and switches already exist and are heavily utilised for other purposes so the additional physical footprint for ceph is pretty tiny :)

mobilemidget|2 years ago

Cool benchmark, and interesting, however it would have read a lot better if abbreviations are explained at first usage. Not everybody is familiar with all terminology used in the post. Nonetheless congrats with results.

markhpc|2 years ago

Thanks (truly) for the feedback! I'll try to remember for future articles. It's easy to forget how much jargon we use after being in the field for so long.

one_buggy_boi|2 years ago

Is modern Ceph appropriate for transactional database storage, how is the IO latency? I'd like to move to a cheaper cfs that can compete with systems like Oracle's clustered file system or DBs backed by something like Veritas. Veritas supports multi-petabyte DBs and I haven't seen much outside of it or ocfs that similarly scales with acceptable latency

antongribok|2 years ago

Not sure about putting DBs on CephFS directly, but Ceph RBD can definitely run RDBMS workloads.

You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.

Source, I run Ceph at work doing pretty much this.

samcat116|2 years ago

Latency is quite poor, I wouldn't recommend running high performance database loads there.

rafaelturk|2 years ago

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

louwrentius|2 years ago

I think Ceph can be fine for NAS use cases, but be wary of latency and do some benchmarking. You may need more nodes/osds than you think to reach latency and throughput targets.

louwrentius|2 years ago

Remember, random IOPs without latency is a meaningless figure.

francoismassot|2 years ago

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?

I would love to see some benchmarks there.

matesz|2 years ago

This would be great, to have a universal benchmark of all available open source solutions for self-hosting. Links appreciated!

peter_d_sherman|2 years ago

Ceph is interesting... open source software whose only purpose is to implement a distributed file system...

Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.

There seems to be some missing software here -- if we examine these two paradigms side-by-side.

For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?

One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?

Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?

The only thing I can think of to solve those types of problems -- is:

What if the Linux filesystem was written such that it was a completely separate piece of software, and a distributed file system like Ceph, and not dependent on the other kernel source code (although, still complilable into the kernel as most linux components normally are)...

A lot of work? Probably!

But there seems to be some software need for something between a solely distributed file system as Ceph is, and a completely monolithic "everything baked in" (but not distributed!) OS/kernel as Linux is...

Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!

So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)

wmf|2 years ago

what if I want a Linux ... that doesn't manage a local file system or local storage at all [but] operates solely using the network, solely using a distributed file system

Linux can boot from NFS although that's kind of lost knowledge. Booting from CephFS might even be possible if you put the right parts in the initrd.

plagiarist|2 years ago

It sounds you want microkernels, and I agree, it would be nice.

nghnam|2 years ago

My old company ran public and private cloud with Openstack and Ceph. We had 20 Supermicro (24 disks per server) storage nodes and total capacity was 3PB. We learnt some experiences, especially a flapping disk made whole system performance degraded. Solution was removing bad sector disk as soon as possible.

einpoklum|2 years ago

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

jacobwg|2 years ago

Not sure how common the use-case is, but we're using Ceph to effectively roll our own EBS inside AWS on top of i3en EC2 instances. For us it's about 30% cheaper than the base EBS cost, but provides access to 10x the IOPS of base gp3 volumes.

The downside is durability and operations - we have to keep Ceph alive and are responsible for making sure the data is persistent. That said, we're storing cache from container builds, so in the worst-case where we lose the storage cluster, we can run builds without cache while we restore.

jseutter|2 years ago

http://www.45drives.com/blog/ceph/what-is-ceph-why-our-custo... is a pretty good introduction. Basically you can take off-the-shelf hardware and keep expanding your storage cluster and ceph will scale fairly linearly up through hundreds of nodes. It is seeing quite a bit of use in things like Kubernetes and OpenShift as a cheap and cheerful alternative to SANs. It is not without complexity, so if you don't know you need it, it's probably not worth the hassle.

brobinson|2 years ago

I'm curious what the performance difference would be on a modern kernel.

PiratesScorn|2 years ago

For context, I’ve been leading the work on this cluster client-side (not the engineer that discovered the IOMMU fix) with Clyso.

There was no significant difference when testing between the latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In both cases we ran into the same IOMMU behaviour. Our tooling is all very much catered around Ubuntu so testing newer kernels with other distros just wasn’t feasible in the timescale we had to get this built. The plan was < 2 months from initial design to completion.

Awesome to see this on HN, we’re a pretty under-the-radar operation so there’s not much more I can say but proud to have worked on this!

riku_iki|2 years ago

What router/switch one would use for such speed?

NavinF|2 years ago

Linked article says they used 68 machines with 2 x 100GbE Mellanox ConnectX-6 cards. So any 100G pizza box switches should work.

Note that 36 port 56G switches are dirt cheap on eBay and 4tbps is good enough for most homelab use cases

KeplerBoy|2 years ago

800Gbps via OSFP and QSFP-DD are already a thing. Multiple vendors have NICs and switches for that.

hinkley|2 years ago

Sure would be nice if you defined some acronyms.

up2isomorphism|2 years ago

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

PiratesScorn|2 years ago

The purpose of the benchmarking was to validate the design of the cluster and to identify any issues before going into production, so it achieved exactly that objective. Without doing this work a lot of performance would have been left on the table before the cluster could even get out the door.

As per the blog, the cluster is now in a 6+2 EC configuration for production which gives ~7PiB usable. Expensive yes, but well worth it if this is the scale and performance required.

Proven|2 years ago

[deleted]