top | item 26760266

(no title)

emmericp | 4 years ago

Lesson 1: Never ever reboot multiple Ceph nodes without checking if Ceph is happy between reboots. This failure happened early during boot and this could have been handled with no downtime if they checked the rebooted nodes before rebooting the next one.

Lesson 2: Avoid using RAID controllers except for the most simple "pass through" mode.

Lesson 3: XFS+Ceph never really worked out. BlueStore solved so many problems by just removing the XFS dependency for the actual data. Recommended reading: https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

ceph-volume finally fully removed dependency on file systems. Yeah the LVM-mess is sometimes annoying and early version of ceph-volume had many problems, but nowadays I wouldn't want ceph-disk back.

discuss

INTPenis|4 years ago

>Lesson 3: XFS+Ceph never really worked out. BlueStore solved so many problems by just removing the XFS dependency for the actual data. Recommended reading: https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

This gave me a concern. My kube nodes do use XFS in some cases but Ceph uses raw block devices. So XFS is only used for system files, not for Ceph. Except of course to store Ceph config on each node.

So I assume I'm safe. I'm not entirely sure how you'd use XFS with Ceph because Ceph uses a raw device file and formats it for its own storage.

Nullabillity|4 years ago

Ceph OSD has two different storage backends:

- Filestore is the legacy backend that uses files on a filesystem (strongly recommended to be XFS)

- Bluestore is the modern backend that uses raw device files directly

sp332|4 years ago

From the linked PDF: For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most dis- tributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code.

merb|4 years ago

> Lesson 2: Avoid using RAID controllers except for the most simple "pass through" mode.

well raid 1 works aswell, if you can take the performance hit. most raid 1 drives inside a raid controller can be run outside of the raid. we already needed to do that since one of our customers tought it is a good idea to have a running server temporarly near an open window with production data and no backup (we exclude our liability in this case and monitor if backups are created) so we needed to recover the data. which worked by using another controller with passtrough and just running a single disk. (the other disk was destroyed, so was the raid controller) btw. rainwater damages a server, especially if you do not notice it for 30 minutes and a full bucket of water inside it. kudos to dell the server kept running for 25 minutes, when it was full of water until it died. (we transported the server and still had water in it...)

karmakaze|4 years ago

Interesting read and helpful lesson list.

I've only used Ceph as provided to be by others and considered setting it up in some instances. Didn't know about the development of BlusStore and it does seem much simpler. The choice between xfs, btrfs, ext4 always seemed a bit unclear (except that I had experienced non-Ceph troubles with btrfs).

Note to self: use ceph-volume/BlueStore.

unknown|4 years ago

[deleted]

_mikulely|4 years ago

ceph-volume still relies on LVM, which brings unnecessary complexity.

We'd like to stick to ceph-disk(already unavailable in the P release) with raw block device only.

marcan_42|4 years ago

ceph-disk relies on partitions (sometimes with magic type IDs) and a stub XFS filesystem, which is more complexity than ceph-volume.

Really, ceph-volume is better. You create an LVM PV/VG/LV (which is completely standard, well supported Linux stuff) on your OSD drive and then pass it to ceph-volume. It puts the OSD metadata in LVM metadata (no stub partition! No XFS!), and the actual OSD directory just gets mounted as a tmpfs and populated from that data. Only one LV for the BlueStore block device. It all just works, and is much easier to reason about than the partitioning stuff with ceph-disk.

Plus you can play around with multiple OSDs on the same device, or OSDs plus system volumes, or RAID members, or anything. I used to have to do some horrible stuff to get somewhat "interesting" Ceph setups with e.g. a system volume on a small RAID next to the OSDs on the same disks, with ceph-disk. All that just works without any confusion with ceph-volume, just make more LVs. Bog standard stuff.