Ex-GlusterFS person here (used to work at Red Hat on the project side, leaving mid last year).
"Small file access", and "lots of files in a directory" have been a pain point with GlusterFS for ages. The 3.7.0 release had some important improvements in it, specificially designed to help fix that:
The latest Gluster release is 3.7.8 (the same series as 3.7.0), and is worth looking at if you're needing a good distributed file system. If you have something like 1Mill files in a single directory though... hrmmm... NFS or other technologies might still be a better idea. ;)
I worked with a GlusterFS deployment in production about 2 years ago, and it was such a nightmare that I both feel compelled to write about it and never touch anything made by that team ever again.
It was the whole shebang: Kernel panics, inconsistent views, data loss, very slow performance, split-brain problems all the time. Our set up IIRC was very simple: two bricks in a replicated volume. It worked so poorly that we had to take it out of production. Some of our experience can be explained by GlusterFS performing poorly under network partitions, but nothing could justify kernel panics. It blew my mind that Redhat acquired that company and product.
Edit: I hope there's been a big improvement to the reliability and performance of GlusterFS. Can anyone with more recent experience running it in production comment?
I'm not a GlusterFS expert, and haven't used it before, but you should know that most consensus algorithms (Paxos, Raft, etc) only function reliably with an odd number of nodes. I have to wonder if your problems were mostly self-inflicted from having 2 nodes. Of course, any network partition in a 2-node cluster has a huge potential for data corruption, as each node now thinks it is the master (split-brain).
In a 3-node cluster, any system with a decent consensus algorithm (to be clear, I'm not sure if GlusterFS has one) would know that during a partition the cluster can only continue to operate if at least 2 nodes can communicate with each other to elect a new master.
You're right: nothing justifies kernel panics. There is nothing that GlusterFS or any other user-space program should be able to cause one. We (yes, I'm a GlusterFS developer) don't really do anything that any application shouldn't able to do as far as the kernel is concerned. If what we do on behalf of our callers causes a crash, it's the kernel developers' fault and you should engage with them instead of blaming your peers out in user-land.
As far as performing poorly under network partitions, I'd love to hear more. That is our responsibility, and sounds like something we can/should fix.
My experience was not as bad as yours, but the problems I saw at a similarly small scale (6 servers, 3 2X replica sets) make me glad I'm not having to scale up a GlusterFS infrastructure. The biggest problem I saw was with absolutely atrocious performance when healing after a temporary node loss. Even if it was just a loss of a minute or two (server reboot), it wouldn't lose availability or corrupt files, but performance would get so bad for the next 2 hours that it might as well have been down.
If I had to make a comparison, I'd say GlusterFS reminds me a lot of MongoDB in the beginning. It wins a lot of kudos at the outset based on ease of setup, management and CLI UI, plus it has a good "story" on ability to scale up that gradually begins to fray when pushed. Hopefully there have been big improvements.
I feel compelled to write a disagreement. I've been running GlusterFS for 6 months with 15TB of data on a 43TB cluster using 5 servers with zero issues. I have no idea what your particular combination of bad luck was, but I don't think your experience is truly reflective of the product, the team, or Red Hat's sensibilities.
Last time I tried GlusterFS was in 2012. The way it worked was very impressive back then and I would have loved to actually put it into production.
Unfortunately, I hit a roadblock in relation to enumeration of huge directories: Even with just 5K files in a directory, performance started to drop really badly to the point where enumerating a directory containing 10K files would take longer than 5 minutes.
Yes. You're not supposed to store many files in a directory, but this was about giving third parties FTP upload access for product pictures and I can't possibly ask them to follow any schema for file and folder naming. These people want a directory to put stuff to with their GUI FTP client and they want their client to be able to not upload files if the target already exists. So having all files in one directory was a huge improvement UX-wise.
So in the end, I had to move to nfs on top of drbd to provide shared backend storage. Enumerating 20K files over NFS still isn't fast but completes within 2 seconds instead of more than 5 minutes.
Of course, now that we're talking about GlusterFS, I wonder whether this has been fixed since?
I started using Gluster this last fall. At first it was not a contender - the articles I read were not encouraging. However many of this articles were older and it appears a lot of progress has been made with Gluster since version 3.5
It maybe worth a second look - however a directory with a large file count still might be a problem
Couldn't your FTP server have handled this in a clever way? For example, sort in directories by first letter, then by first two letters. While still providing a virtual flat view to the FTP client user. It'd be a simple mapping away.
I'm not sure what is announced here. Gluster FS is for few years already (version 3.6 now), while the article doesn't mention that there is started any managed service based on it. It is more like reminder that you can set up distributed file system on your cloud servers using Gluster. Not even any step-by-step tutorial how to do that.
I think, though I'm not sure, that its an official, active support relationship between Google and Red Hat for GlusterFS on GCE, related to Red Hat's vendor certification programs.
If so, this seems like the kind of thing that has the most impact on people with support contracts with Red Hat, but should also have peripheral impacts on the stability, quality of support, etc., of the product on the GCE platform generally.
Basically, GlusterFS is trying to solve a hard problem: make distributed/remote filesystem to feel like a local filesystem for applications built on top of it. For the client, you can choose from NFS, SMB or its homemake fuse client, which makes the remote system accessible as if everything is on local file system. I used to build similar systems in house and find it extremely painful to design and maintain, we did lots of custom hacks to make our system to suit our need. GlusterFS, as a general solution, won't have that much flexibility and may or may not suit your custom needs.
Overall, I feel AWS S3 is a better (or at least simpler) approach. Just acknowledge that files are not locally stored and use them as is. AWS is experimenting EFS as well, which we found not as desirable as well.
Edit: I am not saying that you cannot make GlusterFS or EFS perform great. My appoint it that it's hard to do so, and might not worth the effort to develop such a system given that S3 can serve most needs of distributed file storage.
aren't you comparing apples to oranges? S3 is an object store (and non POSIX.. Also only eventually consistent).. GlusterFS is neither of those. They simply solve different problem spaces
I needed a shared volume across multiple EC2 instances in a VPC. My use case is that multiple "ingress" boxes write files to the shared volume, and then a single "worker" box processes those files. This is a somewhat unusual use case in that it means one box is responsible for 99% of IO heavy operations, and the other boxes are responsible only for writing to the volume, with no latency requirements.
My solution was to mount an EBS on the "worker box," along with an NFS server. Each "ingress box" runs an NFS client that connects to the server via its internal VPC IP address, and mounts the NFS volume to a local directory. It works wonderfully. In three months of running this setup, I've had no downtime or issues, not even minor ones. Granted I don't need any kind of extreme I/O performance, so I haven't measured it, but this system took less than an hour to setup and fit my needs perfectly.
It's a bit light for a press release. Considering RedHat is officially promoting AWS on their website, providing more information to let people know whether the offering on Google Cloud will be better or similar would have been better.
Before reading the article, I was going to ask if it solves the "high read access of many small files" I/O problem, but alas, it's on GlusterFS, so only insomuch as Gluster has been making improvements these last few minor releases.
Is anyone here running a GlusterFS setup with high read/write volume on small files successfully? If so, what's your secret?
If you are looking for a POSIX compatible file system for GCE or EC2, we think our ObjectiveFS[1] is the easiest way to get started and use. It is a log structured filesystem using GCS or S3 for backend storage and with a ZFS like interface to manage your filesystems.
Glad to see Gluster is still making waves. I was an early customer. It's impressive when a brand survives acquisition much less a transition into a new type of offering like this. Kudos to everyone who helped make Gluster special!
I wonder why they even did this. They already have a state of the art distributed filesystem (Colossus) which doesn't have any scalability problems at all, since they use it for everything.
Disclaimer: I work on GCP.
GlusterFS works best on RHEL, and consumes normal GCP resources like GCE and PD-SSD storage. To host a rocking fast, best practices, HA, 3TB all-SSD filer, it'd be less than $900 on GCP: https://cloud.google.com/products/calculator/#id=e76e9a5a-bf...
I see that the GlusterFS FAQ says it is fully POSIX compliant. That's a pretty good trick. Ten years ago or so I had a suite of compliance tests I would use to embarrass salesmen from iBrix and Panasas. The only actually POSIX-compliant distributed filesystem I could find in those days was Lustre (unrelated to Gluster, despite the naming). Lustre works well but it almost impossible to install and operate.
I remember at around the same time I was playing with GlusterFS, and it was interesting, but painful to configure as well. It was prone to have client/server configuration mismatches, and you had to be careful to reason about where you enabled a feature, the client or server, to make sure it failed in a safe way. It did have capabilities for Fcntl and flock file locking though, which was interesting at the time. Unfortunately ist was also somewhat unstable, and I would see segfaults every few weeks. My focus for work projects moved into different areas, and I didn't keep up with it's development, so I'm not sure what they've done in the last decade, but it was promising and refreshingly new back then. I should take another look.
I invite you to check out our product, Quobyte (www.quobyte.com). Not open-source, but a parallel high-performance POSIX file system, with split-brain safe quorum replication of all components, can also do erasure coding for files, with policy-based data placement, running on standard server hardware.
We designed it to yield top performance both for (parallel) file system workloads and block storage workloads. So you can run VMs and databases on it with a performance better than any other partition-tolerant software storage system. The goal is to provide customers with a scalable automated general storage platform for all workloads, a la Google, but for real world applications.
In HPC/HTC, Lustre is very common, especially in DOE labs. I've never tried to install it, but I don't think my colleagues who have are some kind of special genius.
[+] [-] justinclift|10 years ago|reply
Ex-GlusterFS person here (used to work at Red Hat on the project side, leaving mid last year).
"Small file access", and "lots of files in a directory" have been a pain point with GlusterFS for ages. The 3.7.0 release had some important improvements in it, specificially designed to help fix that:
https://www.gluster.org/community/documentation/index.php/Fe...
The latest Gluster release is 3.7.8 (the same series as 3.7.0), and is worth looking at if you're needing a good distributed file system. If you have something like 1Mill files in a single directory though... hrmmm... NFS or other technologies might still be a better idea. ;)
[+] [-] DannoHung|10 years ago|reply
[+] [-] blantonl|10 years ago|reply
[+] [-] profeta|10 years ago|reply
[+] [-] gamegod|10 years ago|reply
It was the whole shebang: Kernel panics, inconsistent views, data loss, very slow performance, split-brain problems all the time. Our set up IIRC was very simple: two bricks in a replicated volume. It worked so poorly that we had to take it out of production. Some of our experience can be explained by GlusterFS performing poorly under network partitions, but nothing could justify kernel panics. It blew my mind that Redhat acquired that company and product.
Edit: I hope there's been a big improvement to the reliability and performance of GlusterFS. Can anyone with more recent experience running it in production comment?
[+] [-] illumin8|10 years ago|reply
In a 3-node cluster, any system with a decent consensus algorithm (to be clear, I'm not sure if GlusterFS has one) would know that during a partition the cluster can only continue to operate if at least 2 nodes can communicate with each other to elect a new master.
[+] [-] notacoward|10 years ago|reply
As far as performing poorly under network partitions, I'd love to hear more. That is our responsibility, and sounds like something we can/should fix.
[+] [-] craigyk|10 years ago|reply
If I had to make a comparison, I'd say GlusterFS reminds me a lot of MongoDB in the beginning. It wins a lot of kudos at the outset based on ease of setup, management and CLI UI, plus it has a good "story" on ability to scale up that gradually begins to fray when pushed. Hopefully there have been big improvements.
[+] [-] codefreakxff|10 years ago|reply
[+] [-] pilif|10 years ago|reply
Unfortunately, I hit a roadblock in relation to enumeration of huge directories: Even with just 5K files in a directory, performance started to drop really badly to the point where enumerating a directory containing 10K files would take longer than 5 minutes.
Yes. You're not supposed to store many files in a directory, but this was about giving third parties FTP upload access for product pictures and I can't possibly ask them to follow any schema for file and folder naming. These people want a directory to put stuff to with their GUI FTP client and they want their client to be able to not upload files if the target already exists. So having all files in one directory was a huge improvement UX-wise.
So in the end, I had to move to nfs on top of drbd to provide shared backend storage. Enumerating 20K files over NFS still isn't fast but completes within 2 seconds instead of more than 5 minutes.
Of course, now that we're talking about GlusterFS, I wonder whether this has been fixed since?
[+] [-] jweir|10 years ago|reply
It maybe worth a second look - however a directory with a large file count still might be a problem
[+] [-] mikaelj|10 years ago|reply
[+] [-] prohor|10 years ago|reply
[+] [-] milesward|10 years ago|reply
[+] [-] baldfat|10 years ago|reply
Gluster is available not announcing that this is a new technology.
[+] [-] dragonwriter|10 years ago|reply
I think, though I'm not sure, that its an official, active support relationship between Google and Red Hat for GlusterFS on GCE, related to Red Hat's vendor certification programs.
If so, this seems like the kind of thing that has the most impact on people with support contracts with Red Hat, but should also have peripheral impacts on the stability, quality of support, etc., of the product on the GCE platform generally.
[+] [-] goodcjw2|10 years ago|reply
Overall, I feel AWS S3 is a better (or at least simpler) approach. Just acknowledge that files are not locally stored and use them as is. AWS is experimenting EFS as well, which we found not as desirable as well.
Edit: I am not saying that you cannot make GlusterFS or EFS perform great. My appoint it that it's hard to do so, and might not worth the effort to develop such a system given that S3 can serve most needs of distributed file storage.
[+] [-] k_bx|10 years ago|reply
[+] [-] spydum|10 years ago|reply
[+] [-] Beldur|10 years ago|reply
[+] [-] chatmasta|10 years ago|reply
My solution was to mount an EBS on the "worker box," along with an NFS server. Each "ingress box" runs an NFS client that connects to the server via its internal VPC IP address, and mounts the NFS volume to a local directory. It works wonderfully. In three months of running this setup, I've had no downtime or issues, not even minor ones. Granted I don't need any kind of extreme I/O performance, so I haven't measured it, but this system took less than an hour to setup and fit my needs perfectly.
[+] [-] helper|10 years ago|reply
[+] [-] Nux|10 years ago|reply
We've been using it in production for a few years now and having a single namespace that can basically grow ad infinitum has been pretty neat.
If you want a trouble free Gluster experience stay away from MANY small files and replicated volumes.
[+] [-] cgarrigue|10 years ago|reply
[+] [-] justinclift|10 years ago|reply
[+] [-] jqueryin|10 years ago|reply
Is anyone here running a GlusterFS setup with high read/write volume on small files successfully? If so, what's your secret?
[+] [-] objectivefs|10 years ago|reply
[1] https://objectivefs.com
[+] [-] godzillabrennus|10 years ago|reply
[+] [-] melted|10 years ago|reply
http://www.highlyscalablesystems.com/3202/colossus-successor...
[+] [-] wmf|10 years ago|reply
[+] [-] profeta|10 years ago|reply
So, here is the link to the star of the show: https://www.redhat.com/en/technologies/storage/gluster
[+] [-] amelius|10 years ago|reply
[+] [-] jpgvm|10 years ago|reply
[+] [-] stevenking86|10 years ago|reply
[+] [-] secopdev|10 years ago|reply
[+] [-] milesward|10 years ago|reply
[+] [-] thrownaway2424|10 years ago|reply
[+] [-] kbenson|10 years ago|reply
[+] [-] fh973|10 years ago|reply
We designed it to yield top performance both for (parallel) file system workloads and block storage workloads. So you can run VMs and databases on it with a performance better than any other partition-tolerant software storage system. The goal is to provide customers with a scalable automated general storage platform for all workloads, a la Google, but for real world applications.
[+] [-] batbomb|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]