SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).
The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.
In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.
The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.
I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.
When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.
(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)
SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper.
The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.
When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.
First, the feature set you have built is very impressive.
I think SeaweedFS would really benefit from more documentation on what exactly it does.
People who want to deploy production systems need that, and it would also help potential contributors.
Some examples:
* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)
* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?
* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.
* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.
* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.
* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.
* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.
In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.
Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.
I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.
We (https://hivegames.io/) use this for storing 50+ TB of multiplayer match recordings ("replays"), heavily using the built-in expiry feature. It's incredibly easy to use and to built on top off; never had an issue updating, migrating or utilizing new features.
This sounds like what Microsoft has tried but failed to do in numerous iterations for two decades: OFS (Cairo, unreleased predecessor to Windows 95), Storage+ (SQL Server 7.0), RFS (SQL Server 2000), Exchange Webstore, Outlook LIS, WinFS, and finally Microsoft Semantic Engine.
All projects were either cancelled, features cut, or officially left in limbo.
It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.
He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.
>> and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft,
Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.
I was asking around in my network after experience with self hosting S3 like solutions. One serious user of SeaweedFS recommended looking into min.io instead. Another serious user of min.io recommend looking into SeaweedFS instead…
It used to be if you wanted thousands of tiny files give seaweed a go, minio would suck. But minio has since had a revision so you'd have to test it out.
Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.
Tried and rejected SeaweedFS due to Postgres failing to even initialize itself on a POSIX FS volume mounted over SeaweedFS' CSI driver. And that's too bad, because SeaweedFS was otherwise working well!
What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.
This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.
While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.
You do know that you cannont implement a fully-compliant POSIX FS with only the S3 API?
None of the scalalbe SDS' support random writes. Atomic rename (for building transactional systems like lakehouse table formats) is not there. Listing of files is often eventually consistent.
The closest functional API to a posix-compliant one in scalable SDS' is the HDFS API. Only ADLS supports that. But then again, they are the only one who enable you to fuse mount a directory for local FS read/write access. All of the S3 fuse mount stuff is fundamentally limited by the S3 API.
Things to make sure of when choosing your distributed storage:
1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)
2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)
3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)
4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)
5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)
6) do you care about availability, consistency or speed? (pick one, maybe one and a half)
7) how are you going to recover from the distributed storage shitting it's self all at the same time
The comments already mention several alternatives (Minio, Ceph, GarageFS). I think another one, not mentioned yet, is JuiceFS [1]. Found one comparison here [2].
JuiceFS isn't standalone, it requires separate backing storage for each of data [0] and metadata. So for example, JuiceFS would target SeaweedFS or GarageFS as its data store. JuiceFS can also target the local file system, but .. SDS use cases typically care about things like redundancy and availability of the data itself, things that JuiceFS happily delegates. JuiceFS itself can be distributed, but that's merely the control place as I understand it.
I tried JuiceFS with AWS S3 and an (admittedly slow) self-hosted postgres instance, and it didn't work at all. I would have understood if it had been really slow, but erroring it really seems wrong for software where correctness is paramount.
Does anyone know how well the Seaweed Filer works for renaming files or missing files? My use case’s involves writing a lot of data to temporary files that are then renamed to their final name. This is always the Achilles heel for distributed file storage, where files are put into buckets based on the file path… when you rename the path, but keep the data, lookups become more complicated.
(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )
One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.
Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.
I am aware of some research into operating systems with a database rather than filesystem as a base layer. If SeaweedFS serves a middle-ground between databases and filesystems, could it also suggest a middle-ground in conceiving of research operating systems??
SeaweedFS is a non hierarchical distributed key value store. It makes different tradeoffs to a filesystem which provides a hierarchical view of local only data. There’s some evidence to suggest that a hierarchical structuring of the data itself isn’t actually beneficial for modern systems. And you could design a system that used similar techniques to SeaweedFS to do a semi-distributed local store (ie locally stored data for fast access with offloading to cheap remote storage for durability / infinite extensibility). And the plain KV store will likely be faster for most operations although in practice you’ll probably only see it in micro benchmarks.
Have used SeaweedFS to store billions of thumbnails. The tooling is a bit clunky, but it mostly works. The performance is very good for small-ish objects (memory usage + latency), and latency remains consistently good into 99.9 percentiles. We had some issues with data loss and downtime, but that was mostly our own fault.
I don't understand why you wouldn't just use plain s3. There is no comparison in the readme and I would love to understand what the benefits are. Also I would have expected a comparison to maybe Apache Iceberg, but this might be more specialized for relational data lake data?
Ceph should have 10x+ metadata overhead for chunk storage. When using erasure-coding writes are faster because it's using replication and then erasure-coding is done async for whole volumes (30GB).
bnewbold|2 years ago
The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.
In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.
The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.
seized|2 years ago
https://garagehq.deuxfleurs.fr/
no_wizard|2 years ago
I was expecting C/C++ or Rust, pleasantly surprised to see Go.
riku_iki|2 years ago
why you would base64 encode them, they all store binary formats?
pilgrim0|2 years ago
kyledrake|2 years ago
(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)
chrislusf|2 years ago
SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.
When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.
CyberDildonics|2 years ago
just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
How is that not mmap?
Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?
nh2|2 years ago
I think SeaweedFS would really benefit from more documentation on what exactly it does.
People who want to deploy production systems need that, and it would also help potential contributors.
Some examples:
* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)
* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?
* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.
* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.
* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.
* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.
* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.
In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.
Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.
I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.
clankstar|2 years ago
candiddevmike|2 years ago
jug|2 years ago
All projects were either cancelled, features cut, or officially left in limbo.
It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.
He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.
osigurdson|2 years ago
Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.
gorset|2 years ago
jasonjayr|2 years ago
Already__Taken|2 years ago
Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.
seized|2 years ago
https://garagehq.deuxfleurs.fr/
I use it for self hosting.
junon|2 years ago
chaxor|2 years ago
papaver-somnamb|2 years ago
What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.
This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.
While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.
jamesblonde|2 years ago
arccy|2 years ago
snthpy|2 years ago
I've never used it myself and just learned about it from this thread but it seems to fit the bill.
4by4by4|2 years ago
Seaweed had much better performance for our use case.
Scaevolus|2 years ago
bomewish|2 years ago
erikaww|2 years ago
Drop in S3 compatibility with much better performance would be insane
_zoltan_|2 years ago
KaiserPro|2 years ago
1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)
2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)
3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)
4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)
5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)
6) do you care about availability, consistency or speed? (pick one, maybe one and a half)
7) how are you going to recover from the distributed storage shitting it's self all at the same time
8) how are you going to control access?
flemhans|2 years ago
2) No modifications, just new files and the occasional deletion request.
3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.
4) Never
5) Files are written only by one other server, and there will be no parallel writes.
6) I pick consistency and as the half, availability.
7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).
8) Reads are public, writes restricted to one other service that may write.
SheddingPattern|2 years ago
PhilippGille|2 years ago
[1] https://juicefs.com/en/
[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...
papaver-somnamb|2 years ago
[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...
remram|2 years ago
mbreese|2 years ago
(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )
chrislusf|2 years ago
One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.
Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.
pitherpather|2 years ago
vlovich123|2 years ago
jakjak123|2 years ago
hardwaresofton|2 years ago
DrDroop|2 years ago
throwup238|2 years ago
Otherwise: egress costs.
lolpanda|2 years ago
ddorian43|2 years ago
fodkodrasz|2 years ago
jamesblonde|2 years ago
sighansen|2 years ago
throw0101d|2 years ago
ddorian43|2 years ago
thrusong|2 years ago
I think since v0.7— I was always intrigued by Facebook's Haystack.
SeaweedFS been super reliable, efficient, and trouble free.
jdthedisciple|2 years ago
Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,
like with every new piece of technology...
Or what should give me the confidence that it isn't so?
stevekemp|2 years ago
https://github.com/seaweedfs/seaweedfs/pull/187
unknown|2 years ago
[deleted]
_3u10|2 years ago
killingtime74|2 years ago
fefferkorn|2 years ago
monlockandkey|2 years ago
fuddle|2 years ago
feliciegerald|2 years ago
[deleted]