SeaweedFS fast distributed storage system for blobs, objects, files and datalake

bnewbold|2 years ago

SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).

The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.

In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.

The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.

seized|2 years ago

GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.

https://garagehq.deuxfleurs.fr/

no_wizard|2 years ago

Written in Go no less, a GC language!

I was expecting C/C++ or Rust, pleasantly surprised to see Go.

riku_iki|2 years ago

> almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc

why you would base64 encode them, they all store binary formats?

pilgrim0|2 years ago

I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.

kyledrake|2 years ago

When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.

(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)

chrislusf|2 years ago

Thanks for sharing! I work on SeaweedFS.

SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.

When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.

CyberDildonics|2 years ago

what makes it different is a new way of programming for the cloud era.

just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

How is that not mmap?

Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?

nh2|2 years ago

First, the feature set you have built is very impressive.

I think SeaweedFS would really benefit from more documentation on what exactly it does.

People who want to deploy production systems need that, and it would also help potential contributors.

Some examples:

* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)

* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?

* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.

* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.

* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.

* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.

* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.

In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.

Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.

I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.

clankstar|2 years ago

We (https://hivegames.io/) use this for storing 50+ TB of multiplayer match recordings ("replays"), heavily using the built-in expiry feature. It's incredibly easy to use and to built on top off; never had an issue updating, migrating or utilizing new features.

candiddevmike|2 years ago

What do you use for the metadata store?

jug|2 years ago

This sounds like what Microsoft has tried but failed to do in numerous iterations for two decades: OFS (Cairo, unreleased predecessor to Windows 95), Storage+ (SQL Server 7.0), RFS (SQL Server 2000), Exchange Webstore, Outlook LIS, WinFS, and finally Microsoft Semantic Engine.

All projects were either cancelled, features cut, or officially left in limbo.

It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.

He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.

osigurdson|2 years ago

>> and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft,

Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.

gorset|2 years ago

I was asking around in my network after experience with self hosting S3 like solutions. One serious user of SeaweedFS recommended looking into min.io instead. Another serious user of min.io recommend looking into SeaweedFS instead…

jasonjayr|2 years ago

If your looking for more recommendations, try Garage ( https://garagehq.deuxfleurs.fr/ ), which is on my short list to try in my home lab...

Already__Taken|2 years ago

It used to be if you wanted thousands of tiny files give seaweed a go, minio would suck. But minio has since had a revision so you'd have to test it out.

Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.

seized|2 years ago

Take a look at GarageS3, it's a niceoption for "just an S3 server" for self hosting.

https://garagehq.deuxfleurs.fr/

I use it for self hosting.

junon|2 years ago

Sounds like you should try both and write an article!

chaxor|2 years ago

A serious user of both suggested to use iroh instead

papaver-somnamb|2 years ago

Tried and rejected SeaweedFS due to Postgres failing to even initialize itself on a POSIX FS volume mounted over SeaweedFS' CSI driver. And that's too bad, because SeaweedFS was otherwise working well!

What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.

This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.

While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.

jamesblonde|2 years ago

You do know that you cannont implement a fully-compliant POSIX FS with only the S3 API? None of the scalalbe SDS' support random writes. Atomic rename (for building transactional systems like lakehouse table formats) is not there. Listing of files is often eventually consistent. The closest functional API to a posix-compliant one in scalable SDS' is the HDFS API. Only ADLS supports that. But then again, they are the only one who enable you to fuse mount a directory for local FS read/write access. All of the S3 fuse mount stuff is fundamentally limited by the S3 API.

arccy|2 years ago

running something like postgres over a networked filesystem sounds very wrong

snthpy|2 years ago

What about JuiceFS?

I've never used it myself and just learned about it from this thread but it seems to fit the bill.

4by4by4|2 years ago

We tested both SeaweedFS and Min.io for cheaply (HDD) storing > 100TB of audio data.

Seaweed had much better performance for our use case.

Scaevolus|2 years ago

Do you wish it supported Erasure Coding for lower disk usage, or is your workload such that the extra spindles from replication are useful?

bomewish|2 years ago

Forgive my ignorance but why is this preferable to a big ZFS pool?

erikaww|2 years ago

Any hiccups?

Drop in S3 compatibility with much better performance would be insane

_zoltan_|2 years ago

why not ceph?

KaiserPro|2 years ago

Things to make sure of when choosing your distributed storage:

1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)

2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)

3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)

4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)

5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)

6) do you care about availability, consistency or speed? (pick one, maybe one and a half)

7) how are you going to recover from the distributed storage shitting it's self all at the same time

8) how are you going to control access?

flemhans|2 years ago

1) only if it removes a "janitor" token of nannying the servers. Right now I just have one big server with a big 160TB ZFS pool, but it's running out.

2) No modifications, just new files and the occasional deletion request.

3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.

4) Never

5) Files are written only by one other server, and there will be no parallel writes.

6) I pick consistency and as the half, availability.

7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).

8) Reads are public, writes restricted to one other service that may write.

SheddingPattern|2 years ago

Sounds like you are talking from experience. Are you storage specialist, how did you learn so much about this?

PhilippGille|2 years ago

The comments already mention several alternatives (Minio, Ceph, GarageFS). I think another one, not mentioned yet, is JuiceFS [1]. Found one comparison here [2].

[1] https://juicefs.com/en/

[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...

papaver-somnamb|2 years ago

JuiceFS isn't standalone, it requires separate backing storage for each of data [0] and metadata. So for example, JuiceFS would target SeaweedFS or GarageFS as its data store. JuiceFS can also target the local file system, but .. SDS use cases typically care about things like redundancy and availability of the data itself, things that JuiceFS happily delegates. JuiceFS itself can be distributed, but that's merely the control place as I understand it.

[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...

remram|2 years ago

I tried JuiceFS with AWS S3 and an (admittedly slow) self-hosted postgres instance, and it didn't work at all. I would have understood if it had been really slow, but erroring it really seems wrong for software where correctness is paramount.

mbreese|2 years ago

Does anyone know how well the Seaweed Filer works for renaming files or missing files? My use case’s involves writing a lot of data to temporary files that are then renamed to their final name. This is always the Achilles heel for distributed file storage, where files are put into buckets based on the file path… when you rename the path, but keep the data, lookups become more complicated.

(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )

chrislusf|2 years ago

Should not be a problem.

One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.

Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.

pitherpather|2 years ago

I am aware of some research into operating systems with a database rather than filesystem as a base layer. If SeaweedFS serves a middle-ground between databases and filesystems, could it also suggest a middle-ground in conceiving of research operating systems??

vlovich123|2 years ago

SeaweedFS is a non hierarchical distributed key value store. It makes different tradeoffs to a filesystem which provides a hierarchical view of local only data. There’s some evidence to suggest that a hierarchical structuring of the data itself isn’t actually beneficial for modern systems. And you could design a system that used similar techniques to SeaweedFS to do a semi-distributed local store (ie locally stored data for fast access with offloading to cheap remote storage for durability / infinite extensibility). And the plain KV store will likely be faster for most operations although in practice you’ll probably only see it in micro benchmarks.

jakjak123|2 years ago

Have used SeaweedFS to store billions of thumbnails. The tooling is a bit clunky, but it mostly works. The performance is very good for small-ish objects (memory usage + latency), and latency remains consistently good into 99.9 percentiles. We had some issues with data loss and downtime, but that was mostly our own fault.

hardwaresofton|2 years ago

What issues did you run into? Not settling replication?

DrDroop|2 years ago

I there any reason to use something like this instead of S3 or similar products when you are not running your own infra?

throwup238|2 years ago

If you’re running it on AWS? Probably not.

Otherwise: egress costs.

lolpanda|2 years ago

For companies hosting their entire infra on AWS, what's the advantage of SeaweedFS running on a fleet of EC2 machines over storing on S3?

ddorian43|2 years ago

Nothing. AWS doesn't give you the option to rent HDDs to create your own S3 so you're locked in to use S3.

fodkodrasz|2 years ago

Hard to imagine anything.

jamesblonde|2 years ago

Nobody pointed out yet that Chris, the main developer, developed this for Roblox. My kids love Roblox - massively popular game.

sighansen|2 years ago

I don't understand why you wouldn't just use plain s3. There is no comparison in the readme and I would love to understand what the benefits are. Also I would have expected a comparison to maybe Apache Iceberg, but this might be more specialized for relational data lake data?

throw0101d|2 years ago

Advantages over Ceph?

ddorian43|2 years ago

Ceph should have 10x+ metadata overhead for chunk storage. When using erasure-coding writes are faster because it's using replication and then erasure-coding is done async for whole volumes (30GB).

thrusong|2 years ago

I'm a small user— only about 250,000 objects in storage and a lot of those cold storage behind Cloudflare, but I've been using SeaweedFS for years.

I think since v0.7— I was always intrigued by Facebook's Haystack.

SeaweedFS been super reliable, efficient, and trouble free.

jdthedisciple|2 years ago

Sounds great!

Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,

like with every new piece of technology...

Or what should give me the confidence that it isn't so?

stevekemp|2 years ago

This is an old project, I had a quick look and see that I submitted a pull-request back in 2015:

https://github.com/seaweedfs/seaweedfs/pull/187

unknown|2 years ago

[deleted]

_3u10|2 years ago

What’s the different between files, objects, blobs and data lake?

killingtime74|2 years ago

Each one is a pay increase for the administrator and vendor.

fefferkorn|2 years ago

@chrislusf, i use btrfs with lz4 compression and beesd for deduplicatiion,.. does seaweed support chunking in a way so that deduplication happens?

monlockandkey|2 years ago

What would be the best S3 like storage software with user based access and limits that I can locally host?

fuddle|2 years ago

Is it compatible with OpenStack?

feliciegerald|2 years ago

[deleted]

123 comments