High-availability durable filesystem is a difficult problem to solve. It usually starts with NFS, which is a big huge single point of failure. Depending on the nature of the application this might be good enough.
But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.
Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.
Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.
Reading through your article, this solution is built on top of s3. So, moving and listing files is faster, presumably due to a new metadata system you've built for tracking files. The trade off here, is that writes must be strictly slower now than they were previously because you've added a network hop. All read and write data now flows through these workers. Which adds a point of failure, if you steam too much data through these workers, you could potentially OOM them. Reads are potentially faster, but that also depends on the hit rate of the block cache. Would be nice to see a more transparent post listing the pros and cons, rather than what reads as a technical advertisement.
The linked post says "has the same cost as S3", yet on the linked pricing page there's only a "Contact Us" Enterprise plan besides a free one. Am I missing something?
Diagram seems to imply an active/passive namenode setup like HDFS. Doesn't that limit it to tiny filesystems, and curse it with the same availability problems that plague HDFS?
We see similar results using JuiceFS [1], are glad to see that another player has moved on the same direction with us, since JuiceFS was released to public in 2017, and is available on most of public cloud vendor.
The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).
JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).
Systems like Hadoop and Spark will run multiple versions of the same task writing out to a FS at once but to a temporary directory and when the first finishes the output data is just moved to the final place. It's not uncommon for a job to "complete" writing data to S3 and just sit there and hang as the FS move commands run copying the data and deleting the old version. It is just assumed a rename/move is a no-op in some systems.
Also, I presume that "move" means "update the index, held in another object or objects" (no slight! If I understand correctly, that's what most on-disk filesystems do as well). That's the only way I can imagine getting performance that much better.
We do this in our ETL jobs on several hundreds of thousands of files a day. Not a reason to switch to a different system, but there are definitely non-maintenance use cases for this.
From the title, I though that this is a technology which gives you 100x faster static sites compared to S3, which sounded really cool. Discovering that it was about move operations was really underwhelming. The title is on the edge of being misleading.
Dude, there’s literally no rename/move in the official S3 specification, you are going to need to do a PUT object operation which is going to be as fast as a PUT operation is. If you judge a fish for its ability to fly it will feel stupid.
I was gonna ask how it compares with Minio. Minio looks incredibly easy to "deploy" (single executable!) [1]. Looks nice to start hosting files on a single machine if there's no immediate need for the industrial strength features of AWS S3, or even to run locally for development.
We haven't run mv/list experiments on 100,000 and 1,000,000 files for the blog. However, We expect the gap in latency between HopsFS/S3 and EMRFS would increase even further with larger directories. In our original HopsFS paper, we showed that the latency for mv/rename for a directory with 1 million files was around 5.8 seconds.
I have VM for my data scientists already in GCP, my datasets live in Google Cloud Storage. Can I take advantage of HopsFS for a shared file system across my VMs. Google Filestore Is ridículous expensive and at least they give u 1TB. Multi writer only supports 2VMs
If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).
Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.
Reading/writing SMALL files are SUPER slow on things like S3, Google Drive, Backblaze. Also using a lot of threads does only help a little bit but it's nowhere near reading/writing speeds of e.g. a single 600MB file.
> it's nowhere near reading/writing speeds of e.g. a single 600MB file.
I have to disagree here. If you look at benchmarks on internet, yes, it will look like S3 is dead slow. But that is a client problem, not an S3 problem. For instance, boto3 (s3transfer) is an awful implementation that was so overengineered with a reimplementation of futures, task stealing, etc, that the download throughput is pathetic. Most often it will make you top below 200MB/s.
But S3 itself scales very well if you know how to use it, and skip boto3.
From my experience and benchmarks, each S3 connection will deliver up to 80MB/s, and with range requests you can easily have multiple parallel blocks downloaded.
I wrote a simple library that does this called s3pd (https://github.com/NewbiZ/s3pd). It's not perfect and is process based instead of coroutines, but that will give you an idea.
For reference, using s3pd I can saturate any EC2 network interface that I found (tested up to 10Gb/s interfaces, with download speed >1GB/s).
It's not surprising that it's slow. Handling files has 2 costs associated to it: One is a fixed setup cost, which includes looking up the file metadata, creating connections between all the services that make it possible to access the file, and starting the read/write. The other part is actually transferring the file content.
For small files the fixed cost will be the most important factor. The "transfer time" after this "first byte latency" might actually be 0, since all the data could be transferred within a single write call on each node.
You're right that S3 isn't for small files but for a lot of small files (think 500 bytes), I either plunge them to S3 through Kinesis Firehose or fit them gzipped into DynamoDB (depending on access patterns).
One could also consider using Amazon FSx which can IO to S3 natively.
> Reading/writing SMALL files are SUPER slow on things like S3
My experience was the opposite. Small files work acceptably quickly, the cost can't be beat, and the reliability probably beats out anything else I've seen. We saw latencies of ~50ms to write out small files. Slower than an SSD, yes, but not really that slow.
You do have to make sure you're not doing things like re-establishing the connection to get that, though. If you have to do a TLS handshake… it won't be fast. Also, in you're in Python, boto's API encourages, IMO, people to call `get_bucket`, which will do an existence check on the bucket (and thus, double the latency due to the extra round-trip); usually, you have out-of-band knowledge that the bucket will exist, and can skip the check. (If you're wrong, the subsequent object GET/PUT will error anyways.)
Does it matter? I often see 'our product is much faster than thing X at cloud Y!' and find myself asking why. Why would I want something less integrated for a performance change I don't need, a software change I'll have to write and an extra overhead for dealing with another source?
It's great that one individual thing is better than one other individual thing, but if you look at the bigger picture it generally isn't that individual thing by itself that you are using.
Ok so you're not the target customer for this product. Do you believe people with this problem should be deprived of a solution just because you don't need it?
It does matter, because not every use case requires everything but the kitchen sink. If you're building things that only ever require the same mass produced components for every application, well that stifles the possibilities of what can be built.
Do you plan on having s Kubernetes storage provider? I like the idea of present a POSIX like mount to s container while paying per use for storage in S3
Cool work! I love seeing people pushing distributed storage.
IIUC though, you make a similar choice as Avere and others. You're treating the object store as a distributed block store [1]:
> In HopsFS-S3, we added configuration parameters to allow users to provide their Amazon S3 bucket to be used as the block data store. Similar to HopsFS, HopsFSS3 stores the small files, < 128 KB, associated with the file system’s metadata. For large files, > 128 KB, HopsFS-S3 will store the files in the user-provided bucket.
...
> HopsFSS3 implements variable-sized block storage to allow for any new appends to a file to be treated as new objects rather than overwriting existing objects
It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?
Using S3 / GCS / any object store as a block-store with a different frontend is a fine assumption for dedicated client or applications like HDFS-based ones. But it does mean you throw away interop with other services. For example, if your HDFS-speaking data pipeline produces a bunch of output and you want to read it via some tool that only speaks S3 (like something in Sagemaker or whatever), you're kind of trapped.
It sounds like you're already prepared to support variably-sized chunks / blocks, so I'd encourage you to have a "transparent mode". So many users love things like s3fs, gcsfuse and so on, because even if they're slow, they preserve interop. That's why we haven't gone the "blocks" route in the GCS Connector for Hadoop, interop is too valuable.
P.S. I'd love to see which things get easier for you if you are also able to use GCS directly (or at least know you're relying on our stronger semantics). A while back we finally ripped out all the consistency cache stuff in the Hadoop Connector once we'd rolled out the Megastore => Spanner migration [2]. Being able to use Dual-Region buckets that are metadata consistent while actively running Hadoop workloads in two regions is kind of awesome.
It's conceptually similar to EMR in the way it works. You connect your AWS account and we'll deploy a cluster there. HopsFS will run on top a S3 bucket in your organization.
You get a fully featured Spark environment (With metrics and logging included - no need for cloudwatch). UI with Jupyter notebooks, the Hopsworks feature store and ML capabilities that EMR does not provide.
Yeah, https://hopsworks.ai is an alternative to EMR that includes HopsFS out-of-the-box. You get $4k free credits right now - it includes Spark, Flink, Python, Hive, Feature Store, GPUs (tensorflow), Jupyter, notebooks as jobs, AirFlow. And a UI for development and operations. So kind of like Databricks with more of a ML focus, but still supporting Spark.
You could but I would do thorough real-world benchmarks.
EMRFS has dozens of optimisations for Spark/Hadoop workloads e.g. S3 select, partitioning pruning, optimised committers etc and since EMR is a core product it is continually being improved. Using HopsFS would negate all of that.
[+] [-] tutfbhuf|5 years ago|reply
But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.
Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.
Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.
[+] [-] geertj|5 years ago|reply
NFS is just the protocol. Whether it's a single point of failure depends on the server-side implementation. In Amazon EFS it is not.
(disclaimer: I'm a PM-T on the EFS team)
[+] [-] catlifeonmars|5 years ago|reply
[+] [-] elhawtaky|5 years ago|reply
[+] [-] hilbertseries|5 years ago|reply
[+] [-] dividuum|5 years ago|reply
[+] [-] fh973|5 years ago|reply
Isn't rename in S3 effectively a copy-delete operation?
[+] [-] ajb|5 years ago|reply
[+] [-] fwip|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] de_Selby|5 years ago|reply
[+] [-] rsync|5 years ago|reply
What is the access protocol ? What tools am I using to access the POSIX presentation ?
[+] [-] jeffbee|5 years ago|reply
[+] [-] daviesliu|5 years ago|reply
The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).
JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).
[1]: https://juicefs.com/docs/en/metadata_performance_comparison....
Disclaimer: Founder of JuiceFS here
[+] [-] hartator|5 years ago|reply
I don’t see how it can be useful. Moving or renaming files in S3 seems more like maintenance than something you want to do on a regular basis.
[+] [-] colinwilson|5 years ago|reply
[+] [-] randallsquared|5 years ago|reply
[+] [-] stingraycharles|5 years ago|reply
[+] [-] Aperocky|5 years ago|reply
There's no 'renaming' of any file in S3. I don't think AWS had S3 as file system in mind. There's EBS for that.
[+] [-] untech|5 years ago|reply
[+] [-] smarx007|5 years ago|reply
[+] [-] perrohunter|5 years ago|reply
[+] [-] threeseed|5 years ago|reply
Would've been more interesting had it taken advantage of newer technologies such as io_uring, NVME over Fabric, RDMA etc.
[+] [-] emmanueloga_|5 years ago|reply
1: https://min.io/download
[+] [-] rwdim|5 years ago|reply
[+] [-] elhawtaky|5 years ago|reply
[+] [-] spicyramen|5 years ago|reply
[+] [-] boulos|5 years ago|reply
If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).
Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.
[+] [-] daviesliu|5 years ago|reply
Discloser: Founder of JuiceFS here
[+] [-] therealmarv|5 years ago|reply
Is HopsFS helping in this area?
[+] [-] Galanwe|5 years ago|reply
I have to disagree here. If you look at benchmarks on internet, yes, it will look like S3 is dead slow. But that is a client problem, not an S3 problem. For instance, boto3 (s3transfer) is an awful implementation that was so overengineered with a reimplementation of futures, task stealing, etc, that the download throughput is pathetic. Most often it will make you top below 200MB/s.
But S3 itself scales very well if you know how to use it, and skip boto3.
From my experience and benchmarks, each S3 connection will deliver up to 80MB/s, and with range requests you can easily have multiple parallel blocks downloaded.
I wrote a simple library that does this called s3pd (https://github.com/NewbiZ/s3pd). It's not perfect and is process based instead of coroutines, but that will give you an idea.
For reference, using s3pd I can saturate any EC2 network interface that I found (tested up to 10Gb/s interfaces, with download speed >1GB/s).
Boto is really doing bad press to S3.
[+] [-] Matthias247|5 years ago|reply
For small files the fixed cost will be the most important factor. The "transfer time" after this "first byte latency" might actually be 0, since all the data could be transferred within a single write call on each node.
[+] [-] ignoramous|5 years ago|reply
Here are some strategies to make S3 go faster: https://news.ycombinator.com/item?id=19475726
--
You're right that S3 isn't for small files but for a lot of small files (think 500 bytes), I either plunge them to S3 through Kinesis Firehose or fit them gzipped into DynamoDB (depending on access patterns).
One could also consider using Amazon FSx which can IO to S3 natively.
[+] [-] deathanatos|5 years ago|reply
My experience was the opposite. Small files work acceptably quickly, the cost can't be beat, and the reliability probably beats out anything else I've seen. We saw latencies of ~50ms to write out small files. Slower than an SSD, yes, but not really that slow.
You do have to make sure you're not doing things like re-establishing the connection to get that, though. If you have to do a TLS handshake… it won't be fast. Also, in you're in Python, boto's API encourages, IMO, people to call `get_bucket`, which will do an existence check on the bucket (and thus, double the latency due to the extra round-trip); usually, you have out-of-band knowledge that the bucket will exist, and can skip the check. (If you're wrong, the subsequent object GET/PUT will error anyways.)
[+] [-] chordysson|5 years ago|reply
https://kth.diva-portal.org/smash/record.jsf?pid=diva2:12608...
[+] [-] oneplane|5 years ago|reply
It's great that one individual thing is better than one other individual thing, but if you look at the bigger picture it generally isn't that individual thing by itself that you are using.
[+] [-] rgbrenner|5 years ago|reply
[+] [-] ethanwillis|5 years ago|reply
[+] [-] simonebrunozzi|5 years ago|reply
[0]: https://vimeo.com/7330740
[+] [-] Just1689|5 years ago|reply
[+] [-] threeseed|5 years ago|reply
[+] [-] jamesblonde|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] boulos|5 years ago|reply
Cool work! I love seeing people pushing distributed storage.
IIUC though, you make a similar choice as Avere and others. You're treating the object store as a distributed block store [1]:
> In HopsFS-S3, we added configuration parameters to allow users to provide their Amazon S3 bucket to be used as the block data store. Similar to HopsFS, HopsFSS3 stores the small files, < 128 KB, associated with the file system’s metadata. For large files, > 128 KB, HopsFS-S3 will store the files in the user-provided bucket.
...
> HopsFSS3 implements variable-sized block storage to allow for any new appends to a file to be treated as new objects rather than overwriting existing objects
It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?
Using S3 / GCS / any object store as a block-store with a different frontend is a fine assumption for dedicated client or applications like HDFS-based ones. But it does mean you throw away interop with other services. For example, if your HDFS-speaking data pipeline produces a bunch of output and you want to read it via some tool that only speaks S3 (like something in Sagemaker or whatever), you're kind of trapped.
It sounds like you're already prepared to support variably-sized chunks / blocks, so I'd encourage you to have a "transparent mode". So many users love things like s3fs, gcsfuse and so on, because even if they're slow, they preserve interop. That's why we haven't gone the "blocks" route in the GCS Connector for Hadoop, interop is too valuable.
P.S. I'd love to see which things get easier for you if you are also able to use GCS directly (or at least know you're relying on our stronger semantics). A while back we finally ripped out all the consistency cache stuff in the Hadoop Connector once we'd rolled out the Megastore => Spanner migration [2]. Being able to use Dual-Region buckets that are metadata consistent while actively running Hadoop workloads in two regions is kind of awesome.
[1] https://content.logicalclocks.com/hubfs/HopsFS-S3%20Extendin...
[2] https://cloud.google.com/blog/products/gcp/how-google-cloud-...
[+] [-] mwcampbell|5 years ago|reply
ObjectiveFS has been around for several years. How is HopsFS better?
[+] [-] Lucasoato|5 years ago|reply
[+] [-] SirOibaf|5 years ago|reply
It's conceptually similar to EMR in the way it works. You connect your AWS account and we'll deploy a cluster there. HopsFS will run on top a S3 bucket in your organization. You get a fully featured Spark environment (With metrics and logging included - no need for cloudwatch). UI with Jupyter notebooks, the Hopsworks feature store and ML capabilities that EMR does not provide.
[+] [-] jamesblonde|5 years ago|reply
[+] [-] threeseed|5 years ago|reply
EMRFS has dozens of optimisations for Spark/Hadoop workloads e.g. S3 select, partitioning pruning, optimised committers etc and since EMR is a core product it is continually being improved. Using HopsFS would negate all of that.
[+] [-] daviesliu|5 years ago|reply
There is a blog post [1] talking about this use case. Unfortunately, we have not publish a English version, you can read it using google translate .
[1] https://juicefs.com/blog/cn/posts/globalegrow-big-data-platf...
Disclaimer: Founder of JuiceFS here.