HopsFS: 100x Times Faster Than AWS S3

[+] tutfbhuf|5 years ago|reply

High-availability durable filesystem is a difficult problem to solve. It usually starts with NFS, which is a big huge single point of failure. Depending on the nature of the application this might be good enough.

But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.

Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.

Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.

[+] geertj|5 years ago|reply

> It usually starts with NFS, which is a big huge single point of failure.

NFS is just the protocol. Whether it's a single point of failure depends on the server-side implementation. In Amazon EFS it is not.

(disclaimer: I'm a PM-T on the EFS team)

[+] catlifeonmars|5 years ago|reply

S3 isn’t a file system. It’s a key-value store. If you’re trying to use it as a file system, you’re doing it wrong :)

[+] elhawtaky|5 years ago|reply

I'm the author. Let me know if you have any questions.

[+] hilbertseries|5 years ago|reply

Reading through your article, this solution is built on top of s3. So, moving and listing files is faster, presumably due to a new metadata system you've built for tracking files. The trade off here, is that writes must be strictly slower now than they were previously because you've added a network hop. All read and write data now flows through these workers. Which adds a point of failure, if you steam too much data through these workers, you could potentially OOM them. Reads are potentially faster, but that also depends on the hit rate of the block cache. Would be nice to see a more transparent post listing the pros and cons, rather than what reads as a technical advertisement.

[+] dividuum|5 years ago|reply

The linked post says "has the same cost as S3", yet on the linked pricing page there's only a "Contact Us" Enterprise plan besides a free one. Am I missing something?

[+] fh973|5 years ago|reply

> ... but has 100X the performance of S3 for file move/rename operations

Isn't rename in S3 effectively a copy-delete operation?

[+] ajb|5 years ago|reply

You say it's "posix-like" - so what from posix had to be left out?

[+] fwip|5 years ago|reply

What tradeoffs did you make? In what situations does S3 have better characteristics than HopsFS?

[+] unknown|5 years ago|reply

[deleted]

[+] de_Selby|5 years ago|reply

How does this differ from Objectivefs?

[+] rsync|5 years ago|reply

I searched your page, and then this HN discussion, for the string 'ssh' and got nothing ...

What is the access protocol ? What tools am I using to access the POSIX presentation ?

[+] jeffbee|5 years ago|reply

Diagram seems to imply an active/passive namenode setup like HDFS. Doesn't that limit it to tiny filesystems, and curse it with the same availability problems that plague HDFS?

[+] daviesliu|5 years ago|reply

We see similar results using JuiceFS [1], are glad to see that another player has moved on the same direction with us, since JuiceFS was released to public in 2017, and is available on most of public cloud vendor.

The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).

JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).

[1]: https://juicefs.com/docs/en/metadata_performance_comparison....

Disclaimer: Founder of JuiceFS here

[+] hartator|5 years ago|reply

> 100X the performance of S3 for file move/rename operations

I don’t see how it can be useful. Moving or renaming files in S3 seems more like maintenance than something you want to do on a regular basis.

[+] colinwilson|5 years ago|reply

Systems like Hadoop and Spark will run multiple versions of the same task writing out to a FS at once but to a temporary directory and when the first finishes the output data is just moved to the final place. It's not uncommon for a job to "complete" writing data to S3 and just sit there and hang as the FS move commands run copying the data and deleting the old version. It is just assumed a rename/move is a no-op in some systems.

[+] randallsquared|5 years ago|reply

Also, I presume that "move" means "update the index, held in another object or objects" (no slight! If I understand correctly, that's what most on-disk filesystems do as well). That's the only way I can imagine getting performance that much better.

[+] stingraycharles|5 years ago|reply

We do this in our ETL jobs on several hundreds of thousands of files a day. Not a reason to switch to a different system, but there are definitely non-maintenance use cases for this.

[+] Aperocky|5 years ago|reply

S3 isn't a file system? There's a reason it's called buckets.

There's no 'renaming' of any file in S3. I don't think AWS had S3 as file system in mind. There's EBS for that.

[+] untech|5 years ago|reply

From the title, I though that this is a technology which gives you 100x faster static sites compared to S3, which sounded really cool. Discovering that it was about move operations was really underwhelming. The title is on the edge of being misleading.

[+] smarx007|5 years ago|reply

I think the right title may have been "HopsFS: speeding up HDFS over S3 by up to 100x"

[+] perrohunter|5 years ago|reply

Dude, there’s literally no rename/move in the official S3 specification, you are going to need to do a PUT object operation which is going to be as fast as a PUT operation is. If you judge a fish for its ability to fly it will feel stupid.

[+] threeseed|5 years ago|reply

Looks no different to Alluxio or Minio S3 Gateway or the dozens of other S3 caches around.

Would've been more interesting had it taken advantage of newer technologies such as io_uring, NVME over Fabric, RDMA etc.

[+] emmanueloga_|5 years ago|reply

I was gonna ask how it compares with Minio. Minio looks incredibly easy to "deploy" (single executable!) [1]. Looks nice to start hosting files on a single machine if there's no immediate need for the industrial strength features of AWS S3, or even to run locally for development.

1: https://min.io/download

[+] rwdim|5 years ago|reply

What’s the latency with 100,000 and 1,000,000 files?

[+] elhawtaky|5 years ago|reply

We haven't run mv/list experiments on 100,000 and 1,000,000 files for the blog. However, We expect the gap in latency between HopsFS/S3 and EMRFS would increase even further with larger directories. In our original HopsFS paper, we showed that the latency for mv/rename for a directory with 1 million files was around 5.8 seconds.

[+] spicyramen|5 years ago|reply

I have VM for my data scientists already in GCP, my datasets live in Google Cloud Storage. Can I take advantage of HopsFS for a shared file system across my VMs. Google Filestore Is ridículous expensive and at least they give u 1TB. Multi writer only supports 2VMs

[+] boulos|5 years ago|reply

Disclosure: I work on Google Cloud.

If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).

Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.

[+] daviesliu|5 years ago|reply

JuiceFS supports GCS and can be used in GCP, we have customers use it for data scientists. JuiceFS is free if you don't have data more than 1TB.

Discloser: Founder of JuiceFS here

[+] therealmarv|5 years ago|reply

Reading/writing SMALL files are SUPER slow on things like S3, Google Drive, Backblaze. Also using a lot of threads does only help a little bit but it's nowhere near reading/writing speeds of e.g. a single 600MB file.

Is HopsFS helping in this area?

[+] Galanwe|5 years ago|reply

> it's nowhere near reading/writing speeds of e.g. a single 600MB file.

I have to disagree here. If you look at benchmarks on internet, yes, it will look like S3 is dead slow. But that is a client problem, not an S3 problem. For instance, boto3 (s3transfer) is an awful implementation that was so overengineered with a reimplementation of futures, task stealing, etc, that the download throughput is pathetic. Most often it will make you top below 200MB/s.

But S3 itself scales very well if you know how to use it, and skip boto3.

From my experience and benchmarks, each S3 connection will deliver up to 80MB/s, and with range requests you can easily have multiple parallel blocks downloaded.

I wrote a simple library that does this called s3pd (https://github.com/NewbiZ/s3pd). It's not perfect and is process based instead of coroutines, but that will give you an idea.

For reference, using s3pd I can saturate any EC2 network interface that I found (tested up to 10Gb/s interfaces, with download speed >1GB/s).

Boto is really doing bad press to S3.

[+] Matthias247|5 years ago|reply

It's not surprising that it's slow. Handling files has 2 costs associated to it: One is a fixed setup cost, which includes looking up the file metadata, creating connections between all the services that make it possible to access the file, and starting the read/write. The other part is actually transferring the file content.

For small files the fixed cost will be the most important factor. The "transfer time" after this "first byte latency" might actually be 0, since all the data could be transferred within a single write call on each node.

[+] ignoramous|5 years ago|reply

How small are the files?

Here are some strategies to make S3 go faster: https://news.ycombinator.com/item?id=19475726

--

You're right that S3 isn't for small files but for a lot of small files (think 500 bytes), I either plunge them to S3 through Kinesis Firehose or fit them gzipped into DynamoDB (depending on access patterns).

One could also consider using Amazon FSx which can IO to S3 natively.

[+] deathanatos|5 years ago|reply

> Reading/writing SMALL files are SUPER slow on things like S3

My experience was the opposite. Small files work acceptably quickly, the cost can't be beat, and the reliability probably beats out anything else I've seen. We saw latencies of ~50ms to write out small files. Slower than an SSD, yes, but not really that slow.

You do have to make sure you're not doing things like re-establishing the connection to get that, though. If you have to do a TLS handshake… it won't be fast. Also, in you're in Python, boto's API encourages, IMO, people to call `get_bucket`, which will do an existence check on the bucket (and thus, double the latency due to the extra round-trip); usually, you have out-of-band knowledge that the bucket will exist, and can skip the check. (If you're wrong, the subsequent object GET/PUT will error anyways.)

[+] chordysson|5 years ago|reply

Regarding small files, HopsFS can store small files in the metadata layer for improved performance.

https://kth.diva-portal.org/smash/record.jsf?pid=diva2:12608...

[+] oneplane|5 years ago|reply

Does it matter? I often see 'our product is much faster than thing X at cloud Y!' and find myself asking why. Why would I want something less integrated for a performance change I don't need, a software change I'll have to write and an extra overhead for dealing with another source?

It's great that one individual thing is better than one other individual thing, but if you look at the bigger picture it generally isn't that individual thing by itself that you are using.

[+] rgbrenner|5 years ago|reply

Ok so you're not the target customer for this product. Do you believe people with this problem should be deprived of a solution just because you don't need it?

[+] ethanwillis|5 years ago|reply

It does matter, because not every use case requires everything but the kitchen sink. If you're building things that only ever require the same mass produced components for every application, well that stifles the possibilities of what can be built.

[+] simonebrunozzi|5 years ago|reply

If you're curious, I once gave an in-depth talk about S3 internals in 2009, back when I was at AWS [0].

[0]: https://vimeo.com/7330740

[+] Just1689|5 years ago|reply

Do you plan on having s Kubernetes storage provider? I like the idea of present a POSIX like mount to s container while paying per use for storage in S3

[+] threeseed|5 years ago|reply

You can do this using s3backer, goofys etc to mount S3 as a host filesystem and then use the host path provisioner.

[+] jamesblonde|5 years ago|reply

I like the idea, too :) We have it as a project for a Master's student at KTH starting in January.

[+] unknown|5 years ago|reply

[deleted]

[+] boulos|5 years ago|reply

Disclosure: I work on Google Cloud.

Cool work! I love seeing people pushing distributed storage.

IIUC though, you make a similar choice as Avere and others. You're treating the object store as a distributed block store [1]:

> In HopsFS-S3, we added configuration parameters to allow users to provide their Amazon S3 bucket to be used as the block data store. Similar to HopsFS, HopsFSS3 stores the small files, < 128 KB, associated with the file system’s metadata. For large files, > 128 KB, HopsFS-S3 will store the files in the user-provided bucket.

...

> HopsFSS3 implements variable-sized block storage to allow for any new appends to a file to be treated as new objects rather than overwriting existing objects

It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?

Using S3 / GCS / any object store as a block-store with a different frontend is a fine assumption for dedicated client or applications like HDFS-based ones. But it does mean you throw away interop with other services. For example, if your HDFS-speaking data pipeline produces a bunch of output and you want to read it via some tool that only speaks S3 (like something in Sagemaker or whatever), you're kind of trapped.

It sounds like you're already prepared to support variably-sized chunks / blocks, so I'd encourage you to have a "transparent mode". So many users love things like s3fs, gcsfuse and so on, because even if they're slow, they preserve interop. That's why we haven't gone the "blocks" route in the GCS Connector for Hadoop, interop is too valuable.

P.S. I'd love to see which things get easier for you if you are also able to use GCS directly (or at least know you're relying on our stronger semantics). A while back we finally ripped out all the consistency cache stuff in the Hadoop Connector once we'd rolled out the Megastore => Spanner migration [2]. Being able to use Dual-Region buckets that are metadata consistent while actively running Hadoop workloads in two regions is kind of awesome.

[1] https://content.logicalclocks.com/hubfs/HopsFS-S3%20Extendin...

[2] https://cloud.google.com/blog/products/gcp/how-google-cloud-...

[+] mwcampbell|5 years ago|reply

> But, until today, there has been no equivalent to ADLS for S3.

ObjectiveFS has been around for several years. How is HopsFS better?

[+] Lucasoato|5 years ago|reply

Can this be installed on AWS EMR clusters?

[+] SirOibaf|5 years ago|reply

Not really, but you can try it out on https://hopsworks.ai

It's conceptually similar to EMR in the way it works. You connect your AWS account and we'll deploy a cluster there. HopsFS will run on top a S3 bucket in your organization. You get a fully featured Spark environment (With metrics and logging included - no need for cloudwatch). UI with Jupyter notebooks, the Hopsworks feature store and ML capabilities that EMR does not provide.

[+] jamesblonde|5 years ago|reply

Yeah, https://hopsworks.ai is an alternative to EMR that includes HopsFS out-of-the-box. You get $4k free credits right now - it includes Spark, Flink, Python, Hive, Feature Store, GPUs (tensorflow), Jupyter, notebooks as jobs, AirFlow. And a UI for development and operations. So kind of like Databricks with more of a ML focus, but still supporting Spark.

[+] threeseed|5 years ago|reply

You could but I would do thorough real-world benchmarks.

EMRFS has dozens of optimisations for Spark/Hadoop workloads e.g. S3 select, partitioning pruning, optimised committers etc and since EMR is a core product it is continually being improved. Using HopsFS would negate all of that.

[+] daviesliu|5 years ago|reply

JuiceFS can used in AWS EMR cluster, we have customers using JuiceFS to address the consistency and performance issues from S3.

There is a blog post [1] talking about this use case. Unfortunately, we have not publish a English version, you can read it using google translate .

[1] https://juicefs.com/blog/cn/posts/globalegrow-big-data-platf...

Disclaimer: Founder of JuiceFS here.

132 comments