(no title)
Sirupsen | 2 years ago
Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.
Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal
We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]
[0]: https://github.com/sirupsen/napkin-math [1]: https://turbopuffer.com/
jamesblonde|2 years ago
NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.
[0] https://www.hopsworks.ai/post/faster-than-aws-s3
dekhn|2 years ago
The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.
BTW EBS can do 4GB/sec per volume. But you will pay for it.
kernelsanderz|2 years ago
seasily|2 years ago
This instantly makes a number of applications able to run directly on S3, sans any caching system.
__turbobrew__|2 years ago
Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.