top | item 22433440

(no title)

MichaelSalib | 6 years ago

Ah, thanks for these! But I see nothing has changed. * pyfive is interesting but immature and doesn't seem to have any cloud bucket support * h5s3 is an abandoned experiment that hasn't been touched in two years * h5py is fine but again, no cloud support * kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)

Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.

discuss

order

hcrisp|6 years ago

It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.

MichaelSalib|6 years ago

To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.