The generalized form of this range-request-based streaming approach looks something like my project VirtualiZarr [0].
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].
The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.
This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].
In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.
I would love to see if someone could create a virtual Zarr store pointing at this WSI data!
Sounds like an approach that would also work for ML model weights files — just another kind of multidimensional array with metadata.
I wonder what exactly the big multi-model AI companies are doing to optimize model cold-start latency, and how much it just looks like Zarr on top of on-prem object storage.
> Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks
Yeah, a recurring thought is that these should condense into Apache Arrow queried by DuckDB but there must be some reason for this not to have already happened.
Interesting guide to the Whole Slide Images (WSI) format. The surprising thing for me is that compression is used, and they note does not affect use in diagnostics.
Back in the day we used TIFF for a similar application (X-ray detector images).
Seems very similar to how maps work on the web these days, in particular protomap files [0]. I wonder if you could view the medical images in leaflet or another frontend map library with the addition of a shim layer? Cool work!
Thanks! Indeed, digital pathology, satellite imaging and geospatial data share a lot of computational problems: efficient storage, fast spatial retrieval/indexing. I think this could be doable.
As for digital pathology, the field is very much tied to scanner-vendor proprietary formats (SVS, NDPI, MRXS, etc).
I did something similar once for a mining technique called “core logging”. It’s a single photo about 1000 pixels wide and several million “deep”: what the earth looks like for a few km down.
Existing solutions are all complicated and clunky, I put something together with S3 and bastardised CoGeoTIFF, instant view of any part of the image.
I'm curious about the "core logging" photo. Where can I find one? Do you have an implementation of your solution? I would be curious to have a look at it.
A while back I worked on a project where s3 held giant zip files containing zip files (turtles all the way down) and also made good use of range requests. I came up with seekable-s3-stream[1] to generalize working with them via an idiomatic C# stream.
Maybe a bit pedantic, but if you're streaming it, then you're still downloading portions of it, yah? Just not persisting the whole thing locally before viewing it.
Edit: Looks like this is a slight discrepancy between the HN title and the GitHub description.
Yes, I agree. I'm not persisting the WSI locally, which creates a smoother user experience. But I do need to transfer tiles from server to client. They are stored in an LRU cache and evicted if not used.
Interesting - I'm not so familiar with S3 but I wonder if this would work for WSI stored on-premises. Imposing lower network requirememts and a lightweight web viewer is very advantageous in this use case. I'll have to try it out!
When WSI are stored on-premise, they are typically stored on hard drives with a filesystem. If you have a filesystem, you can use OpenSlide, and use a viewer like OpenSeaDragon to visualize the slide.
WSIStreamer is relevant for storage systems without a filesystem. In this case, OpenSlide cannot work (it needs to seek and open the file).
Yes there is a requirement to work with the vendor format. For instance, TCGA (The Cancer Genome Atlas - a large dataset of 12k+ human tumor cases) has mostly .svs files (scanned with an Aperio scanner). We tend to work with these formats as they contain all the metadata we need.
Sometimes, it happens that we re-write the image in a pyramidal TIFF format (happened to me a few times, where NDPI images had only the highest resolution level, no pyramid), in which case COGs could work.
You could probably do it completely clientside. I have a parser for 12 scanner formats in js. It doesnt read the pixels, just parses metadata but jpeg is easy and most common anyway
As data scientists, we usually don't get to choose. It's usually up to the hospital or digital lab's CISO to decide where the digitized slides are stored, and S3 is a fairly common option.
That being said, I plan to support more cloud platforms in the future, starting with GCP.
Currently we only support TIFF and SVS with JPEG and JPEG2000 compression formats. I plan on supporting more file extensions (e.g. NDPI, MRXS) in the future, each with their own compression formats.
tomnicholas1|1 month ago
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].
The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.
This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].
In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.
I would love to see if someone could create a virtual Zarr store pointing at this WSI data!
[0]: https://virtualizarr.readthedocs.io/en/stable/
[1]: https://earthmover.io/blog/fundamentals-what-is-cloud-optimi...
[2]: https://earthmover.io/blog/what-is-zarr
[3]: https://earthmover.io/blog/icechunk-1-0-production-grade-clo...
[4]: https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud
[5]: https://earthmover.io/blog/announcing-flux
el_pa_b|1 month ago
derefr|1 month ago
I wonder what exactly the big multi-model AI companies are doing to optimize model cold-start latency, and how much it just looks like Zarr on top of on-prem object storage.
adolph|1 month ago
Yeah, a recurring thought is that these should condense into Apache Arrow queried by DuckDB but there must be some reason for this not to have already happened.
unknown|1 month ago
[deleted]
rwmj|1 month ago
Interesting guide to the Whole Slide Images (WSI) format. The surprising thing for me is that compression is used, and they note does not affect use in diagnostics.
Back in the day we used TIFF for a similar application (X-ray detector images).
yread|1 month ago
matthberg|1 month ago
0: https://protomaps.com/
el_pa_b|1 month ago
As for digital pathology, the field is very much tied to scanner-vendor proprietary formats (SVS, NDPI, MRXS, etc).
carderne|1 month ago
Existing solutions are all complicated and clunky, I put something together with S3 and bastardised CoGeoTIFF, instant view of any part of the image.
Wish I knew how to commercialise it…
kirubakaran|1 month ago
You've already done the "building v1" part, and have started to do the "talking about it" part.
Next step is to write up how one could use it, how it is better than the alternatives, and put it up on a website.
I'm happy to chat about it if you like. My email is in my profile.
Once you have real users, they will pull the v2 out of you, and that will be what you'll sell.
What I've written above sounds like a business proposition, but I want to clarify that I'm just offering to share what I know for free :-)
el_pa_b|1 month ago
mlhpdx|1 month ago
[1] https://github.com/mlhpdx/seekable-s3-stream
el_pa_b|1 month ago
tokyovigilante|1 month ago
dmd|1 month ago
iberator|1 month ago
Sleaker|1 month ago
Edit: Looks like this is a slight discrepancy between the HN title and the GitHub description.
el_pa_b|1 month ago
lametti|1 month ago
el_pa_b|1 month ago
WSIStreamer is relevant for storage systems without a filesystem. In this case, OpenSlide cannot work (it needs to seek and open the file).
invaderJ1m|1 month ago
Was there a requirement to work with these formats directly without converting?
el_pa_b|1 month ago
Sometimes, it happens that we re-write the image in a pyramidal TIFF format (happened to me a few times, where NDPI images had only the highest resolution level, no pyramid), in which case COGs could work.
yread|1 month ago
andrewstuart|1 month ago
el_pa_b|1 month ago
That being said, I plan to support more cloud platforms in the future, starting with GCP.
lijok|1 month ago
There are choices that speak the S3 data plane API (GetObject, ListBucket, etc).
There are no alternatives that support most of the AWS S3 functionality such as replication, event notifications.
kube-system|1 month ago
thenaturalist|1 month ago
Nora23|1 month ago
el_pa_b|1 month ago
isuckatcoding|1 month ago
huflungdung|1 month ago
[deleted]
tonyhart7|1 month ago
tonymet|1 month ago