Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.
Although "wide" data is touted as a optimization guideline for Nimble, how well does it fare against "normal"(?) data, i.e. with just a few to tens of columns?
It seems to be optimized towards ML where sequential scan is the access pattern. so it wouldn't be suitable for analytical workloads yet, though they are planning on working on that.
Yes, I'd be curious to know how much better it is than them - from my limited understanding, they also share many of the advantages that Nimble boasts of, thus I can appreciate they'd both be better than legacy formats but it's not clear how close these two are.
But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).
Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:
Well, Parquet seems to be so widely supported, it's my default pick, unless you can explain why it's not the right fit.
Though I'll say if your primary use case is "higher-dimensional arrays", none of Parquet etc are likely to be a good fit -- these things are columnar formats where each column has a separate name, datatype etc, not formats for multi-dimensional arrays of numbers. That's a different problem. A Parquet column can be a list of arrays, but there's no special handling of matrices.
jgarzik|1 year ago
I would prefer to write a parser with zero dependencies.
SonOfLilit|1 year ago
Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.
CharlesW|1 year ago
winwang|1 year ago
Also, are there any preliminary benchmarks?
quadrature|1 year ago
It seems to be optimized towards ML where sequential scan is the access pattern. so it wouldn't be suitable for analytical workloads yet, though they are planning on working on that.
1-6|1 year ago
unknown|1 year ago
[deleted]
levzettelin|1 year ago
snthpy|1 year ago
1: https://lancedb.github.io/lance/
2: https://lancedb.github.io/lance/format.html
3: https://youtu.be/ixpbVyrsuL8?si=9QhF0wyxYtl2L01_
nmstoker|1 year ago
zX41ZdbW|1 year ago
1-6|1 year ago
albertzeyer|1 year ago
We still use HDF (https://en.wikipedia.org/wiki/Hierarchical_Data_Format).
But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).
Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:
https://www.hopsworks.ai/post/guide-to-file-formats-for-mach...
https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/...
https://github.com/pangeo-data/pangeo/issues/285
yencabulator|1 year ago
Though I'll say if your primary use case is "higher-dimensional arrays", none of Parquet etc are likely to be a good fit -- these things are columnar formats where each column has a separate name, datatype etc, not formats for multi-dimensional arrays of numbers. That's a different problem. A Parquet column can be a list of arrays, but there's no special handling of matrices.
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]