The Nimble File Format by Meta

jgarzik|1 year ago

Where is the file format specification?

I would prefer to write a parser with zero dependencies.

They strongly discourage you.

Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.

CharlesW|1 year ago

Video link and more comments from two weeks ago: https://news.ycombinator.com/item?id=39995112

winwang|1 year ago

Although "wide" data is touted as a optimization guideline for Nimble, how well does it fare against "normal"(?) data, i.e. with just a few to tens of columns?

Also, are there any preliminary benchmarks?

quadrature|1 year ago

There are some numbers in the youtube presentation https://www.youtube.com/watch?v=bISBNVtXZ6M

It seems to be optimized towards ML where sequential scan is the access pattern. so it wouldn't be suitable for analytical workloads yet, though they are planning on working on that.

1-6|1 year ago

I wonder how well it functions as a blobstore.

unknown|1 year ago

[deleted]

levzettelin|1 year ago

What are the differences versus Arrow/Parquet?

snthpy|1 year ago

Yes, and also compared to Lance [1]. They have a spec [2] and I got a lot more out of Chang's talk [3] than the Nimble one.

1: https://lancedb.github.io/lance/

2: https://lancedb.github.io/lance/format.html

3: https://youtu.be/ixpbVyrsuL8?si=9QhF0wyxYtl2L01_

nmstoker|1 year ago

Yes, I'd be curious to know how much better it is than them - from my limited understanding, they also share many of the advantages that Nimble boasts of, thus I can appreciate they'd both be better than legacy formats but it's not clear how close these two are.

zX41ZdbW|1 year ago

Interesting if there are benefits over MergeTree (ClickHouse's data format).

1-6|1 year ago

I thought the future would have been unstructured.

albertzeyer|1 year ago

https://xkcd.com/927/ ?

We still use HDF (https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).

Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:

https://www.hopsworks.ai/post/guide-to-file-formats-for-mach...

https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/...

https://github.com/pangeo-data/pangeo/issues/285

yencabulator|1 year ago

Well, Parquet seems to be so widely supported, it's my default pick, unless you can explain why it's not the right fit.

Though I'll say if your primary use case is "higher-dimensional arrays", none of Parquet etc are likely to be a good fit -- these things are columnar formats where each column has a separate name, datatype etc, not formats for multi-dimensional arrays of numbers. That's a different problem. A Parquet column can be a list of arrays, but there's no special handling of matrices.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

21 comments