top | item 41841079

(no title)

_willmanning | 1 year ago

Perhaps that verbiage is just confusing. "On-disk" sort of implies "file format" but could be more explicit.

That said, the immediate next line in the README perhaps clarifies a bit?

"Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."

discuss

order

jagged-chisel|1 year ago

“Vortex is […] a highly extensible & extremely fast framework for building a modern columnar file format.”

It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.

aduffy|1 year ago

Will and I actually work on Vortex :wave:

Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!

We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.

This is nice for a couple of reasons:

(a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.

(b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.

Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!