top | item 47021233

(no title)

cpard | 15 days ago

It's important in a book treating an emerging field (data eng for LLMs) to mention emerging categories related to it such as storage formats purpose built for the full ML lifecycle.

Lance[1] (the format, not just LanceDB) is a great example, where you have columnar storage optimized for both analytical operations and vector workloads together with built-in versioning for dataset iteration.

Plus (very important) random access, which is important for stuff like sampling and efficient filtering during curation but also for working with multimodal data, e.g. videos.

Lance is not alone, vortex[2] is another one, nimble[3] from Meta yet another one and I might be missing a few more.

[1] https://github.com/lance-format/lance [2] https://vortex.dev [3] https://github.com/facebookincubator/nimble

discuss

order

No comments yet.