top | item 26018827

(no title)

wesm | 5 years ago

Microsoft is also on top of this with their Magpie project

http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf

"A common, efficient serialized and wire format across data engines is a transformational development. Many previous systems and approaches (e.g., [26, 36, 38, 51]) have observed the prohibitive cost of data conversion and transfer, precluding optimizers from exploiting inter-DBMS performance advantages. By contrast, inmemory data transfer cost between a pair of Arrow-supporting systems is effectively zero. Many major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler, SciDB, TileDB) and data-processing frameworks (e.g., Pandas, NumPy, Dask) have or are in the process of incorporating support for Arrow and ArrowFlight. Exploiting this is key for Magpie, which is thereby free to combine data from different sources and cache intermediate data and results, without needing to consider data conversion overhead."

discuss

order

polskibus|5 years ago

I wish MS put in some resources behind Arrow in .NET. I tried raising some remarks about it on dotnet repos (esp. within ML.NET), but to no avail. Hopefully it would change now that Arrow is more popular, and also written about by MS itself.

data_ders|5 years ago

way cool! Is magpie end-user facing anywhere yet? We were using the azureml-dataprep library for a while which seems similar but not all of magpie