top | item 42864975

(no title)

lidavidm | 1 year ago

Arrow has several other related projects in this space:

Arrow Flight SQL defines defines a full protocol designed to support JDBC/ODBC-like APIs but using columnar, Arrow data transfer for performance (why take your data and transpose it twice?)

https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f...

There's an Apache-licensed JDBC driver that talks the Flight SQL protocol (i.e. it's a driver for _any_ server that implements the protocol): https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jd...

(There's also an ODBC driver, but at the moment it's GPL - the developers are working on upstreaming it and rewriting the GPL bits. And yes, this means that you're still transposing your data, but it turns out that transferring your data in columnar format can still be faster - see https://www.vldb.org/pvldb/vol10/p1022-muehleisen.pdf)

There's an experiment to put Flight SQL in front of PostgreSQL: https://arrow.apache.org/blog/2023/09/13/flight-sql-postgres...

There's also ADBC; where Flight SQL is a generic protocol (akin to TDS or how many projects implement the PostgreSQL wire protocol), ADBC is a generic API (akin to JDBC/ODBC in that it abstracts the protocol layer/database, but it again uses Arrow data): https://arrow.apache.org/blog/2023/01/05/introducing-arrow-a...

discuss

majoe|1 year ago

Arrow is pretty cool, although I haven't had the opportunity yet to use it.

I skimmed the paper you linked and wondered, how one measures the ser/de time a query takes or more generally how one would estimate the possible speedup of using Arrow Flight for communication with a database.

Do you by chance have any insights in that direction?

At work we have a Java application, that produces a big amount of simulation results (ca. 1Tb per run), which are stored in a database. I suspect, that a lot of time is wasted for ser/de, when aggregating the results, but it would be good to have some numbers.

lidavidm|1 year ago

To get a really precise answer you'd have to profile or benchmark. I'd say it's also hard to do an apples to apples comparison (if you only replace the data format in the wire protocol, the database probably still has to transpose the data to ingest it). And it's hard to do a benchmark in the first place since probably your database's wire protocol is not really exposed for you to do a benchmark.

You can sort of see what benefits you might get from a post like this, though: https://willayd.com/leveraging-the-adbc-driver-in-analytics-...

While we're not using Arrow on the wire here, the ADBC driver uses Postgres's binary format (which is still row oriented) + COPY and can get significant speedups compared to other Postgres drivers.

The other thing might be to consider whether you can just dump to Parquet files or something like that and bypass the database entirely (maybe using Iceberg as well).

chrisjc|1 year ago

And all the Arrow parts work together quite nicely.

    ADBC client --> Flight SQL (duckdb/whatever) --> Flight --> ?

The result highlights your exact point: why take your data and transpose it twice?

It's quite an exciting space, and lots of projects popping up around Arrow Flight and duckdb.