top | item 28293360

(no title)

eduren | 4 years ago

I've been following Arrow and Datafusion dev for a little bit, mostly because the architecture and goals look interesting.

What I'd be curious about is one of the possible use cases mentioned in the Readme: ETL processes. I have yet to come across any projects that are building ETL/ELT/pipeline tools that leverage Datafusion. Might not be looking in the right places.

Would anyone have insight into whether this is simply unexplored territory, or just not as good of a fit as other use cases?

discuss

order

seddonm1|4 years ago

Disclosure: I am a contributor to Datafusion.

I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

eduren|4 years ago

Oh hey, thanks for the info!

I spent some time evaluating Arc for my team's ETL purposes and I was really impressed. I hesitated somewhat to move forward with it because it seemed really tied into the Spark ecosystem (for great reasons). We just weren't at all familiar with deploying and operating Spark, so ended up rolling our own scripts on top of (an existing) Airflow cluster for now.

Besides performance reasons, are there any other advantages to porting Arc to run on top of datafusion? If the porting effort was shared somewhere I'd love to dig in and see what the proof-of-concept looks like.

houqp|4 years ago

ETL pipeline is a perfect fit for Datafusion and its distributed version Ballista. Personally, this is the main reason I am investing my time into Datafusion.