top | item 41393036

(no title)

cmollis | 1 year ago

true spark has existed for years and is a great toolset.. i use it ever day. it's also a huge hassle spinning clusters up and down and configuration is complex.

I can execute some pretty hairy scans against a huge s3 parquet dataset in Duckdb that I would typically have to run in either spark or athena.. it's a little slower, but not ridiculously slower. And, it does all of that from my desktop.. no clusters, no mem or task configs.. just run the query. Being able to integrate all of the expensive historical scanning and knitting that back into an ML pipeline with desktop python is pretty nice.

discuss

No comments yet.