I am impressed that Polar is close to DuckDB near the top. It's surprising that a Python library would often out perform everything but DuckDB. DuckDB is very impressive but DataFrames and Python is too useful to give up on.
DuckDB interoperates with polars dataframes easily. I see DuckDB as a SQL engine for dataframes.
Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).
The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.
So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.
Wow, what a cool workflow. I looks like the interop promise of Apache Arrow is real. It's a great thing when your computer works as fast as you think as opposed to sitting around waiting for queries to finish.
I mean polars is great, but there's nothing fundamentally impossible about polars providing similar performance to DuckDB, polars is written in rust, and really a lazy dataframe just provides an alternative frontend (sql being another frontend).
There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.
I didn't know that Polars was implemented in Rust. In fact, it's very neat that Rust can interop with Python so cleanly, but that shouldn't be surprising since Python is basically a wrapper around C libraries.
But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.
But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.
wenc|2 years ago
Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).
The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.
So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.
https://duckdb.org/docs/guides/python/polars.html
indeedmug|2 years ago
theLiminator|2 years ago
There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.
indeedmug|2 years ago
But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.
But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.
shpongled|2 years ago
unknown|2 years ago
[deleted]