top | item 39372438

(no title)

nchammas | 2 years ago

There is something I don't get about the Polars DataFrame API.

https://docs.pola.rs/user-guide/migration/spark/

Look at the examples on this page of the Spark vs. Polars DataFrame APIs. (Disclaimer: I contributed this documentation. [1])

Having used SQL and Spark DataFrames heavily, but not Polars (or Pandas, for that matter), my impression is that Spark's DataFrame is analogous to SQL tables, whereas Polars's DataFrame is something a bit different, perhaps something closer to a matrix.

I'm not sure how else to explain these kinds of operations you can perform in Polars that just seem really weird coming from relational databases. I assume they are useful for something, but I'm not sure what. Perhaps machine learning?

[1]: https://github.com/pola-rs/polars-book/pull/113

discuss

order

tomtom1337|2 years ago

I have not used spark, but I have written a lot of sql, polars and pandas. I think much more in terms of sql when I write polars than pandas. Do you have any examples of what you are referring to?

nchammas|2 years ago

The examples I'm referring to are in that page I linked to in my comment above.

Here's one of them:

  # Polars
  df.select(
    pl.col("foo").sort().head(2),
    pl.col("bar").sort(descending=True).head(2),
  )
In SQL and Spark DataFrames, it doesn't make sense to sort columns of the same table independently like this and then just juxtapose them together. It's in fact very awkward to do something like this with either of those interfaces, which you can see in the equivalent Spark code on that page. SQL will be similarly awkward.

But in Polars (and maybe in Pandas too) you can do this easily, and I'm not sure why. There is something qualitatively different about the Polars DataFrame that makes this possible.