Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).
Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.
Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.
Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.
What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?
msto|2 months ago
mkl|2 months ago
> Copyright 2016
dahcryn|2 months ago
What other framework would you replace it with?
No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
crystal_revenge|2 months ago
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
porker|2 months ago
Can you expand on why Polars isn't optimised for a holistic approach to data science?
minimaxir|2 months ago
maxnoe|2 months ago
xenophonf|2 months ago
crystal_revenge|2 months ago
Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.
Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.
Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.
What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?
clickety_clack|2 months ago
amelius|2 months ago
wesleywt|2 months ago
unknown|2 months ago
[deleted]