(no title)
edschofield | 1 month ago
It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
edschofield | 1 month ago
It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
data-ottawa|1 month ago
Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry.
I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars.
And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great.
Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for.
I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars.
I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave.
neves|1 month ago
https://tomaugspurger.net/posts/modern-1-intro/
nothrowaways|1 month ago
sampo|1 month ago
For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.
Prepare some data
And then Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column. If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.P.S. Polars has an optimization to overwite a single value
But as far as I know, it doesn't allow slicing or anything.richardbachman|1 month ago
I believe it is just "syntax sugar" for calling `Series.scatter()`[1]
> it doesn't allow slicing
I believe you are correct:
You can do: Perhaps nobody has requested slice syntax? It seems like it would be easy to add.[1]: https://github.com/pola-rs/polars/blob/9079e20ae59f8c75dcce8...
goatlover|1 month ago
satvikpendem|1 month ago
Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.
crystal_revenge|1 month ago
I (and many others) hated Pandas long before Polars was a thing. The main problem is that it's a DSL that doesn't really work well with the rest of Python (that and multi-index is awful outside of the original financial setting). If you're doing pure data science work it doesn't really come up, but as soon as you need to transform that work into a production solution it starts to feel quite gross.
Before Polars my solution was (and still largely remains) to do most of the relational data transformations in the data layer, and the use dicts, lists and numpy for all the additional downstream transformations. This made it much easier to break out of the "DS bubble" and incorporate solutions into main products.
vegabook|1 month ago
Xunjin|1 month ago
bicepjai|1 month ago
v3ss0n|1 month ago
gkbrk|1 month ago
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
quentindanjou|1 month ago
rdedev|1 month ago
I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas.
Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars
data-ottawa|1 month ago
When it feels like you’re writing some external udf thats executed in another environment, it does not feel as nice as throwing in a lambda, even if the lambda is not ideal.
rich_sasha|1 month ago
Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.
The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.
thelastbender12|1 month ago
sirfz|1 month ago
lairv|1 month ago
skylurk|1 month ago
ritchie46|1 month ago
However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.
schmidtleonard|1 month ago
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
torcete|1 month ago
datsci_est_2015|1 month ago
jvican|1 month ago
bovermyer|1 month ago
The professor doesn't actually care which tool we use as long as we produce nice graphs, so this is as good a time as any to experiment.
__mharrison__|1 month ago
Pandas is better for plotting and third party integration.
vaylian|1 month ago
I used Pandas a lot with Jupyter notebooks. I don't have any experience with Polars. Is it also possible to work with Polars dataframes in Jupyter notebooks?
disgruntledphd2|1 month ago
bhadass|1 month ago
data-ottawa|1 month ago
UDFs in most dataframe libraries tend to feel better than writing udfs for a sql engine as well.
Polars specifically has lazy mode which enables a query optimizer, so you get predicate push down and all the goodies if SQL, with extra control/primitives (sane pivoting, group_by_dynamic, etc)
I do use ibis on top of duckdb sometimes, but the UDF situation persists and the way they organize their docs is very difficult to use.
vegabook|1 month ago
bikelang|1 month ago
unknown|1 month ago
[deleted]
pelasaco|1 month ago
noitpmeder|1 month ago
noo_u|1 month ago
Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.
skylurk|1 month ago
I maintain one of those libraries and everything is polars internally.
unknown|1 month ago
[deleted]