top | item 38920043

Polars

981 points| tosh | 2 years ago |pola.rs

386 comments

order
[+] j1elo|2 years ago|reply
It's cristal clear that this page has been written for people who already know what they are looking at; the first line of the first paragraph, far from describing the tool, is about some qualities of it: "Polars is written from the ground up with performance in mind"

And the rest follows the same line.

Anyone could ELI5 what this is and for what needs it is a good solution to use?

EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:

> The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

> DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.

[0]: https://realpython.com/pandas-dataframe/

[+] anigbrowl|2 years ago|reply
Yes, it's annoying negative feature of many tech products. Of course it's natural to want to speak to your target audience (in this case, data scientists who like Pandas but find it annoyingly slow/inflexible), but it's quite alienating to newbies who might otherwise become your most enthusiastic customers.

I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/

[+] DylanDmitri|2 years ago|reply
It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.
[+] madeofpalk|2 years ago|reply
I was going to say - it always feels so humbling seeing pages like this. "DataFrames for the new era" okay… maybe I know what data frames are? "Multi-threaded query engine" ahh, so it’s like a database. A graph comparing it to things called pandas, modin, and vaex - I have no clue what any of these are either! I guess this really isn’t for me.

It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.

This must be what normal people go through when I talk about my lowly web development work…

[+] pama|2 years ago|reply
In fairness, the title of the page is “Dataframes for the new Era”. The “Get Started” link below the title links to a document that points to the GitHub page, which explains what the library is about to people with data analysis backgrounds: https://github.com/pola-rs/polars
[+] Joeboy|2 years ago|reply
I'm currently getting dragged into "data" stuff, and I get the impression it's a parallel universe, with its own background and culture. A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".

Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.

As far as I can tell everybody loves it and it'll probably supplant pandas over time.

[+] godelski|2 years ago|reply
I try to use polars each time I have to do some analysis where dataframes helps. So basically any time I'd reach for pandas, which isn't too often. So each time it's fairly "new". This makes me have a hard time believing everyone that is saying "Pandas but faster" has used Polars, because I can often write Pandas from memory.

There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".

To all devs:

Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.

This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.

[+] drbaba|2 years ago|reply
Above that it says “DataFrames for a new era” hidden in their graphics. I believe it’s a competitor to the Python library “Pandas”, which makes it easy to do complex transformations on tabular data in Python.
[+] notatoad|2 years ago|reply
I think something like dataframes suffers from having a name that isn't obscure enough. You read "dataframes" and think those are two words you know, so you should understand what it is.

If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.

[+] mekster|2 years ago|reply
How come some submissions don't even describe what it is about than just the name of it? It's really puzzling how everyone is meant to know what it is by its name.
[+] Ultimatt|2 years ago|reply
Right... but the title before the first line reads "DataFrames for the new era". If you don't know what a data frame is then, yes, it's for people who already know that.
[+] jmspring|2 years ago|reply
You were right that the page is written for those that know what they are looking for, which is just fine. If you are getting started in DS/ML/etc and you have used numpy, pandas, etc. polars is useful in some cases. A simple one, it loads dataframes faster (from experience with a team I help) than pandas.

I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.

[+] the__alchemist|2 years ago|reply
Dataframes in Python are a wrapper around 2D numpy arrays, that have labels and various accessors. Operations on them are OOM slower than using the underlying arrays.
[+] esafak|2 years ago|reply
Marketing is a skill that needs to be learned. You have to put yourself in the shoes of a person who knows nothing about your product. This does not come naturally to the engineers who make these products and are used to talking to other specialists like themselves.
[+] nnevatie|2 years ago|reply
Noticed exactly the same - there's no description of the library whatsoever on the landing page. It is implied that it is a DataFrame library, whatever that means.
[+] SalmoShalazar|2 years ago|reply
It’s not written for you and that’s fine. This is a library targeted at a very specific subset of people and you’re not in it.
[+] dang|2 years ago|reply
Related:

Detailed Comparison Between Polars, DuckDB, Pandas, Modin, Ponder, Fugue, Daft - https://news.ycombinator.com/item?id=37087279 - Aug 2023 (1 comment)

Polars: Company Formation Announcement - https://news.ycombinator.com/item?id=36984611 - Aug 2023 (52 comments)

Replacing Pandas with Polars - https://news.ycombinator.com/item?id=34452526 - Jan 2023 (82 comments)

Fast DataFrames for Ruby - https://news.ycombinator.com/item?id=34423221 - Jan 2023 (25 comments)

Modern Polars: A comparison of the Polars and Pandas dataframe libraries - https://news.ycombinator.com/item?id=34275818 - Jan 2023 (62 comments)

Rust polars 0.26 is released - https://news.ycombinator.com/item?id=34092566 - Dec 2022 (1 comment)

Polars: Fast DataFrame library for Rust and Python - https://news.ycombinator.com/item?id=29584698 - Dec 2021 (124 comments)

Polars: Rust DataFrames Based on Apache Arrow - https://news.ycombinator.com/item?id=23768227 - July 2020 (1 comment)

[+] dangoodmanUT|2 years ago|reply
so you took my original username
[+] serjester|2 years ago|reply
Used pandas for years and it always felt like rolling a ball uphill - just look at doing something as simple as a join (don't forget to reset the index).

Polars feels better than pandas in every way (faster + multi-core, less memory, more intuitive API). The library is still relatively young which has its downsides but in my opinion, at minimum, it deserves to be considered on any new project.

Easily being able to leverage the Rust ecosystem is also awesome - I sped up some geospatial code 100X by writing my own plugin to parallelize a function.

[+] nerdponx|2 years ago|reply
> just look at doing something as simple as a join (don't forget to reset the index)

It's slightly ironic that you mention this, because I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".

Pandas was originally very heavily inspired by R terminology and usage patterns, where the term "merge" to mean "join" was already commonplace. If I didn't already know R when I started learning Pandas (~2015), I don't think I'd have been able to pick it up quickly at all.

[+] snthpy|2 years ago|reply
I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131... That issue is closed but I would love to hear any comments and you are welcome to open a new issue referencing that comment or simply tagging me (@snth).

Disclaimer: I'm a PRQL contributor.

[+] dcreater|2 years ago|reply
What's difficult with pandas dataframe merges?
[+] maliker|2 years ago|reply
Biggest advantage I found when I evaluated it was the API was much more consistent and understandable than the pandas one. Which is probably a given, they’ve learned from watching 20 major versions of pandas get released. However, since it’s much rarer, copilot had trouble writing polars code. So I’m sticking with pandas and copilot for now. Interesting barrier to new libraries in general I hadn’t noticed until I tried this.
[+] epolanski|2 years ago|reply
You're the first person I ever encounter that publicly states to prefer a library because of its copilot support.

Not making a judgement, just finding it interesting.

Anyway, for what is worth, Copilot learns fast in your repos, very fast.

I use an extremely custom stack made of TS-Plus a TypeScript fork that not even the author itself uses nor recommends and Copilot churns very good TS-Plus code.

So don't underestimate how good can copilot can get at the boilerplate stage once he's seen few examples.

[+] BadHumans|2 years ago|reply
Copilot support is a chicken and egg problem. It needs to train on others code but if people don't write Polars code without Copilot then Copilot will not get better at writing Polars code.
[+] humbleharbinger|2 years ago|reply
I had a similar experience using danfo.js, another data frame library in js. Copilot straight up hallucinate functionality and method names.

Not a big deal because I just read the docs but it was annoying that I couldn't have copilot just spit out what I need.

[+] xpe|2 years ago|reply
You recognize the API is more consistent and understandable, but you want to stay with Pandas only because Copilot makes it easier? Please, (a) for your own sake and (b) for the sake of open source innovation, use the tool that you admit is better.

About me: I've used and disliked the Pandas API for a long time. I'm very proactive about continual improvement in my learning, tooling, mindset, and skills.

[+] naiv|2 years ago|reply
The Polars lib changes rapidly. I am not using Copilot but achieved very good results with ChatGpt if you set system instructions to let it know that eg with_column was replaced with with_columns etc. and add the updated doc information to the system instructions.
[+] __mharrison__|2 years ago|reply
Copilot support is basically non existent for Polars. It does a decent job of writing basic pandas... (But could do a lot better).
[+] bradhilton|2 years ago|reply
I use polars, but I've also run into this problem with copilot.
[+] mmastrac|2 years ago|reply
When we shipped Jupyter support in Deno, `nodejs-polars` was one of the cornerstone library for data science we supported.

https://blog.jupyter.org/bringing-modern-javascript-to-the-j...

I'm not personally a Data Science guy, but considering how early the JS/Jupyter ecosystem is, it was surprisingly quick to get pola.rs-based analysis up and running in TypeScript.

The JS bindings certainly need a bit of love, but hopefully now that it's more accessible we'll see some iteration on it.

[+] benrutter|2 years ago|reply
I'm really excited about Polars and it's speed performance is super impressive buuutt. . . It annoys me to see vaex, modin and dask all compared on the same benchmarks.

For anyone who doesn't use those libraries, they are all targeted towards out-of-core data processing (i.e. computing across multiple machines because your data is too big). Comparing them to a single core data frame library is just silly, and they will obviously be slower because they necessarily come with a lot of overhead. It just wouldn't make sense to use polars in the same context as those libraries, so seeing them presented in benchmarks as if they are equivalents is a little silly.

And on top of that, duckdb, which you might use in the same context as polars and is faster than polars in a lot of contexts, isn't included in the benchmarks.

The software engineering behind polars is amazing work and there's no need to have misleading benchmarks like this.

[+] wenc|2 years ago|reply
I don’t use Polars directly, but instead I use it as a materialization format in my DuckDB workflows.

Duckdb.query(sql).pl() is much faster than duckdb.query(sql).df(). It’s zero copy to Polars and happens instantaneously while Pandas takes quite a while if the DataFrame is big. And you can manipulate it like a Pandas DataFrame (albeit with slightly different syntax).

It’s greater for working with big datasets.

[+] imgabe|2 years ago|reply
There must be a corollary to Greenspun's Tenth Rule (https://en.wikipedia.org/wiki/Greenspun's_tenth_rule) that any sufficiently complicated data analysis library contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQL.

I use Pandas from time to time and I'll probably try this out, but I always find myself wishing I'd just started with shoving whatever data I'm working with into Postgres.

It's not like I'm some database expert either, I'm far more comfortable with Python, but the facilities for selecting, sorting, filtering, joining, etc tabular data are just way better in SQL.

[+] frogamel|2 years ago|reply
A few months ago I tried migrating a large pandas codebase to polars. I'm not much of a fan of doing analytics/data pipelining in Python - a complex transformation takes me 2-5x as long in pandas compared to Julia or R (using dataframes.jl & dplyr).

Unfortunately polars was not it. Too many bugs on standard operations, unreliable interoperability with pandas (which is an issue since so many libraries require pandas dataframes as inputs), the API is also very verbose for a modern dataframe library, though it's still better than pandas.

Hopefully these will get resolved out over time but for now I had the best luck using duckdb on top of pandas, it is as fast as polars but more stable/better interoperability.

Eventually I hope the Python dataframe ecosystem gets to the same point as R's, where you have a analytics-oriented dataframe library with an intuitive API (dplyr) that can be easily used alongside a high-performance dataframe library (data.table).

[+] kelseyfrog|2 years ago|reply
My data science team evaluated Polars and came back with a mixed bag of results. If there was any performance-critical section, then we would consider employing it, but otherwise it was marginal negative given the overhead of replacing Pandas across dozens of projects.
[+] recursive4|2 years ago|reply
I recently reached the limits of Pandas running on my 2020 16gb M1. Counting the number of times an element appears in a 1.7B row DataFrame using `df.groupby().size()` would consistently exceed available memory.

Rust Polars is able to handle this using Lazy DataFrames / Streaming without issue.

[+] sweezyjeezy|2 years ago|reply
FWIW I think df.column.value_counts() is better to use here in pandas.
[+] nomilk|2 years ago|reply
Is the sole appeal of polars (vs, say, pandas) its execution speed?

I've found being able to express ideas clearly in code (to aid comprehension now and in the future) to be much more important than shaving off a few seconds of run time.

For this reason I think speed alone is not a strong sell point, except specifically in cases where execution times really matter.

Analogous somewhat to how ruby/rails might be a 'slow' language/framework (e.g. 600ms when another framework might be 200ms) but multiples faster in facilitating the expression of complex ideas through code, which tends to be the far bigger problem in most software projects.

[+] jh_zab|2 years ago|reply
I have ported a few internal libraries to polars from pandas and had great results.

I never liked pandas much due to its annoying API, indices and single thread only implementation (we usually get a 10x performance boost at least and for me that also means improvements in productivity). Also, pandas never handled NULLs properly. This should now work with the pyarrow backend in pandas, but we can’t read our parquet files generated by PySpark. With polars it mostly just works, but we use pyarrow to read/write parquet.

Overall I can recommend it, conversion from/to pandas DataFrames was never an issue as well.

[+] spenczar5|2 years ago|reply
Polars is cool, but man, I really have come to think that dataframes are disastrous for software. The mess of internal state and confusion of writing functions that take “df” and manipulate it - its all so hard to clean up once you’re deep in the mess.

Quivr (https://github.com/spenczar/quivr) is an alternative approach that has been working for me. Maybe types are good!

[+] theLiminator|2 years ago|reply
Polars is a lot better than pandas at maintaining valid state.

Because you ideally describe everything in terms of lazy operations, it actually internally keeps track of all your data types at every step until materialization and execution of the query plan. Because of that, you're not going to have the same kind of data type issues you might have in pandas. There are also libraries based off pydantic built for polars dataframes validation (see patito, though it's not mature yet).

[+] efxhoy|2 years ago|reply
Yes, the dataframe is very tricky to get right function signature wise. I used to write a lot of pandas heavy software and converged on writing most functions to take a dataframe or series and when possible just return a series, then put that series back into the df at the function call site. Handing dfs off to functions that mutate them gets gnarly very quickly.
[+] ImageXav|2 years ago|reply
Like with many such projects, it's very helpful if you use DataFrames in isolation, but it lacks support from the wider scientific ecosystem. I've found that using polar will often break other common data scientific packages such as scikit-learn. This unfortunately often makes it impractical in the wild.
[+] anonu|2 years ago|reply
We've been using polars in production for over a year as a replacement to pandas. It's been a good experience: smaller memory footprint, way faster and just more pleasant in general to code. Package is being developed quickly so things get deprecated quickly but I'm not one to complain.
[+] mmaunder|2 years ago|reply
Anyone got any real-world comparison with Pandas? Like an orders of magnitude wow moment?