Polars: Fast DataFrame library for Rust and Python

[+] civilized|4 years ago|reply

In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!

Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!

*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.

[+] cigrainger|4 years ago|reply

I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer

[+] pdeffebach|4 years ago|reply

DataFramesMeta.jl might be exactly what you are looking for then! The syntax is very close to dplyr, but has performance benefits thanks to Julia.

Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/

[+] vavooom|4 years ago|reply

Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.

[+] extr|4 years ago|reply

dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!

[+] ttymck|4 years ago|reply

Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?

[+] otsaloma|4 years ago|reply

Agreed, dplyr is great.

I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.

Having done this, a couple notes on what will unavoidably differ in Python

* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.

* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.

* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.

https://github.com/otsaloma/dataiter

[+] _Wintermute|4 years ago|reply

You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.

[+] pietroppeter|4 years ago|reply

still very small yet, but Nim's dataframe library (datamancer) has a dplyr api (and it is fast): https://github.com/SciNim/Datamancer

Being in Nim, it will be easy also to add sweet DSLs.

[+] BiteCode_dev|4 years ago|reply

You don't need to write "import pandas; pandas.bla()", you can do "from pandas import *; anything_in_pandas()" if you want quick and dirty.

[+] cabalamat|4 years ago|reply

> dplyr

Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?

[+] gpderetta|4 years ago|reply

From the python docs:

  > No Index
  > They are not needed. Not having them makes things easier. Convince me otherwise

Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).

The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.

[+] ritchie46|4 years ago|reply

> (as usual project is called select...)

Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.

[+] sriku|4 years ago|reply

Hmmm .. in the linked benchmarks [1], DataFrames.jl (Julia library) appears to be fairly competitive.

[1] https://h2oai.github.io/db-benchmark/

[+] abeppu|4 years ago|reply

There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.

[+] devin-petersohn|4 years ago|reply

This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.

Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.

[1] https://arxiv.org/pdf/2001.00888

[+] teruakohatu|4 years ago|reply

Worse than that, pandas has a terrible API to start with. Going from the QueryVerse to Pandas feels like going back in time.

[+] ddavis|4 years ago|reply

There is an effort for this: https://github.com/data-apis/dataframe-api

[+] sdfgsdf|4 years ago|reply

In Julia there's something better, called Tables.jl. It's not exactly an API for dataframes (what would be point the of that? You don't need many implementations of dataframes, you just need one great one). Instead it's an API for table-shaped data. Dataframes are containers for table-shaped data.

[+] austospumanto|4 years ago|reply

https://github.com/austospumanto/minimal-pandas-api-for-pola...

pip install minimal-pandas-api-for-polars

I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.

Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.

I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.

[+] chrisaycock|4 years ago|reply

Types for Tables was posted to HN last week:

https://news.ycombinator.com/item?id=29509439

They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.

[+] gpderetta|4 years ago|reply

I believe Codd took a stab at it a few years ago. He had some success, but didn't break in data science.

[+] contravariant|4 years ago|reply

That's SQL isn't it?

[+] vincent-toups|4 years ago|reply

God please anything to liberate me from pandas, which has one of the wildest API's I've ever had to routinely work with.

[+] Dowwie|4 years ago|reply

Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.

[+] ahurmazda|4 years ago|reply

I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.

[+] riskneutral|4 years ago|reply

I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?

[+] Fiahil|4 years ago|reply

… and it’s using arrow2, not the official, unsafe, arrow crate. Great, it means we can use it !

[+] optimalonpaper|4 years ago|reply

I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?

Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.

So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing

[+] jmakov|4 years ago|reply

How does compare to Vaex?

[+] unixhero|4 years ago|reply

What makes Pandas so bad and what makes Dplyr so great?

I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.

[+] otsaloma|4 years ago|reply

If you use Pandas daily, maybe get used to it and can ignore the issues, but for anyone using Pandas occasionally, it's every time a huge pain trying to figure out how to use it. The API is not intuitive and the documentation is very verbose and unclear. And stackoverflow top answers are often the "old way" of doing something when yet another way of doing the same thing has been added to the API.

[+] wodenokoto|4 years ago|reply

For some people pandas seems to click. Good for you. I always struggle with google and the manual to get even simple things done.

I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.

I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.

[+] bllguo|4 years ago|reply

it's just so bloated and verbose. many ways to do the same things, annoying defaults (how is column not the default axis to drop?), indices are beyond frustrating (have never met anyone who doesn't just reset them after a groupby), inconvenient to do custom aggregations, very slow, not opinionated enough

then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls

[+] StreamBright|4 years ago|reply

I could never use Pandas without SO and the documentation and I use it for almost 10 years.

I have no idea what is the intention of the developers most of the time.

[+] the_biot|4 years ago|reply

I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?

[+] rytill|4 years ago|reply

How would this compare to loading a sqlite database into memory and performing queries with it?

[+] pvitz|4 years ago|reply

Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.

[+] thenipper|4 years ago|reply

We've been thinking about trying this out at work for some of our data pipelines/simplified models. The speed/ergonomics look great.

[+] ZeroGravitas|4 years ago|reply

Is there a plugin to use this as a visidata backend? I quite like their UX.

[+] xiaodai|4 years ago|reply

It's great to see innovation in this area.

[+] Maxion|4 years ago|reply

I wouldn't really call it innovation, it's more just a project trying to bring to python something similar to the tidyverse from R.

[+] callmerk|4 years ago|reply

.

[+] nas|4 years ago|reply

It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.

[+] lern_too_spel|4 years ago|reply

"Embarrassingly parallel" is a technical term, not a marketing term. https://en.wikipedia.org/wiki/Embarrassingly_parallel

[+] nojito|4 years ago|reply

Why?

The benchmarks speak volumes.

https://h2oai.github.io/db-benchmark/

[+] ritchie46|4 years ago|reply

The embarrassingly parallel is aimed at the expression API. This allows one to write multiple expressions, and all of them get executed parallel. (So embarrassingly, meaning they don't have to communicate and use locks).

[+] space_rock|4 years ago|reply

He is basically describing benefits of the rest language so it's perfectly credible

124 comments