In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!
Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!
*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.
I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer
Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.
dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!
Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?
I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.
Having done this, a couple notes on what will unavoidably differ in Python
* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.
* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.
* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.
You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.
> No Index
> They are not needed. Not having them makes things easier. Convince me otherwise
Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).
The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.
Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.
There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.
This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.
Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.
In Julia there's something better, called Tables.jl. It's not exactly an API for dataframes (what would be point the of that? You don't need many implementations of dataframes, you just need one great one). Instead it's an API for table-shaped data. Dataframes are containers for table-shaped data.
I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.
Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.
I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.
They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.
Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.
I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.
I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?
I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?
Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.
So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing
If you use Pandas daily, maybe get used to it and can ignore the issues, but for anyone using Pandas occasionally, it's every time a huge pain trying to figure out how to use it. The API is not intuitive and the documentation is very verbose and unclear. And stackoverflow top answers are often the "old way" of doing something when yet another way of doing the same thing has been added to the API.
For some people pandas seems to click. Good for you. I always struggle with google and the manual to get even simple things done.
I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.
I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.
it's just so bloated and verbose. many ways to do the same things, annoying defaults (how is column not the default axis to drop?), indices are beyond frustrating (have never met anyone who doesn't just reset them after a groupby), inconvenient to do custom aggregations, very slow, not opinionated enough
then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls
I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?
Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.
It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.
The embarrassingly parallel is aimed at the expression API. This allows one to write multiple expressions, and all of them get executed parallel. (So embarrassingly, meaning they don't have to communicate and use locks).
[+] [-] civilized|4 years ago|reply
Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!
*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.
[+] [-] cigrainger|4 years ago|reply
[+] [-] pdeffebach|4 years ago|reply
Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/
[+] [-] vavooom|4 years ago|reply
[+] [-] extr|4 years ago|reply
[+] [-] ttymck|4 years ago|reply
[+] [-] otsaloma|4 years ago|reply
I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.
Having done this, a couple notes on what will unavoidably differ in Python
* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.
* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.
* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.
https://github.com/otsaloma/dataiter
[+] [-] _Wintermute|4 years ago|reply
[+] [-] pietroppeter|4 years ago|reply
Being in Nim, it will be easy also to add sweet DSLs.
[+] [-] BiteCode_dev|4 years ago|reply
[+] [-] cabalamat|4 years ago|reply
Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?
[+] [-] gpderetta|4 years ago|reply
The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.
[+] [-] ritchie46|4 years ago|reply
Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.
[+] [-] sriku|4 years ago|reply
[1] https://h2oai.github.io/db-benchmark/
[+] [-] abeppu|4 years ago|reply
[+] [-] devin-petersohn|4 years ago|reply
Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.
[1] https://arxiv.org/pdf/2001.00888
[+] [-] teruakohatu|4 years ago|reply
[+] [-] ddavis|4 years ago|reply
[+] [-] sdfgsdf|4 years ago|reply
[+] [-] austospumanto|4 years ago|reply
pip install minimal-pandas-api-for-polars
I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.
Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.
I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.
[+] [-] chrisaycock|4 years ago|reply
https://news.ycombinator.com/item?id=29509439
They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.
[+] [-] gpderetta|4 years ago|reply
[+] [-] contravariant|4 years ago|reply
[+] [-] vincent-toups|4 years ago|reply
[+] [-] Dowwie|4 years ago|reply
[+] [-] ahurmazda|4 years ago|reply
[+] [-] riskneutral|4 years ago|reply
[+] [-] Fiahil|4 years ago|reply
[+] [-] optimalonpaper|4 years ago|reply
Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.
So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing
[+] [-] jmakov|4 years ago|reply
[+] [-] unixhero|4 years ago|reply
I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.
[+] [-] otsaloma|4 years ago|reply
[+] [-] wodenokoto|4 years ago|reply
I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.
I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.
[+] [-] bllguo|4 years ago|reply
then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls
[+] [-] StreamBright|4 years ago|reply
I have no idea what is the intention of the developers most of the time.
[+] [-] the_biot|4 years ago|reply
[+] [-] rytill|4 years ago|reply
[+] [-] pvitz|4 years ago|reply
[+] [-] thenipper|4 years ago|reply
[+] [-] ZeroGravitas|4 years ago|reply
[+] [-] xiaodai|4 years ago|reply
[+] [-] Maxion|4 years ago|reply
[+] [-] callmerk|4 years ago|reply
[+] [-] nas|4 years ago|reply
[+] [-] lern_too_spel|4 years ago|reply
[+] [-] nojito|4 years ago|reply
The benchmarks speak volumes.
https://h2oai.github.io/db-benchmark/
[+] [-] ritchie46|4 years ago|reply
[+] [-] space_rock|4 years ago|reply