top | item 15335462

Apache Arrow and the “Things I Hate About Pandas”

394 points| jbredeche | 8 years ago |wesmckinney.com

139 comments

order
[+] drej|8 years ago|reply
I started with Python because of (or rather, thanks to) pandas, it was my gateway drug. Over the past ~5 years I've done all sorts of things with it, including converting the whole company I worked at. At one of my employer's, I sampled our big data platform, because it was tedious and slow to work with and used pandas instead.

All that being said, I'd stress pretty clearly that I never let a single line of pandas into production. There are a few reasons that I've long wanted to summarise, but just real quick: 1) It's a heavy dependency and things can go wrong, 2) It can act in unexpected ways - throw in an empty value in a list of integers and you suddenly get floats (I know why, but still), or increase the number of rows beyond a certain threshold and type inference works differently. 3) It can be very slow, especially if your workflow is write heavy (at the same time it's blazing fast for reads and joins in most cases, thanks to its columnar data structure). 4) The API evolves and breaking changes are not infrequent - that's a great thing for exploratory work, but not when you want to update libs on your production.

pandas is an amazing library, the best at exploratory work, bar none. But I would not let it power some unsupervised service.

[+] chasedehan|8 years ago|reply
> pandas is an amazing library, the best at exploratory work,

I will add ... "in python." That is definitely true, but to call it "the best at exploratory work" is not accurate. I might be opening up a completely separate debate, but for down and dirty exploratory work nothing beats R's dplyr and ggplot.

With that being said, I now do most of my work in python because of putting models into production. I also haven't had any issues with pandas in production; maybe because I'm not doing high throughput operations and our ML application is relatively lightweight.

[+] filmor|8 years ago|reply
1) Conda helps with that quite a bit, Pandas is not a much heavier dependency than NumPy itself.

2) Depends a bit on your background, but to me this is not really unexpected. Integers don't have a well-defined "missing" value while Floats do, so Pandas is trying to help you by not using python objects and instead converting to the "most useful" array type. It only does so if it can convert the integers without loss of precision.

3) This one I totally get, I wrote a custom, msgpack-based serialisation due to that for our usage (before Arrow was around, seriously considering that for data exchange now).

4) Apart from the changes to `resample` all of those breaking changes had a prior `DeprecationWarning`, IIRC.

[+] pmart123|8 years ago|reply
pandas is a great example of data science code at its best and its worst. If you look at the source code you will see every object and every function allows for way too many variations in input options and therefore about 20 conditional statements. For instance, I believe DataFrame's init method can take a dictionary, DataFrame, Series, etc. versus a class method for each one. Contrast that to requests where the public interface is a nice requests.get, requests.post. Yet, have a csv file you are only uploading once or twice to peak at? Then its super efficient. I think my biggest issue is all the effort that goes into pandas-like api's, i.e. https://github.com/ibis-project/ibis. To me, it doesn't make sense to take something stable and known (SQL) and build a complex DSL so it works like pandas.
[+] aldanor|8 years ago|reply
Ditto, same here (prop trading); pandas in notebooks and for quick (to hack together, not to run) scripts; pure NumPy or C++ via pybind11 for releases
[+] krit_dms|8 years ago|reply
so you prototyped in pandas, and build production code around numpy arrays?
[+] stdbrouw|8 years ago|reply
Wes seems to be very focused on performance and big data applications these days, and of course it'd be great if Pandas could be used for bigger datasets, but when I hear people complain about Pandas they complain about:

1. the weird handling of types and null values (#4) 2. the verbosity of filtering like `dataframe[dataframe.column == x]` and transformations like `dataframe.col_a - dataframe.col_b`, compared to `dplyr` in R 3. warts on the indexing system (including MultiIndex, which is very powerful but confusing)

For those of us who use Pandas as an alternative to R, these usability shortcomings matter way more than memory efficiency.

[+] timClicks|8 years ago|reply
I too would welcome a friendlier pandas library, but every time I've tried to think of an API that would work I fail. Well actually, I keep on wanting pandas to understand SQL.
[+] chasedehan|8 years ago|reply
I will definitely echo (2). dplyr is amazing and works in far, far fewer lines of code than pandas. That was my largest issue when migrating over from R.

There is dplython, but it doesn't quite work the same so I don't use it much. https://github.com/dodger487/dplython

[+] paultopia|8 years ago|reply
#2 is a big issue for me. Filtering, subsetting, these are all really arcane-feeling transformations, there's a lot of weirdness with view vs copy, etc.
[+] filmor|8 years ago|reply
You can write 2) as dataframe.query("column == x").
[+] geocar|8 years ago|reply
I often want one of these middlewares.

Strings are a killer -- indeed any variable-length object makes array programming tricky when it's nested, so a sound strategy is to intern your strings first (in some way; KDB has enumerations, but language support isn't necessary: hash the strings and save an inverted index works good enough for a lot of applications). Interning strings means you see integers in your data operations, which is about as un-fun to program in as it sounds. People want to be able to write something like:

    ….str.extract('([ab])(\d)', expand=False)
and then get disappointed that it's slow. Everything is slow when you do it a few trillion times, but slow things are really slow when you do them a few trillion times.

If we think about how we build our tables, we can store these as a single-byte column (or even a bitmask) and an int (or long) column, then we get fast again.

However it is clear "fast" and "let's use JSON" are incompatible, and a good middleware or storage system isn't going to make me trade.

[+] lifeisstillgood|8 years ago|reply
Could you expand on this (I'm asking because you clearly intend something you have thought about a lot, but I am missing the point - it's me not you)

As far as I understand you want to handle nested arrays of strings in your data. Ok

The "right" way is to build an index of the strings we are storing and then store the index values (hashes of some kind) in the arrays as longs

This way our arrays are doing numbers and we handwavy search for or use strings through some wrapper

Is this right?

And I am guessing the middleware you want does this transparently? Maybe storing the index alongside the data in some fashion

[+] jampekka|8 years ago|reply
I would love to see some sort of "smarter" indexing in the engine. I use pandas quite a bit, but I've never really understood the rationale behind the indexing, especially why indexes are treated so separately from data columns. I seem to be resetting and recreating indexes all the time, and use the .values a lot.

More SQL-style indexing would be a lot more intuitive at least for me.

[+] nerdponx|8 years ago|reply
I used to hate it, but I've come around to its usefulness in some cases.

However I do prefer the R data.table model, which is what you descibe. You can set an index on one or more columns in the table, and that's that.

[+] rkwasny|8 years ago|reply
Apache Arrow is the next big thing in data analytics, imagine doing a SQL query in MAPD ( executes in ms on GPU ) then passing the data zero-copy into python, doing some calculations in pandas and outputting the result into web interface, because everything is zero-copy it can be done faster than ever before.
[+] F_J_H|8 years ago|reply
What is your "go to" web interface for displaying/visualizing data?

I use Oracle APEX because it has a killer "interactive report" feature (ie a data grid on steroids), which enables non-programers to easily filter, aggregate,export,report on, etc, the data. However, although APEX is a free option that comes with the DB, it ties you to Oracle.

It would be great if there was a similar, database independent, low-code tool like APEX out there, so am curious what you have seen to work well.

[+] kornish|8 years ago|reply
(Disclaimers: I don't have much experience using Python to build data science products; potentially silly questions)

In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?

If the latter, do people prefer to push computation down into OLAP databases for performance reasons?

And if so, what impact will the convergence of libraries and database functionality have on product development? These features strike me as things that you'd find in a database, e.g. query optimizer. I know in the past couple years there have been a couple commercial acquisitions of in-memory execution engines, e.g. Hyper by Tableau.

[+] nigelcleland|8 years ago|reply
We currently use a combination of Pandas and Scikit-Learn to run our production models. We're not in the big data space, instead, creating small tightly tuned models for a very specific purpose in a large energy company.

At the moment the general work flow is:

* Internal library based over Pandas which abstracts our mess of internal databases

* Application specific model code that utilises the internal library to pull data in. This is then fed into a trained scikit-learn model and then further processed by Pandas.

* Internal monitoring tools (dashboards based upon Ploty and Flask as well as an alerting system) are built using the internal library and Pandas as the glue.

From a design decision we focused upon Pandas as the root source of all data. Everything is a DataFrame throughout the entire application.

Painpoints:

* Writing to a database is pretty painful (SQL Server here as Windows shop).

* Minor API changes can be irritating.

* Pandas MultiIndexing is both very painful and mind bending at the same time trying to get the slice syntax to work.

Overall though, Pandas is a huge value add and we've gradually rolled out from 2 people to approximately 9-10 people who hadn't used python in anger before.

Almost all reporting functionality is being migrated into Pandas instead of SQL stored procs, excel, tableau etc for the additional flexibility it provides.

[+] wesm|8 years ago|reply
Wes here. From what I understand, pandas is the middleware layer powering about 90% (if not more) of analytical applications in Python. It is what people use for data ingest, data prep, and feature engineering for machine learning models.

The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.

[+] ves|8 years ago|reply
We use pandas as a last-mile library for offline exploration. Typically, datasets have been sampled or aggregated enough that they can fit inside 10gb so you can work with them comfortably using pandas. I don't like using pandas in prod because the performance is really sensitive to stuff like missing a type declaration or calling the wrong method and the API is really convoluted.

E: that's not to say pandas isn't good. It's really good. Thanks for the software, Wes!

[+] tanilama|8 years ago|reply
> In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?

My experience echos yours, Pandas from my observation, is more like a post-modeling tool, that people use to further process data that they digest from certain DB query or Spark jobs.

After reading through Arrow homepage, I am left somewhat baffled about where it seats. If my reading is correct, it is a client-side protocol that abstracts away the underlying data storage implementations? If so, isn't it still limited by how much data the client machine can handle? Or the benefits is about the unified interface of accessing different storage system? No matter what, it seems pretty ambitious. Looking forward to see how it goes.

[+] atupis|8 years ago|reply
I usually use pandas as exploratory data tool and then rewrite code using numpy because pandas needs much more memory and is lot slower.
[+] Dowwie|8 years ago|reply
Anyone here who is smart enough to write alternative solutions to the problems that pandas solves is capable of making a meaningful contribution to the pandas project, yet it seems that once pandas reaches the limits of its usefulness people go off and write a proprietary solution, never giving back.

What stopped you from contributing improvements to pandas? Have you taken alternate routes to open source your work?

[+] antod|8 years ago|reply
Not very familiar with pandas, but it looks like the author of this post is the creator of pandas.
[+] coliveira|8 years ago|reply
You probably never took a deep look at pandas. It is a very complex library with lots of dependencies. It is not surprising that it is easier to implement an alternative rather than change the existing one.
[+] mojoe|8 years ago|reply
There are pretty huge barriers to open sourcing work in certain large companies.
[+] robochat42|8 years ago|reply
This is exciting stuff but will it have any downsides for the majority (??) of users who don't use pandas for big data? Also, this all sounds very similar to the Blaze ecosystem, whatever happened to that? Finally, will arrow/feather replace hdf5 and bcolz in the future?
[+] misnome|8 years ago|reply
Blaze was also my thought - I'd love to know how this/these proposals match up with what Blaze is doing/planning to do.
[+] sevensor|8 years ago|reply
Pandas is the library I wish I'd had in the late '00s when my employer decided our site license for JMP was too costly. Well, really pandas plus matplotlib plus Jupyter notebooks. My job frequently involved creating plots and putting them in Powerpoint. Often the same plot, day in and day out, with new data from the production line. An interactive tool that can automate this, with a low barrier to entry, can save an incredible amount of time. Since I discovered pandas, I've been recommending it to anybody who works in a putting-plots-in-powerpoint job. And there are a lot of people who have jobs like that.
[+] teekert|8 years ago|reply
I'd add seaborn to that, it work perfectly with pandas.DataFrames, often it creates exactly what you want with minimal input (just a dataframe), ie:

import seaborn as sns

sns.violinplot(data=dataframe)

Set %matplotlib inline and you don't need more commands in your notebook.

[+] taeric|8 years ago|reply
That sounds more amenable to an Excel sheet, honestly. Which I suppose is not that surprising, since spreadsheets were the original freeform notebook style program.
[+] kortex|8 years ago|reply
Look into Jupyter dashboard
[+] pleasecalllater|8 years ago|reply
The pandas memory consumption is hilarious.

The last time I was trying to use pandas, it was the hackernews data dump. It wasn't big. However when pandas started using the memory, my 32GB was just too little.

I just ended to convert the data within postgres, much faster, with sensible memory usage.

[+] goatlover|8 years ago|reply
How does Julia DataFrames compare to R & Pandas for the 11 issues he mentioned?
[+] twic|8 years ago|reply
Noob question: what is the relationship between Arrow, Parquet, and ORC? Do we need all three?
[+] jamesblonde|8 years ago|reply
Parquet and ORC are columnar on-disk data formats that power Hadoop-on-SQL engines (Impala/SparkSQL and Hive, respectively). Arrow is an in-memory representation (Parquet/Orc are on-disk). The idea is that you can have workflows in different languages or frameworks using the same in-memory representation, not having to rebuild it just because you're going from Spark to another framework
[+] wodenokoto|8 years ago|reply
Does anyone know how dataframe in R compares on these 10/11 points?
[+] huac|8 years ago|reply
I'd say native data frames in R aren't great at these (maybe #4, maybe #9). I'm excited to see how Arrow can perform, and hopefully we'll see solid bindings to R as well.

The data.table package (https://github.com/Rdatatable/data.table/wiki) does make progress on some of these - I'd say #1, #3, maybe #7, #8. Dplyr has a query planner too, fwiw.

[+] sandGorgon|8 years ago|reply
This is why I was hoping ONNX (Facebook+Microsoft's new machine learning serialization format) was built on top of Arrow rather than proto2.

Just like Feather is built on top of Arrow, ONNX can be based on top of Arrow.

[+] StreamBright|8 years ago|reply
This is huge. Performance matters (even in 2017) and we need to do things the right way. Project like Julia and Apache Arrow are paving the way for high performance analytics even for large data sets.
[+] anentropic|8 years ago|reply
> Logical operator graphs for graph dataflow-style execution (think TensorFlow or PyTorch, but for data frames)

> A multicore schedular for parallel evaluation of operator graphs

Does anything like this already exist somewhere?

[+] quotemstr|8 years ago|reply
Pandas is definitely powerful, if somewhat mind-bending at first if you're used to a relational, SQL world. It's never been clear to me why Pandas wasn't more copy-on-write from the start: it's difficult to predict which operations copy.
[+] Myrmornis|8 years ago|reply
As has been said before, the problem with pandas is the confusing and hard-to-remember python API.