I started with Python because of (or rather, thanks to) pandas, it was my gateway drug. Over the past ~5 years I've done all sorts of things with it, including converting the whole company I worked at. At one of my employer's, I sampled our big data platform, because it was tedious and slow to work with and used pandas instead.
All that being said, I'd stress pretty clearly that I never let a single line of pandas into production. There are a few reasons that I've long wanted to summarise, but just real quick: 1) It's a heavy dependency and things can go wrong, 2) It can act in unexpected ways - throw in an empty value in a list of integers and you suddenly get floats (I know why, but still), or increase the number of rows beyond a certain threshold and type inference works differently. 3) It can be very slow, especially if your workflow is write heavy (at the same time it's blazing fast for reads and joins in most cases, thanks to its columnar data structure). 4) The API evolves and breaking changes are not infrequent - that's a great thing for exploratory work, but not when you want to update libs on your production.
pandas is an amazing library, the best at exploratory work, bar none. But I would not let it power some unsupervised service.
> pandas is an amazing library, the best at exploratory work,
I will add ... "in python." That is definitely true, but to call it "the best at exploratory work" is not accurate. I might be opening up a completely separate debate, but for down and dirty exploratory work nothing beats R's dplyr and ggplot.
With that being said, I now do most of my work in python because of putting models into production. I also haven't had any issues with pandas in production; maybe because I'm not doing high throughput operations and our ML application is relatively lightweight.
1) Conda helps with that quite a bit, Pandas is not a much heavier dependency than NumPy itself.
2) Depends a bit on your background, but to me this is not really unexpected. Integers don't have a well-defined "missing" value while Floats do, so Pandas is trying to help you by not using python objects and instead converting to the "most useful" array type. It only does so if it can convert the integers without loss of precision.
3) This one I totally get, I wrote a custom, msgpack-based serialisation due to that for our usage (before Arrow was around, seriously considering that for data exchange now).
4) Apart from the changes to `resample` all of those breaking changes had a prior `DeprecationWarning`, IIRC.
pandas is a great example of data science code at its best and its worst. If you look at the source code you will see every object and every function allows for way too many variations in input options and therefore about 20 conditional statements. For instance, I believe DataFrame's init method can take a dictionary, DataFrame, Series, etc. versus a class method for each one. Contrast that to requests where the public interface is a nice requests.get, requests.post. Yet, have a csv file you are only uploading once or twice to peak at? Then its super efficient. I think my biggest issue is all the effort that goes into pandas-like api's, i.e. https://github.com/ibis-project/ibis. To me, it doesn't make sense to take something stable and known (SQL) and build a complex DSL so it works like pandas.
Wes seems to be very focused on performance and big data applications these days, and of course it'd be great if Pandas could be used for bigger datasets, but when I hear people complain about Pandas they complain about:
1. the weird handling of types and null values (#4)
2. the verbosity of filtering like `dataframe[dataframe.column == x]` and transformations like `dataframe.col_a - dataframe.col_b`, compared to `dplyr` in R
3. warts on the indexing system (including MultiIndex, which is very powerful but confusing)
For those of us who use Pandas as an alternative to R, these usability shortcomings matter way more than memory efficiency.
It vectorizes/tensorizes most major data types to put them in shape for machine learning. It also lets you save the data pipeline as a reusable object.
I too would welcome a friendlier pandas library, but every time I've tried to think of an API that would work I fail. Well actually, I keep on wanting pandas to understand SQL.
I will definitely echo (2). dplyr is amazing and works in far, far fewer lines of code than pandas. That was my largest issue when migrating over from R.
#2 is a big issue for me. Filtering, subsetting, these are all really arcane-feeling transformations, there's a lot of weirdness with view vs copy, etc.
Strings are a killer -- indeed any variable-length object makes array programming tricky when it's nested, so a sound strategy is to intern your strings first (in some way; KDB has enumerations, but language support isn't necessary: hash the strings and save an inverted index works good enough for a lot of applications). Interning strings means you see integers in your data operations, which is about as un-fun to program in as it sounds. People want to be able to write something like:
….str.extract('([ab])(\d)', expand=False)
and then get disappointed that it's slow. Everything is slow when you do it a few trillion times, but slow things are really slow when you do them a few trillion times.
If we think about how we build our tables, we can store these as a single-byte column (or even a bitmask) and an int (or long) column, then we get fast again.
However it is clear "fast" and "let's use JSON" are incompatible, and a good middleware or storage system isn't going to make me trade.
I would love to see some sort of "smarter" indexing in the engine. I use pandas quite a bit, but I've never really understood the rationale behind the indexing, especially why indexes are treated so separately from data columns. I seem to be resetting and recreating indexes all the time, and use the .values a lot.
More SQL-style indexing would be a lot more intuitive at least for me.
Apache Arrow is the next big thing in data analytics, imagine doing a SQL query in MAPD ( executes in ms on GPU ) then passing the data zero-copy into python, doing some calculations in pandas and outputting the result into web interface, because everything is zero-copy it can be done faster than ever before.
What is your "go to" web interface for displaying/visualizing data?
I use Oracle APEX because it has a killer "interactive report" feature (ie a data grid on steroids), which enables non-programers to easily filter, aggregate,export,report on, etc, the data. However, although APEX is a free option that comes with the DB, it ties you to Oracle.
It would be great if there was a similar, database independent, low-code tool like APEX out there, so am curious what you have seen to work well.
(Disclaimers: I don't have much experience using Python to build data science products; potentially silly questions)
In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?
If the latter, do people prefer to push computation down into OLAP databases for performance reasons?
And if so, what impact will the convergence of libraries and database functionality have on product development? These features strike me as things that you'd find in a database, e.g. query optimizer. I know in the past couple years there have been a couple commercial acquisitions of in-memory execution engines, e.g. Hyper by Tableau.
We currently use a combination of Pandas and Scikit-Learn to run our production models. We're not in the big data space, instead, creating small tightly tuned models for a very specific purpose in a large energy company.
At the moment the general work flow is:
* Internal library based over Pandas which abstracts our mess of internal databases
* Application specific model code that utilises the internal library to pull data in. This is then fed into a trained scikit-learn model and then further processed by Pandas.
* Internal monitoring tools (dashboards based upon Ploty and Flask as well as an alerting system) are built using the internal library and Pandas as the glue.
From a design decision we focused upon Pandas as the root source of all data. Everything is a DataFrame throughout the entire application.
Painpoints:
* Writing to a database is pretty painful (SQL Server here as Windows shop).
* Minor API changes can be irritating.
* Pandas MultiIndexing is both very painful and mind bending at the same time trying to get the slice syntax to work.
Overall though, Pandas is a huge value add and we've gradually rolled out from 2 people to approximately 9-10 people who hadn't used python in anger before.
Almost all reporting functionality is being migrated into Pandas instead of SQL stored procs, excel, tableau etc for the additional flexibility it provides.
Wes here. From what I understand, pandas is the middleware layer powering about 90% (if not more) of analytical applications in Python. It is what people use for data ingest, data prep, and feature engineering for machine learning models.
The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.
We use pandas as a last-mile library for offline exploration. Typically, datasets have been sampled or aggregated enough that they can fit inside 10gb so you can work with them comfortably using pandas. I don't like using pandas in prod because the performance is really sensitive to stuff like missing a type declaration or calling the wrong method and the API is really convoluted.
E: that's not to say pandas isn't good. It's really good. Thanks for the software, Wes!
> In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?
My experience echos yours, Pandas from my observation, is more like a post-modeling tool, that people use to further process data that they digest from certain DB query or Spark jobs.
After reading through Arrow homepage, I am left somewhat baffled about where it seats. If my reading is correct, it is a client-side protocol that abstracts away the underlying data storage implementations? If so, isn't it still limited by how much data the client machine can handle? Or the benefits is about the unified interface of accessing different storage system? No matter what, it seems pretty ambitious. Looking forward to see how it goes.
Anyone here who is smart enough to write alternative solutions to the problems that pandas solves is capable of making a meaningful contribution to the pandas project, yet it seems that once pandas reaches the limits of its usefulness people go off and write a proprietary solution, never giving back.
What stopped you from contributing improvements to pandas? Have you taken alternate routes to open source your work?
You probably never took a deep look at pandas. It is a very complex library with lots of dependencies. It is not surprising that it is easier to implement an alternative rather than change the existing one.
This is exciting stuff but will it have any downsides for the majority (??) of users who don't use pandas for big data? Also, this all sounds very similar to the Blaze ecosystem, whatever happened to that? Finally, will arrow/feather replace hdf5 and bcolz in the future?
Pandas is the library I wish I'd had in the late '00s when my employer decided our site license for JMP was too costly. Well, really pandas plus matplotlib plus Jupyter notebooks. My job frequently involved creating plots and putting them in Powerpoint. Often the same plot, day in and day out, with new data from the production line. An interactive tool that can automate this, with a low barrier to entry, can save an incredible amount of time. Since I discovered pandas, I've been recommending it to anybody who works in a putting-plots-in-powerpoint job. And there are a lot of people who have jobs like that.
That sounds more amenable to an Excel sheet, honestly. Which I suppose is not that surprising, since spreadsheets were the original freeform notebook style program.
I would be interested to know what Wes thinks of the Weld project, which seems to have some similar goals, but takes the 'query planner' concept much further.
The last time I was trying to use pandas, it was the hackernews data dump. It wasn't big. However when pandas started using the memory, my 32GB was just too little.
I just ended to convert the data within postgres, much faster, with sensible memory usage.
Parquet and ORC are columnar on-disk data formats that power Hadoop-on-SQL engines (Impala/SparkSQL and Hive, respectively).
Arrow is an in-memory representation (Parquet/Orc are on-disk). The idea is that you can have workflows in different languages or frameworks using the same in-memory representation, not having to rebuild it just because you're going from Spark to another framework
I'd say native data frames in R aren't great at these (maybe #4, maybe #9). I'm excited to see how Arrow can perform, and hopefully we'll see solid bindings to R as well.
This is huge. Performance matters (even in 2017) and we need to do things the right way. Project like Julia and Apache Arrow are paving the way for high performance analytics even for large data sets.
Pandas is definitely powerful, if somewhat mind-bending at first if you're used to a relational, SQL world. It's never been clear to me why Pandas wasn't more copy-on-write from the start: it's difficult to predict which operations copy.
[+] [-] drej|8 years ago|reply
All that being said, I'd stress pretty clearly that I never let a single line of pandas into production. There are a few reasons that I've long wanted to summarise, but just real quick: 1) It's a heavy dependency and things can go wrong, 2) It can act in unexpected ways - throw in an empty value in a list of integers and you suddenly get floats (I know why, but still), or increase the number of rows beyond a certain threshold and type inference works differently. 3) It can be very slow, especially if your workflow is write heavy (at the same time it's blazing fast for reads and joins in most cases, thanks to its columnar data structure). 4) The API evolves and breaking changes are not infrequent - that's a great thing for exploratory work, but not when you want to update libs on your production.
pandas is an amazing library, the best at exploratory work, bar none. But I would not let it power some unsupervised service.
[+] [-] chasedehan|8 years ago|reply
I will add ... "in python." That is definitely true, but to call it "the best at exploratory work" is not accurate. I might be opening up a completely separate debate, but for down and dirty exploratory work nothing beats R's dplyr and ggplot.
With that being said, I now do most of my work in python because of putting models into production. I also haven't had any issues with pandas in production; maybe because I'm not doing high throughput operations and our ML application is relatively lightweight.
[+] [-] filmor|8 years ago|reply
2) Depends a bit on your background, but to me this is not really unexpected. Integers don't have a well-defined "missing" value while Floats do, so Pandas is trying to help you by not using python objects and instead converting to the "most useful" array type. It only does so if it can convert the integers without loss of precision.
3) This one I totally get, I wrote a custom, msgpack-based serialisation due to that for our usage (before Arrow was around, seriously considering that for data exchange now).
4) Apart from the changes to `resample` all of those breaking changes had a prior `DeprecationWarning`, IIRC.
[+] [-] pmart123|8 years ago|reply
[+] [-] aldanor|8 years ago|reply
[+] [-] krit_dms|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] stdbrouw|8 years ago|reply
1. the weird handling of types and null values (#4) 2. the verbosity of filtering like `dataframe[dataframe.column == x]` and transformations like `dataframe.col_a - dataframe.col_b`, compared to `dplyr` in R 3. warts on the indexing system (including MultiIndex, which is very powerful but confusing)
For those of us who use Pandas as an alternative to R, these usability shortcomings matter way more than memory efficiency.
[+] [-] vonnik|8 years ago|reply
https://github.com/deeplearning4j/datavec
https://deeplearning4j.org/datavec
It vectorizes/tensorizes most major data types to put them in shape for machine learning. It also lets you save the data pipeline as a reusable object.
[+] [-] timClicks|8 years ago|reply
[+] [-] chasedehan|8 years ago|reply
There is dplython, but it doesn't quite work the same so I don't use it much. https://github.com/dodger487/dplython
[+] [-] paultopia|8 years ago|reply
[+] [-] filmor|8 years ago|reply
[+] [-] geocar|8 years ago|reply
Strings are a killer -- indeed any variable-length object makes array programming tricky when it's nested, so a sound strategy is to intern your strings first (in some way; KDB has enumerations, but language support isn't necessary: hash the strings and save an inverted index works good enough for a lot of applications). Interning strings means you see integers in your data operations, which is about as un-fun to program in as it sounds. People want to be able to write something like:
and then get disappointed that it's slow. Everything is slow when you do it a few trillion times, but slow things are really slow when you do them a few trillion times.If we think about how we build our tables, we can store these as a single-byte column (or even a bitmask) and an int (or long) column, then we get fast again.
However it is clear "fast" and "let's use JSON" are incompatible, and a good middleware or storage system isn't going to make me trade.
[+] [-] lifeisstillgood|8 years ago|reply
As far as I understand you want to handle nested arrays of strings in your data. Ok
The "right" way is to build an index of the strings we are storing and then store the index values (hashes of some kind) in the arrays as longs
This way our arrays are doing numbers and we handwavy search for or use strings through some wrapper
Is this right?
And I am guessing the middleware you want does this transparently? Maybe storing the index alongside the data in some fashion
[+] [-] jampekka|8 years ago|reply
More SQL-style indexing would be a lot more intuitive at least for me.
[+] [-] nerdponx|8 years ago|reply
However I do prefer the R data.table model, which is what you descibe. You can set an index on one or more columns in the table, and that's that.
[+] [-] rkwasny|8 years ago|reply
[+] [-] F_J_H|8 years ago|reply
I use Oracle APEX because it has a killer "interactive report" feature (ie a data grid on steroids), which enables non-programers to easily filter, aggregate,export,report on, etc, the data. However, although APEX is a free option that comes with the DB, it ties you to Oracle.
It would be great if there was a similar, database independent, low-code tool like APEX out there, so am curious what you have seen to work well.
[+] [-] kornish|8 years ago|reply
In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?
If the latter, do people prefer to push computation down into OLAP databases for performance reasons?
And if so, what impact will the convergence of libraries and database functionality have on product development? These features strike me as things that you'd find in a database, e.g. query optimizer. I know in the past couple years there have been a couple commercial acquisitions of in-memory execution engines, e.g. Hyper by Tableau.
[+] [-] nigelcleland|8 years ago|reply
At the moment the general work flow is:
* Internal library based over Pandas which abstracts our mess of internal databases
* Application specific model code that utilises the internal library to pull data in. This is then fed into a trained scikit-learn model and then further processed by Pandas.
* Internal monitoring tools (dashboards based upon Ploty and Flask as well as an alerting system) are built using the internal library and Pandas as the glue.
From a design decision we focused upon Pandas as the root source of all data. Everything is a DataFrame throughout the entire application.
Painpoints:
* Writing to a database is pretty painful (SQL Server here as Windows shop).
* Minor API changes can be irritating.
* Pandas MultiIndexing is both very painful and mind bending at the same time trying to get the slice syntax to work.
Overall though, Pandas is a huge value add and we've gradually rolled out from 2 people to approximately 9-10 people who hadn't used python in anger before.
Almost all reporting functionality is being migrated into Pandas instead of SQL stored procs, excel, tableau etc for the additional flexibility it provides.
[+] [-] wesm|8 years ago|reply
The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.
[+] [-] ves|8 years ago|reply
E: that's not to say pandas isn't good. It's really good. Thanks for the software, Wes!
[+] [-] tanilama|8 years ago|reply
My experience echos yours, Pandas from my observation, is more like a post-modeling tool, that people use to further process data that they digest from certain DB query or Spark jobs.
After reading through Arrow homepage, I am left somewhat baffled about where it seats. If my reading is correct, it is a client-side protocol that abstracts away the underlying data storage implementations? If so, isn't it still limited by how much data the client machine can handle? Or the benefits is about the unified interface of accessing different storage system? No matter what, it seems pretty ambitious. Looking forward to see how it goes.
[+] [-] atupis|8 years ago|reply
[+] [-] Dowwie|8 years ago|reply
What stopped you from contributing improvements to pandas? Have you taken alternate routes to open source your work?
[+] [-] antod|8 years ago|reply
[+] [-] coliveira|8 years ago|reply
[+] [-] mojoe|8 years ago|reply
[+] [-] robochat42|8 years ago|reply
[+] [-] misnome|8 years ago|reply
[+] [-] sevensor|8 years ago|reply
[+] [-] teekert|8 years ago|reply
import seaborn as sns
sns.violinplot(data=dataframe)
Set %matplotlib inline and you don't need more commands in your notebook.
[+] [-] taeric|8 years ago|reply
[+] [-] kortex|8 years ago|reply
[+] [-] adwhit|8 years ago|reply
https://weld-project.github.io/
[+] [-] timClicks|8 years ago|reply
[+] [-] wesm|8 years ago|reply
[+] [-] pleasecalllater|8 years ago|reply
The last time I was trying to use pandas, it was the hackernews data dump. It wasn't big. However when pandas started using the memory, my 32GB was just too little.
I just ended to convert the data within postgres, much faster, with sensible memory usage.
[+] [-] goatlover|8 years ago|reply
[+] [-] twic|8 years ago|reply
[+] [-] jamesblonde|8 years ago|reply
[+] [-] wodenokoto|8 years ago|reply
[+] [-] huac|8 years ago|reply
The data.table package (https://github.com/Rdatatable/data.table/wiki) does make progress on some of these - I'd say #1, #3, maybe #7, #8. Dplyr has a query planner too, fwiw.
[+] [-] sandGorgon|8 years ago|reply
Just like Feather is built on top of Arrow, ONNX can be based on top of Arrow.
[+] [-] StreamBright|8 years ago|reply
[+] [-] anentropic|8 years ago|reply
> A multicore schedular for parallel evaluation of operator graphs
Does anything like this already exist somewhere?
[+] [-] quotemstr|8 years ago|reply
[+] [-] Myrmornis|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]