(no title)
m16ghost | 4 years ago
SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider".
In the article, the pandas version is:
>pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')
Compared to the SQL version:
>SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role;
Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes.
There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.
There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query.
kbelder|4 years ago
But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.
It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.
Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.
ziml77|4 years ago
mr_toad|4 years ago
And you need to define the columns in advance anyway, because query planners can’t handle columns that change at runtime.
hantusk|4 years ago
michelpp|4 years ago
https://www.postgresql.org/docs/13/tablefunc.html
m16ghost|4 years ago
-You still need to enumerate and label each new column and their types. This particular problem is fixed by crosstabN().
-You need to know upfront how many columns are created before performing the pivot. In the context of data analysis, this is often dynamic or unknown.
-The input to the function is not a dataframe, but a text string that generates the pre-pivot results. This means your analysis up to this point needs to be converted into a string. Not only does this disrupt the flow of an analysis, you also have to worry about escape characters in your string.
-It is not standard across SQL dialects. This function is specific to Postgres, and other dialects have their own version of this function with their own limitations.
The article contains several examples like this where SQL is much more verbose and brittle than the equivalent pandas code.
papercrane|4 years ago
ellis-bell|4 years ago
a pattern that i converged on --- at least in postgres --- is to aggregate your data into json objects and then go from there. you don't need to know how many attributes (columns) should be in the result of your pivot. you can also do this in reverse (pivot from wide to long) with the same technique.
so for example if you have the schema `(obj_id, key, value)` in a long-formatted table, where an `obj_id` will have data spanning multiple rows, then you can issue a query like
``` SELECT obj_id, jsonb_object_agg(key, value) FROM table GROUP BY obj_id; ```
up to actual syntax...it's been awhile since i've had to do a task requiring this, so details are fuzzy but pattern's there.
so each row in your query result would look like a json document: `(obj_id, `{"key1": "value", "key2": "value", ...})`
see https://www.postgresql.org/docs/current/functions-json.html for more goodies.
unknown|4 years ago
[deleted]
unknown|4 years ago
[deleted]