top | item 38724536

(no title)

IKantRead | 2 years ago

> A lot to be said for not defaulting to data frames, in both r and python

I would even add especially in Python. The main issue I have found is that pandas heavy code is just not as easy to integrate into other Python tools/features/abstractions as code using mostly numpy, dictionaries and various comprehensions to do the vast majority of your work.

As a heavy pandas user for several years, I decided about a year ago to not import pandas by default and instead treat most data problems like regular python problems. I've been genuinely surprised as how much easier it is to create useful abstractions with the code I've been writing, and also how much easier it's been to onboard non-DS devs into the code base.

There are a few obvious cases when Pandas is very helpful, and I'll pull it out in those places, but I've been able to do a tremendous amount of data work in the last year and used very little pandas. The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

discuss

kristjansson|2 years ago

> The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

This is the biggest part. Giving yourself permission to make real abstractions, rather than forcing yourself to go directly from data-on-disk to pandas (or whatever) makes it that much easier to test, repeat, modify, and extend whatever analysis you're working on.

franklin_p_dyer|2 years ago

In what cases have you found it worthwhile to use pandas?

isoprophlex|2 years ago

Resampling, regularizing, binning and forward/backward filling time series data is an absolute pain in the ass using only SQL and/or vanilla python. It does its thing well, there.

(Note that in general, I'm the biggest pandas hater I know)

canjobear|2 years ago

It can be nice for groupby-aggregate logic. And it feeds into plotnine.