top | item 28524098

(no title)

infimum | 4 years ago

scikit-learn (next to numpy) is the one library I use in every single project at work. Every time I consider switching away from python I am faced with the fact that I'd lose access to this workhorse of a library. Of course it's not all sunshine and rainbows - I had my fair share of rummaging through its internals - but its API design is a de-facto standard for a reason. My only recurring gripe is that the serialization story (basically just pickling everything) is not optimal.

discuss

CapmCrackaWaka|4 years ago

I recently ran into this issue as well. Serialization of sklearn random forests results in absolutely massive files. I had to switch to lightgbm, which is 100x faster to load from a save file and about 20x smaller.

kzrdude|4 years ago

What's a typical task you do with sklearn? Just trying to get inspired about what it can do

zeec123|4 years ago

There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background.[1]

[1] https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...

kzrdude|4 years ago

These seem like minor gripes (reading your link) - and I don't even agree with them, seems like an ok use of mutable state (otherwise a separate object would be needed for hyperparameter state?). Maybe my expectations are low, but they way sklearn unifies the API across different estimators all across the library - that's already way above what you can expect - especially if you consider it to be "written by a bunch of phd students".

mrtranscendence|4 years ago

I didn't want to bag on sklearn (I've already bagged on pandas enough here), but for what it's worth I agree with you. It's, ahh, not the API I would've come up with. It's what everybody has standardized on, though, and maybe there's some value in that.