(no title)
amakelov | 2 years ago
For a concrete example, often in an ML project you want to study how several quantities vary across several parameters. A straightforward workflow for this is: write some nested loops, collect results in python dictionaries, finally put everything together in a dataframe and compare (by plotting or otherwise).
However, after looking at the results, maybe you spot some trend and wonder if it will continue if you tweak one of the parameters by using a new value for it; of course, you also want to look at the previous values and bring everything together in the same plot(s). You now have a problem: either re-run the cell (thus losing previous work, which is annoying even if you have to wait 1 minute - you know it's a wasted minute!), or write the new computation in a new cell, possibly with a lot of redundancy (which over time makes the notebook hard to navigate and keep consistent).
So, this and other considerations eventually convinced me that the function is more natural than the cell as an interface/boundary at which caching should be implemented, at least for my use cases (coming from ML research). I wrote a framework based on this idea, with lots of other features (some quite experimental/unusual) to turn this into a feasible experiment management tool - check it out at https://github.com/amakelov/mandala
P.S.: I notice you use `pickle` for the hashing - `joblib.dump` is faster with objects containing numpy arrays, which covers a lot of useful ML things
No comments yet.