top | item 38702123

(no title)

amakelov | 2 years ago

I see two concerns here:

- inputs/outputs being high volume: the inputs/outputs that are large are often also things that don't change over the course of a project (e.g. a dataset or a model). So you don't really need to cache the object itself, just a (typically short string) immutable reference to it. As long as the object can be looked up at runtime, everything's fine;

- detecting changes in data: content hashing is the general way in which you can tell if a result changed; using `joblib.dump` and then hashing the resulting string provides a good starting implementation, though certainly there are some corner cases to be aware of.

Both of these approaches are available/used in mandala (https://github.com/amakelov/mandala; disclosure: I'm the author), which uses content hashing to tell when data (or even code/code dependencies) have changed, and gives you a generic caching decorator for functions which can then look up large objects by reference; this is the way I used it for e.g. my mechanistic interpretability work, which is often of the form one big model + lots of analyses producing tiny artifacts based on it.

discuss

No comments yet.