(no title)
amakelov | 2 years ago
- inputs/outputs being high volume: the inputs/outputs that are large are often also things that don't change over the course of a project (e.g. a dataset or a model). So you don't really need to cache the object itself, just a (typically short string) immutable reference to it. As long as the object can be looked up at runtime, everything's fine;
- detecting changes in data: content hashing is the general way in which you can tell if a result changed; using `joblib.dump` and then hashing the resulting string provides a good starting implementation, though certainly there are some corner cases to be aware of.
Both of these approaches are available/used in mandala (https://github.com/amakelov/mandala; disclosure: I'm the author), which uses content hashing to tell when data (or even code/code dependencies) have changed, and gives you a generic caching decorator for functions which can then look up large objects by reference; this is the way I used it for e.g. my mechanistic interpretability work, which is often of the form one big model + lots of analyses producing tiny artifacts based on it.
No comments yet.