top | item 35580569

Mandala: Python programs that save, query and version themselves

34 points| benkuhn | 2 years ago |amakelov.github.io

4 comments

order

emptysongglass|2 years ago

I write Python (not well) and I don't understand any of this:

> mandala is a tool that implements this idea. It turns function calls into interlinked, queriable, content-versioned data that is automatically co-evolved with your codebase. By applying different semantics to this data, the same ordinary Python code, including control flow and collections, can be used to not only compute, but also save, load, query, batch-compute, and combinations of those.

I wanted to call it out because I think it's such an obvious example of where programmers need to find ways of more clearly expressing their programs in ways everyone can understand.

amakelov|2 years ago

Hi, author here. Sorry about the confusion - this blog post's intention was to give a more programming-language-themed introduction to the project (discussion on r/programminglanguages is here: https://www.reddit.com/r/ProgrammingLanguages/comments/12im9...). As such, it doesn't really talk very much about what you'd actually use this for - tracking computational experiments (for example). As for the part you highlighted, I was looking for a two-sentence summary that conveys the main features of the project... well, I guess I succeeded with the two-sentence part! :)

Let me try to unpack this in a hopefully saner format:

- the whole thing is based on memoization. You put a memoization decorator on all the functions whose results you want to persist/reuse, and you compose entire programs by calling such functions on the outputs of other functions. In between calls, you can also mix in data structure operations - so say a function returns a list, you can call another memoized function on just an element of this list.

- as such programs execute, some metadata is passed around that links together the inputs and outputs of each call, and the elements of each data structure. This dynamically builds a computational graph of the program behind the scenes.

- why would you need this graph? One reason is that there is a principled way to extract a SQL query from this graph that pattern-matches to all analogous computational graphs that have already been memoized. This gives you a flexible and natural interface to ask the queries you usually ask of your programs - "How do these outputs depend on these inputs across all the experiments of this kind?" - without writing extra code.

- since memoized results can go stale when you change your code, there is also a versioning system that tracks the dependencies that each memoized call accessed, and alerts you when a dependency changes. Tracking dependencies dynamically on a per-call basis, rather than statically on a per-function basis, gives you more opportunity to reuse computation automatically - for example, a function can have two branches that depend on different dependencies. If only one of the dependencies changes, you won't recompute the calls that used the other dependency. The "content-versioned" part refers to how the system recognizes which version a function is at: by looking at its code (content), instead of by you explicitly providing it with some arbitrary version name/number. This means that e.g. you can "go back in time" w.r.t a given function(s) by restoring the old code, and the storage will recognize that it's back in this world.

I hope that helps clarify things, and thanks a lot for bringing this up.