For indexing operators, there is some flexibility regarding the amount of internal operator state that is persisted. Say, in a stream-stream join structure, it's actually often faster to rebuild its state from its "boundary conditions" than persist it fully.
For vector indexes, it is necessary to persist rather more of the internal state due to determinism issues (the next time the index is rebuilt, it could come back different, and could give different approximate results, which is bad). Currently, the HNSW implementation which is the basis of VectorStoreServer is still not fully integrated into the main Differential Dataflow organization, and has its own way of persisting/caching data "on the side".
All in all, this part of the codebase is relatively young, and there is a fair amount of room for improvement.
No comments yet.