top | item 40674940

(no title)

Arimbr | 1 year ago

Nice, thanks! I was reading https://pathway.com/developers/user-guide/deployment/persist.... If I understand correctly you persist both source data and internal state, including the intermediary state of the computational graph. And you only rely on the backend to recover from failures and upgrades. So if I want to clone a Pathway instance, I don't need to reprocess all source data, I can recover the intermediary state from the snapshot.

Is it the same logic for the VectorStoreServer? https://pathway.com/developers/user-guide/llm-xpack/vectorst...

discuss

order

dxtrous|1 year ago

For indexing operators, there is some flexibility regarding the amount of internal operator state that is persisted. Say, in a stream-stream join structure, it's actually often faster to rebuild its state from its "boundary conditions" than persist it fully. For vector indexes, it is necessary to persist rather more of the internal state due to determinism issues (the next time the index is rebuilt, it could come back different, and could give different approximate results, which is bad). Currently, the HNSW implementation which is the basis of VectorStoreServer is still not fully integrated into the main Differential Dataflow organization, and has its own way of persisting/caching data "on the side". All in all, this part of the codebase is relatively young, and there is a fair amount of room for improvement.