You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)
They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.
KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s
vlovich123|11 months ago
boroboro4|11 months ago
magicalhippo|11 months ago
KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s