i think llama.cpp has context caching with "--prompt-cache" but it will result in a very large cache file. i guess it's also very expensive for any inference api provider to support caching as they have to persist the file and load/unload it each time.
No comments yet.