(no title)
bluecoconut | 6 months ago
It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.
(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)
tarruda|6 months ago
Not sure about ollama, but llama-server does have a transparent kv cache.
You can run it with
Web UI at http://localhost:8080 (also OpenAI compatible API)