top | item 44805075

(no title)

bluecoconut | 6 months ago

I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

discuss

order

tarruda|6 months ago

> Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

Not sure about ollama, but llama-server does have a transparent kv cache.

You can run it with

    llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none

Web UI at http://localhost:8080 (also OpenAI compatible API)