(no title)
SeriousStorm | 1 year ago
Are you just running an LLM server (Ollama, llama.cpp, etc) and then making API calls to that server with plain Python or is it more than that?
SeriousStorm | 1 year ago
Are you just running an LLM server (Ollama, llama.cpp, etc) and then making API calls to that server with plain Python or is it more than that?
OutOfHere|1 year ago
For now I have used only cloud APIs with their Python SDKs, including the prompt completion, TTS, and embedding endpoints. They allow me to run many jobs in parallel which is useful for complex workflows or if facing heavy user demand. For caching of responses, I have used a local disk caching library, although I guess one can alternatively use a standalone or embedded database. I have used threading via `concurrent.futures` for concurrent jobs, although asyncio too would work.
The one simple external Python library I found so far is `semantic-text-splitter` for splitting long texts using token counts, but this too I could have done by myself with a bit of effort. I think langchain has something for it too.