top | item 40720202

(no title)

Every time I look into building a workflow with langchain it seems unnecessarily complex. So I end up stopping.

Are you just running an LLM server (Ollama, llama.cpp, etc) and then making API calls to that server with plain Python or is it more than that?

discuss

OutOfHere|1 year ago

I suppose ollama and llama.cpp, or at least any corresponding Python SDKs, would be good for using self-hosted models, especially if they support parallel GPU use. If it's something custom, Pytorch would come into the picture. In production workflows, it can obviously be useful to run certain LLM prompts in parallel to hasten the job.

For now I have used only cloud APIs with their Python SDKs, including the prompt completion, TTS, and embedding endpoints. They allow me to run many jobs in parallel which is useful for complex workflows or if facing heavy user demand. For caching of responses, I have used a local disk caching library, although I guess one can alternatively use a standalone or embedded database. I have used threading via `concurrent.futures` for concurrent jobs, although asyncio too would work.

The one simple external Python library I found so far is `semantic-text-splitter` for splitting long texts using token counts, but this too I could have done by myself with a bit of effort. I think langchain has something for it too.