(no title)
Translationaut | 2 years ago
Here with Go, it seems to make sense to use an abstraction. Though don't you lose a lot of flexibility?
Translationaut | 2 years ago
Here with Go, it seems to make sense to use an abstraction. Though don't you lose a lot of flexibility?
lolinder|2 years ago
* You have a machine with a GPU that can do the inference but you need to run the application code on a smaller device. In my case, that's a Raspberry Pi with a touchscreen.
* You have multiple applications that all need to use LLM inference for different purposes. Loading the models exactly once is more efficient than loading them with huggingface for each application.
* Your python script is ephemeral and run on demand. You don't want to load the model each execution because it's slow, so you need a daemon. Ollama can serve that role without you having to write anything other than the script.
If you're writing Python and only need one application that's going to run on a powerful machine, you're probably better off running the models directly, but I'd venture a guess that that's a minority case.
As for flexibility, there are probably some things that are easier to do when you're running the inference yourself, but Ollama's API is powerful enough for most use cases.
rybosome|2 years ago
This is exactly my use case for it; I invoke a Python binary which uses the Ollama API and get a model response within seconds because it’s already resident in memory.