llmtosser's comments

llmtosser | 7 months ago | on: Ollama Turbo

Interesting - it does indeed seem like llama-server has the needed endpoints to do the model swapping and llama.cpp as of recently also has a new flag for the dynamic CPU offload now.

However the approach to model swapping is not 'ollama compatible' which means all the OSS tools supporting 'ollama' Ex Openwebui, Openhands, Bolt.diy, n8n, flowise, browser-use etc.. aren't able to take advantage of this particularly useful capability as best I can tell.

llmtosser | 7 months ago | on: Ollama Turbo

This is not true.

No inference engine does all of:

- Model switching

- Unload after idle

- Dynamic layer offload to CPU to avoid OOM

llmtosser | 7 months ago | on: Ollama Turbo

Distractions like this probably the reason they still, over a year now, do not support sharded GGUF.

https://github.com/ollama/ollama/issues/5245

If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama.