llmtosser | 7 months ago | on: Ollama Turbo
llmtosser's comments
llmtosser | 7 months ago | on: Ollama Turbo
This is not true.
No inference engine does all of:
- Model switching
- Unload after idle
- Dynamic layer offload to CPU to avoid OOM
llmtosser | 7 months ago | on: Ollama Turbo
Distractions like this probably the reason they still, over a year now, do not support sharded GGUF.
https://github.com/ollama/ollama/issues/5245
If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama.
page 1
However the approach to model swapping is not 'ollama compatible' which means all the OSS tools supporting 'ollama' Ex Openwebui, Openhands, Bolt.diy, n8n, flowise, browser-use etc.. aren't able to take advantage of this particularly useful capability as best I can tell.