More like an easy-mode llama.cpp that does a cgo wrapping of the lib (now; before they built patched llama.cpp runners and did IPC and managed child processes) and it does a few clever things to auto figure out layer splits (if you have meager GPU VRAM). The easy mode is that it will auto-load whatever model you'd like per request. They also implement docker-like layers for their representation of a model allowing you to overlay parameters of configuration and tag it. So far, it has been trivial to mix and match different models (or even the same models just with different parameters) for different tasks within the same application.
okwhateverdude|2 years ago