(no title)
vsolina | 2 years ago
You're definitely right about the feature richness, but the truth is I just want completions :D
Performance is a funny thing, mostly scales with the slowest part of the system. Since both servers use the same inference lib (llama.cpp) which does all the heavy lifting, there's essentially no completion performance difference in the single user mode according to my tests. Because I use a smaller model by default (Q5_K_M instead of Tabby's Q8, ~30% difference in size), and LLM inference is essentially memory bandwidth bound: my new deployment is around 30% faster with no noticeable quality difference on identical hardware.
p.s. I'd highly recommend providing additional quantization methods in your model repository to make it easier for novice users.
Thank you
No comments yet.