My gut feeling is that there may be optimization you can do for faster performance (but I could be wrong since I don't know your setup or requirements). In general on a 4090 running between Q6-Q8 quants my tokens/sec have been similar to what I see on cloud providers (for open/local models). The fastest local configuration I've tested is Exllama/TabbyAPI with speculative-decoding (and quantized cache to be able to fit more context)
No comments yet.