If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.
Gracana|3 months ago
simonw|3 months ago
isoprophlex|3 months ago
MurizS|3 months ago
skeptrune|3 months ago