Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large
It'll help, but GPU crunch isn't caused by people running 6-8bit inference on a single card, but by all the large scale pre-training + fine-tuning runs.
Easy. I made tests on desktop core i7-7700 with 64G DDR4-2400. And I've tested 13B..30B..70B models on it, and you may imagine, how easy to manage how many CPU cores used.
Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).
i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.
chessgecko|2 years ago
Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large
FanaHOVA|2 years ago
yazzku|2 years ago
simne|2 years ago
Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).
i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.
BTW Llama.CPU is very good software.