top | item 39311108

(no title)

It depends. Right now once we hit 6-8 bit precision inference, H100s/A100s are not memory-bound, but compute-bound.

discuss

This is wrong, being memory bound or not has to do with the dimensions of the matrices being multiplied (if you’re on tensor cores). https://docs.nvidia.com/deeplearning/performance/dl-performa...

Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large

FanaHOVA|2 years ago

It'll help, but GPU crunch isn't caused by people running 6-8bit inference on a single card, but by all the large scale pre-training + fine-tuning runs.

yazzku|2 years ago

Can you link to an actual performance analysis on this?

simne|2 years ago

Easy. I made tests on desktop core i7-7700 with 64G DDR4-2400. And I've tested 13B..30B..70B models on it, and you may imagine, how easy to manage how many CPU cores used.

Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).

i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.

BTW Llama.CPU is very good software.