(no title)
vladf | 1 year ago
1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)
2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?
danielhanchen|1 year ago
You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.
So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.
Unsure on latency sadly
mobicham|1 year ago