top | item 39870848

(no title)

vladf | 1 year ago

Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

> Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.

discuss

mobicham|1 year ago

Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!

vladf|1 year ago

Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?

mikeravkine|1 year ago

Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.

danielhanchen|1 year ago

When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Dylan16807|1 year ago

> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?

vladf|1 year ago

Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?