top | item 39858949

(no title)

I've been fascinated by a small mention in the 1.58bit quantization article that mentioned 0.68 quantization , which I believe to mean 0,1 instead of 1.58's -1,0,1. When I read https://www.reddit.com/r/LocalLLaMA/comments/1bpa6ol/unoffic... great experiment of making their own unofficial 1.58b quantization, I began to wonder if I could squeeze a vector down to 1 bit. And.. I can! (with some caveats in the discussion)

It was when I realized that XNOR and population count could basically score 32 dimensions at a time.

While this isn't ANYTHING like an actual quantized LLM, I thought it was a really nice proof-of-concept, and could be very useful for smaller machines running RAG applications.

My Code: https://github.com/bigattichouse/bitvector_research

NOTE: I'm not saying 30X faster than GPUs, but CPU implementations could be 30X faster.

discuss

No comments yet.