(no title)
w1nk | 2 years ago
From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.
Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?
w1nk | 2 years ago
From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.
Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?
l33tman|2 years ago
w1nk|2 years ago
losteric|2 years ago
Sounds like the big win is load time from the optimizations. Also, maybe llama.cpp now supports low-memory systems through mmap swapping? ... at the end of the day, 30B quantized is still 19GB...
matsemann|2 years ago
w1nk|2 years ago
jhatemyjob|2 years ago