Your work is an inspiration as always!! My n00b question is: what do you think is currently the most practical path to running a reasonably-sized (doesn't have to be the biggest) LLM on a commodity linux server for hooking up to a hobby web app ... i.e., one without a fancy GPU. (Renting instances with GPUs on, say, Linode, is significantly more expensive than standard servers that host web apps.) Is this totally out of reach, or are approaches like yours (or others you know of) a feasible path forward?
vikp|2 years ago
pedrovhb|2 years ago
It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow.
brucethemoose2|2 years ago
Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.
LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?
summarity|2 years ago
Y_Y|2 years ago
franga2000|2 years ago
bg24|2 years ago