Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.
Thanks for sharing this! It's great to have a reference implementation written on java lang. With given original simplicity it's really easy to follow llama architecture logic.
Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py
I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.
Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.
How you all used these things for anything useful? I can't get them to give useful results on my 3060 8gb. If I wanted to get decent results I think I'd need to rent a GPU node somewhere, but chatGPT is still free
To quote Gary Frost (creator of Aparapi), TornadoVM is the state-of-the-art right now. He mentioned this at JVMLS 2023. Hopefully the videos will be available soon from this link:
https://openjdk.org/projects/mlvm/jvmlangsummit/
gavinray|2 years ago
Looked at the author and realized it's Alfonso from the Graal team -- makes sense.
I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.
mukel|2 years ago
kurhan|2 years ago
atairov|2 years ago
Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py
I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.
mukel|2 years ago
mike_hearn|2 years ago
shortrounddev2|2 years ago
SushiHippie|2 years ago
nmfisher|2 years ago
jiehong|2 years ago
Any abstraction for GPGPU or shaders programming?
pjmlp|2 years ago
http://javagl.de/jcuda.org/
https://dragan.rocks/software/
https://blogs.oracle.com/javamagazine/post/programming-the-g...
mike_hearn|2 years ago
But it's a research project.
jfumero|2 years ago