(no title)
nigma | 2 years ago
First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.
No comments yet.