(no title)
nenkoru | 2 years ago
- Low memory footprint(thanks quantization)
- Fast inference(thanks io binding)
Particularly in case of alpaca I have seen a 5x decrease in latency on A100 and 10x on AMD EPYC. I believe this is the way for users to have an AI that could genereate a response as fast as it can on their hardware. I have also added a link to my profile on hf with small alpacas turned into ONNX format. Take a look into them.
No comments yet.