(no title)
Hyzer | 2 years ago
Comparing it to llama.cpp on my M1 Max 32GB, it seems at least as fast just by eyeballing it. Not sure if the inference speed numbers can be compared directly.
vicuna-7b-v0 on Chrome Canary with the disable-robustness flag: encoding: 74.4460 tokens/sec, decoding: 18.0679 tokens/sec = 10.8ms per token
llama.cpp: $ ./main -m models/7B/ggml-model-q4_0-ggjt.bin -t 8 --ignore-eos = 45 ms per token
No comments yet.