Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.[^1] https://news.ycombinator.com/item?id=43595888
terhechte|11 months ago
anoncareer0212|11 months ago
IIUC the data we have:
2K tokens / 12 seconds = 166 tokens/s prefill
120K tokens / (10 minutes == 600 seconds) = 200 token/s prefill
kgwgk|11 months ago
It seems the other way around?
120k : 2k = 600s : 10s