top | item 38739479

(no title)

This was surprisingly fast, 276.27 T/s (although Llama 2 70B is noticeably worse than GPT-4 turbo). I'm actually curious if there's good benchmarks for inference tokens per second- I imagine it's a bit different for throughput vs. single inference optimization, but curious if there's an analysis somewhere on this

edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there

discuss

andygeorge|2 years ago

fast but wrong/gibberish

razorguymania|2 years ago

Its using vanilla llama-2 from Meta with no fine tuning. The point here is the speed and responsiveness of the underlying HW and SW.