top | item 41371471

(no title)

seldo | 1 year ago

For agentic use cases, where you might need several round-trips to the LLM to reflect on a query, improve a result, etc., getting fast inference means you can do more round-trips while still responding in reasonable time. So basically any LLM use-case is improved by having greater speed available IMO.

discuss

freediver|1 year ago

The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.