top | item 38478147

(no title)

dnnssl2 | 2 years ago

What are some of the better use cases of fast inference? From my experience using ChatGPT, I don't need it to generate faster than I can read, but waiting for code generation is painful because I'm waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?

discuss

rfw300|2 years ago

The main thing is chat is just one application of LLMs. Other applications are much more latency sensitive. Imagine, for instance, an LLM-powered realtime grammar checker in an editor.

lmeyerov|2 years ago

Most LLM model use shouldn't be 'raw' but as part of a smart & iterative pipeline. Ex:

* reading: If you want it to do inference over a lot of context, you'll need to do multiple inferences. If each inference is faster, you can 'read' more in the same time on the same hardware

* thinking: a lot of analytical approaches essentially use writing as both memory & thinking. Imagine iterative summarization, or automatically iteratively refining code until it's right

For louie.ai sessions, that's meant a fascinating trade-off here when doing the above:

* We can use smarter models like gpt-4 to do fewer iterations...

* ... or a faster but dumber model to get more iterations in the same amount of time

It's entirely not obvious. For example, the humaneval leaderboard has gpt4 for code being beat by gpt 3.5 for code when run by a LATS agent: https://paperswithcode.com/sota/code-generation-on-humaneval . This highlights that the agent framework is the one really responsible for final result quality, so their ability to run many iterations in the same time window matters.

jasonjmcghee|2 years ago

Programmatic and multi-step use cases. If you need chain-of-thought or similar, tool use, etc. Generating data.

Most use cases outside of classic chat.

For example, I made an on-demand educational video project, and the slowest part was by far the content generation. RAG, TTS, Image generation, text rendering, and video processing were all a drop in the bucket, in comparison.

It would be an even wider gap now, and TTS is super-realtime, and image generation can be single step.

ClarityJones|2 years ago

Perhaps this is naive, but in my mind it can be useful for learning.

- Hook LLM to VMs

- Ask for code that [counts to 10]

- Run code on VM

- Ask different LLM to Evaluate Results.

- Repeat for sufficient volume.

- Train.

The faster it can generate results the faster those results can be tested against the real world, e.g. a VM, users on X, other models with known accuracies.

wedn3sday|2 years ago

One obvious use case is that it makes per-token generation much cheaper.

dnnssl2|2 years ago

That's not so much a use case, but I get what you're saying. It's nice that you can find optimizations to shift down the pareto frontier of across the cost and latency dimension. The hard tradeoffs are for cases like inference batching where it's cheaper and higher throughput but slower for the end consumer.

What's a good use case for an order of magnitude decrease in price per token? Web scale "analysis" or cleaning of unstructured data?