top | item 42430368

(no title)

roh26it | 1 year ago

What are the trade-offs you've made to achieve this?

discuss

adiraja|1 year ago

We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.

So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.

roh26it|1 year ago

So, while time to first token is lower, throughput might also be lower in most cases?