We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
adiraja|1 year ago
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
roh26it|1 year ago