top | item 35244794

(no title)

borzunov | 2 years ago

A Petals dev here. FlexGen is good at high-throughput inference (generating multiple sequences in parallel). During single-batch inference, it spends more than 5 sec/token in case of GPT-3/BLOOM-sized models.

So, I believe 1 sec/token with Petals is the best you can get for the models of this size, unless you have enough GPUs to fit the entire model into the GPU memory (you'd need 3x A100 or 8x 3090 for the 8-bit quantized model).

discuss

No comments yet.