top | item 35244656

(no title)

borzunov | 2 years ago

A Petals dev here. We say up front that "Single-batch inference runs at ≈ 1 sec per step (token)".

In turn, "parallel inference" refers to the high-throughput scenario when you generate multiple sequences in parallel. This is useful when you process some large dataset with LLM (e.g. run inference with batch size of 200) or run a beam search with a large beam width. In this case, you can actually get the speed of hundreds of tokens per sec, see our benchmarks for parallel forward passes: https://github.com/bigscience-workshop/petals#benchmarks

If you have another wording in mind that is more up front, please let us know, we'd be happy to improve the project description. Petals is a non-commercial research project, and we don't want to oversell anything.

discuss

null4bl3|2 years ago

Can it run in a docker-compose container with a set ressource limit?

Do each node earn points for supplying resources that can then be spend for greater query / process speed?

borzunov|2 years ago

Sure! The point system is being developed, we'll ship it soon. Once it's ready, you'll be able to spend points on high-priority requests to increase the speed.