(no title)
borzunov | 2 years ago
In turn, "parallel inference" refers to the high-throughput scenario when you generate multiple sequences in parallel. This is useful when you process some large dataset with LLM (e.g. run inference with batch size of 200) or run a beam search with a large beam width. In this case, you can actually get the speed of hundreds of tokens per sec, see our benchmarks for parallel forward passes: https://github.com/bigscience-workshop/petals#benchmarks
If you have another wording in mind that is more up front, please let us know, we'd be happy to improve the project description. Petals is a non-commercial research project, and we don't want to oversell anything.
null4bl3|2 years ago
Do each node earn points for supplying resources that can then be spend for greater query / process speed?
borzunov|2 years ago