top | item 34874976

(no title)

borzunov | 3 years ago

Note that the authors report the speed of generating many sequences in parallel (per token):

> The batch size is tuned to a value that maximizes the generation throughput for each system.

> FlexGen cannot achieve its best throughput in [...] single-batch case.

For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.

However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.

The authors don't report the speed of the single-batch generation in the repo and the paper.

[1] https://arxiv.org/pdf/2209.01188.pdf

discuss

order

152334H|3 years ago

I spoke with the authors of the paper; the leftmost points in Figure 1 were generated with batch-size 1, indicating ~1.2x and ~2x improvements in speed over DeepSpeed for 30B and 175B models respectively. For reference, this is speeding up from ~0.009tokens/s to about ~0.02tokens/s on 175B.

These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...

ImprobableTruth|3 years ago

Calling this garbage is absolutely wild. The authors make it very clear that this is optimized for throughput and not latency. Throughput focused scenarios absolutely do exist, editorializing this as "running large language models like ChatGPT" and focusing on chatbot applications is the fault of HN.

It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.