(no title)
borzunov | 3 years ago
> The batch size is tuned to a value that maximizes the generation throughput for each system.
> FlexGen cannot achieve its best throughput in [...] single-batch case.
For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.
However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.
The authors don't report the speed of the single-batch generation in the repo and the paper.
152334H|3 years ago
These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...
ImprobableTruth|3 years ago
It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.