(no title)
ein0p
|
10 months ago
Note that this is _way_ slower at small batch sizes you'd need for interactive use. At batch size 1 this seems to run at 1/3rd the speed of bf16 (so about 1/6th the speed of fp8 you'd realistically be using) if figure 5 is to be believed. This is actually a pretty impressive feat in itself if you know anything about GPU kernel programming, but it is much slower nevertheless. For this to work at "wire speed" it'd need hardware support, which takes years. Their "baseline" elsewhere in the paper is CPU offloading, which is dog slow and can't be made fast due to PCIe bottleneck.
timschmidt|10 months ago
ein0p|10 months ago
ow5|10 months ago
Also when we ran experiments for streaming with the current kernels, we were median ~1.3x slower at inference
ein0p|10 months ago