top | item 28912807

(no title)

VictorSh | 4 years ago

(author here)

I don't have exact numbers for latency but the inference widget is currently on a TPU v3-8 (which if I am not mistaken could roughly be compared to a cluster of 8 V100). That gives you a rough idea of the latency for short inputs.

Note that a colleague just reminded me that it is possible on a single (big) GPU with enough CPU to run inference for T5-11B (which is the size we use) with offloading -> https://github.com/huggingface/transformers/issues/9996#issu...

discuss

No comments yet.