top | item 42750749

(no title)

Correct me if I'm wrong, but, if you run multiple inferences at the same time on the same GPU you will need load multiple models in the vram and the models will fight for resources right? So running 10 parallel inferences will slow everything down 5 times right? Or am I missing something?

discuss

Palmik|1 year ago

Inference for single example is memory bound. By doing batch inference, you can interleave computation with memory loads, without losing much speed (up until you cross the compute bound threshold).

bavell|1 year ago

You will most likely be using the same model so just 1 to load into vram.

aeternum|1 year ago

No, the key is to use the full context window so you structure the prompt as something like: For each line below, repeat the line, add a comma then output whether it most closely represents a product or service:

20 bottles of ferric chloride

salesforce

...

e12e|1 year ago

Appreciate the concrete advice in this response. Thank you.