top | item 47113823

(no title)

No, not in milliseconds if you have longish context. Prefill is very compute heavy, compared to inference.

discuss

Depends how you’re defining it. There can be a lot of it to ingest so it’s a lot of compute in absolute terms. It’s also much more memory efficient since it’s batchable so, it’s more likely to be compute bound, but you can also throw a lot of resources at the problem. But in terms of time generation can be significantly more expensive since it’s slower and you can’t batch (only use a draft model)