Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.
No comments yet.