top | item 44572508

(no title)

You're right, modern edge devices are powerful enough to run small models, so the real bottleneck for a forward pass is usually memory bandwidth, which defines the upper theoretical limit for inference speed. Right now, we've figured out how to run computations in a granular way on specific processing units, but we expect the real benefits to come later when we add support for VLMs and advanced speculative decoding, where you process more than one token at a time

discuss

J_Shelby_J|7 months ago

VLMs = very large models?

mmorse1217|7 months ago

Probably vision language models.