Deploying a model on an NPU requires significant profile based optimization. Picking up a model that works fine on the CPU but hasn't been optimized for an NPU usually leads to disappointing results.
I don't think this is correct. The difference between well optimized code and unoptimized code on the CPU is frequently at least an order of magnitude performance.
Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.
Yeah whenever I’ve spoken to people who work on stuff like IREE or OpenXLA they gave me the impression that understanding how to use those compilers/runtimes is an entire job.
CAP_NET_ADMIN|1 year ago
marginalia_nu|1 year ago
Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.
catgary|1 year ago