top | item 47205386

(no title)

bytesandbits | 15 hours ago

I am not sure what "edge" device you want to run this on, but you can compress parakeet to under 500MB on RAM / disk with dynamic quants on-the-fly dequantization (GGUF or CoreML centroid palettization style). And retain essentially almost all accuracy.

And just to be clear, 500MB is even enough for a raspberry Pi. Then your problem is not memory, is FLOPS. It might run real-time in a RPi 5, since it has around 50 GFLOPS of FP32, i.e. 100 GFLOPS of FP16. So about 20-50 times less than a modern iPhone. I don't think it will be able to keep it real time, TBF, but close.

regardless, this model with such quantization strategy runs real time at +10x real-time factor even in 6-year old iPhones (which you can acquire for under $200) and offline at a reasonable speed, essentially anywhere.

You get the best of both worlds: the accuracy of a whisper transformer at the speed and footprint of a small model.

discuss

order

No comments yet.