It is about the parameter numbers if what you care about is edge devices with limited RAM. Beyond a certain size your model just doesn't fit, it doesn't matter how good it is - you still can't run it.
I am not sure what "edge" device you want to run this on, but you can compress parakeet to under 500MB on RAM / disk with dynamic quants on-the-fly dequantization (GGUF or CoreML centroid palettization style). And retain essentially almost all accuracy.
And just to be clear, 500MB is even enough for a raspberry Pi. Then your problem is not memory, is FLOPS. It might run real-time in a RPi 5, since it has around 50 GFLOPS of FP32, i.e. 100 GFLOPS of FP16. So about 20-50 times less than a modern iPhone. I don't think it will be able to keep it real time, TBF, but close.
regardless, this model with such quantization strategy runs real time at +10x real-time factor even in 6-year old iPhones (which you can acquire for under $200) and offline at a reasonable speed, essentially anywhere.
You get the best of both worlds: the accuracy of a whisper transformer at the speed and footprint of a small model.
bytesandbits|20 hours ago
And just to be clear, 500MB is even enough for a raspberry Pi. Then your problem is not memory, is FLOPS. It might run real-time in a RPi 5, since it has around 50 GFLOPS of FP32, i.e. 100 GFLOPS of FP16. So about 20-50 times less than a modern iPhone. I don't think it will be able to keep it real time, TBF, but close.
regardless, this model with such quantization strategy runs real time at +10x real-time factor even in 6-year old iPhones (which you can acquire for under $200) and offline at a reasonable speed, essentially anywhere.
You get the best of both worlds: the accuracy of a whisper transformer at the speed and footprint of a small model.