Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX

[+] whywhywhywhy|2 years ago|reply

Find these findings questionable unless Whisper is very poorly optimized the way it was run on a 4090.

I have a 3090 and an M1 Max 32GB and and although I haven't tried Whisper the inference difference on Llama and Stable Diffusion between the two is staggering, especially with Stable Diffusion where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.

[+] woadwarrior01|2 years ago|reply

You're taking benchmark numbers from a latent diffusion model's (SDXL) inference and extrapolating them to encoder-decoder transformer model's (Whisper) inference. These two model architectures have little in common (except perhaps the fact that Stable Diffusion models use a pre-trained text encoder from clip, which again is very different from an encoder-decoder transformer).

[+] kamranjon|2 years ago|reply

There has been a ton of optimization around whisper with regards to apple silicon, whisper.cpp is a good example that takes advantage of this - also this article is specifically referencing the new apple MLX framework which I’m guessing your tests with llama and stable diffusion weren’t utilizing.

[+] tgtweak|2 years ago|reply

Reading through some (admittedly very early) MLX docs and it seems that convolutions (as used heavily in GANs and particularly stable diffusion) are not really seeing meaningful uplifts on MLX at all, and in some cases are slower than on the cpu.

Not sure if this is a hardware limitation or just unoptimized MLX libraries but I find it hard to believe they would have just ignored this very prominent use case. It's more likely that convolutions use high precision and much larger tile sets that require some expensive context switching when the entire transform can't fit in the gpu.

[+] ps|2 years ago|reply

I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.

[+] stefan_|2 years ago|reply

Having used whisper a ton, there are versions of it that have one or two magnitudes of better performance at the same quality while using less memory for reasons I don't fully understand.

So I'd be very careful about your intuition on whisper performance unless it's literally the same software and same model (and then the comparison isn't very meaningful still, seeing how we want to optimize it for different platforms).

[+] agloe_dreams|2 years ago|reply

It all is really messy, I would assume that almost any model is poorly optimized to run on Apple Silicon as well.

[+] liuliu|2 years ago|reply

Both of your SDXL and M1 Max number should be faster (of course, it depends on how many steps). But the point stands, for SDXL, 3090 should be 5x to 6x faster than M1 Max and should be 2x to 2.5x faster than M2 Ultra.

[+] mv4|2 years ago|reply

Thank you for sharing this data. I've just been debating between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090 for personal ML use. Glad I chose the 3090! Would love to learn more about your setup.

[+] KingOfCoders|2 years ago|reply

"I haven't tried Whisper"

I haven't tried the hardware/software/framework/... of the article, but I have an opinion on this exact topic.

[+] oceanplexian|2 years ago|reply

M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.

[+] atty|2 years ago|reply

I think this is using the OpenAI Whisper repo? If they want a real comparison, they should be comparing MLX to faster-whisper or insanely-fast-whisper on the 4090. Faster whisper runs sequentially, insanely fast whisper batches the audio in 30 second intervals.

We use whisper in production and this is our findings: We use faster whisper because we find the quality is better when you include the previous segment text. Just for comparison, we find that faster whisper is generally 4-5x faster than OpenAI/whisper, and insanely-fast-whisper can be another 3-4x faster than faster whisper.

[+] moffkalast|2 years ago|reply

Is insanely-fast-whisper fast enough to actually run on the CPU and still trascribe in realtime? I see that none of these are running quantized models, it's still fp16. Seems like there's more speed left to be found.

Edit: I see it doesn't yet support CPU inference, should be interesting once it's added.

[+] youssefabdelm|2 years ago|reply

Does insanely-fast-whisper use beam size of 5 or 1? And what is the speed comparison when set to 5?

Ideally it also exposes that parameter to the user.

Speed comparisons seem moot when quality is sacrificed for me, I'm working with very poor audio quality so transcription quality matters.

[+] PH95VuimJjqBqy|2 years ago|reply

yeah well, I find that super-duper-insanely-fast-whisper is 3-4x faster than insanely-fast-whisper.

/s

[+] tiffanyh|2 years ago|reply

Key to this article is understanding it’s leveraging the newly released Apple MLX, and their code is using these Apple specific optimizations.

https://news.ycombinator.com/item?id=38539153

[+] modeless|2 years ago|reply

Also, this is not comparing against an optimized Nvidia implementation. There are faster implementations of Whisper.

Edit: OK I took the bait. I downloaded the 10 minute file he used and ran it on my 4090 with insanely-fast-whisper, which took two commands to install. Using whisper-large-v3 the file is transcribed in less than eight seconds. Fifteen seconds if you include the model loading time before transcription starts (obviously this extra time does not depend on the length of the audio file).

That makes the 4090 somewhere between 6 and 12 times faster than Apple's best. It's also much cheaper than M2 Ultra if you already have a gaming PC to put it in, and still cheaper even if you buy a whole prebuilt PC with it.

This should not be surprising to people, but I see a lot of wishful thinking here from people who own high end Macs and want to believe they are good at everything. Yes, Apple's M-series chips are very impressive and the large RAM is great, but they are not competitive with Nvidia at the high end for ML.

[+] intrasight|2 years ago|reply

Honest question: Why would I (or most users) care? If I have a Mac, I'm going to get the performance of that machine. If I have a gaming PC, I'll get the performance of that machine. If I have both, I'm still likely to use whichever AI is running on my daily driver.

[+] Flux159|2 years ago|reply

How does this compare to insanely-fast-whisper though? https://github.com/Vaibhavs10/insanely-fast-whisper

I think that not using optimizations allows this to be a 1:1 comparison, but if the optimizations are not ported to MLX, then it would still be better to use a 4090.

Having looked at MLX recently, I think it's definitely going to get traction on Macs - and iOS when Swift bindings are released https://github.com/ml-explore/mlx/issues/15 (although there might be some C++20 compilation issue blocking right now).

[+] brucethemoose2|2 years ago|reply

This is the thing about Nvidia. Even if some hardware beats them in a benchmark, if its a popular model, there will be some massively hand optimized CUDA implementation that blows anything else out of the water.

There are some rare exceptions (like GPT-Fast on AMD thanks to PyTorch's hard work on torch.compile, and only in a narrow use case), but I can't think of a single one for Apple Silicon.

[+] claytonjy|2 years ago|reply

To have a good comparison I think we'd need to run the insanely-fast-whisper code on a 4090. I bet it handily beats both the benchmarks in OP, though you'll need a much smaller batch size than 24.

You can beat these benchmarks on a CPU; 3-4x realtime is very slow for whisper these days!

[+] chrisbrandow|2 years ago|reply

he updated with insanely-fast

[+] tgtweak|2 years ago|reply

Does this translate to other models or was whisper cherry picked due to it's serial nature and integer math? looking at https://github.com/ml-explore/mlx-examples/tree/main/stable_... seems to hint that this is the case:

>At the time of writing this comparison convolutions are still some of the least optimized operations in MLX.

I think the main thing at play is the fact you can have 64+G of very fast ram directly coupled to the cpu/gpu and the benefits of that from a latency/co-accessibility point of view.

These numbers are certainly impressive when you look at the power packages of these systems.

Worth considering/noting that the cost of m3 max system with the minimum ram config is ~2x the price of a 4090...

[+] densh|2 years ago|reply

Apple’s silicon memory is fast only in comparison to consumer CPUs that stagnated for ages with having only 2 memory channels which was fine in 4 core era but wakes no sense at all with modern core counts. Memory scaling on GPUs is much better, even on the consumer front.

[+] SlavikCA|2 years ago|reply

It's easy to run Whisper on my Mac M1. But it's not using MLX out of the box.

I spend an hour or two, trying to run figure out what I need to install / configure to enable it to use MLX. Was getting cryptic Python errors, Torch errors... Gave up on it.

I rented VM with GPU, and started Whisper on it within few minutes.

[+] xd1936|2 years ago|reply

I've really enjoyed this macOS Whisper GUI[1]. It doesn't use MLX, but does use Metal.

1. https://goodsnooze.gumroad.com/l/macwhisper

[+] jonnyreiss|2 years ago|reply

I was able to get it running on MLX on my M2 Max machine within a couple minutes using their example: https://github.com/ml-explore/mlx-examples/tree/main/whisper

[+] JCharante|2 years ago|reply

Hmm, I've been using this product for whisper https://betterdictation.com/

[+] tambourine_man|2 years ago|reply

Is was released last week. Give it a month or two

[+] Lalabadie|2 years ago|reply

There will be a lot of debate about which is the absolute best choice for X task, but what I love about this is the level of performance at such a low power consumption.

[+] mightytravels|2 years ago|reply

Use this Whisper derivative repo instead - one hour of audio gets transcribed within a minute or less on most GPUs - https://github.com/Vaibhavs10/insanely-fast-whisper

[+] claytonjy|2 years ago|reply

Anecdotally I've found ctranslate2 to be even faster than insanely-fast-whisper. On an L4, using ctranslate2 with a batch size as low as 4 beats all their benchmarks except the A100 with flash attention 2.

It's a shame faster-whisper never landed batch mode, as I think that's preventing folks from trying ctranslate2 more easily.

[+] thrdbndndn|2 years ago|reply

Could someone elaborate how this is accomplished and if there is any quality disparity compared to original?

Repos like https://github.com/SYSTRAN/faster-whisper makes immediate sense on why it's faster than the original implementation, and lots of others do so by lowering quantization precision etc (and worse results).

but this one, it's not very clear how. Especially considering it's even much faster.

[+] theschwa|2 years ago|reply

I feel like this is particularly interesting in light of their Vision Pro. Being able to run models in a power efficient manner may not mean much to everyone on a laptop, but it's a huge benefit for an already power hungry headset.

[+] LiamMcCalloway|2 years ago|reply

I'll take this opportunity to ask for help: What's a good open source transcription and diarization app or work flow?

I looked at https://github.com/thomasmol/cog-whisper-diarization and https://about.transcribee.net/ (from the people behind Audapolis) but neither work that well -- crashes, etc.

Thank you!

[+] dvfjsdhgfv|2 years ago|reply

I developed my own solutions, pretty rudimentary - it divides the MP3s into chunks that Whisper is able to handle and then sends them one by one to the API to transcribe. Works as expected so far, it's just a couple of lines of Python code.

[+] mosselman|2 years ago|reply

I would like to know the same.

It shouldn’t be so hard since many apps have this. But what is the most reliable way right now?

[+] bcatanzaro|2 years ago|reply

What precision is this running in? If 32-bit, it’s not using the tensor cores in the 4090.

[+] lars512|2 years ago|reply

Is there a great speech generation model that runs on MacOS, to close the loop? Something more natural than the built in MacOS voices?

[+] unknown|2 years ago|reply

[deleted]

[+] treprinum|2 years ago|reply

You can try VALL-E; it takes around 5s to generate a sentence on a 3090 though.

[+] 2lkj22kjoi|2 years ago|reply

4090 -> 82 TFLOPS

M3 MAX GPU -> 10 TFLOPS

It is 8 times slower than 4090.

But yeah, you can claim that a bike has a faster acceleration than Ferrari, because it could reach the speed of 1km per hour faster...

[+] jauntywundrkind|2 years ago|reply

I wonder how AMD's XDNA accelerator will fair.

They just shipped 1.0 of the Ryzen AI Software and SDK. Alleges ONNX, PyTorch, and Tensorflow support. https://www.anandtech.com/show/21178/amd-widens-availability...

Interestingly, the upcoming XDNA2 supposedly is going to boost generative performance a lot? "3x". I'd kind of assumed these sort of devices would mainly be helping with inference. (I don't really know what characterizes the different workloads, just a naive grasp.)

[+] sim7c00|2 years ago|reply

looking at the comments perhaps the article could be more eptly titled. the author does stress these benchmarks, maybe better called test runs, are not of any scientific accuracy or worth, but simply to demonstrate what is being tested. i think its interesting though that apple and 4090s are even compared in any way since the devices are so vastly different. id expect the 4090 to be more powerful, but apple optimized code runs really quick on apple silicon despite this seemingly obvious fact, and that i think is interesting. you dont need a 4090 to do things if you use the right libraries. is that what i can take from it?

[+] darknoon|2 years ago|reply

Would be more interesting if Pytorch with MPS backend was also included.

[+] iAkashPaul|2 years ago|reply

There's a better parallel/batching that works on the 30s chunks resulting in 40X. From HF at https://github.com/Vaibhavs10/insanely-fast-whisper

This is again not native PyTorch so there's still room to have better RTFX numbers.

[+] runjake|2 years ago|reply

Anyone have overall benchmarks or qualified speculation on how an optimized implementation for a 4070 compares against the M series -- especially the M3 Max?

I'm trying to decide between the two. I figure the M3 Max would crush the 4070?

[+] etchalon|2 years ago|reply

The shocking thing about these M series comparisons is never "the M series is fast as the GIANT NVIDIA THING!" it's always "Man, the M series is 70% as fast with like 1/4 the power."

[+] accidbuddy|2 years ago|reply

About whisper, anyone knows a project (github) about using the model in real-time? I'm studying a new language, and it appears to be a good chance to use and learning pronunciation vs. word.

[+] samx81|2 years ago|reply

This one uses faster-whisper as the backend, I've tried with small model and the performance is good. https://github.com/collabora/WhisperLive

The is another one that uses huggingface's implementation, but I haven't tried it since my spec doesn't support flash-att2 https://github.com/luweigen/whisper_streaming

149 comments