Here's my D3D11 implementation of speech-to-text https://github.com/Const-me/Whisper With medium model it needs 1.43 GB of assets, 2 GB of VRAM, and on gaming GPUs works at 10x realtime speed. These performance figures might be good enough for modern videogames. BTW, the model understands almost 100 spoken languages and can translate them to English.
You wouldn't be able to run locally, but these models are pretty cheap to run assuming you batch everything. You wouldn't want to use it for a F2P game, but for a subscription game (order of a few dollars a month) it would not be prohibitively expensive.
Const-me|3 years ago
sebzim4500|3 years ago