(no title)
Teleoflexuous | 1 year ago
For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring out the number of speakers on it's own, but that wasn't a problem for my use case.
joshspankit|1 year ago
Here’s an example for clarity:
1. AI is trained on the voice of a podcast host. As a side effect it now (presumably) has all the information it needs to replicate the voice
2. All the past podcasts can be processed with the AI comparing the detected voice against the known voice which leads to highly-accurate labelling of that person
3. Probably a nice side bonus: if two people with different registers are speaking over each other the AI could separate them out. “That’s clearly person A and the other one is clearly person C”
c0brac0bra|1 year ago
You pass N number of PCM frames through their trainer and once you reach a certain percentage you can extract an embedding you can save.
Then you can identify audio against the set of identified speakers and it will return percentage matches for each.