top | item 39894360

(no title)

petronic | 1 year ago

My experience with more traditional (non-Whisper-based) diarization & transcription is that it's heavily dependent on how well the audio is isolated. In a perfect scenario (one speaker per audio channel, well-placed mics) you'll potentially see some value from it. If you potentially have scenarios where the speaker's audio is being mixed with other sounds or music, they'll often be flagged as an additional speaker (so, speaker 1 might also be speaker 7 and 9) - which can make for a less useful summary.

discuss

notjulianjaynes|1 year ago

You're saying that diarization quality is dependent on speaker isolation? I should have clarified, to my knowledge whisper does not perform that step, and whether I need to do diarization is what I'm trying to figure out. (Probably I do.) Pyannote.audio has been suggested, but I ran into some weird dependency thing I didn't feel like troubleshooting late last night, so I have not been successful in using it yet.