top | item 35786290

(no title)

djsamseng | 2 years ago

Unfortunately I think the answer is “we don’t know” we have loads of techniques (ex: band pass filter) and hypotheses (ex: harmonic frequencies and timbre) but we haven’t been able to implement them perfectly which seems to be why deep learning has worked so well.

Personally I hypothesize that the reason it’s so hard is that the sources are intermixed sharing frequencies so isolating to certain frequencies doesn’t isolate a speaker. We’d need something like beam forming to know how much amplitude of each frequency to extract. I’d also hypothesize that humans, while able to focus on a directional source, also cannot “extract” clean signal either (imagine someone talking while a pan crashes on the floor - it completely drowns out what the person said)

discuss

HarHarVeryFunny|2 years ago

Speech is pretty well understood - there are two complementary aspects to it, speech production (synthesis) and speech recognition (via the changing frequency components as show up in the spectrogram).

When we recognize speech is almost as if we're hearing the way the speaker is articulating words, since what we're recognizing is the changing resonant frequencies ("formants") of the vocal tract corresponding to articulation, as well as other articulation clues such as the sudden energy onset of plosives or high frequencies of fricatives (see my other post in this topic for a bit more info).

High quality (that is, highly intelligible) speech synthesis has been available for a long time based on this understanding of speech production/recognition. One of the earliest speech synthesizers was the DECTalk (from Digital Equipment) introduced in 1984 - a formant-based synthesizer based on the work of linguist Denis Klatt.

The fact that most of the information in speech comes from the formants can be proved by generating synthetic formant-only speech just consisting of sine waves at the changing formant frequencies. It doesn't sound at all natural, but nonetheless very easy to recognize.

The starting point for human speech recognition is similar to a spectrogram - it's a frequency analysis (cf FFT) done by the ear via the varying length hairs in the inner ear vibrating according to the frequencies present, therefore picking up the dominant formant frequencies.

djsamseng|2 years ago

Agreed theoretically however if I gave you two spectrograms, would you be able to tell which one is clear speech and which one is garbled? I’d bet we’d be able to come up with one that wouldn’t pass the sniff test.

If you know of any implementations that can look at a spectrogram and say “hey there’s peaks at 150hz, 220hz and 300hz with standard deviations of 5hz, 7hz, and 10hz, decreasing in frequency over time thus this is a deep voice saying ‘ay’” and get it right every time I’d be really interested in seeing it (besides neural networks)