top | item 35782611

(no title)

djsamseng | 2 years ago

This is a pretty good introductory primer. https://medium.com/analytics-vidhya/understanding-the-mel-sp...

1. STFT (get frequencies from the audio signal)

2. Log scale/ decibel scale (since we hear on the log scale)

3. Optionally convert to the Mel scale (filters to how humans hear)

Happy to answer any questions

discuss

peepwaah|2 years ago

Thanks for your effort in sharing the link- am kind of comfortable with most of the theoretical aspects of STFT/FFT/MelScale etc.. but when i look at the spectrogram i still feel am missing something. When i look at the spectrogram i want to know how clear is the quality of the speech in the audio - is there background noise - Is there a reverb - Is there a loss anywhere - I have a feeling that these are possible to be learnt from analyzing spectrograms but not sure how to do it. Hence the question.

timlod|2 years ago

I would recommend constructing some spectrograms from specific sounds, especially simulated ones, to help you connect the visual with the audible.

For example:

- Sine sweeps (a sine wave that starts at a low frequency and sweeps up to a high one) - to learn associate the frequencies you hear with the Y-axis

- Sine pulses at various frequencies - to better understand the time axis

- different types of noise (e.g. white)

Perhaps move on to your own voice as well, and try different scales (log or mel spectrograms, which are commonly used).

With this, I think you can develop a familiarity quickly!

0xFEE1DEAD|2 years ago

Look for clear and distinct frequency bands corresponding to the vocal range of human speech (generally around 100 Hz to 8 kHz). If the frequency bands are well defined and distinct then the speech is likely clear and intelligible. If the frequency bands are blurred or fuzzy then the speech may be muffled or distorted.

Note that speech like any audio source consists of multiple frequencies, a fundamental frequency and its harmonics.

Background noise can be identified as distinct frequency bands that are not part of the vocal range of human speech. E.g. if you see lots of bright lines below or above the human vocal range then there's lots of background noise. Especially lower frequencies can have a big impact on the perceived clarity of a recording whereas high frequencies come of as being more annoying.

Noise within the frequency range of human speech is harder to spot and you should always use your ears to decide whether it's noise or not.

You can also use a spectrogram to check for plosives (e.g. "s" "k" "t" sounds) as they also can make a recording sound bad/harsh.

djsamseng|2 years ago

Unfortunately I think the answer is “we don’t know” we have loads of techniques (ex: band pass filter) and hypotheses (ex: harmonic frequencies and timbre) but we haven’t been able to implement them perfectly which seems to be why deep learning has worked so well.

Personally I hypothesize that the reason it’s so hard is that the sources are intermixed sharing frequencies so isolating to certain frequencies doesn’t isolate a speaker. We’d need something like beam forming to know how much amplitude of each frequency to extract. I’d also hypothesize that humans, while able to focus on a directional source, also cannot “extract” clean signal either (imagine someone talking while a pan crashes on the floor - it completely drowns out what the person said)