(no title)
adeptima | 11 months ago
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Appreciate any tips on the subject
No comments yet.