top | item 28482366

(no title)

praccu | 4 years ago

At Amazon I set up an evaluation approach based on whether the system completed the desired task (in that context it was "did the search result using the speech recognition return the same set of items to buy as the transcript.)

https://scholar.google.com/citations?view_op=view_citation&h...

discuss

order

dylanbfox|4 years ago

Interesting. It seems like in the "real world" WER is not really the metric that matters, it's more about "is this ASR system performing well to solve my use case" - which is better measured through task-specific metrics like the one you outlined your paper.

6gvONxR4sf7o|4 years ago

A pure ASR analog of this is how many/how much continuous utterances it enables. When I use tools like the one lunixbochs builds (including his own) the challenge as a user is trading of doing little bits at a time (slow, but easier to go back and correct) vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again).

Sentence/command error rate (rate of 100% correct sentences/commands that don’t need any editing or re-attempting) is a decent proxy for this. It’s no silver bullet, but it more directly measures how frustrated your users will be.

If you really wanted to take care of the issues in the article, you could interview a bunch of users and find what percent of the, would go back and edit each kind of mistake (if 70% would have to go back and change ‘liked’ to ‘like’ then it’s 70% as bad as substituting ‘pound’ for ‘around’ which presumably every user will go back and edit).

The infuriating thing as a user is when metrics don’t map to the extra work I have to do.