top | item 18710237

(no title)

It's a metric that's hard to nail down because there is so much parameter space that you are flattening into one number. Also it doesn't address the "I care about these five high value words (that are made up), can you recognize them?" like product names and company names.

There's ~4 types of audio:

Phone call - close microphone - conversational - low bandwidth audio - two way conversation - more industry specific terminology

Meetings - 2-5 people - conversational - far away mic - better bandwidth audio - more industry specific terminology

Broadcast - usually good diction - close mic - good bandwidth audio - more general terminology

Command&Control (saying to your phone: "go to <this address>") - close mic or array or mics far away - short audio chunks, 2-10 seconds - spoken in a way that makes it easier to recognize (learned behavior) - usually a lot of widely known named entities are said

In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet.

discuss

dumbfoundded|7 years ago

I guess maybe a better way to ask is which acoustic environments do you excel in?

In terms of gathering data, I'm curious how to plan to get the 15K audio hours it takes to train each of these models. The most you want to segment it (like through acoustic environment or genders), the more data you need. Do you have a cheap way of generating high quality data?

stephensonsco|7 years ago

I didn't answer "Do you have a cheap way of generating high quality data?". We have good ways to do it. They're not that cheap though. It's expensive (organizationally and real $$$) to label large amounts of data no matter what.

But we do utilize our capabilities to better tackle the wild data gathering and labeling. For instance, "is every labeled minute just as valuable as any other?". Definitely not. So if you can find and select only the data you want to label, rather than indiscriminately labeling a bunch, then you can increase your overall efficacy.

stephensonsco|7 years ago

If you're training from scratch around 10k hours is needed to get a good model, but when you are transfer learning you don't need nearly that much (100 hours gets you a lot).

We excel in phone call and meetings settings. I.e. the typical sales/office/support environment.