top | item 18708876

Launch HN: Deepgram (YC W16) – Scalable Speech API for Businesses

56 points| stephensonsco | 7 years ago

Hey HN,

I’m Scott Stephenson, one of the cofounders of Deepgram (https://www.deepgram.com/). Getting information from recorded phone calls and meetings is time-intensive, costly, and imprecise. Our speech recognition API allows businesses to reliably translate high-value unstructured audio into accurate, parsable data.

Deepgram started when my cofounder Noah Shutty and I had just finished looking for dark matter (while in a particle physics lab at University of Michigan). Noah had the idea to start recording all audio from his life, 24/7. After gathering hundreds of hours of recordings, we wanted to search inside this fresh dataset, but realized there wasn’t a good way to find specific moments. So, we built a tool utilizing the same AI techniques we used for finding dark matter particle events, and it ended up working pretty well. A few months later, we made a single page demo to show off “searching through sound” and posted to HN. Pretty soon we were in the winter batch of YC in 2016 (https://techcrunch.com/2016/09/27/launching-a-google-for-sou...).

I’d say we didn’t know what we were getting ourselves into. Speech is a really big problem with a huge market, but it’s also a tough nut to crack. For decades, companies have been unable to get real learnings from their massive amounts of recorded audio (some companies record more than 1,000,000 minutes of call center calls every single day). They have a few reasons why they record the audio — some for compliance, some for training, and some for market research. The questions they’re trying to answer are usually as simple as:

  - “What is the topic of the call?” 
  - “Is this call compliant?” (did I say: my company name, my name, and “this call may be recorded”)
  - “Are people getting their problems solved quickly?” 
  - “Do my agents need training?” 
  - “What are our customers talking about? Competitors? Our latest marketing campaign?”

It’s the most intimate view you can get on your customers, but the problem is so large and difficult to solve that companies pushed it into the corner over the past couple decades, only trying to mitigate the bleeding. Current tools only transcribe with around 50-60% accuracy on real-world, noisy, accented, industry-specific audio (don’t believe the ‘human level accuracy’ hype). When companies start solving problems using speech data, they first want transcription that’s accurate. After accuracy, comes scale — another big problem. Speech processing is computationally expensive and slow. Imagine trying to get into an iterative problem solving loop when you have to wait 24 hours to get your transcripts back.

So we’ve set our sights on building the speech company. Competition from companies like Google, Amazon, and Nuance is real, but none of these approach speech recognition like we do. We've rebuilt the entire speech processing stack, replacing heuristics and stats based speech processing with fully end-to-end deep learning (we use CNNs and RNNs). Using GPUs, we train speech models to learn customer’s unique vocabularies, accents, product names, and acoustic environments. This can be the difference between correctly capturing “wasn’t delivered” and “was in the liver.” We’ve focused on speed since we think that’s very important for exploration and scale. Our API returns hour-long transcripts interactively in seconds. It’s a tool many businesses wish they had.

So far we’ve released tools that:

  - transcribe speech with timestamps
  - support real-time streaming
  - have multi-channel support
  - understand multiple languages (in beta now)
  - allow you to deeply search for keywords and phrases
  - transcribe to phonemes
  - get more accurate with use
Some of those are better mousetraps of things you’re familiar with and some are completely new levers to pull in your audio data. We’ve built the core on English but now we’re releasing the tools for all of the Americas. (aside: You can transfer learn speech and it works well!)

Accuracy will continue to improve for transcription, but I think we can do more. It's such a large problem, and we really want to make a dent in “solving speech”. That means asking, truly: “What can a human do?“

People can, with little context, jump into a conversation and determine:

  - What are the words? When are they said? Who said what?
  - Is this person young/old? Male/Female? Exhausted/energetic?
  - Where is there confusion?
  - What language are they speaking? What’s the speaker’s accent?
  - What’s the topic of the conversation? Small talk or real? Is it going well?
Some of those things are being worked on now: additional language support, language and accent detection, sentiment analysis, auto-summarization, topic modeling, and more.

We’d love to hear your feedback and ideas.

23 comments

order

btown|7 years ago

(FYI your https://deepgram.com/v2/docs links are giving "error": "Not Found" JSON responses.)

I love progress in this space. Something I also think is necessary, though, is innovation in the discoverability interfaces around speech data. Can you search over potential transcriptions weighted by their likelihood, rather than just doing full-text search on the most-likely transcriptions? Can you visualize multiple potential transcriptions inline without overloading someone's visual cortex with information? Can you one-click-to-listen to any specific line? Can you enable people to switch conversations on the fly to an "off-the-record" mode, with such confidence that the default can be that every conversation is highlighted? Can you do all of this from Slack? Can you make setup a one-click process with Twilio OAuth? Can you do all of this from a web app that requires no coding?

All this, I'm sure, is part of an ecosystem that will be built on tools like yours, and that ecosystem fundamentally depends on the quality of the data - so it makes sense for you all to focus there first. But to the extent you want to capture the entire "stack," there's a tremendous space for someone to take the level of "passion" for data quality and apply that same instinct to quality-of-experience.

stephensonsco|7 years ago

This is a seriously fertile area where you get to "define the new interface".

It's a big problem though, since few buyers know they want those things. Around 95% of customers come into it with "give me the transcripts" and discover over time they want these other things too (some graphical, some technical). They just didn't know it was available.

New GUIs and data representations is a big part of it. Getting accuracy and scale in place is a big part. Building awareness and distribution of what's possible now is another big part.

Re: JSON Error; We fixed that doc link error you saw (it was pointing in the wrong place since we _just_ updated it).

The real docs link is: https://brain.deepgram.com/docs

trevyn|7 years ago

>Noah had the idea to start recording all audio from his life, 24/7

Want this as a product. :)

stephensonsco|7 years ago

You find out very interesting things even randomly sampling your life in audio.

We still come back to this for fun. The original device was an intel edison but recent variants have been based on the raspberry pi zero w.

vitovito|7 years ago

Do you plan to offer something around one-shot machine transcription with offline/on-prem search?

I have ~200k hours of legacy audio I'd love to be able to do a fuzzy (phonetic?) search on, to pull content from and get real (human-edited) transcriptions of important stuff to resurface it, but there's not a lot of incentive to push it through a service for a quarter million dollars and then also pay to store and search it, since we're currently doing without it. Doing it at extremely low priority, delivering it over a long span of time, for an order of magnitude cheaper, with our IT standing up some stock fuzzy search engine, is a pretty easy sell, though.

stephensonsco|7 years ago

We do custom models (train the full DNN, not just tack on a new text language model) using transfer learning and it works for small numbers of examples too.

Glad to hear you asking about fuzzy search. That's something we do (it's actually what Deepgram started on!). It's not in the docs at the moment (tends to confuse people who are looking for transcription, we're working on how to present it in a better way). You can submit with queries and get back confidences and timestamps.

Many times the model doesn't need any training but it does increase accuracy if you do training and can get really good if it's focused (it's a lot like wake word detection -- we don't offer WWD as a real product yet either, just saying the challenges are similar). Best thing to do is search for phrases if you can, that really helps signal/noise.

dumbfoundded|7 years ago

Hi! Thanks for sharing and I have a few questions.

- How does your WER compare to other engines? https://medium.com/descript/which-automatic-transcription-se...

- How do you gather data?

- Where do you see your long-term differentiation? Is it the features you build on top of other engines or is it the engine itself?

Disclaimer: I led engineering for temi.com (a competitor of your's) but am no longer affiliated with it.

stephensonsco|7 years ago

It's a metric that's hard to nail down because there is so much parameter space that you are flattening into one number. Also it doesn't address the "I care about these five high value words (that are made up), can you recognize them?" like product names and company names.

There's ~4 types of audio:

Phone call - close microphone - conversational - low bandwidth audio - two way conversation - more industry specific terminology

Meetings - 2-5 people - conversational - far away mic - better bandwidth audio - more industry specific terminology

Broadcast - usually good diction - close mic - good bandwidth audio - more general terminology

Command&Control (saying to your phone: "go to <this address>") - close mic or array or mics far away - short audio chunks, 2-10 seconds - spoken in a way that makes it easier to recognize (learned behavior) - usually a lot of widely known named entities are said

In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet.

jaredwiener|7 years ago

This looks interesting. Curious what the pricing is? I don't see it on the website.

stephensonsco|7 years ago

Price starts at $1/hr billed in 1 second increments. Frequently we charge less than that, since the price is dropped with volume, and that's typically businesses have a steady amount running through them (a few thousands hours). Medium usage scale would be $0.25-0.75/hr (e.g. 10,000 hours to 100,000 hours a month scale). Large usage is around 10,000 hrs+ per day and the price can be much lower per transcribed hour (like $0.15/hr).

That's the ballpark for cloud + batch mode. If it's cloud + realtime it's a little more. If you need it on premise it's a little more (we work with integration partners to do parts of it).

Pricing for speech is interesting since there's more than just how many $/hr in the equation. Usually businesses care about turnaround time, throughput, failover/availability, and a collection of features. So we usually want to talk about those goals and price accordingly to support 'em. I definitely wish I had a better way to frame it than "it's complicated"!

ivankirigin|7 years ago

With multiple speakers, can you identify who is speaking?

If you were in a conference room with multiple threads of conversation, could you tease out all of them?

stephensonsco|7 years ago

Best to say "yes! but only some of the time". It's something we're working on right now. You can be 80% accurate, by some metric, but it's still not good enough usually to pass a human's sniff test. Good speaker labeled audio in various settings is hard to find.

There are several ways to look at this problem too.

L1: exact speaker is known (voiceprint) and can be picked from all humans with accuracy, even when others are talking L2: exact speaker is known from a subset of people, even while talking in a conversation with others L3: speaker1,2,3,... are identified accurately L4: speaker changes are identified accurately

L1 is a really hard problem. L2 is fine if you don't care about the time domain (knowing exactly when they spoke), but is harder if you have to accurately detect changes. L3 is about as hard as L2 but the big goal isn't who anymore, it's when. And L4 is easier, kinda like putting line breaks in when human transcribing a file. Not too bad. All of them need better data sources.

pouta|7 years ago

How does this compare to Trint in terms of speech recognition performance?