The video is one of the best I've seen, really makes me excited for the product.
As for the product itself, I think the biggest "feature" is the ability to cut the audio by cutting the transcript, which makes it easier to quickly edit files. Transcribing is pretty common, the dubbing also sounds interesting but depends on how good the quality is.
I think the use-case for this is not YouTubers who expect high-quality, but social-media users who want to generate more average-quality content in a short amount of time.
That is BY FAR the best product video I've ever seen. Wow. "Really makes me excited for the product."... well said! I think that's what marketing is supposed to do.
I don't even have to look at the credits to know who put the video together - https://sandwichvideo.com/
They are f*king awesome and produce the best ads/product videos i've ever seen in my life. Unfortunately extremely expensive (the cost of 2 Ferraris) - unaffordable unless you are a SV startup that has raised $millions :)
Yes! Extremely helpful and time saving. The app has changed since then but a few years ago I had a blast using it.
You can also indicate who is speaking as you generate a transcript. I would then export a closed caption file and use ffmpeg to generate the video based on who was talking before merging it with the audio (think no budget v-tuber, just a different image on screen depending on who’s voice is playing)
The download page doesn't show anything particularly helpful if you're on an unsupported platform (eg Linux). From the source:
window.onload = function detectOS(){
if (navigator.userAgent.indexOf("Mac")!=-1) window.location.replace("/download/mac");
if (navigator.userAgent.indexOf("Win")!=-1) window.location.replace("/download/windows");
if (screen.width <= 992) {window.location.replace("/download/other-device");};
return undefined;
}
It's a very cool product, which I've only used briefly.
However, product aside, their promotional videos are _phenomenal_. Not sure if they are making these in-house or some company is putting them together, but someone is doing a great job.
It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech
> Are there any really good open source speech to text programs?
I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).
There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.
Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..
Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.
The user quotes sound almost too good to be true. When I think of the average YouTuber discussing video editing software, the words "best productization of machine learning I've ever seen" do not come to mind.
This looks amazing for audio. There's no doubt that this will massively improve podcasts, radio, etc.
I can't imagine watching a video that's been chopped up like that would be a particularly nice experience though. Editing video to remove unwanted sections and not have it look like people's heads are jumping around weirdly is really hard. Cuts are really noticeable. If they've managed to fix that with ML it's going to have a huge impact on the cost of video production.
I watched a youtuber go through the whole flow of producing a video using Descript and it doesn't do anything in terms of real video editing. It's useful for removing bad takes out of a longer video, but then you need to export the chopped up product and spend a lot of additional time fixing it up and makeing it flow seamlessly in a tool like AfterEffects or DavinciResolve.
I wonder if deep learning algorithms such as worldsheet [0] would help in simulating multiple angles, so the program can switch from one angle to another on cuts, to make them less jarring ...
We don't use your Project Information for anything other than providing the service we offer — e.g. we don’t sell it; we don’t use it for marketing; we don’t use it for advertising.
This is a strong statement, which is nice. Only covers the projects themselves, though.
Under what's shared:
Google Cloud Speech-to-Text to provide automatic transcription
Google only accesses or uses your data to complete the automatic transcription service. Shortly after completing the service, Google deletes your data from its servers.
As the only HIPAA-compliant automatic transcription service, Google is an extremely privacy-friendly transcription service.
And then there's a bunch of other integrations with Google/AWS/others.
I understand some of the issues were fixed bugs, but I don't know if selling Google integration as a privacy choice is honest, given Google's business model.
There's a difference between Googles consumer facing products where all these services are availed for free vs. enterprise clients using gcp - it'll be very surprising at the least if they don't honor these terms with their clous speech to text offerings.
I've had to do a lot of videos of my talks this year for virtual conferences. This would be useful to eliminate all of those ums and ahs. My 20-minute presentation would likely get cut down to 15...
I'd be interested to know the impact of cutting out the ums and its impact on the pacing/cadence of a talk.
From the pricing page, it looks like even if you pay for the most expensive option, you still only get to remove their watermark for 30 minutes! If you are paying for the basic service (NOT free) you can only remove their watermark for 2 minutes!
Apart from that, it looks like an interesting product.
Looks great. Would love to be able to use it without signing up for an online account :(
That is annoying in the first place, then when you try to sign up the password field has finnicky rules like requiring upper+lowercase+numbers+symbols and min length 10.
This is compounded by the fact that it does not integrate with Chrome so I can't easily get an auto-generated strong password stored with all my other ones. Nor does it allow using a Google or Facebook account to signup.
Descript is an amazing product and one of the main inspirations for our startup and product : www.spoke.app
We offer the same services of editing by high-lighting text, but differ in so far as we offer video + direct capture of your content (both your microphone and system sound). Our goal isn't as much to produce nice, clean-cut videos, than to offer to summarize any video-conversation you might have for co-workers, friends, etc.
I thought that was neat too (and the video itself was awesome), however I do with that they had hidden the gif when the video begins playing. Most modern machines won't mind, but my older machine didn't like having to play a video whilst also render an animated gif in the background - it managed, but the CPU shot up considerably.
[+] [-] jamessb|5 years ago|reply
* Content-Based Tools for Editing Audio Stories [2] (the software is released as an open-source project called speecheditor [3])
* Text-Based Editing of Talking-head video [4]
* QuickCut: An Interactive Tool for Editing Narrated Video [5]
[1]: https://graphics.stanford.edu/~maneesh/
[2]: http://vis.berkeley.edu/papers/audiostories/
[3]: https://ucbvislab.github.io/speecheditor/
[4]: https://www.ohadf.com/projects/text-based-editing/
[5]: https://graphics.stanford.edu/projects/quickcut/
[+] [-] XCSme|5 years ago|reply
As for the product itself, I think the biggest "feature" is the ability to cut the audio by cutting the transcript, which makes it easier to quickly edit files. Transcribing is pretty common, the dubbing also sounds interesting but depends on how good the quality is.
I think the use-case for this is not YouTubers who expect high-quality, but social-media users who want to generate more average-quality content in a short amount of time.
[+] [-] chris_st|5 years ago|reply
[+] [-] exnor|5 years ago|reply
They are f*king awesome and produce the best ads/product videos i've ever seen in my life. Unfortunately extremely expensive (the cost of 2 Ferraris) - unaffordable unless you are a SV startup that has raised $millions :)
[+] [-] tehwebguy|5 years ago|reply
You can also indicate who is speaking as you generate a transcript. I would then export a closed caption file and use ffmpeg to generate the video based on who was talking before merging it with the audio (think no budget v-tuber, just a different image on screen depending on who’s voice is playing)
[+] [-] hhkb|5 years ago|reply
[+] [-] Yenrabbit|5 years ago|reply
window.onload = function detectOS(){
[+] [-] iansinnott|5 years ago|reply
However, product aside, their promotional videos are _phenomenal_. Not sure if they are making these in-house or some company is putting them together, but someone is doing a great job.
[+] [-] robenkleene|5 years ago|reply
[+] [-] mikewarot|5 years ago|reply
Are there any really good open source speech to text programs? I imagine it's going to involve a pre-trained neural net.
[update] Following a thread https://news.ycombinator.com/item?id=20097542
It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech
[+] [-] hobofan|5 years ago|reply
I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).
There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.
Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..
[+] [-] albertzeyer|5 years ago|reply
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
Wav2Letter (https://github.com/facebookresearch/wav2letter), the tool by Facebook.
These are probably just the most well known. There are many others as well. DeepSpeech is inferior to all the ones above, but maybe simpler.
Important is also the data to train these, but you will find quite some resources online for English, e.g. Tedlium, Librispeech, etc.
You will find lots of resources actually for ASR. Some random links:
https://github.com/gooofy/zamia-speech
https://commonvoice.mozilla.org/en/datasets
https://www.openslr.org/resources.php
To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.
[+] [-] JMTQp8lwXL|5 years ago|reply
[+] [-] ced|5 years ago|reply
[+] [-] onion2k|5 years ago|reply
I can't imagine watching a video that's been chopped up like that would be a particularly nice experience though. Editing video to remove unwanted sections and not have it look like people's heads are jumping around weirdly is really hard. Cuts are really noticeable. If they've managed to fix that with ML it's going to have a huge impact on the cost of video production.
[+] [-] nimbix|5 years ago|reply
[+] [-] tgirod|5 years ago|reply
[0] https://worldsheet.github.io/
[+] [-] cush|5 years ago|reply
https://youtu.be/2UP6CSZsc5o
[+] [-] suchire|5 years ago|reply
If anyone is interested, we are hiring engineers and PMs! In particular, we’re looking for senior backend engineers and full stack engineers.
https://www.descript.com/careers
I’m the tech lead for the backend/server team, happy to answer any questions as best as I can
[+] [-] hienyimba|5 years ago|reply
[+] [-] bassman9000|5 years ago|reply
We don't use your Project Information for anything other than providing the service we offer — e.g. we don’t sell it; we don’t use it for marketing; we don’t use it for advertising.
This is a strong statement, which is nice. Only covers the projects themselves, though.
Under what's shared:
Google Cloud Speech-to-Text to provide automatic transcription
Google only accesses or uses your data to complete the automatic transcription service. Shortly after completing the service, Google deletes your data from its servers.
As the only HIPAA-compliant automatic transcription service, Google is an extremely privacy-friendly transcription service.
3 second search:
https://edition.cnn.com/2019/07/22/tech/google-street-view-p...
https://www.reuters.com/article/us-alphabet-google-privacy-l...
https://www.forbes.com/sites/daveywinder/2019/06/23/google-c...
And then there's a bunch of other integrations with Google/AWS/others.
I understand some of the issues were fixed bugs, but I don't know if selling Google integration as a privacy choice is honest, given Google's business model.
[+] [-] ramraj07|5 years ago|reply
[+] [-] ageitgey|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] roter|5 years ago|reply
I'd be interested to know the impact of cutting out the ums and its impact on the pacing/cadence of a talk.
[+] [-] rjvs|5 years ago|reply
Apart from that, it looks like an interesting product.
[+] [-] anentropic|5 years ago|reply
[+] [-] krackers|5 years ago|reply
[+] [-] ffpip|5 years ago|reply
https://www.descript.com/security
[+] [-] DTE|5 years ago|reply
[+] [-] anentropic|5 years ago|reply
That is annoying in the first place, then when you try to sign up the password field has finnicky rules like requiring upper+lowercase+numbers+symbols and min length 10.
This is compounded by the fact that it does not integrate with Chrome so I can't easily get an auto-generated strong password stored with all my other ones. Nor does it allow using a Google or Facebook account to signup.
[+] [-] Erazal|5 years ago|reply
Descript is an amazing product and one of the main inspirations for our startup and product : www.spoke.app
We offer the same services of editing by high-lighting text, but differ in so far as we offer video + direct capture of your content (both your microphone and system sound). Our goal isn't as much to produce nice, clean-cut videos, than to offer to summarize any video-conversation you might have for co-workers, friends, etc.
[+] [-] Erazal|5 years ago|reply
Western leaders all seem to adopt the same compassionate tone, and wish to shine hope on the new year (with the exception of Trump).
On the other hand, Xi Jinping is just....self-congratulating.
[+] [-] thunderbong|5 years ago|reply
Very innovative!
[+] [-] betamaxthetape|5 years ago|reply
[+] [-] Epskampie|5 years ago|reply
[+] [-] dkarp|5 years ago|reply
Does anyone else find this kind of "hypey" copy a put off?
[+] [-] dkarp|5 years ago|reply
[+] [-] ambivalents|5 years ago|reply