Descript – A collaborative audio/video editor that works like a doc

[+] jamessb|5 years ago|reply

Maneesh Agrawala [1] and his group has done lots of similar work with the same basic idea of editing audio or narrated video as if it's text:

* Content-Based Tools for Editing Audio Stories [2] (the software is released as an open-source project called speecheditor [3])

* Text-Based Editing of Talking-head video [4]

* QuickCut: An Interactive Tool for Editing Narrated Video [5]

[1]: https://graphics.stanford.edu/~maneesh/

[2]: http://vis.berkeley.edu/papers/audiostories/

[3]: https://ucbvislab.github.io/speecheditor/

[4]: https://www.ohadf.com/projects/text-based-editing/

[5]: https://graphics.stanford.edu/projects/quickcut/

[+] XCSme|5 years ago|reply

The video is one of the best I've seen, really makes me excited for the product.

As for the product itself, I think the biggest "feature" is the ability to cut the audio by cutting the transcript, which makes it easier to quickly edit files. Transcribing is pretty common, the dubbing also sounds interesting but depends on how good the quality is.

I think the use-case for this is not YouTubers who expect high-quality, but social-media users who want to generate more average-quality content in a short amount of time.

[+] chris_st|5 years ago|reply

That is BY FAR the best product video I've ever seen. Wow. "Really makes me excited for the product."... well said! I think that's what marketing is supposed to do.

[+] exnor|5 years ago|reply

I don't even have to look at the credits to know who put the video together - https://sandwichvideo.com/

They are f*king awesome and produce the best ads/product videos i've ever seen in my life. Unfortunately extremely expensive (the cost of 2 Ferraris) - unaffordable unless you are a SV startup that has raised $millions :)

[+] tehwebguy|5 years ago|reply

Yes! Extremely helpful and time saving. The app has changed since then but a few years ago I had a blast using it.

You can also indicate who is speaking as you generate a transcript. I would then export a closed caption file and use ffmpeg to generate the video based on who was talking before merging it with the audio (think no budget v-tuber, just a different image on screen depending on who’s voice is playing)

[+] hhkb|5 years ago|reply

I think this has huge potential for the education market especially in post-pandemic world where remote learning is more or less accepted.

[+] Yenrabbit|5 years ago|reply

The download page doesn't show anything particularly helpful if you're on an unsupported platform (eg Linux). From the source:

window.onload = function detectOS(){

   if (navigator.userAgent.indexOf("Mac")!=-1) window.location.replace("/download/mac");

   if (navigator.userAgent.indexOf("Win")!=-1) window.location.replace("/download/windows");

   if (screen.width <= 992) {window.location.replace("/download/other-device");};

   return undefined;
}

[+] iansinnott|5 years ago|reply

It's a very cool product, which I've only used briefly.

However, product aside, their promotional videos are _phenomenal_. Not sure if they are making these in-house or some company is putting them together, but someone is doing a great job.

[+] robenkleene|5 years ago|reply

Both this video (https://sandwich.co/work/descript-video/), and their original promotional video (https://sandwich.co/work/its-how-you-make-a-podcast/), were done by Sandwich.

[+] mikewarot|5 years ago|reply

This is amazing, I wonder how I can do this offline, using open source tools.

Are there any really good open source speech to text programs? I imagine it's going to involve a pre-trained neural net.

[update] Following a thread https://news.ycombinator.com/item?id=20097542

It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech

[+] hobofan|5 years ago|reply

> Are there any really good open source speech to text programs?

I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).

There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.

Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..

[+] albertzeyer|5 years ago|reply

Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).

Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.

Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.

ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.

RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).

Wav2Letter (https://github.com/facebookresearch/wav2letter), the tool by Facebook.

These are probably just the most well known. There are many others as well. DeepSpeech is inferior to all the ones above, but maybe simpler.

Important is also the data to train these, but you will find quite some resources online for English, e.g. Tedlium, Librispeech, etc.

You will find lots of resources actually for ASR. Some random links:

https://github.com/gooofy/zamia-speech

https://commonvoice.mozilla.org/en/datasets

https://www.openslr.org/resources.php

To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.

[+] JMTQp8lwXL|5 years ago|reply

The user quotes sound almost too good to be true. When I think of the average YouTuber discussing video editing software, the words "best productization of machine learning I've ever seen" do not come to mind.

[+] ced|5 years ago|reply

Some of the HN comments here are suspicious as well.

[+] onion2k|5 years ago|reply

This looks amazing for audio. There's no doubt that this will massively improve podcasts, radio, etc.

I can't imagine watching a video that's been chopped up like that would be a particularly nice experience though. Editing video to remove unwanted sections and not have it look like people's heads are jumping around weirdly is really hard. Cuts are really noticeable. If they've managed to fix that with ML it's going to have a huge impact on the cost of video production.

[+] nimbix|5 years ago|reply

I watched a youtuber go through the whole flow of producing a video using Descript and it doesn't do anything in terms of real video editing. It's useful for removing bad takes out of a longer video, but then you need to export the chopped up product and spend a lot of additional time fixing it up and makeing it flow seamlessly in a tool like AfterEffects or DavinciResolve.

[+] tgirod|5 years ago|reply

I wonder if deep learning algorithms such as worldsheet [0] would help in simulating multiple angles, so the program can switch from one angle to another on cuts, to make them less jarring ...

[0] https://worldsheet.github.io/

[+] cush|5 years ago|reply

Every modern YouTube video is cut just like this. It looks good. Jumping heads turn out to not look bad, when the audio flows perfectly.

https://youtu.be/2UP6CSZsc5o

[+] suchire|5 years ago|reply

Thanks for posting this! It’s great when our users are excited about our product :-)

If anyone is interested, we are hiring engineers and PMs! In particular, we’re looking for senior backend engineers and full stack engineers.

https://www.descript.com/careers

I’m the tech lead for the backend/server team, happy to answer any questions as best as I can

[+] hienyimba|5 years ago|reply

It would be cool to know the tech stack you guys use to handle such heavy loads.

[+] bassman9000|5 years ago|reply

https://www.descript.com/security

We don't use your Project Information for anything other than providing the service we offer — e.g. we don’t sell it; we don’t use it for marketing; we don’t use it for advertising.

This is a strong statement, which is nice. Only covers the projects themselves, though.

Under what's shared:

Google Cloud Speech-to-Text to provide automatic transcription

Google only accesses or uses your data to complete the automatic transcription service. Shortly after completing the service, Google deletes your data from its servers.

As the only HIPAA-compliant automatic transcription service, Google is an extremely privacy-friendly transcription service.

3 second search:

https://edition.cnn.com/2019/07/22/tech/google-street-view-p...

https://www.reuters.com/article/us-alphabet-google-privacy-l...

https://www.forbes.com/sites/daveywinder/2019/06/23/google-c...

And then there's a bunch of other integrations with Google/AWS/others.

I understand some of the issues were fixed bugs, but I don't know if selling Google integration as a privacy choice is honest, given Google's business model.

[+] ramraj07|5 years ago|reply

There's a difference between Googles consumer facing products where all these services are availed for free vs. enterprise clients using gcp - it'll be very surprising at the least if they don't honor these terms with their clous speech to text offerings.

[+] ageitgey|5 years ago|reply

This is Andrew Mason's baby (of Groupon fame). They've been working on it for several years and it looks really great.

[+] unknown|5 years ago|reply

[deleted]

[+] roter|5 years ago|reply

I've had to do a lot of videos of my talks this year for virtual conferences. This would be useful to eliminate all of those ums and ahs. My 20-minute presentation would likely get cut down to 15...

I'd be interested to know the impact of cutting out the ums and its impact on the pacing/cadence of a talk.

[+] rjvs|5 years ago|reply

From the pricing page, it looks like even if you pay for the most expensive option, you still only get to remove their watermark for 30 minutes! If you are paying for the basic service (NOT free) you can only remove their watermark for 2 minutes!

Apart from that, it looks like an interesting product.

[+] anentropic|5 years ago|reply

Where did you read that? I can't see anything about watermarking here https://www.descript.com/pricing

[+] krackers|5 years ago|reply

Their TTS alone is very convincing. Seems to be better than what Google and Amazon have to offer https://www.descript.com/overdub

[+] ffpip|5 years ago|reply

They use Google for TTS.

https://www.descript.com/security

[+] DTE|5 years ago|reply

Another really exciting co in this space is Runway (https://runwayml.com/). Primarily focused on ML-enhanced media production workflows.

[+] anentropic|5 years ago|reply

Looks great. Would love to be able to use it without signing up for an online account :(

That is annoying in the first place, then when you try to sign up the password field has finnicky rules like requiring upper+lowercase+numbers+symbols and min length 10.

This is compounded by the fact that it does not integrate with Chrome so I can't easily get an auto-generated strong password stored with all my other ones. Nor does it allow using a Google or Facebook account to signup.

[+] Erazal|5 years ago|reply

s/Descript/self-promotion/g

Descript is an amazing product and one of the main inspirations for our startup and product : www.spoke.app

We offer the same services of editing by high-lighting text, but differ in so far as we offer video + direct capture of your content (both your microphone and system sound). Our goal isn't as much to produce nice, clean-cut videos, than to offer to summarize any video-conversation you might have for co-workers, friends, etc.

[+] Erazal|5 years ago|reply

Here's a small demo I made for the new year, comparing speeches from different heads of State : https://cutt.ly/happy_new_year_from_spoke

Western leaders all seem to adopt the same compassionate tone, and wish to shine hope on the new year (with the exception of Trump).

On the other hand, Xi Jinping is just....self-congratulating.

[+] thunderbong|5 years ago|reply

I really liked the idea of showing a gif with a 'Play with sound' button which pops up the video.

Very innovative!

[+] betamaxthetape|5 years ago|reply

I thought that was neat too (and the video itself was awesome), however I do with that they had hidden the gif when the video begins playing. Most modern machines won't mind, but my older machine didn't like having to play a video whilst also render an animated gif in the background - it managed, but the CPU shot up considerably.

[+] Epskampie|5 years ago|reply

Probably forced by technology, browsers don't allow autoplay with sound anymore.

[+] dkarp|5 years ago|reply

"some mind-bendingly useful AI tools"

Does anyone else find this kind of "hypey" copy a put off?

[+] dkarp|5 years ago|reply

After watching the video the AI tools do look mind-bendingly useful, so I take back what I said before!

[+] ambivalents|5 years ago|reply

I would absolutely use this for work and personal projects. But, couldn't one easily manipulate another person's work and falsely present it?

79 comments