top | item 26251322

Launch HN: Milk Video (YC W21) – Edit online event recordings quickly

154 points| rememberlenny | 5 years ago | reply

Hello HN gang! Lenny and Ross here, working on Milk Video (https://milkvideo.com/), a browser-based tool to turn long videos into watchable clips. We speed up the workflow for marketers editing long, boring Zoom recordings and webinars into visually engaging clips with quality templates and styled captions.

Ross and I met 8 years ago in Shanghai, where we worked at an education startup and organized tech and design events. When we realized Covid was creating a tsunami of webinars, Ross noticed the growing cost of editing all the new content as B2B companies replaced their in-person marketing channels with online events.

Most registrants to online events don't end up attending. They may be interested in the content, but they won’t take time to watch an entire webinar recording. Webinar content has a short shelf life unless it is reworked into a friendlier format. Doing that with traditional video editing software is cumbersome, so it often doesn’t happen. It’s time-intensive to review videos for key moments, ask designers to create appropriate graphics and captions, and receive final approval from managers.

We started out contacting companies organizing webinars, and learned they were stuck in a vicious cycle of constantly having to focus on the next upcoming event. We started manually editing videos for them to better understand how the most engaging bits could be reworked. Doing this manually revealed a glaring problem: the technology interfacing with video has changed dramatically, but the editing software hasn’t. Video editing software is designed for film makers or social media, and businesses creating video content have very different needs.

Milk Video uses a transcript-to-video based interface to review long recordings and minimize the mental effort around editing. We transcribe uploaded videos, present you with the content so you can quickly clip the best parts, and allow you to use templates to compose visually interesting layouts with additional assets, like logos or static text.

We made a drag-and-drop interface for creating short video clips with styled word-by-word captions. In a world where people often don't have their audio on, the timestamp information on a machine-generated transcript is perfect for creating interesting visual elements, such as captions styled one word at a time. This also makes content accessible by default. And because most webinars or Zoom recordings are visually similar, we have the ability to recommend which video templates might be best suited for their uploaded content in the future.

The frontend is a React app based on Redux Toolkit and Recoil.js. Our performant transcript interface is made possible due to Slate.js. Our backend is a Ruby on Rails app and depends on a non-trivial number of serverless functions hosted on Google Cloud and AWS. Our speech-to-text provider is AssemblyAI, who we found were both cheaper, faster and better than Google and Amazon.

We would love your feedback on the tool. We are spending a lot of time working directly with our first users, and would appreciate all of the input we can get. I’m also happy to go into detail around how any specific parts work! We’ll be in the comments and are eager to hear all your thoughts!

52 comments

[+] derrickli978|5 years ago|reply

We use both Milk and Descript at Macro for our podcast workflows.

Milk is really good at creating short (for us it's ~1min), impactful Twitter + LinkedIn marketing collaterals based on each podcast guest (we can add our logo, the guest's background, the guest's bio + picture, transcript, etc.).

Descript is amazing at editing the entire podcast and make sure we have the overall content needed to publish.

Can't imagine doing our podcast without Milk + Descript.

[+] rememberlenny|5 years ago|reply

Thank you! Thank you! Thank you!

Per the podcast point, last night (phew) we launched the ability to upload audio files and work with them.

We are focused on webinars/Zoom recordings, but now you can upload a podcast and create a promo tile.

These are some other links that were made in Milk Video:

- https://twitter.com/m_cieplinski/status/1356331228954292224

- https://twitter.com/rabois/status/1310644068326629376

- https://twitter.com/rememberlenny/status/1339618249575714816

[+] ahstilde|5 years ago|reply

Pretty cool to see a product I use on HN.

I've been using Milk to promote my podcast ( https://www.allschemesconsidered.com ) on LinkedIn. Here's some sample videos:

https://www.linkedin.com/posts/mraakashshah_cloud-aws-gcp-ug...

https://www.linkedin.com/posts/mraakashshah_heres-a-teaser-f...

All the social media companies are pushing video really hard in their algos (and stories ). Recording and editing a podcast is super fun for me, but the audience-building part was a drag. Milk lets me make professional quality highlights super easily. Ironically, the viewership on these highlight videos is 100x the listenership on the podcast.

Anyway, I like the software.

Disclaimer: Ross found and pitched me on Milk, but I've been a happy user ever since.

[+] rosscranwell|5 years ago|reply

It's been great working with you, Aakash! Thanks for the support.

[+] mchusma|5 years ago|reply

I have some constructive feedback. The main video on your homepage confused me immensely. It was a guy from brex talking about retool, but they weren't very eloquent (no offense) and it rambled on forever. I thought it was going to be a before and after thing showing how bad it was before and how amazing it was after...but it was just a before? Or was that after milk? I also thought for a minute I was on the wrong video, hearing about retool.

I use brex and retool and descript and canva so I think I would be a target user but just didn't get it at all from the video.

My feedback would be to make the company "Acme" and show before or after...but definitely for a video product that first video should be a really good video.

My 2 cents.

[+] rememberlenny|5 years ago|reply

Extremely valuable and noted! Thank you for taking the time to write this out and share.

[+] nate|5 years ago|reply

Congrats y'all! Looks interesting. One use case I believe you're talking about that I'd love to see even more fleshed out is the tool to find the interesting clips for me? I'd love if after I just did this upload, you were like: "here's 3 clips that seem interesting to our AI." Maybe even a summarization algorithm would suffice at finding the most relevant chunks in transcript? Or maybe something more fancy if it's doable. But I'd love a best effort stab at the clips so I don't even have to think about finding them :)

[+] rememberlenny|5 years ago|reply

I love this idea in concept.

The point of the transcript is to lower the bar on who can review video content. One layer on top of that is moving the technical work (cutting video) into an editorial role (picking the parts that are recommended).

We aren't trying to position ourselves to do the clip picking/recommendation now, but we have already done some machine learning based analysis to make this easier to find. We have a video processing task that looks for "scene changes" based on image threshold changes, so the metadata associated to when a new person joins/slide changes/etc is present.

The original thinking here is that we can recommend "templates" that correspond with certain video (ie. multiple speakers vs single presenter).

[+] 1234letshaveatw|5 years ago|reply

So it wasn't by choice, but most of my grad school business classes are now virtual. Most have a team presentation aspect where you collaborate on a powerpoint and charts and spreadsheets then "present" virtually by tacking on audio and video and then upload the whole mess after a lot of stress and panic.

Have you considered creating a free "edu" edition that would generate a mashup of uploaded videos, ppt and media, watermarked or whatever? The students would then conversion to a paid model for work? I would use it

[+] rememberlenny|5 years ago|reply

Yes! Non-profit and academic use case is currently free.

It's worth noting, the use case you are mentioning is interesting, but not exactly the workflow we are building for.

We are focused on improving the workflow around a marketer/demand generation/sales person at a company who needs to use existing content to get attention on social media/blog/email.

One of the ways we are thinking about this is around increasing the "shelf-life" of quality content, which doesn't get discovered because its long. This is very much a problem that appeared when businesses became content creators, as opposed to individuals.

That being said, if you have any questions, please email either Ross ([email protected]) or me ([email protected]) and we will be happy to help!

[+] whoisjuan|5 years ago|reply

"The students would then conversion to a paid model for work?"

Honestly that almost never works for startups.

It never happens or at least it never happens fast enough. I'm fairly sure that large companies like Microsoft that popularized student licenses, are benefiting from this, but through very long cycles of adoption.

At the end of the day if you get discounted Office for 4 or 5 years, you will highly likely continue doing it once you lose the discount. The secret sauce there is distribution. Office is so massively popular that you don't even need to advocate for that at work. It's the default choice.

[+] rargulati|5 years ago|reply

You mentioned your backend being a Rails app + serverless functions, what's the benefit of doing both there? What does your video processing infra look like, is that in one of those systems?

[+] rememberlenny|5 years ago|reply

I could go pretty deep here, so let me know if I should elaborate on anything.

The backend is a Ruby on Rails application that serves the frontend app's API. This interfaces with the user tables, database, and handles all the "state" of the app.

The serverless stuff has changed over the months, but primarily it handles the stuff I don't want Rails to handle: file uploads, video processing and transcription.

First, huge props to the Mux (https://mux.com) team and product. I can not express how easy it has been to build video (and audio) products. File uploads are handled to AWS/GCP (depending on a few things) and then trigger a serverless callback to Mux.com. Mux was the fastest way we found to turn an arbitrary video file (mp4/mov/etc) into HLS format for quick streaming.

Then once the video is uploaded, we have another serverless callback that sends the video for transcription using Assembly AI (https://assemblyai.com). There are a ton of transcription based services and they vary dramatically in quality, based on the media content. I believe Google/Amazon services were largely built around the need to process phone calls, so unless you may for their "enhanced" models, the quality is surprisingly bad (and surprisingly slow).

I *highly highly* recommend Mux and Assembly AI if you are doing anything video/transcription based work.

To get an immediate update to the end user, we actually process two transcript requests - one that is just the first 60 seconds, and then the remainder of the video. This lets us render a preview transcript in the first 15-20 seconds.

We also have a serverless pipeline for generating the videos, but I won't go into that unless you're interested. In short, a serverless function kicks off a Docker instance running on ECS.

The requests to the serverless apps (mostly Node) have a callback to the Rails app, which then updates the end user state using websockets (which are very easy to use in Rail's ActionCable).

[+] acemarke|5 years ago|reply

> The frontend is a React app based on Redux Toolkit and Recoil.js

Hey, great to see Redux Toolkit being used in the wild! Would love to hear your thoughts on using RTK, and I'm particularly curious about the combination of RTK + Recoil together. What use cases are you using each of those for?

Please let me know if you've got any suggestions for improving RTK! I'm usually in the Reactiflux Discord evenings US time, and always happy to chat.

[+] rememberlenny|5 years ago|reply

Redux Toolkit is INCREDIBLE. I have the utmost respect for the developers working on it. I've worked on 6 large redux based applications, and they were all implemented incredibly differently. This has been the first time I really love the implementation.

I am using RTK for the overall app state and Recoil for the on-page state. I make API requests and store the results in the redux store, but the hooks/prop passing is too slow for handling video players/transcript manipulation.

I initially had everything in RTK, but noticed the render cycle for dispatching to and listening to the store was creating unusual issues.

With Recoil, Im able to represent the video player's current time state, and then listen to it in the other parts of the app. Similarly, when I have the transcript updating the time, the React Context based API performs better than the hook/props.

Happy to dig more into this. I'll reach out via Twitter too!

[+] tfizzz|5 years ago|reply

Congrats on the launch! I plan on running/promoting webinars alot more this year and will check this out. Really interesting on what you shared about your backend/infra. Could you share more about how you chose your transcription/video APIs vs google/aws?

[+] rememberlenny|5 years ago|reply

Regarding general infrastructure, we are running on Heroku, AWS, and GCP for various things.

I touched on the video APIs above, but regarding transcription - I have a lot of thoughts.

We chose to work with AssemblyAI (https://assemblyai.com) after trying AWS's transcript service, RevAI and Google's Speech-to-text API.

First, we started with doing manual transcripts through Rev, but the cost was unmanageable at scale. We were really happy with the quality, but couldn't charge $100 per video, so we needed a cost-effective automated solution.

I then found an old blog post from Descript's co-founder Andrew Mason, who talked about which speech-to-text API they decided to use. The blog post is old, so the metric used are going to be irrelevant, but I was impressed they decided to use Google's API.

We implemented the GCP option, I was shocked how slow it was and how expensive it was. For one, the quality wasn't that great, and to use the lower cost option (audio-only), you need to do some additional FFMPEG based transcoding, which is very error prone. Because we receive a range of video types from users, it was causing more problems than was worth dealing with. Also the time lost made the cost savings irrelevant.

Enter - AssemblyAI.

I did some research around what other companies were using today, and saw they have great ratings on G2 (https://www.g2.com/products/assemblyai-speech-to-text-api/re...). The CEO jumped on a call on a Sunday, when I was trying to improve our transcript processing time, and after testing the API, I was shocked that the transcript quality was closer to the human done transcripts I was getting, at a cost significantly cheaper than the Google option.

Conclusion - we needed speed, quality, low-cost and support. AssemblyAI won on all these fronts.

[+] 35mm|5 years ago|reply

I’m interested to try this as my day job involves editing 2-3 webinars and 2-5 Zoom interviews per week.

Currently I use Premiere Pro with some templates I’ve created.

I haven’t found any of the transcript based editing tools to be robust enough. Descript is buggy and slow on my MacBook Pro (which has zero issues running Prem Pro & After Effects). Transcriptive has issues where it gets out of sync with the original.

What would be really helpful is detecting speaker changes, long pauses, the start and end of slide presentations (or switching to different decks), and transcription if it actually works smoothly and stays in sync across edits.

[+] rememberlenny|5 years ago|reply

Would you email me, and we can chat? [email protected]

The inspiration behind making this is to replace Premiere Pro, so I'd love to understand your MVP to solve your problem.

I do think we can be very performant for you, given that all the processing is done on the cloud, and you are only ever interfacing with JavaScript/video tags. That being said, there is work to do!

We don't have speaker diarization right now, but it's just a feature flag for us. Also the start/end content is something that we don't have active, but is planned for next week.

[+] yeldarb|5 years ago|reply

We tried out the demo last summer (which was a bit of manu-mation before the product was fully built out -- kudos to Lenny & crew for doing things that don't scale) and had a great experience! Here's an example of one of the videos we got via Milk: https://www.youtube.com/watch?v=O4jOqVqyAo8

Excited to circle back and try it out again now that it's software instead of humans doing the heavy lifting behind the scenes!

[+] rememberlenny|5 years ago|reply

Thank you!

Context here - before we started working on the current software, we planned to do a opaque marketplace for post-production video work. To vet the idea, we reached out to companies with webinars and manually edited their videos.

In the process, one finding was that its hard to make styled word-by-word highlighted captions. This resulted in a small utility app that turned SRT and VTT caption files into a dynamically sized/styled caption videos, and later evolved into todays product.

[+] mfleit|5 years ago|reply

I’m using Type Studio https://typestudio.co What is the difference?

[+] rememberlenny|5 years ago|reply

We aren't focused on being a transcript editing tool. You can upload a video, get a transcript, and edit it in Milk Video, but thats not our focus.

Our focus is helping make a visual clip that is engaging, based on the transcript information.

Here are some examples I posted above:

- https://twitter.com/rememberlenny/status/1339618249575714816...

- https://twitter.com/rabois/status/1310644068326629376?s=20

- https://twitter.com/m_cieplinski/status/1356331228954292224?...

[+] trowngon|5 years ago|reply

Assembly AI is at $0.5 per hour, not extremely reasonable these days. With open source models like Facebook RASR or Vosk you can get self-hosted solution with even better accuracy and cost of $0.05 per hour, 10 times cheaper.

Once any of your customers come to require private video setup, you'll come to hosted solution anyway.

[+] rememberlenny|5 years ago|reply

I'll say this clearly: Assembly AI is the hands down best speech to text transcription service on price, quality, speed and support. Hands down. It more than pays for itself time and time again.

If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.

Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.

We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.

Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.

We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.

[+] gramakri|5 years ago|reply

In the mobile view of https://milk.video/pricing, all I see are numbers/unlimited . I guess the row titles got scrolled down

[+] rememberlenny|5 years ago|reply

Thanks! Will fix this. The app definitely doesn't work on a small screen, but the homepage/pricing should.

FWIW - they are Webflow templates, but do a great job at making it easy to manage.

[+] gamesbrainiac|5 years ago|reply

Just checked this out. I prefer Descript. Better editing, as well as overdub.

[+] rememberlenny|5 years ago|reply

Thanks for signing up and trying it out.

We actually drive people to use Descript for most use cases, that aren't relevant.

Think Photoshop vs Canva.

Since speech-to-text APIs have become really good (props to companies like AssemblyAI (https://www.assemblyai.com/), the transcript-based interfaces are going to become much more common.

Our product goal is to solve the use case around making the visual output, when editing/correction isn't the goal. That being said, the editor should be performant and work well, so lots to improve there.

As an aside, there are a few evolving open-source libraries that consume the output of these STT services (https://github.com/bbc/react-transcript-editor) and make turnkey transcript interfaces.

The newest/most developed one I like is based on Slate, and made by a really amazing engineer at the Wall Street Journal named Pietro.

Link: https://github.com/pietrop/slate-transcript-editor

[+] elviejo|5 years ago|reply

How does Milk compare to Descript?

what are the advantages / disadvantages ?

[+] rememberlenny|5 years ago|reply

Thanks for asking this. This is like comparing Photoshop and Canva.

Descript is hands down the leader in any transcript based video/audio editing. They set the standard for detailed editing and magically manipulating audio/video.

We are focused on the workflow around creating something visually appealing, that uses a Zoom recording in it. Specifically, the transcript-based interface is only for speeding up the review process, but our main focus is on visual templates to drop in the video/captions.

One way to think of it is that we took the Descript Audiogram feature, and built out a workflow that creates a wider variety applicable to marketing/sales related needs.

We are solving the problem where you need to quickly take a video recording and make something you/your team can proudly share on social media, that reflects your company's brand guidelines/visual aesthetic.

[+] sachaker|5 years ago|reply

i've not used the product but i know the founders, who are both extremely talented. super excited to know about this rocket during its early days!

[+] mohitgarg|5 years ago|reply

Hey Lenny, congrats on the launch!!!

[+] rememberlenny|5 years ago|reply

Thank you!

[+] annelibby|5 years ago|reply

Go, Lenny!

[+] rememberlenny|5 years ago|reply

Thank you Anne!!

[+] vanpelt|5 years ago|reply

Hey Lenny! Congrats on the launch, so excited to see your product come together.

[+] rememberlenny|5 years ago|reply

Thank you Chris for your advice and support! The partnership you and Luke have is an inspiration for Ross and I.