Launch HN: Milk Video (YC W21) – Edit online event recordings quickly
154 points| rememberlenny | 5 years ago | reply
Ross and I met 8 years ago in Shanghai, where we worked at an education startup and organized tech and design events. When we realized Covid was creating a tsunami of webinars, Ross noticed the growing cost of editing all the new content as B2B companies replaced their in-person marketing channels with online events.
Most registrants to online events don't end up attending. They may be interested in the content, but they won’t take time to watch an entire webinar recording. Webinar content has a short shelf life unless it is reworked into a friendlier format. Doing that with traditional video editing software is cumbersome, so it often doesn’t happen. It’s time-intensive to review videos for key moments, ask designers to create appropriate graphics and captions, and receive final approval from managers.
We started out contacting companies organizing webinars, and learned they were stuck in a vicious cycle of constantly having to focus on the next upcoming event. We started manually editing videos for them to better understand how the most engaging bits could be reworked. Doing this manually revealed a glaring problem: the technology interfacing with video has changed dramatically, but the editing software hasn’t. Video editing software is designed for film makers or social media, and businesses creating video content have very different needs.
Milk Video uses a transcript-to-video based interface to review long recordings and minimize the mental effort around editing. We transcribe uploaded videos, present you with the content so you can quickly clip the best parts, and allow you to use templates to compose visually interesting layouts with additional assets, like logos or static text.
We made a drag-and-drop interface for creating short video clips with styled word-by-word captions. In a world where people often don't have their audio on, the timestamp information on a machine-generated transcript is perfect for creating interesting visual elements, such as captions styled one word at a time. This also makes content accessible by default. And because most webinars or Zoom recordings are visually similar, we have the ability to recommend which video templates might be best suited for their uploaded content in the future.
The frontend is a React app based on Redux Toolkit and Recoil.js. Our performant transcript interface is made possible due to Slate.js. Our backend is a Ruby on Rails app and depends on a non-trivial number of serverless functions hosted on Google Cloud and AWS. Our speech-to-text provider is AssemblyAI, who we found were both cheaper, faster and better than Google and Amazon.
We would love your feedback on the tool. We are spending a lot of time working directly with our first users, and would appreciate all of the input we can get. I’m also happy to go into detail around how any specific parts work! We’ll be in the comments and are eager to hear all your thoughts!
[+] [-] derrickli978|5 years ago|reply
Milk is really good at creating short (for us it's ~1min), impactful Twitter + LinkedIn marketing collaterals based on each podcast guest (we can add our logo, the guest's background, the guest's bio + picture, transcript, etc.).
Descript is amazing at editing the entire podcast and make sure we have the overall content needed to publish.
Can't imagine doing our podcast without Milk + Descript.
[+] [-] rememberlenny|5 years ago|reply
Per the podcast point, last night (phew) we launched the ability to upload audio files and work with them.
We are focused on webinars/Zoom recordings, but now you can upload a podcast and create a promo tile.
These are some other links that were made in Milk Video:
- https://twitter.com/m_cieplinski/status/1356331228954292224
- https://twitter.com/rabois/status/1310644068326629376
- https://twitter.com/rememberlenny/status/1339618249575714816
[+] [-] ahstilde|5 years ago|reply
I've been using Milk to promote my podcast ( https://www.allschemesconsidered.com ) on LinkedIn. Here's some sample videos:
https://www.linkedin.com/posts/mraakashshah_cloud-aws-gcp-ug...
https://www.linkedin.com/posts/mraakashshah_heres-a-teaser-f...
All the social media companies are pushing video really hard in their algos (and stories ). Recording and editing a podcast is super fun for me, but the audience-building part was a drag. Milk lets me make professional quality highlights super easily. Ironically, the viewership on these highlight videos is 100x the listenership on the podcast.
Anyway, I like the software.
Disclaimer: Ross found and pitched me on Milk, but I've been a happy user ever since.
[+] [-] rosscranwell|5 years ago|reply
[+] [-] mchusma|5 years ago|reply
I use brex and retool and descript and canva so I think I would be a target user but just didn't get it at all from the video.
My feedback would be to make the company "Acme" and show before or after...but definitely for a video product that first video should be a really good video.
My 2 cents.
[+] [-] rememberlenny|5 years ago|reply
[+] [-] nate|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
The point of the transcript is to lower the bar on who can review video content. One layer on top of that is moving the technical work (cutting video) into an editorial role (picking the parts that are recommended).
We aren't trying to position ourselves to do the clip picking/recommendation now, but we have already done some machine learning based analysis to make this easier to find. We have a video processing task that looks for "scene changes" based on image threshold changes, so the metadata associated to when a new person joins/slide changes/etc is present.
The original thinking here is that we can recommend "templates" that correspond with certain video (ie. multiple speakers vs single presenter).
[+] [-] 1234letshaveatw|5 years ago|reply
Have you considered creating a free "edu" edition that would generate a mashup of uploaded videos, ppt and media, watermarked or whatever? The students would then conversion to a paid model for work? I would use it
[+] [-] rememberlenny|5 years ago|reply
It's worth noting, the use case you are mentioning is interesting, but not exactly the workflow we are building for.
We are focused on improving the workflow around a marketer/demand generation/sales person at a company who needs to use existing content to get attention on social media/blog/email.
One of the ways we are thinking about this is around increasing the "shelf-life" of quality content, which doesn't get discovered because its long. This is very much a problem that appeared when businesses became content creators, as opposed to individuals.
That being said, if you have any questions, please email either Ross ([email protected]) or me ([email protected]) and we will be happy to help!
[+] [-] whoisjuan|5 years ago|reply
Honestly that almost never works for startups.
It never happens or at least it never happens fast enough. I'm fairly sure that large companies like Microsoft that popularized student licenses, are benefiting from this, but through very long cycles of adoption.
At the end of the day if you get discounted Office for 4 or 5 years, you will highly likely continue doing it once you lose the discount. The secret sauce there is distribution. Office is so massively popular that you don't even need to advocate for that at work. It's the default choice.
[+] [-] rargulati|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
The backend is a Ruby on Rails application that serves the frontend app's API. This interfaces with the user tables, database, and handles all the "state" of the app.
The serverless stuff has changed over the months, but primarily it handles the stuff I don't want Rails to handle: file uploads, video processing and transcription.
First, huge props to the Mux (https://mux.com) team and product. I can not express how easy it has been to build video (and audio) products. File uploads are handled to AWS/GCP (depending on a few things) and then trigger a serverless callback to Mux.com. Mux was the fastest way we found to turn an arbitrary video file (mp4/mov/etc) into HLS format for quick streaming.
Then once the video is uploaded, we have another serverless callback that sends the video for transcription using Assembly AI (https://assemblyai.com). There are a ton of transcription based services and they vary dramatically in quality, based on the media content. I believe Google/Amazon services were largely built around the need to process phone calls, so unless you may for their "enhanced" models, the quality is surprisingly bad (and surprisingly slow).
I *highly highly* recommend Mux and Assembly AI if you are doing anything video/transcription based work.
To get an immediate update to the end user, we actually process two transcript requests - one that is just the first 60 seconds, and then the remainder of the video. This lets us render a preview transcript in the first 15-20 seconds.
We also have a serverless pipeline for generating the videos, but I won't go into that unless you're interested. In short, a serverless function kicks off a Docker instance running on ECS.
The requests to the serverless apps (mostly Node) have a callback to the Rails app, which then updates the end user state using websockets (which are very easy to use in Rail's ActionCable).
[+] [-] acemarke|5 years ago|reply
Hey, great to see Redux Toolkit being used in the wild! Would love to hear your thoughts on using RTK, and I'm particularly curious about the combination of RTK + Recoil together. What use cases are you using each of those for?
Please let me know if you've got any suggestions for improving RTK! I'm usually in the Reactiflux Discord evenings US time, and always happy to chat.
[+] [-] rememberlenny|5 years ago|reply
I am using RTK for the overall app state and Recoil for the on-page state. I make API requests and store the results in the redux store, but the hooks/prop passing is too slow for handling video players/transcript manipulation.
I initially had everything in RTK, but noticed the render cycle for dispatching to and listening to the store was creating unusual issues.
With Recoil, Im able to represent the video player's current time state, and then listen to it in the other parts of the app. Similarly, when I have the transcript updating the time, the React Context based API performs better than the hook/props.
Happy to dig more into this. I'll reach out via Twitter too!
[+] [-] tfizzz|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
I touched on the video APIs above, but regarding transcription - I have a lot of thoughts.
We chose to work with AssemblyAI (https://assemblyai.com) after trying AWS's transcript service, RevAI and Google's Speech-to-text API.
First, we started with doing manual transcripts through Rev, but the cost was unmanageable at scale. We were really happy with the quality, but couldn't charge $100 per video, so we needed a cost-effective automated solution.
I then found an old blog post from Descript's co-founder Andrew Mason, who talked about which speech-to-text API they decided to use. The blog post is old, so the metric used are going to be irrelevant, but I was impressed they decided to use Google's API.
We implemented the GCP option, I was shocked how slow it was and how expensive it was. For one, the quality wasn't that great, and to use the lower cost option (audio-only), you need to do some additional FFMPEG based transcoding, which is very error prone. Because we receive a range of video types from users, it was causing more problems than was worth dealing with. Also the time lost made the cost savings irrelevant.
Enter - AssemblyAI.
I did some research around what other companies were using today, and saw they have great ratings on G2 (https://www.g2.com/products/assemblyai-speech-to-text-api/re...). The CEO jumped on a call on a Sunday, when I was trying to improve our transcript processing time, and after testing the API, I was shocked that the transcript quality was closer to the human done transcripts I was getting, at a cost significantly cheaper than the Google option.
Conclusion - we needed speed, quality, low-cost and support. AssemblyAI won on all these fronts.
[+] [-] 35mm|5 years ago|reply
Currently I use Premiere Pro with some templates I’ve created.
I haven’t found any of the transcript based editing tools to be robust enough. Descript is buggy and slow on my MacBook Pro (which has zero issues running Prem Pro & After Effects). Transcriptive has issues where it gets out of sync with the original.
What would be really helpful is detecting speaker changes, long pauses, the start and end of slide presentations (or switching to different decks), and transcription if it actually works smoothly and stays in sync across edits.
[+] [-] rememberlenny|5 years ago|reply
The inspiration behind making this is to replace Premiere Pro, so I'd love to understand your MVP to solve your problem.
I do think we can be very performant for you, given that all the processing is done on the cloud, and you are only ever interfacing with JavaScript/video tags. That being said, there is work to do!
We don't have speaker diarization right now, but it's just a feature flag for us. Also the start/end content is something that we don't have active, but is planned for next week.
[+] [-] yeldarb|5 years ago|reply
Excited to circle back and try it out again now that it's software instead of humans doing the heavy lifting behind the scenes!
[+] [-] rememberlenny|5 years ago|reply
Context here - before we started working on the current software, we planned to do a opaque marketplace for post-production video work. To vet the idea, we reached out to companies with webinars and manually edited their videos.
In the process, one finding was that its hard to make styled word-by-word highlighted captions. This resulted in a small utility app that turned SRT and VTT caption files into a dynamically sized/styled caption videos, and later evolved into todays product.
[+] [-] mfleit|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
Our focus is helping make a visual clip that is engaging, based on the transcript information.
Here are some examples I posted above:
- https://twitter.com/rememberlenny/status/1339618249575714816...
- https://twitter.com/rabois/status/1310644068326629376?s=20
- https://twitter.com/m_cieplinski/status/1356331228954292224?...
[+] [-] trowngon|5 years ago|reply
Once any of your customers come to require private video setup, you'll come to hosted solution anyway.
[+] [-] rememberlenny|5 years ago|reply
If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.
Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.
We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.
Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.
We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.
[+] [-] gramakri|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
FWIW - they are Webflow templates, but do a great job at making it easy to manage.
[+] [-] gamesbrainiac|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
We actually drive people to use Descript for most use cases, that aren't relevant.
Think Photoshop vs Canva.
Since speech-to-text APIs have become really good (props to companies like AssemblyAI (https://www.assemblyai.com/), the transcript-based interfaces are going to become much more common.
Our product goal is to solve the use case around making the visual output, when editing/correction isn't the goal. That being said, the editor should be performant and work well, so lots to improve there.
As an aside, there are a few evolving open-source libraries that consume the output of these STT services (https://github.com/bbc/react-transcript-editor) and make turnkey transcript interfaces.
The newest/most developed one I like is based on Slate, and made by a really amazing engineer at the Wall Street Journal named Pietro.
Link: https://github.com/pietrop/slate-transcript-editor
[+] [-] elviejo|5 years ago|reply
what are the advantages / disadvantages ?
[+] [-] rememberlenny|5 years ago|reply
Descript is hands down the leader in any transcript based video/audio editing. They set the standard for detailed editing and magically manipulating audio/video.
We are focused on the workflow around creating something visually appealing, that uses a Zoom recording in it. Specifically, the transcript-based interface is only for speeding up the review process, but our main focus is on visual templates to drop in the video/captions.
One way to think of it is that we took the Descript Audiogram feature, and built out a workflow that creates a wider variety applicable to marketing/sales related needs.
We are solving the problem where you need to quickly take a video recording and make something you/your team can proudly share on social media, that reflects your company's brand guidelines/visual aesthetic.
[+] [-] sachaker|5 years ago|reply
[+] [-] mohitgarg|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
[+] [-] annelibby|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply
[+] [-] vanpelt|5 years ago|reply
[+] [-] rememberlenny|5 years ago|reply