With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
In the idea of making more of an OpenAI minute, don't send it any silence.
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.
From my own experience with whisper.cpp, normalizing the audio and removing silence not only shortens the process time significantly, but also increases a lot the quality of the transcription, as silence can mean hallucinations. You can do that graphically with Audacity too, if you do not want to deal with the command line. You also do not need any special hardware to run whisper.cpp, with the small model literally any computer should be able to do it if you can wait a bit (less than the audio length).
One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.
Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.
For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.
A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.
> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
I wish there was a 2.25x YouTube option for "normal" humans. I already use every shortcut, and listen at 2x 90% of the time. But Andrej I can't take faster than 1.25x
The interesting thing here is that OpenAI likely has a layer that trims down videos exactly how you suggest, so they can still charge by the full length while costing less for them to actually process the content.
Gemini charges by tokens rather than minutes. I used VAD to trim silence hoping token count will go down. I noticed the token count wasn't much different (Eg: 30 seconds of background noise had the same count as 2s of background noise). Either Gemini API trims silence under the hood, or the nature of tokenization is dependent on speech content rather than the length. Not sure which.
In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.
I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.
That's an amusing perspective. I really struggle with watching any video at double speed, but I've never had trouble listening to any of his talks at 1x. To me, he seems to speak at a perfectly reasonable pace.
A point on skimming vs taking the time to read something properly.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.
Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.
I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
As if human-generated transcriptions of audio ever came with guarantees of accuracy?
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!
I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
If you have a recent macbook you can run the same whisper model locally for free. People are really sleeping on how cheap the compute you own hardware for already is.
Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
You could use Hugging Face's Inference API (which supports all of these API providers) directly making it easier to switch between them, e.g. look at the panel on the right on: https://huggingface.co/openai/whisper-large-v3
Let me know if you are interested in a more reliable transcription API. I'm building Lemonfox.ai and we've optimized our transcription API to be highly available and very fast for large files. Happy to give you a discount (email: bruno at lemonfox.ai)
I am a blue collar electrician. Not a coder (but definitely geeky).
Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).
We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.
I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).
I came here to ask the same question. This is a well-solved problem, red queen racing it seems utterly pointless, a symptom of reflexive adversarialism.
I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.
If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.
Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.
You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.
Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.
I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.
The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.
This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)
I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.
I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?
Maybe I'll try three approaches:
- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)
- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs
- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)
Quick Feedback: Would it be cool to research this internally and maybe find a sweet spot in speed multiplier where the loss is minimal. This pre-processing is quite cheap and could bring down the API price eventually.
Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!
I'm implementing a similar workflow for VideoToBe.com
My Current Pipeline:
Media Extraction - yt-dlp for reliable video/audio downloads
Local Transcription - OpenAI Whisper running on my own hardware (no API costs)
Storage & UI - Transcripts stored in S3 with a custom web interface for viewing
After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!
Omg long post. TLDR from an LLM for anyone interested
Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.
If you're already doing local ffmpeg stuff (i.e. pretty involved with code and scripting already) you're only a couple of steps more away from just downloading the openai-whisper models (or even the faster-whisper models which runs about two times faster). Since this looks like personal usage and not building production quality code, you can use AI (e.g. Cursor) to write a script to run the whisper model inference in seconds.
Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)
On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.
This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F
Seems like Thai. Thai translation and recognition is like 10 years ago comparing to other languages I'm dealing with in my everyday life. Good news tho is the same level was for Russian years ago, and now it is near perfect.
Gemini 2.5 pro is, in my usage, quite superior for high quality transcriptions of phone calls, in Dutch in my case. As long as you upload the audio to GCS there you can easily process conversations of over an hour. It correctly identified and labeled speakers.
The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.
As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.
So wait… is whisper transcription really all that slow locally on a M3 Macbook? It’s been a while since I used whispercpp, but I seem to remember it taking maybe 20 minutes on a comparatively slowpoke (and powerhungry) i5 12600k for maybe 40 minutes of audio; it might take less time on a faster m chip (maybe I’m imagining mobile apple silicon to be more performant than even desktop intel cpus), even less if there support built in for the built in gpu cores and other ai optimized silicon?
Do the APIs support simultaneous voice transcription in a way that different voices are tagged? (either in text or as metadata)
If so: could you split the audiofile and process the latter half by pitch shifting, say an octave, and then merging them together to get shorter audiofile — then transcribe and join them back to a linear form, tagging removed. (You could insert some prerecorded voice to know at which point the second voice starts.). If pitch change is not enough, maybe manipulate it further by formants.
This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).
When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?
I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?
I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!
For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.
(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)
So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.
We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:
Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.
(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)
You can use yt-dlp to get the transcripts. For instance, to grab just the transcript of a video:
./yt-dlp --skip-download --write-sub --write-auto-sub --sub-lang en --sub-format json3 <youtube video URL>
You can also feed the same command a playlist or channel URL and it'll run through and grab all the transcripts for each video in the playlist or channel.
If you look for a cheaper transcription API you could als use https://Lemonfox.ai. We've optimized the API for long audio files and are much faster and cheaper than OpenAI.
This "hack" also works in real life, youtubers low to talk slowly to increase the video runtime so I watch everything other than songs at 2x speed (and that's only because their player doesn't let you go faster).
You'd need a WER comparison to check if it really is no drop in quality. With this trick, there might be trouble if the audio is noisy, and it may. ot always be obvious whether or not to speed up.
I noticed something similar with images as inputs to Claude, you can scale down the images and still get good outputs. There is an accuracy drop off at a certain point but the token savings are worth doing a little tuning there.
In my experience, transcription software has no problem with transcribing sped up audio, or audio that is inaudible to humans or extremely loud (as long as not clipped), I wonder if LLM transcription works the same.
Hmm…doesn’t this technique effectively make the minute longer, not shorter? Because you can pack more speech into a minute of recording? Seems like making a minute shorter would be counterproductive.
I wonder how much time and battery transcoding/uploading/downloading over coffeeshop wifi would realy save vs just running it locally through optimized Whisper.
I had this same thought and won't pretend my fear was rational, haha.
One thing that I thought was fairly clear in my write-up but feels a little lost in the comments: I didn't just try this with whisper. I tried it with their newer gpt-4o-transcription model, which seems considerably faster. There's no way to run that one locally.
That's really cool! Also, isn't this effectively the same as supplying audio with a sampling rate of 8kHz instead of the 16kHz that the model is supposed to work with?
There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.
With this logic, you should also be able to trim the parts that doesn’t have words. Just add a cut-off for db, and trim the video before transcription.
But you know that you can run OpenAI's Whisper audio recognition model locally for free, right? It has very little GPU requirements, and the new "turbo" model works quite fast (there are also several Python libraries which make it significantly faster still).
w-m|8 months ago
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.jwrallie|8 months ago
One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
swyx|8 months ago
guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text
nickjj|8 months ago
Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.
For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.
A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.
behnamoh|8 months ago
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
georgemandis|8 months ago
QuantumGood|8 months ago
brunoborges|8 months ago
pragmatic|8 months ago
vayup|8 months ago
In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.
unknown|8 months ago
[deleted]
CSMastermind|8 months ago
Is it common for people to watch Youtube sped up?
I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.
cbsmith|8 months ago
niutech|8 months ago
heeton|8 months ago
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.
pluc|8 months ago
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
georgemandis|8 months ago
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
++ to "Slower is usually better for thinking"
itsoktocry|8 months ago
Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.
Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?
bongodongobob|8 months ago
mutagen|8 months ago
conradev|8 months ago
georgemandis|8 months ago
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
bravesoul2|8 months ago
timerol|8 months ago
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
raincole|8 months ago
Newspaper is essentially just an inaccurate summary of what really happened. So I don't find this realization that uncomfortable.
BHSPitMonkey|8 months ago
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
simonw|8 months ago
Graziano_M|8 months ago
dataviz1000|8 months ago
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
[1] https://github.com/adam-s/doomberg-terminal
kgc|8 months ago
rob|8 months ago
[0] https://groq.com/pricing/
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
colechristensen|8 months ago
georgemandis|8 months ago
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
abidlabs|8 months ago
pzo|8 months ago
https://developers.cloudflare.com/workers-ai/models/whisper-...
BrunoJo|8 months ago
Tepix|8 months ago
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.
ProllyInfamous|8 months ago
Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).
We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.
I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).
anigbrowl|8 months ago
alok-g|8 months ago
Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.
If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.
appleaday1|8 months ago
ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4
This works VERY well for my needs.
conjecTech|8 months ago
nomercy400|8 months ago
mt_|8 months ago
MaxDPS|8 months ago
brendanfinan|8 months ago
https://news.ycombinator.com/item?id=44125598
jasonjmcghee|8 months ago
And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"
raincole|8 months ago
Are people just staring it for meme value or something? Is this a scam?
[0]: https://github.com/Olow304/memvid
stogot|8 months ago
georgemandis|8 months ago
I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.
The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.
pbbakkum|8 months ago
georgemandis|8 months ago
I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?
Maybe I'll try three approaches:
- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)
- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs
- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)
nerder92|8 months ago
pimlottc|8 months ago
meerab|8 months ago
I'm implementing a similar workflow for VideoToBe.com
My Current Pipeline:
Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing
Y Combinator playlist https://videotobe.com/play/playlist/ycombinator
and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ
After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!
karpathy|8 months ago
Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.
;)
georgemandis|8 months ago
(Thanks for your good sense of humor)
bravesoul2|8 months ago
I have been thinking for a while how do you make good use of the short space in those places.
LLM did well here.
lordspace|8 months ago
godot|8 months ago
Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)
On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.
Tepix|8 months ago
55555|8 months ago
xenator|8 months ago
dajonker|8 months ago
The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.
As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.
7speter|8 months ago
Did I miss that the task was time sensitive?
addaidirectory|8 months ago
mushishi|8 months ago
If so: could you split the audiofile and process the latter half by pitch shifting, say an octave, and then merging them together to get shorter audiofile — then transcribe and join them back to a linear form, tagging removed. (You could insert some prerecorded voice to know at which point the second voice starts.). If pitch change is not enough, maybe manipulate it further by formants.
KTibow|8 months ago
fallinditch|8 months ago
I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?
I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!
rob|8 months ago
https://github.com/jdepoix/youtube-transcript-api
For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.
(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)
So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.
We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:
https://github.com/iv-org/youtube-trusted-session-generator
Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.
(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)
banana_giraffe|8 months ago
vjerancrnjak|8 months ago
isubkhankulov|8 months ago
I use this free tool to extract those and dump the transcripts into a LLM with basic prompts: https://contentflow.megalabs.co
jasonjmcghee|8 months ago
georgemandis|8 months ago
BrunoJo|8 months ago
ta8903|8 months ago
impossiblefork|8 months ago
another_twist|8 months ago
tmaly|8 months ago
pzo|8 months ago
cprayingmantis|8 months ago
georgemandis|8 months ago
Clearly the next thing we need to test is removing all the vowels from words, or something like that :)
ryanar|8 months ago
donkey_brains|8 months ago
StochasticLi|8 months ago
PeterStuer|8 months ago
georgemandis|8 months ago
One thing that I thought was fairly clear in my write-up but feels a little lost in the comments: I didn't just try this with whisper. I tried it with their newer gpt-4o-transcription model, which seems considerably faster. There's no way to run that one locally.
xg15|8 months ago
ada1981|8 months ago
There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.
moralestapia|8 months ago
Nice. Any blog post, twitter comment or anything pointing to that?
appleaday1|8 months ago
pottertheotter|8 months ago
Or you can just copy the transcript that YouTube provides below the video.
celltalk|8 months ago
Possibly another 10-20% gain?
fuzztester|8 months ago
there is tons of this happening everywhere, and we need to fight this, and boycott it.
mcc1ane|8 months ago
canyp|8 months ago
raluk|8 months ago
anshumankmr|8 months ago
Nevermark|8 months ago
pknerd|8 months ago
amelius|8 months ago
yashasolutions|8 months ago
KPennig86852|8 months ago
b0a04gl|8 months ago
[deleted]
spapinwar|8 months ago
[deleted]
Raphell|8 months ago
[deleted]
weird-eye-issue|8 months ago
topaz0|8 months ago