If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
It's way cheaper - everyone is, elevenlabs is very expensive. Nobody matches their quality though. Especially if you want something that doesn't sound like a voice assistant/audiobook/podcast/news anchor/tv announcer.
This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.
Elevenlabs is an ecosystem play. They have hundreds of different voices, legally licensed from real people who chose to upload their voice. It is a marketplace of voices.
None of the other major players is trying to do that, not sure why.
ElevenLabs is the only one offering speech to speech generation where the intonation, prosody, and timing is kept intact. This allows for one expressive voice actor to slip into many other voices.
Yes ElevenLabs is orders of magnitude more expensive than everyone else. Very clever from a business perspective, I think. They are (were?) the best so know that people will pay a premium for that.
yes, I think you are right. When I did the math on 11labs million chars I got the same numbers (Pro plan).
I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)
Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!
Hi Jeff. This is awesome. Any plans to add word timestamps to the new speech-to-text models, though?
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)
Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
Hi Jeff, thanks for these and congrats on the launch. Your docs mention supporting accents. I cannot get accents to work at all with the demo.
For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.
As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.
A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?
B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.
This is a pernicious second order race and culture impact that I think is not where the company should be.
I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.
Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?
1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?
2) What is the latency?
3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?
4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?
Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twilio phone call audio) for speech-to-text models? Currently, we have to either process each channel separately and lose conversational context, or merge channels and lose speaker identification.
Hey Jeff, this is awesome! I’m actually building a S2S application right now for a startup with the Realtime API and keen to know when these new voices/expressive prompting will be coming to it?
Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.
Do you have plans to make it more realistic like kokoro-82M? I don't know, is it only me or anyone else, machine voice is irritating to me to listen for longer period of time.
How is the latency (Time To First Byte of audio, when streaming) and throughput (non-vibe characters input per second) compared to the existing 'tts-1' non-HD that's the same price? TTFB in particular is important and needs to be much better than 'tts-1'.
Hi Jeff, Thanks for updating the TTS endpoint! I was literally about to have to make a workaround with the chat completions endpoint with a hit and hope the transcription matches strategy... as it was the only way to get the updated voice models.
Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!
Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?
Woohoo new voices! I’ve been using a mix of TTS models on a project I’ve been working on, and I consistently prefer the output of OpenAI to ElevenLabs (at least when things are working properly).
Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)
So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?
Do you know when we can expect an update on the realtime API? It’s still in beta and there are many issues (e.g voice randomly cutting off, VAD issues, especially with mulaw etc…) which makes it impossible to use in production, but there’s not much communication from OpenAI. It’s difficult to know what to bet on. Pushing for stt->llm->tts makes you wonder if we should carry on building with the realtime API.
Hi Jeff, I have an app that already supports the Whisper API, so I added the GPT4o models as options. I noticed that the GPT4o models don't support prompting, and as a result my app had a higher error rate in practice when using GPT4o compared to Whisper. Is prompting on the roadmap?
Hey Jeff, maybe you could improve the TTS that is currently in the OpenAI web and phone apps. When I set it to read numbers in Romanian it slurs digits. This also happens sometimes with regular words as well. I hope you find resources for other languages than English.
How about more sample code for the streaming transcription api? I gave o1pro the docs for both the real-time endpoint and the stt API but we couldn't get it working (from Java, but any language would help).
Please release a stable realtime speech to speech model. The current version constantly thinks it’s a young teen heading to college and sad but then suddenly so excited about it
Hey Jeff, thanks for your work! Quick question for you, are you guys using Azure Speech Services or have these TTS models been trained by OpenAI from scratch?
After toying around with the TTS model it seems incredibly nondeterministic. Running the same input with the same parameters can have widely different results, some really good, others downright bad. The tone, intonation and character all vary widely. While some of the outputs are great, this inconsistency makes it a really tough sell. Imagine if Siri responded to you with a different voice every time, as an example. Is this something you're looking to address somewhere down the line or do you consider that working as intended?
Whisper's major problem was hallucinations, how are the new models doing there? The performance of ChatGPT advanced voice in recognizing speech is, frankly, terrible. Are these models better than what's used there?
How did you make whisper better? I used whisper large to transcribe 30 podcast episodes and it did an amazing job. The times it made mistakes were understandable like confusing “Macs” and “Max”, slurred speech or people just saying things in a weird way. I was able to correct these mistakes because I understood the context of what was being talked about.
Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.
Both the text-to-speech and the speech-to-text models launched here suffer from reliability issues due to combining instructions and data in the same stream of tokens.
Thanks for the write up. I've been writing assembly lately, so as soon as I read your comment, I thought "hmm reminds me of section .text and section .data".
Large text-to-speech and speech-to-text models have been greatly improving recently.
But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
I use Piper for one of my apps. It runs on CPU and doesn't require a GPU. It will run well on a raspberry pi. I found a couple of permissively licensed voices that could handle technical terms without garbling them.
However, it is unmaintained and the Apple Silicon build is broken.
My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.
Is there way to get "speech marks" alongside the generated audio?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types... I'm amazed.
The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
> Everyone's just going to pick whatever voice they're in the mood for.
I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.
Didn’t look closely, but is there a way to clone a voice from a few seconds of recording and then feed the sample to generate the text in the same voice?
I am always listening to audio books but they are no good anymore after playing with this for 2 minutes.
I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.
This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.
So many books I want as audio books too that no one would bother to record.
One very important quote from the official announcement:
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
I gave it (part of) the classic Navy Seal copypasta.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
I tried some wacky strings like "𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯NNNNNNNNNNNNNN𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯"
Its hilarious either they start to make harsh noise or say nonsense trying so sing something
Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
Glad I'm not the only one whose inner 12 year old curiosity is immediately triggered by free input TTS. Swear words and just raking my hands across the keyboard to insert gibberish in every possible accent.
I just tested the "gpt-4o-mini-tts" model on several texts in Japanese, a particularly challenging language for TTS because many character combinations are read differently depending on the context. The produced speech was quite good, with natural intonation and pronunciation. There were, however, occasional glitches, such the word 現在 genzai “now, present” read with a pause between the syllables (gen ... zai) and the conjunction 而も read nadamo instead of the correct shikamo. There were also several places where the model skipped a word or two.
However, unlike some other TTS models offering Japanese support that have been discussed here recently [1], I think this new offering from OpenAI is good enough for language users. I certainly could have put it to good use when I was studying Japanese many years ago. But it’s not quite ready for public-facing applications such as commercial audiobooks.
That said, I really like the ability to instruct the model on how to read the text. In that regard, my tests in both English and Japanese went well.
Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
Personally I just want to text or talk to Siri or an LLM and have it do whatever I need. Have it interface with AI Agents of companies, businesses, friends or families AI Agents to get whatever I need done like the example on OpenAI.fm site here (rebook my flight). Once it's done it shows me the confirmation on my lock screen and I receive an email confirmation.
Is this right? The current best TTS from OpenAI uses gpt-4o-audio-preview which is $2.50 input text, $80 output audio, the new gpt-4o-mini-tts is $0.60 input text, $12 output audio. An average 5x price reduction.
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
Previous offering from OpenAI was $15 for TTS and $30 for TTS HD so not 5x reduction. This one is slighly cheaper but definitely more capable (if you need control vibe)
Impressive in terms of quality, not so much in terms of style. I tried feeding it two prompts with the same script - one to be straightforward and didactic, then one asking it to deliver calculus like a morning shock-jock DJ. They sounded quite similar, and it definitely did not capture the vibe of 97.3 FM's Altman & the Claude with the latter prompt.
But then, I got much better results from the cowboy prompt by changing "partner" to "pardner" in the text prompt (even on neighboring words). So maybe it's an issue with the script and not the generation? Giving it "pardner" and an explicit instruction to use a Russian accent still gives me a Texas drawl, so it seems like the script overrides the tone instructions.
It doesn't seem clear, but can the model do correct emphesis? On things like single words:
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
Interestingly "replaces every second word with potato" and "speaks in Spanish instead of English" both (kind of) work as a style, so it's clear there's significant flexibility and probably some form of LLM-like thing under the hood.
The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
I was experimenting recently with voiceover TTS generation. Did run Kokoro TTS locally and it's magical for how few resources it takes (runs fine in a browser), but only the default female voices (Heart/Bella) are usable, and very good. Then I found that Clipchamp has it built-in and several voices from a big selection there are very good, and free. I've listened to this OpenAI TTS and I could not like them at all even compared to Kokoro.
It's interesting that they pitch this for agent development. The realtime API provides a much simpler architecture for developing agents. Why would you want to string together STT -> LLM -> TTS when you could have a consolidated model doing all three steps? They alluded to there being some quality/intelligence benefits to the multi-step approach, but in the long-run I'd expect them to improve the realtime API to make this unnecessary.
Text allows developers lots for flexibility to do other processing, including RAG, calling APIs yourself and multiple chained LLM invocations. The low latency of realtime API means relying fully on one invocation of their model to do everything.
Hmm I was hoping these would be bridging the gap between what's already been availalbe on their audio API or in the RealtimeAPI vs. Advanced Voice Mode, but the audio quality is really the same as its been up to this point.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
These models show some good improvements in allowing users to control many aspects of the delivery, but it falls deep in the uncanny valley and still has a ways to go in not sounding weird or slightly off-putting. I much prefer the current advanced voice models over these.
Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!
Hi! Can you add prefix support? This would be very valuable in being able to support overlapping windows. The only other way would be to use another ai to determine the overlap
oh doh. thanks ... we just pushed a fix for the crash. Unfortunately our currently implementation needs service works for streaming audio, so the "fix" was to disable the feature if the worker isn't available
Just use a synthesizer. Writing textual prompts is about the most inefficient way of getting what you want. When I was working in film I'd tell directors to stop describing what they had in mind (unless they were referencing something very specific) and try just making some funny mouth noises.
I'm surprised at how poor this is at following a detailed prompt.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example:
Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
I find it works better with shorter simpler instructions. I would try:
Voice: Warm and slow, like a friendly Somerset farmer. Tone: Laid-back and rustic. Dialect: Classic West Country with a relaxed drawl and colloquial phrases.
In Russian, OpenAI audio models usually have a slight American (?) accent. The intonation and the phonetics fall into the uncanney valley. Does the same happen in other languages?
The Japanese sounds OK to me. Not 100% but better than most human speakers. I understand Japanese well enough to be able to pick up a few different foreign accents in that language.
is it just me or are these voices clearly AI generated? They've obviously been improving at a steady rate but if I saw a YouTube video that had this voice, I'd instantly stop watching it
benjismith|11 months ago
https://platform.openai.com/docs/pricing
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
https://elevenlabs.io/pricing
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math... Is this right?
furyofantares|11 months ago
This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.
com2kid|11 months ago
None of the other major players is trying to do that, not sure why.
fixprix|11 months ago
oidar|11 months ago
echelon|11 months ago
No matter what happens, they'll eventually be undercut and matched in terms of quality. It'll be a race to the bottom for them too.
ElevenLabs is going to have a tough time. They've been way too expensive.
huijzer|11 months ago
lukebuehler|11 months ago
I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)
forgotpasagain|11 months ago
whimsicalism|11 months ago
youssefabdelm|11 months ago
kuprel|11 months ago
jeffharris|11 months ago
claiir|11 months ago
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)
noosphr|11 months ago
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
vessenes|11 months ago
For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.
As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.
A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?
B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.
This is a pernicious second order race and culture impact that I think is not where the company should be.
I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.
Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?
simonw|11 months ago
dandiep|11 months ago
2) What is the latency?
3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?
4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?
kiney|11 months ago
a-r-t|11 months ago
urbandw311er|11 months ago
Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.
Some sort of tri-party WebRTC session maybe?
kouteiheika|11 months ago
new_user_final|11 months ago
https://huggingface.co/hexgrad/Kokoro-82M
zhyder|11 months ago
nico|11 months ago
What’s the minimum hardware for running them?
Would they run on a raspberry pi?
Or a smartphone?
staticautomatic|11 months ago
oidar|11 months ago
robbomacrae|11 months ago
Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!
Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?
twalkz|11 months ago
Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)
So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?
TheAceOfHearts|11 months ago
ekzy|11 months ago
dharmab|11 months ago
progbits|11 months ago
On what metric? Also Whisper is no longer state of the art in accuracy, how does it compare to the others in this benchmark?
https://artificialanalysis.ai/speech-to-text
visarga|11 months ago
jbellis|11 months ago
unknown|11 months ago
[deleted]
taf2|11 months ago
MasterScrat|11 months ago
nabakin|11 months ago
Etheryte|11 months ago
unknown|11 months ago
[deleted]
mazd|11 months ago
mclau156|11 months ago
stavros|11 months ago
wewewedxfgdf|11 months ago
modeless|11 months ago
archerx|11 months ago
Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.
risho|11 months ago
pier25|11 months ago
man4|11 months ago
[deleted]
simonw|11 months ago
I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...
accrual|11 months ago
kibbi|11 months ago
But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
(My use case is an AAC app.)
5kg|11 months ago
https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
(no affiliation)
it's English only afaics.
wingworks|11 months ago
ZeroTalent|11 months ago
dharmab|11 months ago
However, it is unmaintained and the Apple Silicon build is broken.
My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.
Ey7NFZ3P0nzAe|11 months ago
benjismith|11 months ago
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":732,"type":"word","start":7,"end":11,"value":"it's"}
{"time":932,"type":"word","start":12,"end":16,"value":"nice"}
{"time":1193,"type":"word","start":17,"end":19,"value":"to"}
{"time":1280,"type":"word","start":20,"end":23,"value":"see"}
{"time":1473,"type":"word","start":24,"end":27,"value":"you"}
{"time":1577,"type":"word","start":28,"end":33,"value":"today"}
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
https://docs.aws.amazon.com/polly/latest/dg/output.html
minimaxir|11 months ago
celestialcheese|11 months ago
Looks like the new models don't have this feature yet.
crazygringo|11 months ago
The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
solardev|11 months ago
If we as developers are scared of AI taking our jobs, the voice actors have it much worse...
clbrmbr|11 months ago
> Speak with an exaggerated German accent, pronouncing all “w” as “v”
ForTheKidz|11 months ago
I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.
l72|11 months ago
Vibe:
Voice Affect: A Primal Scream from the top of your lungs!
Tone: LOUD. A RAW SCREAM
Emotion: Intense primal rage.
Pronunciation: Draw out the last word until you are out of breath.
Script:
EVERY THING WAS SAD!
d4rkp4ttern|11 months ago
anigbrowl|11 months ago
borgdefenser|11 months ago
I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.
This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.
So many books I want as audio books too that no one would bother to record.
minimaxir|11 months ago
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
mlsu|11 months ago
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8
jtbayly|11 months ago
pier25|11 months ago
I'm guessing their spectral generator is super low res to save on resources
unknown|11 months ago
[deleted]
minimaxir|11 months ago
gherard5555|11 months ago
Its hilarious either they start to make harsh noise or say nonsense trying so sing something
gherard5555|11 months ago
"*scream* AAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHH !!!!!!!!!"
corobo|11 months ago
Anyone out there doing any nice robotic robot voices?
Best I've got so far is a blend of Ralph and Zarvox from MacOS' `say`, haha
ranguna|11 months ago
ComputerGuru|11 months ago
amitport|11 months ago
https://www.youtube.com/watch?v=me4BZBsHwZs
swyx|11 months ago
danso|11 months ago
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
nazgulsenpai|11 months ago
TeMPOraL|11 months ago
tkgally|11 months ago
However, unlike some other TTS models offering Japanese support that have been discussed here recently [1], I think this new offering from OpenAI is good enough for language users. I certainly could have put it to good use when I was studying Japanese many years ago. But it’s not quite ready for public-facing applications such as commercial audiobooks.
That said, I really like the ability to instruct the model on how to read the text. In that regard, my tests in both English and Japanese went well.
[1] https://news.ycombinator.com/item?id=42968893
tkgally|11 months ago
forgotpasagain|11 months ago
jeffharris|11 months ago
lukeinator42|11 months ago
Havoc|11 months ago
>Please open openai.fm directly in a modern browser
Doesn't seem to like firefox
dredmorbius|11 months ago
islewis|11 months ago
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
kartikarti|11 months ago
MasterScrat|11 months ago
- Original: https://www.youtube.com/watch?v=FYcMU3_xT-w&t=5s
- AI: https://www.openai.fm/#8e9915b0-771d-4123-8474-78cc39978d33
Arubis|11 months ago
notlisted|11 months ago
paul7986|11 months ago
fixprix|11 months ago
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
pzo|11 months ago
Previous offering from OpenAI was $15 for TTS and $30 for TTS HD so not 5x reduction. This one is slighly cheaper but definitely more capable (if you need control vibe)
rachofsunshine|11 months ago
But then, I got much better results from the cowboy prompt by changing "partner" to "pardner" in the text prompt (even on neighboring words). So maybe it's an issue with the script and not the generation? Giving it "pardner" and an explicit instruction to use a Russian accent still gives me a Texas drawl, so it seems like the script overrides the tone instructions.
tomjen3|11 months ago
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
Sohcahtoa82|11 months ago
> Is the trivial example of something where intonation of the single word is what matters.
My go-to for an example of this is "I didn't say she stole my money".
Changing which word is emphasized completely changes the meaning of the sentence.
khurdula|11 months ago
pklimk|11 months ago
unknown|11 months ago
[deleted]
unknown|11 months ago
[deleted]
danso|11 months ago
Etheryte|11 months ago
Voice: Onyx
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
dougiejones|11 months ago
Delivery: Cow noises. You are actually a cow. You can only moo and grunt. No human noises. Only moo. No words.
Pauses: Moo and grunt between sentences. Some burps and farts.
Tone: Cow.
mft_|11 months ago
rybthrow2|11 months ago
"Get to the chopper now and PUT THAT COOKIE DOWN NOWWWW"
jeffharris|11 months ago
o_____________o|11 months ago
buybackoff|11 months ago
alach11|11 months ago
zhyder|11 months ago
unknown|11 months ago
[deleted]
skc|11 months ago
evalstate|11 months ago
The next version of Model Context Protocol will have native audio support (https://github.com/modelcontextprotocol/specification/pull/9...), which will open up plenty of opportunities for interop.
looknee|11 months ago
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
varunneal|11 months ago
unknown|11 months ago
[deleted]
fumeux_fume|11 months ago
garfieldnate|11 months ago
stephenheron|11 months ago
nickthegreek|11 months ago
tosh|11 months ago
joiemoie|11 months ago
rsp1984|11 months ago
Check out the toggle switch in the upper right corner! I hope more designers will follow this example.
jcmp|11 months ago
havefunbesafe|11 months ago
randomcatuser|11 months ago
vyrotek|11 months ago
IndignantTyrant|11 months ago
saint_yossarian|11 months ago
jeffharris|11 months ago
urbandw311er|11 months ago
justanotheratom|11 months ago
ForTheKidz|11 months ago
jncfhnb|11 months ago
theoryofx|11 months ago
atlasunshrugged|11 months ago
prdonahue|11 months ago
smokeydoe|11 months ago
anigbrowl|11 months ago
unknown|11 months ago
[deleted]
RobinL|11 months ago
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example: Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
anigbrowl|11 months ago
robbomacrae|11 months ago
Voice: Warm and slow, like a friendly Somerset farmer. Tone: Laid-back and rustic. Dialect: Classic West Country with a relaxed drawl and colloquial phrases.
tantalor|11 months ago
carbocation|11 months ago
jeffharris|11 months ago
we put little stars in the bottom right corner for the newer voices, which should sound better
basitmakine|11 months ago
l72|11 months ago
keepamovin|11 months ago
Perhaps that would be lucrative for the voice artists.
nmca|11 months ago
nickthegreek|11 months ago
sintezcs|11 months ago
kgeist|11 months ago
anigbrowl|11 months ago
josu|11 months ago
unknown|11 months ago
[deleted]
unknown|11 months ago
[deleted]
tiahura|11 months ago
Heidaradar|11 months ago
redox99|11 months ago
blazenby|11 months ago
[deleted]