Generate audiobooks from E-books with Kokoro-82M

[+] laserbeam|1 year ago|reply

On the one hand, this is very convenient. Probably cool for some non-fiction.

On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.

I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.

[+] stavros|1 year ago|reply

> On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well

This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".

AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.

[+] felixhummel|1 year ago|reply

I wholeheartedly agree. https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on Terry Pratchett's Discworld series. I loved "Going Postal".

[+] dmazin|1 year ago|reply

Absolutely.

Even on the non-fiction side, the narration for Gleick's The Information adds something.

While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.

[+] ldoughty|1 year ago|reply

I agree with you, but also want to point out:

New authors, self-publishers, can't afford tens of thousands of dollars to get an audiobook recorded professionally... This can limit their distribution.

Authors might even choose not to make such version (or lack confidence to record themselves), so AI capable of making a decently passable version would be nice -- something more than reading text blandly. AI in theory could attempt to track the scene and adjust.

[+] WillAdams|1 year ago|reply

Yes, but if the alternative is not having a book, or having to listen to one poorly read (I love Librivox, but there are some books which I just haven't been able to finish because of readers, and many more which were nixed for family vacation travel listening on that account), this may be workable.

[+] micw|1 year ago|reply

With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.

I wonder if a standardized markup exists to do so.

[+] ahoka|1 year ago|reply

I guess this is still very useful if you are blind.

[+] taude|1 year ago|reply

Agree with you on this.

My example, I was never a Wheel of Time fan, but the new audio editions done by Rosamund Pike are quite the performance, and make me like the story. She brings all the characters to life in a way thats different than just reading. It's a true performance.

[+] lern_too_spel|1 year ago|reply

On the other hand, there are a lot of narrators who are just bad, and the publisher is not going to pay for an alternate narration. These tools are a good way to re-narrate Wil Wheaton narrated books with correct pronunciation and inflection, for example.

Computer chess took a long time to get better than the best players in the world, but it was better than most chess players for many years before that. We're seeing that a lot with these generative models.

[+] Oneunscripted|1 year ago|reply

I guess using different narrators is essential for both fiction and non-fiction books if you want the full experience. Personally, I love it when audiobooks have narrators who stick to the characters’ personalities—it just feels right. Some of the audiobooks I’ve listened to have narrators who switch up their voices for each character, and others even use a different narrator for every character, which gets really good. Narration Box has been doing a really great job with this lately

[+] stevenwoo|1 year ago|reply

A couple of my favorite audiobooks are Stranger in a Strange Land and Flowers for Algernon where the performer changes the intonation and enunciation of main character with the character’s journey and it was a revelation and made me appreciate the stories in a way I did not get reading the printed books the first time. Just the consistency of the performance is sometimes difficult to do in my imagination perhaps.

[+] whazor|1 year ago|reply

A GenAI model that read audiobooks with such dramatisation is really my dream. There are so many books that I would want to listen to, but still lack such an adaptation. Also it takes months after the book release before the audiobook gets released.

Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.

[+] rd11235|1 year ago|reply

I agree but the opposite can be true too. Sometimes the narrator seems to target some general audience that doesn’t fit me at all, in a way that makes me cringe when I listen, until I stop listening altogether. In these cases I’d rather listen to a relatively flat narration from a tool like this.

[+] delegate|1 year ago|reply

The quality is great (amazing even), but I can't listen to AI generated voices for more than 1 minute. I don't know why, I just don't like it. I immediately skip the video on youtube if the voice is AI generated.

Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.

No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..

[+] swores|1 year ago|reply

Can anyone recommend an open source option that would allow training on a custom voice (my own, so I'd be able to record as many snippets as it needed to train on) to allow me to use it for TTS generation without sharing it off my machine?

Edit: I'll wait to see if any recommendations get made here, if not I might give this one a go: https://github.com/coqui-ai/TTS

[+] pprotas|1 year ago|reply

I would love to have an e-reader that allows me to switch between text and audio at the press of a button. Imagine reading your book on the couch and then switching into audio mode while doing the dishes seamlessly, by connecting bluetooth headphones.

[+] InsideOutSanta|1 year ago|reply

Kindles used to provide this feature, but publishers and/or the Authors Guild stopped it, because audio rights and text rights are handled differently. In other words, when Amazon sells you a text book, it does not have the right to then also do TTS on that text and let you listen to it.

There's some contemporary discussion of what happened here: https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...

I think there is still integration with Audible, though. If you buy a book on the Kindle and on Audible, the position will sync, and you can switch between listening and reading without losing your place in the book.

[+] dsign|1 year ago|reply

It is a supported feature in the epub 3.0 standard. It's possible to distribute an epub with audio, and have the audio sync to the HTML elements that form the ebook's text. And there is an e-reader that actually supports this feature, I can't remember which one now but it should be possible to find it with Google.

It's more of an open problem how to create those epubs. I have some code that can do it using Elevenlabs audio, but I imagine it way harder to have something similar for a human narrator.... who's going to do the sync? Maybe we need a sync AI.

[+] freefaler|1 year ago|reply

You can do it easily with non-DRM books (or DRM stripped books):

For Android:

- Moon+ reader pro - some paid high-quality TTS voices (like Acapella)

For iOS:

- Kybook reader and internal iOS voices (no external TTS voices for the walled garden)

This works well enough to listen to a book while you walk and when you get back home read on the WC from the place you stopped.

Additionally if you buy a tablet or an android ebook reader, you install the app there an you can continue on your bigger/better device seamlessly.

Whisper-sync for the masses! Ahoy...

[+] monkeydust|1 year ago|reply

Literally started doing that this week with Amazon Audible. I gave in an started the three month 99c trial and downloaded the app.

What surprised me a good way was my Kindle app was aware of this and asked if I wanted to download the audible version of the current book I am reading.

Been listening on the way to work and then reading on the way back. Enjoying it so far.

[+] llamaimperative|1 year ago|reply

Boox Ultra Tab whatever the fuck (their product naming sucks) + Readwise Reader = amazing for this

Not quite seamless but it works. It has a cursor that follows the words as they’re spoken to, which allows you to read and hear (“immersive reading”) which I find to be extremely helpful for maintaining focus.

[+] leobg|1 year ago|reply

iOS Voice Dream Reader. First app I install on a new iPhone since 2010 I believe. I will even cut and scan physical books just so I can read them in the app. The story of the guy who made it is also interesting!

[+] qurashee|1 year ago|reply

This looks incredible! I’ve had an idea simmering in the back of my mind for a while now: creating an audiobook from an ebook for my commute using the voice of a specific audiobook narrator I really enjoy. The concept struck me after coming across the Infinite Conversation project here on HN. Unfortunately, I just haven’t found the time to bring it to life yet. :(

[+] eamag|1 year ago|reply

For a specific narrator you can try F5-TTS, here's a post how https://eamag.me/2025/Voice-Cloning

[+] leobg|1 year ago|reply

Made this for my kids for Christmas:

- take an ebook in any language - AI translates it to German - AI speaks it using the voice of their fav narrator - a UI showing the text as it is being read

Now they can read Asimov, Kulansky, Bryson, regardless of whether a translation or audio version exists. :)

[+] vinni2|1 year ago|reply

What about the copyright issue? You can’t mimic the voice of a narrator without their consent. OpenAI landed in trouble after using Scarlett Johansson’s voice in a demo.

https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...

[+] cwmoore|1 year ago|reply

The word “kokoro” means “heart” in Japanese, which I learned making the (heart shaped and paperback) puzzle books at https://www.kakurokokoro.com/

[+] tkgally|1 year ago|reply

Note that kokoro (心) means “heart” in the sense of “spirit,” “soul,” “mind,” “emotions,” etc. It doesn’t mean “heart” in the sense of “internal organ that pumps blood.” That is shinzō (心臓).

I once heard an American friend with so-so Japanese ability ask a Japanese woman who had recently had a heart operation how her kokoro was doing, and she looked surprised and taken aback.

Side note: After I started reading HN in 2019, I was struck by how many tech products mentioned here have Japanese names. I compiled a list for a few years and eventually posted it:

https://news.ycombinator.com/item?id=31310370

[+] terhechte|1 year ago|reply

Its also the name of the AI in Terminator Zero https://villains.fandom.com/wiki/Kokoro

I'm not sure if that is related here.

[+] albert_e|1 year ago|reply

I hope a plugin for Calibre ebook management software comes along that makes it easier to convert select titles from your epub library to decent audio versions -- and a decent open source app for tablets and smartphones that can let us seamlessly consume both the ebook and audiobook at will.

[+] Dowwie|1 year ago|reply

2025 may be the year where we can generate a dramatic audiobook with ambient music, sound effects, and theatrical narration using neural networks. Many of the parts already exist.

[+] cess11|1 year ago|reply

I would for sure not want this for fiction, it's too obvious that the voice has no understanding whatsoever of the text, but it's probably pretty nice for converting short news texts or notifications to audio.

[+] sysworld|1 year ago|reply

Finally! Been trying all the TTS models popping up on here for ages, and they've all been pretty average, or not work on Mac, or only work on really short text, or be reeealy slow.

But this one works pretty quick, is easy to install, has some passible voices. Finally I can start listening to those books that have no audio version.

I'm a slow reader, so don't read many books. If a book doesn't have an audiobook version, chances are I won't read it.

PS, I have used elevenlabs in the past for some small TTS projects, but for a full book, it's price prohibitive for personal use. (elevenlabs has some amazing voices)

Thank you to the dev/s who worked on this!

[+] TypoAtLineZero|1 year ago|reply

I am having a very similar setup locally, which uses Chrome with the 'Read Aloud' plugin. I am capturing the audio stream via QJackCtl/VLC. Voices, speed, pitch can be adjusted. Efficient and quickly set up

[+] lc64|1 year ago|reply

"was trained on <100 hours of audio"

How the hell was it trained on that little data ?

[+] woolion|1 year ago|reply

If you look for a lot of the great classics, audiobooks results are inundated with basic TTS "audiobooks" that are impossible to filter out. These are impossible to listen to because they lack the proper intonation marking the end of sentences, making it very tiring to parse. It might be better than tuna can sounding recordings, especially if you want to ear them in traffic (a common requirement), but that's about it. The alternative, if you want real quality recordings, is to stop reading classics and instead read latest Japanime Isekai of murder mystery, these have very good options on the market. Anyway, I don't think it needs more justification that it covers a good niche usage.

I'm checking what the actual quality is (not a cherry-picked example), but:

Started at: 13:20:04 Total characters: 264,081 Total words: 41548 Reading chapter 1 (197,687 characters)...

That's 1h30 ago, there's no kind of progress notification of any kind, so I'm hoping it will finish sometime. It's using 100% of all available CPUs so it's quite a bother. (this is "tale of a tub" by Swift, it's about half of a typical novel length)

[+] msoad|1 year ago|reply

To people who are experts in AI TTS:

Why elevenlabs has such a lead in this space? It sounds better than OpenAI and Google models

[+] dbspin|1 year ago|reply

Does it? The podcasts created by Notebook LLM are completely convincing, at least in terms of voice generation.

[+] eamag|1 year ago|reply

Single-purpose company vs a huge corp with many other objectives in mind

[+] katspaugh|1 year ago|reply

Sounds better than many books on Audible.

[+] TheChaplain|1 year ago|reply

For accessibility I think this is a great thing, but as entertainment less so.

Example is Hobbit and Lord of the Rings, the narrator Rob Inglis, makes an amazing voice performance giving depth to environments and characters. And of course the songs!

[+] flypunk|1 year ago|reply

I really liked it and added a variable speed argument: https://github.com/santinic/audiblez/pull/4

[+] yoavm|1 year ago|reply

Was just looking for a TTS model to run locally for reading out loud articles, and never heard about Kokoro before! This looks great. I wonder if it can run in the browser somehow - could be a nice WebExtension.

[+] nottorp|1 year ago|reply

Well there was some hope with ChatGPT that people will go back to being able to process text communication.

Guess it was just a matter of time till someone figured out how to use "AI" to resume encouraging illiteracy.

[+] stavros|1 year ago|reply

There was some hope with the rise of equestrianism that people will go back to be able to shoe horses.

Guess it was just a matter of time till someone figured out how to use "cars" to resume encouraging being unable to to a basic farrier job.

[+] nickpsecurity|1 year ago|reply

The page says it was trained on under 100 hours of audio. Then, the link says “we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training.” I don’t have time to read the paper to see what that means.

Depending on what that means, it might be more accurate to say it was trained on 100 hours of audio and with the aid of another, pre-trained model. The reader who thinks “only 100 hours?!” will know to look at the pretraining requirements of the other model, too.

246 comments