top | item 38753154

(no title)

r4victor | 2 years ago

Amazing! I've made a similar ebooks-audiobooks aligner years ago: https://github.com/r4victor/syncabook. At that time, I chose to synthesize the text and align two audio sequences because I found texts-alignment approaches (including ML-based ones) too compute-intensive and inadequate for long texts. I see Storyteller works by aligning the texts. Could you give some view on how long it takes to sync a book?

Also, my experience was that audio and text versions are often very different (e.g. the audio having an intro missing from the text). It'd be very interesting to know how well Storyteller handles such cases. Does it require manual audio/text editing or handle the differences automatically?

discuss

order

smoores|2 years ago

Hello! syncabook is awesome, and indeed Storyteller does take "the opposite" approach when it comes to forced alignment.

Others have linked to the docs, where I go into detail about the syncing algorithm, but at a high level:

Storyteller uses Whisper to transcribe the audio to text (this is the most computationally expensive part of the process)

Then we use a Levenshtein-distance-based fuzzy search algorithm to find each chapter in the text (this is attempting to account for the difference between audio and text versions, as you said!)

Then for each chapter, we find the start and end timestamp of each sentence, again using a fuzzy search across the transcription.

In general, Storyteller does a pretty good job; it treats the ebook as the source of truth, which means that at the moment it sometimes misses introductory and ending pieces of the audiobook, though it's on the roadmap to have some support for explicitly triggering those when that happens.

NoahKAndrews|2 years ago

The docs say it's usually 1-4 hours depending on the book and the hardware: https://smoores.gitlab.io/storyteller/docs/syncing-books

The docs also have a detailed section about the algorithm that goes into how it auto-handles differences between the audio and the text.

cyberax|2 years ago

One obvious optimization is to sample the audio file at regular intervals and transcribe only a part of the text. Then just interpolate the locations. This can speed it up by a couple of orders of magnitude.