My intuition is that in the case of audiobook narrators who adopt different voices for different characters, VALL-E would struggle to identify which character is speaking and thus produce the correct voice for a given portion of, say, text in quotation marks. In some bits of prose it can be difficult for a human reader to determine who is speaking, for that matter, but this is a classic weakness with the current crop of AI tools, and stems from a lack of actual understanding of the text.
No comments yet.