Hey, one of the Suno founders/creators of Bark here. Thanks for all the comments, we love seeing how we can improve things in the future. At Suno we work on audio foundation models, creating speech, music, sounds effects etc….
Text to speech was a natural playground for us to share with the community and get some feedback. Given that this model is a full GPT model, the text input is merely a guidance and the model can technically create any audio from scratch even without input text, aka hallucinations or audio continuation.
When used as a TTS model, it’s very different from the awesome high quality TTS models already available. It produces a wider range of audio – that could be a high quality studio recording of an actor or the same text leading to two people shouting in an argument at a noisy bar. Excited to see what the community can build and what we can learn for future products.
Please let us know with any feedback, or if you’re interested in working on this: [email protected]
This tech will be used by crooks to automate attacks. Generate the language using GPT-4 and the audio using Bark, and then start making phone calls. Because it’s open source, all you need is GPUs. This is not a criticism. I’m impressed and grateful for the openness. Everyone needs to wake up and recognize that these attacks are coming at us essentially right now.
I like the emphasis tags, it's something that is not seen with a lot of these transformer models.... things like [laughs] makes alot of sense... i could see where hundreds or possibly thousands of emphasis style tags could be added to support a vast array of intonations in human speech. i.e. [yells], [shouts], [cries], [crying], [whispers], [sarcasasm], etc
Very cool. Side note: bark-gpt.com is already taken for a dog translator: "The world’s first AI powered, real-time communications tool between humans and their furry best friends."[0] I only know this because my law firm partner's name is Bark, and I wanted to automate some legal work and name the software "Bark GPT" after him.
Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.
>Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.
The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.
Man I know this is HN, and I know we have a certain decorum we should be maintaining, but with the recent activity in this field the most appropriate response to these posts is "4bit when?" or "f16 when?". Not sure which one is applicable. I am having no luck running it on a 6GB vram gpu, so I guess its the 16 bit floating point one.
related to this - to those releasing models, it would be great if you could share how much VRAM is required (seems very common for this key piece of info to be missing).
I'm successfully running it on a 12GB GPU (while it downloads some 12.1GB of model data on first run, the highest GPU memory usage was ~6.5GB, settling back down to around 5GB), however the results are nothing like the samples given on the github page. Using the exact code given and in the runs I've tried the results are rather terrible.
I'm not being negative -- some of the samples are really neat on their page -- and I know there is some idiosyncrasy of my setup that is causing issues, though it is a pretty typical conda + pytorch with CUDA 11.8.
Playing with the text and waveform temp from their defaults 0.7 is yielding some semi-decent results, but it feels essentially random.
A few years a back someone had the genius idea of making a robo-answering phone program that would give vague but encouraging replies when it received some unsolicited sales pitch. Although it was a fixed sequence of responses, it fooled some callers for a surprisingly long time.
Someone needs to hook create the plumbing to capture speech to text, feed it to a GPT script that has been told how to reply to such call center calls, then send that back through a TTS generator like this one.
To overcome any latency issues, it could build in a ploy to buy time like the old script did, eg, make the robo-answerer sound like a somewhat addled old man who has to think before each reply, perhaps prefixing responses with "hmm, ahh, ..." to buy time to generate the response.
It seems like a lot of the entries in TTS are either close sourced saas apps or something like this with limitations on customizing it. It seems clearly inevitable and likely only months away that a high quality unrestricted open source option for things like voice cloning will emerge so i'm not sure why these projects are even really bothering trying to stop it. I think in order for TTS to have its StableDiffusion moment it will just be a matter of an unrestricted easily trainable open source model.
Is there likely to be a way to stream this audio in the future? As in, here's an incoming stream of text, generate the audio on-the-fly instead of all at once.
Great news! It's astounding how quickly technology is advancing. Only yesterday, I was wondering about when a new model for text-to-speech would be developed, and today a game-changing model has been released! This new model is simply incredible!
Any idea what the training data for this is? Looking at the model, it looks like it is literally just copy-paste from Karpathy's nanoGPT, so the training data is what's most interesting. Pretty amazing anyway.
I found a secret demo page that shows in real time how they assess any sound file's mood swings along with number of detected laughs, coughs, etc. Guessing that ability is involved somehow.
Imagine Torvalds saying the same in the context of linux- 'to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided...'
On the other hand, Russian was disappointing. It put a stress in one word incorrectly (it confused the grammatical form, used the genitive case instead of the accusative) and in general sounded strange.
The fact that this is open source and can generate more thann just speech is really nice, but for speech itself, it's much lower quality than what Eleven Labs provides.
All the open source models I've seen so far have this weird kind of neural fuzziness to them. I don't know what Eleven does better, but there's definitely a big difference.
"However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language."
Isn't this open source and can be easily removed or am I missing something?
Some of it is very impressive although some of it seems about equal to the TTS built into my phone. How long until someone can package this up and make a program that takes in epubs and spits out mp3s?
Does it sound fairly robotic/static-y to anyone else or just me? Doesn't sound any better than any other TTS software I've tried and in fact sounds a bit worse, like it's noisy.
[+] [-] gkucsko|2 years ago|reply
Text to speech was a natural playground for us to share with the community and get some feedback. Given that this model is a full GPT model, the text input is merely a guidance and the model can technically create any audio from scratch even without input text, aka hallucinations or audio continuation.
When used as a TTS model, it’s very different from the awesome high quality TTS models already available. It produces a wider range of audio – that could be a high quality studio recording of an actor or the same text leading to two people shouting in an argument at a noisy bar. Excited to see what the community can build and what we can learn for future products.
Please let us know with any feedback, or if you’re interested in working on this: [email protected]
[+] [-] ttul|2 years ago|reply
[+] [-] dmix|2 years ago|reply
[+] [-] turnsout|2 years ago|reply
[+] [-] jijji|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] calny|2 years ago|reply
[0] https://www.bark-gpt.com/
[+] [-] dasickis|2 years ago|reply
Reference:
1. https://www.laika.berlin/en/blog/new-client-barkgpt-ai-dog-b...
[+] [-] seydor|2 years ago|reply
[+] [-] lIl-IIIl|2 years ago|reply
[+] [-] cm2187|2 years ago|reply
[+] [-] ripperdoc|2 years ago|reply
[+] [-] JonathanFly|2 years ago|reply
The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.
[+] [-] CreepGin|2 years ago|reply
[+] [-] newswasboring|2 years ago|reply
[+] [-] montebicyclelo|2 years ago|reply
[+] [-] joseph_grobbles|2 years ago|reply
I'm not being negative -- some of the samples are really neat on their page -- and I know there is some idiosyncrasy of my setup that is causing issues, though it is a pretty typical conda + pytorch with CUDA 11.8.
Playing with the text and waveform temp from their defaults 0.7 is yielding some semi-decent results, but it feels essentially random.
[+] [-] tasty_freeze|2 years ago|reply
https://www.youtube.com/watch?v=XSoOrlh5i1k
Someone needs to hook create the plumbing to capture speech to text, feed it to a GPT script that has been told how to reply to such call center calls, then send that back through a TTS generator like this one.
To overcome any latency issues, it could build in a ploy to buy time like the old script did, eg, make the robo-answerer sound like a somewhat addled old man who has to think before each reply, perhaps prefixing responses with "hmm, ahh, ..." to buy time to generate the response.
[+] [-] jdprgm|2 years ago|reply
[+] [-] kleer001|2 years ago|reply
CYA aka https://en.wikipedia.org/wiki/Cover_your_ass
also it still requires tons of money to run, so it's likely only businesses will do it
[+] [-] generalizations|2 years ago|reply
[+] [-] vlugorilla|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] rck|2 years ago|reply
[+] [-] unraveller|2 years ago|reply
[+] [-] DarthNebo|2 years ago|reply
[+] [-] bdg|2 years ago|reply
[+] [-] cyberax|2 years ago|reply
[+] [-] miki123211|2 years ago|reply
All the open source models I've seen so far have this weird kind of neural fuzziness to them. I don't know what Eleven does better, but there's definitely a big difference.
[+] [-] ignoramous|2 years ago|reply
I guess, the "open" part of it is mostly for marketing.
[+] [-] drowsspa|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] quaintdev|2 years ago|reply
[+] [-] sschueller|2 years ago|reply
Isn't this open source and can be easily removed or am I missing something?
[+] [-] gs17|2 years ago|reply
[+] [-] causi|2 years ago|reply
[+] [-] billconan|2 years ago|reply
It seems to be easily reproducible if I specify a non-existing speaker?
audio_array = generate_audio(text_prompt, 'en_speaker_3')
[+] [-] nathias|2 years ago|reply
[+] [-] newswasboring|2 years ago|reply
[+] [-] itomato|2 years ago|reply
https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a4...
[+] [-] xingped|2 years ago|reply