BARK: Free Text to Speech & Voice Cloning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

uh hello my name is Abhishek and welcome to my YouTube channel Namaste is pretty cool right in today's video we are going to see how I generated these amazing voices using just one single model called bark I'm also going to show you towards the end how you can clone any voice by just 10 seconds of audio clip this is one of the best YouTube channels for Applied machine learning yeah that's something that you will be able to do by end of the video I tried to clone Obama's wife's Voice by using only 10 seconds of audio so this is the repository of bark and bark is a Transformer base text to Audio model created by suno it can generate pretty realistic voices so text to speech is very realistic it also has multilingual speech and it can also generate music background noises or simple sound effects so we are going to take a look at all these and in the end the bonus is you will be able to clone anybody's voice um by just using 10 seconds of sample so um bark does not actually provide that functionality so we are going to see how we can do that um sure so so to get started you all you need to do is update your Transformers versions and then we can just use Transformers so let's get started in vs code I will just write from Transformers Import Auto processor and I do have a reference with me for the code and bark model and we will just create a function so that we can simply keep on generating different texts easily without having to change a lot of code and we will also import psip I will let you know why we do that okay and now things are a little bit simpler hopefully so we create a processor processor your auto processor and here you write the model name um for getting Dot from pre-trained which is suno slash bar so if you're doing it for the first time it's going to download the model and everything it's going to take a while if you don't have very fast internet but next time it's quite fast it's it just doesn't have to download uses the cache so suno slash bark and let's send the model to Cuda so that we don't take several minutes but just a few seconds to generate an audio then we Define a function called generate audio which takes three different arguments text preset and output so text is a text preset is something that we were going to take a look at so inputs is processor text and voice underscore preset is preset and we are going to send all the inputs to Cuda again for K comma V in inputs inputs k go to v.2 cooler so now we will generate a audio array which is model Dot generate inputs so now we have an audio array and now we are going to convert it to um now we are going to put it on CPU device so audio or array is audio array Dot CPU convert to numpy and since it's batch size one then it's just one sample squeeze okay um now what is the sample rate so sample rate is also saved in model config so I can just say model Dot generation config dot sample rate and now we are going to use psip to save it so IP dot IO Dot WAV file dot right here is your output the output file name rate equal to sample rate and the array itself so that's data let's take a look there audio array okay uh great so now we can just run it so I can say generate audio uh text my text will be hi welcome to my YouTube Channel okay and now we have the preset okay let's leave it blank for now and output equal to Output dot wave okay so now let's let's go back to uh the repo and see what presets are so preset um so they have created a voice prompt Library and these are the presets basically the presets here so if if I if I click on this one let's say okay this is Hindi in Spanish so let's get rid of all the filters and you can see like there are many different languages in which the presets are available um so for example let's take a look at this prompt audio here so I'll there are a lot of things I could talk about but it would probably sound similar to this okay so this is one of the prompt audios so idea is just basically like language models you have a prompt and then you have some kind of continuation is so this was the German one and similarly you have you have a lot of them there are different uh for different languages so since we typed in English uh let's let's copy this uh female voice and um go back to PS code and let's paste it here so V2 and speaker nine let's try to run this and hope for no errors generate dot pi and it's going to take a few seconds only to generate the audio so yeah I do have an error as usual it should be dot items okay let's run it again and see what happens so after a few seconds we have the out wave here welcome to my YouTube channel pretty good right and um bark allows a lot of different things so you can also modify the audio so let's let's go to this one and if you scroll down a bit there is a dictionary so you can also include laughter music gasps uh clear throat these kind of things so let's let's try a few things you can include this for song lyrics so let's try the song lyrics One let's add a text prompt here and we have text front so this is I've just copy pasted from the readme of bark so uh with the Unicode music here and one in the end and let's try this one can generate something like this in the jungle the mighty jungle the lion barks tonight I don't I don't think there is any music there so let's try to set this to none and so that we don't have a pre-prompt and now let's try what happens okay [Music] a mighty jungle the lion box tonight okay great oh yeah so I was I just wanted to show you that it works um so we you can also include like laughter subscribe to my YouTube Channel and now I have not included any kind of preset subscribe to my YouTube channel and not too bad right and um there is also one more feature um you can you can try that on your own but like yeah so this is the Transformers documentation so the feature is like if you have a mix of different languages so like this is uh German and then then you have in English so it will start in German and keep the accent so you will see like it's uh generating the English text in German accent so this is also something that you can try on your own okay now let's move on to the next part which is cloning so for cloning you will need to we will need to use another Repository so we will be using the TTS package by kokui AI I hope I'm pronouncing that correctly but if not doesn't matter it works so they have a bark now but it's uh it's not really very straightforward the documentation doesn't say a lot of things and it took me a while to get this up and running so let's try it out okay uh we are back to my vs code and now you see there's a lot of stuff here so the first thing is this the repository so you you can just clone the Repository and uh next thing is I've created a folder called bark underscore voices you can call it anything you want and inside that I have something called another folder speaker so that can be also be anything you want and inside that I have speaker underscore zero so like let's say I'm using the voice of Obama so here it is the sample then I have faith that we will emerge from this trying time even stronger and more prosperous than we were before okay so this is a sample and that's only 10 seconds and uh I can also call this folder Obama and Obama underscore zero and bark allows you to have only one file um at least with the TTS package so we import the config from tts.configs.bark config and the model it's similar to Transformers and then psip we have to save the wave file so we have initialized the config then you have a model in it from config so exactly like Transformers and then you load the checkpoint and my checkpoint directory here is bark slash uh because it doesn't like the hugging face checkpoints and then this is also not in the documentation so you have to move it to Buddha device if you have gpus otherwise it's going to be super slow and then I wrote a text and then you can use the model.synthesize function you have to remember the speaker ID is your folder name and voice directory is where all the voices are if you put this as random then it's going to Generate random voices but I'm going to use it only for cloning devices so now I can just go to terminal and type python clone node pi let's see sounds anything like Obama subscribe to my YouTube channel yeah I would say not too bad not too bad at all so anyways uh this is the video and um I hope you like it do subscribe do like and try it out on your own and if you have any questions or comments uh feel free to leave in the comment section and see you in the next one

Info

Channel: Abhishek Thakur

Views: 16,553

Rating: undefined out of 5

Keywords: machine learning, deep learning, artificial intelligence, abhishek thakur, bark, bark voice clone, clone voice, best free tts, free text to speech, best tts, free tts, multilingual tts, multilingual text to speech, how to clone voices, how to clone voice free, free voice cloning, bark transformers, coqui voice clone, clone voice for free, best text to speech, best free text to speech, tts, text to speech, bark tts, bark text to speech, clone voices, ai voice cloning

Id: OHZHM8hcyI4

Channel Id: undefined

Length: 14min 49sec (889 seconds)

Published: Tue Aug 01 2023