Meet Bark (New Text to Speech Model): Clone Any Voice to Generate Music and Speech

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to AI anytime channel so in this video we are going to have a look at newly released Tech to speech model called bark so I came across bark on Reddit and let me read you the headline it says bark Text to Speech but with your custom voice cloning using on audio and Tech samples you know and they have released this model Text Free Speech model which is completely open source at this moment okay you can also use it commercially uh but if you have if you want to use it use it on a large scale you do need uh the licensing from their uh they have given the link I think they are using an encoding from Facebook but this is available as a commercial use open source model okay so the they also have a given a description now you can clone audio using just five to ten second samples of audio and text pairs so remember we we had seen uh Wally by Microsoft you know when it was came in the news that it can clone your bias in less than two seconds right of the audio sample and now bar coming in again which has been inspired by the valley as well you can read that in this GitHub repository that I am going to show you in a bit and will also try to unlock bark that why bark might be bitter what kind of you know technique they are using in their architecture that they have trained the model on how is it different from the you know traditional or old speech synthesis model how Bach is different okay I will try to cover that as much as possible in this video and it's completely uh open source as I said you can run it as well locally you can set it up it runs on CPU as well surprisingly but the inference time is too long okay if you have uh around a single or single line prompt you want to generate an audio it might take up to 30 minutes on a CPU machine but if you have a GPU machine you can influence it faster now you know this is this can be the Apple moment you know when it comes to audio generation guys or the the voice sample generation let me know your thought in the comment box when I when I cover this you see I'm on the GitHub repository it says bark by suno AI okay and they say bark is a Transformer based text to Audio model created by cuno and bar can generate highly realistic multilingual features so they have multilingual capabilities when you have a prompt or a text you want to generate speech you can use their multilingual capabilities as well so they provide the languages like Hindi German you know French for example and they do include music background and noise and different audio effect now if you see old or traditional HP speech synthesis model or text to speech model for example Toto htts or any other they struggle with you know uh putting this noise or music now if I want to generate a speech where it contains love okay laughter there so old kind of struggled with those audio effects but Bach is claiming that you know they can do that and we'll see their results as well and they have given a collab notebook that will try to inference with some of the sample prompts if you come down you'll see they have given some samples some you know code inference documentation Etc is available here that you can have a look it runs on hardware and inference speed you know similar to Valley and some other amazing work in the field but uses GPT style models okay generating pre-trained Transformer you have been seeing a lot of gpt's model nowadays so GPT Styles models to generate audio from scratch what they are doing different here guys it's very interesting I will explain that in a very intuitive okay I don't want to just read the documentation anyone can go on their GitHub repository and read it what I want to do it I want to help you understand that how are they different a bit than uh the model which has been released in past let's head back to my uh uh tab here guys so I'll just explain if you read this one of the lines here it says different from Vali the initial text prompt in is embedded into high level semantic tokens without the use of phonemes now very interesting what what is this phoneme guys is very interesting to even understand this this is very intuitive so let me just explain that what that this line means that the valley might have been using phonemes or any other traditional speech synthesis model for example Toto hdts Etc they might have their own architecture or techniques when they train this model now if you see they have some long speech sound that I was saying laughter laughs you know the music clear throws and all those audio effects are also included if you read this let me explain this if you come back here we are talking about something called phonemes now what do you mean by phonemes okay so phonemes are nothing but no let me write it over here it's nothing but you know it's it's represented using the symbols of the IPA so there is something called something called IPA which is International Phonetic Alphabet International Phonetic Alphabet and you know they have some symbols so these IPS symbols that you see are used by by the one who works with speech sound so if you are working with speech sounds mostly you will come across phonemes okay so let's take a simple example I'm just taking this cat word now this is my word cat now how does I represent this you know uh info names okay so if you see cat it's nothing but it can consists of three phonemes so this is how we write for this cat that you see this is how we write and that's how we then we generate the sequence we read this in a sequence then it generates an audio waveform and that's how it will get generated so you see this is this is an way of writing so we have cat and this is the consist of three phoneme so these are three phonemes here I'm writing it P1 P2 P3 for example now now we first learn so if you want to pronunciate this phoneme so if you want to create an audio first learn the sounds associated with each IPS symbol so there are symbols in IP that you see I have using three phonemes here they're all symbols okay and then produce sounds in sequence sounds in sequence now what are the limitations with phonemes why why should we not you know use phonemes when we are trying to generate audio from a prompt or a text prompt rather if you come over here it says the in is different from Vali the initial text prompt is embedded so the the text prompt that user is making or while the training as well is embedded into high level semantic tokens I will come to that terminology what do you mean by high level semantic tokens without the use of phonemes okay now let's come back you know to our uh this discussion of a phonemes here guys so if you see let me just take this there are two major limitations the first limitation with the which is the major one that phonemes symbols or the phoneme that you see are specific to language okay so they are specific so they are specific to language different language may have different set of phonemes okay so if you have English it will have uh I'm just writing is sp1 which is set of phonemes now you have French this will have spa for example let's call it SPN and then we have German then it will have spgi for example so different languages different set of phonemes this might have different a set of phonemes so it's not easy not easily generalized okay so it cannot generalize to the task because if you talk about bar it claims that it provides different languages if you come over here you see these are the language supported languages from English to Chinese and they are Chinese language or do we call it Mandarin I don't know so English to Chinese and then Arabic Bengali Telugu the Indian Regional languages are also coming in and Arabic is also coming in you know very soon so they support multiple languages but when you talk about phonemes nowadays specific to languages most of the time they have a different set of phoneme for different languages they are not you know easy to generalize to the task involve multilingual capabilities okay so not good for not good for you know multilingual multilingual task that's the biggest limitations of phoneme the second limitation that comes in guys is you know doesn't capture the full meaning of the text okay so it doesn't capture doesn't capture the full meaning of text so the prompt that you are making right so if you are going ahead with phonemes based architecture okay sometimes it doesn't understand the context of that meaning so that's why we you know we have we can see you see it over here on their GitHub repository they say that they are embedding into high level semantic tokens now let's understand high level 70 tokens what they what they mean by this okay high level semantic token so now we'll see about Bach I really like the uh name guys bar so it's barking something right so take takes to a speech so it says but uh you just high level semantic tokens instead of four names four names sorry to represent the meaning of the input okay so the the first thing let's let's have this okay a prompt now if you want to generate an audio of a dog barks loudly for example a dog barks loudly I hope this is you know readable I hope you can read this yes a dog barks loudly now what what do you mean by semantic token or that to a high level semantic token okay now let me and let me help you understand okay so this the major Concepts here are these three concepts the first Concepts I'm writing it concept one my concept one is nothing but the dog now we have concept two concept to a nothing means barking right it barks it asks you to box and concept three is something like loud now there are these three major com concept behind this text prompt okay so what it does it types tries to capture the essence of the word okay the context context or Essence of the sentence or the word behind it essence of the word I'm just writing it prompt context or Essence of the prompt okay so this is how it you know create the embeddings for these different concepts and then you know what it does the your encoder and decoder Network comes in the picture Okay so encoder plus decoder Network comes in and then you use this embeddings Etc with their corresponding audio waveforms to train the model and then you kind of generate this uh audios based on the prompt and this is interesting this is this can work for multilingual it generalized better as well when you are comparing this with you know phonemes based architecture and that's why you know bar kind of You by bark is using this high level semantic token based you know techniques rather than you know uh going with the phonemes there okay it's more flexible and versatile for a wide range of tasks okay this this high level semantic semantic token based technique okay so so that's a very high level you know understanding of the phonemes versus high level semantic token that you see maybe I can give the snaps out of this you know in the description through all through a Drive Link or something okay and now what I will do now okay let me just go back to this bar this is the GitHub repository uh I will give the link in the description now what I will do I will just use their uh notebook they have given this notebook okay uh to to inference the model now as you can already see if you go to the runtime you need GPU to run this okay not I'm not going to restart one time so uh change runtime type in change run type time I'm just doing it save let's run this authored by Google now anyway this is what I'm going to do meanwhile it's getting installed let's head back to their samples here guys you can see they have you know samples here is speech synthesis they have generated some uh audio for you let's hear that audio the qualities Etc here the the model is called Park my Clifford the Big Red Dog or uh or bark as a tree bark did you hear that or um or right it's like very natural okay when you generate this uh let's see if we can get some laughter examples they do have some laughter example that I like to show you so you can hear it uh you see this they have added this in the prompt laugh okay and now if you hear this you will understand what I'm talking about and uh hello my name is suno and uh and I like pizza but uh did you hear that uh laughter there guys there's an audio effect it's it makes you like okay you are uh you are listening to an audio for example and it gives you the vibe that you know okay this is a very natural one that you are rather than the old speech into this model which is very difficult to customize with this you know uh settings okay like preset Etc like you need fast you need slow a little difficult I'm not saying you're not possible of course it was no but this is this is amazing at the first look and that to an open source model that you can completely run it locally even on a CPU machine but though it takes time but you can run it on a GPU on a good four to eight GB GPU you can I think the vram which is required for this is I think 10 GB you know as a standard requirement for this model I already turned uh edit on Reddit and if you come back you can see it's installing it needs some specific versions of Pi torch Etc you can see it installing it over here nightly and all those things once in installed we'll also run this uh back more will import the back law dependencies and more use here if you load preload models and then the text prompt I'll change the text from by the way okay so let's change the text round hello I am testing uh uh model for text to speech that's called bark and what I will do here I will write here's throat sorry throat clear throat but I also have other interest no I want to use this but I also have constraints like running it locally something like this okay and then we'll generate this sample they have also an advanced sample here with different languages you see this takes prompt is a different languages maybe you can take on different language and see if it's able to generate and they also have a speaker based you know multi speaker so you have main man you have women women in that you can also create that you know separately in audio and they also have history prompt so you have some history from you can learn from the speaker one speaker tour Etc and they need to generate on top of that let's see that if you got our first thing so it has says it it has installed it guys so what I will do now let's run this so it when you are loading the models you know sometimes it takes little time on even on collab because they have to download the model checkpoints and weights which which is around 5 GB Plus so let's see that and what I will do I'll give this link in the description guys that you see bulk examples you know AI is in the they're using notion for the documentation and all so all these languages you know they have Hindi as well which is very interesting I'll test it out and this will and also give you this link in the description so you can try and also this will I also give you this uh some screen grab of that so you can read it so let me come back once this is done guys we'll take little time so I will pause the video here and I will come back once this gets complete so you can see uh this is completed it took little time you know to download the model checkpoint it will get stored in the local cache directory guys okay now this is my text from what it does it has a variable what it will do it will create an array of this audio array generate audio text from taking the prompt text from as an input passing to this generate underscore audio method that we have imported from bark and then we are passing this rate equal to sample rate which might be default I don't know how that is let's see so first let's run this might be 22 000 or something we have to figure it out okay so what it does you can see some steps are these are the steps okay so it says 20 out 21 22 steps out of 100 so it has 100 steps these are you might have already seen with uh different Auto regressive model okay if you have used auto htts for example or any other text to speech model in past you might have seen this uh station also with diffuser model as well okay so and this is on a GPU machine you can run the same model on CPU but it will take an hour or something you know 30 minutes to one hour you know for this kind of prompt to generate the audio that list it works okay with the decent Ram of 16 GB and you know some uh I don't know that it's 6200 GB of memory space okay that it takes from a CPU machine on the GPU I think the whatever rate on Reddit is that 10 GB of virtual RAM is required you know to run this with a fast infinite speed okay you can see this is a decent phone and provide Tesla T4 okay here on collab GPU so you can see 78 percent to 80 have been completed let's hear some other examples as well guys I'm very much interested in this speaker prompts [Music] wow that's expensive wow so you can see there's uh there are two different speaker and it has generated in the same audio which is very interesting it helps in the it really help you when I also to generate a lot of samples data you know for your contact center or your if you have contact or call center right in the organization it will help you create you know uh this is a very interesting idea and use case as well where you generate this you know conversation for training modules now what you are creating some training training modules for your employees the new joinings or the new employees they can go through those simulated training modules which has been generated by bath for example or you can generate these voice here okay the women man and you can create an audio you can have any script you can pass it to this and you can simulate this but you need a GPU machine to do that guys you can see before and this this thing that you are seeing right it totally depends on the length of your problem if your prompt length is lengthier you know it will take more steps and more time of course to generate that so let let's see let's here to some other as well voice preset and voice cloning as I was saying right you can also clone wires so you have to pass it here in history prompt now you have your own uh Voice or you want to for example clone Elon Musk or Sam Altman's voice how you can do that so you can also use this to do that I have a silky smooth voice and today I will tell you about the exercise regimen of the common sloth exercise regimen of the common sloth cool so we saw a speaker prompt we saw this let's hear the music guys so they also claim that they are jungles tonight this is amazing right uh it says bark engine it all types of audio and in principle doesn't see a difference between speech and music sometimes Bart chooses to you know generate text as music but you can help it out by adding music notes around your lyrics right you can see this if you see it clearly they have this video I really like the response just imagine if you want to create a deep fake of it so you have a video uh you you use any defect video of you know you have a lot of games model there where you can create a fake video a fake face and you use this to simulate that right I'm really completely you can create a singer out there okay uh completely a voice you can pass that uh history is prompt em speaker one here and then you pass this uh musical notes uh around your lyrics and you create Rihanna's Voice or for example or for some other boys you know Justin Bieber's voice and you use their defect with your create and music video there and launch it on social media it gets Wild and you will see it's around 54 completed take little few more time so the foreign languages but I suppose your English is supposed oh wow so it says Bach supports various languages out of the box and automatically determines language from input text this is fantastic so we are not passing an extra parameter or additional layer of parameter there which kind of you specify the language you know in that function it automatically determines language from input text okay and that's how that's how we're discussing earlier right phonemes versus high uh level 70 tokens uh this these are the reason that you know phonemes might might not help you know in this kind of task and this is this is so intuitive guys to understand I'm really loving it maybe I will create an application very soon on this top of that um let's let me do like this okay I'll create an entire deep fake music video okay in one of the upcoming videos on my YouTube channel okay combining bar combining mid Journey combining some Gan models and we'll create this amazing uh deep fake music video that you will see soon you can see it's 74 completed and very uh very high level guys for example when I say high level they're they're providing these high level apis okay so you don't have to write a lot of codes there so they have you know created all these classes and more use for you just import that module and you pass your prompt and generate the audio but the thing is you you should understand the fundamentals and you know the basics behind it how is working what is phonemes for example and what is semantic token for example how we encoder decoder network works you want to understand more about it please watch my previous videos I have covered in most of my videos you know from you know fundamentally how these things works so it says 82 percent 85 percent see on a free GPU by collab T4 GPU it's still taking around uh six to eight minutes so Insurance speed is very terrible guys but you know it it expected the SE are compute heavy and compute hunger models and they will take time as I said you need 10 GB of vram to run it smoothly let me know your experience you know with this model bark model in the comment box if you have any thoughts please let me know if you are extending it further in some kind of application as I discussed couple of huge cases one for call and contact center then how you can simulate and create training modules for your new join your employees and and then you can also have uh deep fake music video you know uh you can create using this combining all these techniques and Tech and models please let me know if you are extending it it says 97 completed let's see 100 completed and we finally got our audio right let's let's hello let's hear this I am testing a model for text to speech that's called bars but I also have constraints like running it locally wow this is great it is so good you know maybe this is fantastic I don't know if I can print a sample rate let me just do one thing okay can I print the sample rate here guys 24 000 okay so this is good I really liked it uh so it's giving good responses guys so you can also play with this other here as well okay this will take time so I'm not covering it pretty much self you are just using their code okay and the documentation what I wanted to help you understand in this video is about phonemes high level semantic tokens and about creating a little bit of you know this awareness of about bath because every day nowadays you are seeing an AI model or an AI tool getting released in the community people are going crazy about it I hope I'll cover this in one of my videos and that's what I did and that's all for today's videos guys if you like the content you know please uh thumbs up the like button there and if you haven't subscribed the channel yet do subscribe Channel please share the video and channel to your friends and to peer thank you so much for watching this video guys see you in the next video
Info
Channel: AI Anytime
Views: 7,078
Rating: undefined out of 5
Keywords: bark, generative ai, llm, speech, audio, colab
Id: 7np8uOfJfls
Channel Id: undefined
Length: 27min 11sec (1631 seconds)
Published: Sat Apr 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.