FREE AI Voice Tool - Best Open Source AI Text-to-Speech is out!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello my name is Abdul and uh um and I like making YouTube videos this is a completely new text to two speech model it's open source it's quite amazing it's nothing like that we have seen before if you have used 11 Labs before this is brilliant stuff just like that but open source where you can just give a text and it can generate audio for you like the speech for you with actually naturally sounding voices like for example you can add laughter you can add the um kind of breaks it's quite amazing and I wanted to show you in this video how you can use it on Google collab and also the details about this model himself the model is called bark yeah it's actually bug so there's a company called suno AI and this company has created this model called bark so bark is a Transformer based text to Audio model created by suno before I show you this uh the first clip that you'll listen to It's actually completely AI generated so you can see here I've given the text hello my name is Abdul and um I like making YouTube videos so I wanted to clear the throat here and I wanted to have a laughter here so I can go here and then run this hello my name is Abdul and uh um and I like making YouTube videos absolutely see um I like making YouTube videos so now what is this model like I said this is a Transformer based model that can do text to audio so this is from this company called suno.ai so bar can generate highly realistic multilingual speech as well as audio including music background noise and simple sound effect the model can also produce non-verbal Communications like laughing singing and crying and let's look at a demo so what does this demo says hello my name is suno and uh uh I like I like pizza and loves and but I also have other interests like paying playing Tic-Tac-Toe let's see what the EA actually does it you know and uh and I like pizza but um I also have other interests such as playing Tic-Tac-Toe it's uh super impressive and also it can do multilingual which means oh I have to enable the audio yes but I suppose your English isn't terribly my English isn't terrible that's okay and it can generate music [Music] and it can also do voice cloning if you are looking for a software that can do cloning I don't know the legal implications you cannot use the software or the model for commercial application at this point so do not think that you can actually build the next deep fake tool or something but you can do cloning that's what that's what they're saying that bark has the capability to fully clone voices including tone pitch emotion Pro study the model also attempts to preserve music ambient noise Etc from the input audio however to mitigate the misuse of Technology we limit the audio history prompts to limited setup sooner provided fully synthetic options to choose from for specific language so you can see the details here like you cannot like literally take any celebrities voice and anybody's voice and clone it so it's quite limited and uh it can also do speaker prompts you can have like women and a man please wow that's expensive and this is all just purely based on the text prompt so the text prompt is the core here and let's see how this is actually happening so what this is doing is uh if you have uh if you know sometime back Microsoft released an amazing paper called Wally um but the model was never released so similar to Wally and some other amazing work in the field Bach uses a GPT style model to generate audio from scratch completely from scratch different from Wally the initial prompt is embedded into a high level semantic token without use of the phenomes so now whatever the prompt that you give that gets translated into high uh high level semantic tokens it can then therefore generalize to arbitrary instructions Beyond speech that occur in the training data and that so they could actually capture like music lyrics sound effects and all the other things a subsequent second model is used to convert the generated semantic tokens into audio tokens audio codec tokens to generate the full waveform so the the words become the high level semantic tokens and those high level semantic tokens is then converted into audio form um using the codec that Facebook had released and it can also do these kind of non-speed sounds so far like they have figured it out laughter Loft size music gasps clears throat hesitations song lyrics capitalization for emphasis on the word man and woman so that you know you have the different kind of speaker what are the languages it supports English German Spanish French Hindi Italian Japanese Korean polish Portuguese Russian Turkish Chinese simplified Arabic Bengali and Telugu are coming soon and you have got some appreciation here the main thing that you need to notice is bark is licensed under non-commercial license CC by 400 4.0 NC the suno motors themselves may be used commercially however this version of bark uses encodec as a neural codec backend which is licensed under a non-commercial license please contact us bark at suno dot AI if you need access to a larger version of the model or a version of the model that you can commercially use and if you want to use a playground you can go here and sign up for the playground where you don't have to run this entire thing on Google collab this is pretty interesting and pretty amazing I would say like I was mind um my mind was blown when I actually saw this thing so that's why I ended up making the video if you go to the Google collab which actually they have given so you can open in hugging free spaces and play with it but as usual there is a huge queue but you can open in collab and then try it it's quite simple the first thing is go to runtime change runtime and then make sure you have got a GPU and install everything like everything that they've given here you have to install the bark torch Vision torch audio and also the right Coda version for you and once you install it then all you have to do is from bug import sample rate generate audio the preload models and this is to just display the audio and in Google collab preload the models it takes a bit of time but as you can see like I have successfully managed to run this on free collab then give the text prompt hello my name is Abdul um and then what kind of fillers you want like what kind of non-verbal Communications you want and I like making YouTube videos and then generate the audio using the text prompt take the audio array and then give it audio so that this actually can play like this let me play this again and uh I like making YouTubers this is unbelievable if you want you have lot more options in the Google collab notebook you can see man and women and all these kind of examples and uh I would strongly encourage you to play with this if you like this project go ahead start the repository should mean a lot to the developers but we have got the GPT equivalent of text to Audio models and it is open source now even though it is not commercial it's not allowed to to use commercially still it is open source I highly appreciate it the team is quite active on GitHub as well so thank you so much suno AI for making this library and making it open source I would love to hear from you how you feel about this new text to Audio model see you in another video
Info
Channel: 1littlecoder
Views: 74,961
Rating: undefined out of 5
Keywords: ai, machine learning, artificial intelligence
Id: 84LzaXAo6vE
Channel Id: undefined
Length: 8min 9sec (489 seconds)
Published: Thu Apr 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.