Local AI Voice Cloning with Tortoise TTS - 2024 Installation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today's an updated tutorial on how to install tortois TTS locally but for those of you who aren't familiar with taurtis TTS I'm going to run through a quick demo and so here are a few sentences that I'm going to generate audio for with some models that I've trained so let's go ahead and take a listen the men gazed at him with scorn and contempt as Subaru appraised them they appeared to be in their mid 20s their clothes were filthy and it was as though their inner evil had manifested on their faces they weren't subhuman but they couldn't be called decent humans either damn a compulsory event facing the grinning men Subaru wiped his face and stood up in a panic all right and so with that I'm going to jump into the tutorial for this tutorial you don't need python or vs code on your computer all you're going to need to download is 7zip and also you need to make sure that you have an Nvidia GPU if you don't have an Nvidia GPU unfortunately this tutorial and package will not work for you uh head on over to 7zip and you're going to want to download this Windows executable file and then run and and install 7zip on your on your computer and then you can head on over to my GitHub right here so this is the AI voice cloning repository that I took from the git eer page I just uploaded all the files and then added Hi-Fi Gan into this repository because the original author I believe is on some type of Hiatus so I went ahead and uploaded it to GitHub to keep it alive and to do any changes to make sure it can keep on running so if you click on this link here it'll bring you to the original repository but let's go ahead jump into installation so what you need to do is go to this releases area click on the AI voice cloning link and then you're going to have to download it from hugging face so click on hugging face and in here you're going to want to download this package so this package is going to be pretty big um after it's done being unzipped it's going to be about 20 to 22 GB because of all of the models that I have already put inside of this package so this may take a couple of hours to download and after you finished downloading it you want want to go to the folder navigate to it and you should have seven zip here so you want to go to seven zip and then do extract files and then um wherever you want to put it I'm just going to leave it into this folder so this is going to take a little bit of time as well and once it's done downloading you can head on into the folder what I'm going to do is crl X this to cut it and paste it so that it's just one level above so here we are now when I click into it it's right there and once you're in here all you need to do is go to start.bat and it should start up so you're going to see a terminal window open and then a link is going to pop up so these errors don't really mean anything because we don't need those for Tois TTS in this case and everything is going to load here but here is your local URL so what you can do is hold control and then left click it and then it's going to open up in Chrome if not all you need to do is simply just type out this address and it'll bring you to this page that is all you need to do to get toris TTS installed and this one is enabled with high-fi Gan and deep speed as well so what those are are some features that allow the model to run a little bit faster for inference so if you want to turn on those you can head over to the settings tab here and then you do deep speed for speed bump and then you do high-fi Gan instead of diffusion so high-fi Gans quality may be a little bit lower but the output of it will be much much faster so with that um if you've checked these you need to restart the uh guey so to do that just go ahead and click reload TTS here so what that's going to do is it's going to um restart this um program here and then you what you want to do is wait until it says loaded TTS ready for Generation so let's wait until we see that okay and so once it's finished loading uh we can now head on over into the generate Tab and then we could generate a quick sample so in this we have a random voice we're just going to use random for now with that let's go ahead click generate and it's going to generate an audio sample so that generated pretty quick let's go ahead get a longer sample all right I put a new sentence in here so I'm going to generate and it'll come out in about 1 second for 5 seconds of audio with my 4090 the longer the text the faster the um output to time ratio tends to be so let's go ahead play this the men gazed at him with scorn and contempt as Subaru appraised them they appear to be in their mid 20s so that is pretty good and if you want to change it back to the diffusion model uh what you need to do is head on over into settings and then unclick use highi again instead of diffusion and then once again click on reload TTS so this window is now going to go ahead and restart as you can see it's going here we'll wait until it says loaded TTS all righty it's finished loading the TTS so we can now head on over into generate and um in this case it's going to take quite a bit of time um to generate so this one takes about 7 Seconds uh because I have sample set at 16 so if you turn off HiFi Gan um you can use all of these different settings here the ones that affect speed are samples and iterations iterations kind of affects how long it takes linearly and then samples is kind of exponential so if I put it at two and then scroll down to generate um for samples at two and iterations at 50 this one is going to take about 3 seconds so let's listen to this the men gazed at him with scorn and contempt as Subaru appraised them they appeared to be in their mid 20s all right and so since that is a random voice the output of it is going to be random each time but the cool thing about towards TTS and why I like it a lot is that you can use it for voice cloning so let's go ahead throw in a new voice and so I won't be going over how to actually curate an entire data set I do have a video right here I'll be going over some brief training things later on in the the video today but I'm going to just grab a sample from a data set that I've already curated all right and with that to add a voice uh what you need to do is get some audio clips I have five short audio clips they're all less than 5 Seconds um I just go ahead and copy these and then I head over into the um AI voice cloning folder where you go into voices and then in here you want to create a name so in this case I'm going to do new and do folder and then in this case I'm just going to say Melina uh this is Mela from Elden ring and I'm going to now paste those audio samples into here so with that you can now refresh the voice list in the gradio interface and it'll pick up that voice so this is without training what this is is zero shot inference so it's taking those clips from that audio folder creating latent and then generating audio for the sentence here based on those latents and so um if we select melan here um we can now go ahead and generate so I'm going to click generate and what it's going to do is create some diffusion latents and if you take a look inside of the folder I now have a pth file um and these are the conditioning latens so let's go ahead and now take a listen the men gazed at him with scorn and contempt as Subaru appraised them they appeared to be in their mid 20s all righty so if you kind of know how melan's voice sounds that is generally close let's go ahead and now compare it with HiFi Gan so I'm going to go ahead and do use highi Gan instead of diffusion I'm going to click on reload TTS now that TTS is loaded I can go ahead and click generate and what you're going to end up running into is this error so what happened well what ends up happening is that there is some latent mismatch between the HiFi Gan model and then the diffusion model so what you have to do is click this we compute voice latence so click on recompute voice latence and then now we can do generate I wasn't in the m mood to completely automate that process so I just created a error so that uh you could click on this button that's already here so sorry about that now that is done you can just click generate and then it'll generate the audio sample so let's take a listen the men gazed at him with scorn and contempt as Subaru appraised them they appear to be in their mids all righty and that is not too bad so those are the basics of inference for The Voice cloning if you want to voice clone with just some samples for zero shot inference but what I'm going to show you next is how to train a voice and once again I'm not going to show entirely how to get a data set just how the process works once you do have your data set so luckily you don't need to split the data set super super carefully if you don't want to but I do advise that you do go take out take a look at this video right here um to get a feel of how data set curation kind of works and it's kind of the same process I use with with RVC so that's what that video is for but let's say that you just have one long um YouTube video say like this video right here um this is me this is one of my videos uh what you want to do is extract any background music from the vocals so an easy way to do that is head on over um to Ultimate vocal remover and download this uh application uvr and so what you can do is just scroll down to the um Windows mirror here and then install it via the installation executable there and then once you have it uh what you can do is select the MDX and then yours will be on somewhere like uh uvr MDX net and then we can just extract the vocals with this so the easiest way to do that is put two windows side by side uvr and a file explorer and then you can just drag your YouTube video into this input area right here or you could just select on select input and then for output you you can put wherever you want to put the output so what I'm going to actually do in this case is go to AI voice cloning head to voices and then I'm going to create a new folder called me and then I'm going to drag me into the output here so drag me into output and that's going to change the output of the folder and then the input so next thing you want to do is make sure it's a wave file so wav you can leave all of these here and I'm just going to do vocals only and then click on start processing so you want to make sure this is on GPU conversion as this will the speed of this will depend on your GPU so it may take some time depending on what um graphics card you have all right so for me that took about half a minute to get that extracted and then if we head to the Mi folder we'll have a wave file with the vocals extracted it again um and so yeah this is me so cool if you follow along in concept inside of the voices folder you should have a name um and inside of that folder you should have some type of wave file or some type of um training data file the longer it is the longer training is going to take so with that what you want to do is head to generate you want to click on refresh voice list and that's going to now allow us to select the data source so I'm going to select on me and then I'm going to leave it at English because it's English leave all these parameters the same and then in here I'm going to change whisper to large V2 and if you have less vram you would want to select medium or base or if you want it to run faster you can use base or medium but in this case I'm going to do large V2 if you do want it to be a little bit faster what you can also do is change it to whisper X and then use large V2 for whisper X but this is going to concur an additional download that is about 3 GB large for um models and so if you choose to do whisper X which is actually much faster it's going to be inside of models and then a new uh folder called hugging face is going to be created in this models area with that whisper X um model inside of it but but I'm going to head on back over to whisper um and then the ones that are available are the ones that are inside of AI voice cloning models and then whisper and that is large V2 medium and small if you do choose to use any of these other ones um it will download those into that folder the reason that I omitted them was to make the download for the package a little bit smaller with that little explanation click on transcribe and process and with that it's going to now create us a data set so let's just wait for this to finish up as you can see the progress bar is slowly going along um this may take longer depending on how large your file is and this would be a little bit faster if you were using whisper X all righty and so we are finished here um and one quick check we want to do is head on over into the AI voice cloning folder and where your data set and curation is going to be um put into is a new folder called training you go into training you'll now see the name of the folder matching the one that you put into voices you go into here you will have uh three files train validation and Whisper Json along with audio and here are all of the split audios once we've confirmed we do have files in here uh we can now head on over to generate configuration so this is the configuration for training uh what we're going to do is opt to put EPO at let's say 200 we're going to leave learning rate the same here I'm going to leave male learning rate ratio the same here and then that text LR as well for learning rate scheme I'm going to select coign annealing so that's going to enable learning restarts during the training and for batch size um I'm just going to let it figure it out on its own so the next important one is save frequency so this tells the model how how often how frequently it should save it and so basically what an Epoch is is one run through the entire data set so if you have 100 data samples for the model to see all 100 samples that is equivalent to one Epoch and so in this case I want to save every 40 epochs to save a little bit on dis space and I'll show you why a little bit later and all of these other things you can leave the same um and then in this case you want to click on refresh data set list and then click on this um text box here and then click on the name for the data set that you have so now we're going to do validate training configuration and as you can see the batch size kind of changed on its own but this is actually probably going to throw an error what you want to make sure is batch size is evenly uh divis isible by gradient accumulation size so I'm going to do something along the lines of 80 and 20 uh 80 is div divisible by 20 you get four so that'll work you just want to make sure that there's no remainders so once we validated uh training configuration we can now do save training configuration so that is important do make sure that your batch size is divisible by the gradient accumulation size or you may run it into an error when you're training I believe it's something along the lines of index out of range but I forget all right now with save training configuration you can head into run training go to refresh configurations and then select your train yaml and now all you need to do is head to train and click on the train button so on the right hand side here you have a loss graph and the important thing here is that it's continually decreasing and so just a quick rundown on what this graph means in general the lower the value the better the model so so let's wait 20 minutes um I'm going to have the model finished training and then I'll do a little bit more explaining so training is done and so what I was talking about earlier is the lower points on the top graph here um usually indicate the better model to take a listen to and as you can see it was decreasing the entire time and at at the end we have uh Epoch 200 um so if you want to find where those models are if you go over into the training tab um once again that's inside of AI voice cloning training inside of the name that you have here it's inside of finetune and then models um so these are pretty hefty they are about 1.5 GB 1.6 GB each and so um to save space what I generally like to do is just delete these last ones or the first couple of models is generally they are lower quality I say generally um because um sometimes they can be higher quality than the later Epoch that is what what would be classified as overtraining but I don't see it happen too often and then training States um I just delete all of the training States so that leaves me with a total of 1.6 GB for a trained voice and there's one thing that you want to do after training before we actually use the model and that is to take away the audio files um and put them somewhere else so that we don't end up creating a latent for all of those audio files that may result in inference taking a very long time create a new folder and let's call this backup and we're going to copy and paste everything that's in this folder into backup so it should look like fine tune and then backup with that inside of backup um I'm going to sort by size here and then I'm just going to grab um two audio files from here because these are split this is a 15-second audio file and actually that is a little bit too long so I want to take a listen to some of these and these are about 9 seconds about 429 kiloby and so what I want to do is copy these and this is going to be the same thing as what we did earlier with the melan voice we're going to head over into voices and then inside of me which is the name that I used I'm going to delete the original training file this is the one that all of those audio samples got cut from I'm going to delete that and now what I'm going to do is paste in those two short samples so this process is very important um so make sure that you follow it along here if you want to use your the models we can now head on over and use that trained model and in this case let's head to settings and we want to click on refresh model list down here so click on refresh model list and then inside of autor regressive model you can now select the fine tune that you have here so I select me and then once you have that selected uh you want to click on reload TTS so click on reload TTS all right so TTS is now loaded and inside of um voice here we can now select on me and then just run inference for this again so I do have HiFi Gan and deep speed enabled so I'm going to click on generate um it's going to create the latent first and then it's going to create the audio sample so let's go ahead regenerate just to see how fast it actually is with latent already computed and it's about 3 seconds let's take a listen this is a cloned version of my voice the man Gaz at him with score Warn and contempt as Subaru praised them as Subaru praised them they appeared to be in the mid 20s all right so that was pretty terrible but once again this isn't a um video on how to get the best training model it's just how to train a model um real quick and so if I did want to increase the quality of that training what I would do is I would curate the data set and then I would probably run the training for a little bit longer so um I might do 500 and in this case because the amount of data that I was training it on was about 9 minutes what I generally do is about an hour of data I'll train for 200 or so Epoch but all of that is a lot of experimentation and whatnot for um for voices so and so I do have a video on how to make better models inside of tortois TTS where I go through some of the um to me that's still the training that I did um and yeah yeah so there's a lot that goes into this so you may have to experiment quite a bit and if you run into any issues the easiest way for me to track what issues you're running into is head on over into the GitHub for the AI voice cloning go into issues and then click on new issue and then create a new issue so describe your um issue with a nice title and then try to give me as much information in the description as possible I don't have a template here but if I need to make one I will there will be small Minor Adjustments maybe here and there but I don't plan on actively working or improving this project I just want to make sure that it is alive all righty and that is it once again this isn't my repository originally it was by a guy named mrq but he seems to be inactive and so I went ahead and re-uploaded it onto my GitHub to make sure that these tutorials can still stay alive and that people can still use the toris TTS model if you have any questions comments please leave a comment down below and let me know um once again thank you to my members for supporting the channel and helping me out but if you found anything useful have any questions comment down below like subscribe all of those things and I will see you all later in a future video and catch you later

Info

Channel: Jarods Journey

Views: 7,843

Rating: undefined out of 5

Keywords:

Id: p31Ax_A5VKA

Channel Id: undefined

Length: 21min 45sec (1305 seconds)

Published: Mon Dec 18 2023