How to Clone Most Languages Using Tortoise TTS - AI Voice Cloning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up YouTube today I'm showing you how to install the latest version of the AI voice cloning repository this is the one that allows you to train other languages and includes all the tools that I used for those previous videos on the topic now you do need an Nvidia GPU for this to work so if you don't have an Nvidia GPU the AI voice cloning repository isn't for you but let's first take a look at a few examples today is not a package installation I'm going to show you how to get it set up with python and git so those are the two requisites that you're going to need um in this case you can go to this website git right here to download git you just click this download for Windows option um that'll get you started here and then select the Standalone install and then for python you're going to need 3.11 so in this case you can use the latest version and that one uh all you need to do is click on this Windows installer here for 64-bit and you just run through the installation so you would just go ahead open it and in this case you'd run through the setup and make sure when you run through the setup you click on add to path um and that'll get you there and once we're on the geub page we're going to do uh click on the code button here go to copy URL to clipboard and now now what we need to do is um well make sure Python's installed go into some type of command prompt CMD and wherever you want to um actually hold on a quick way to open the command line window wherever you want to in Windows uh let's just say you want to go to desktop what we can do is just type in CMD up here and it's going to open up the command prompt inside of uh the desktop so once we're here we want to type in get clone and then paste in the link and it's going to clone my repository so you'll see that up here the repository has been cloned inside of the desktop and I'll double click into that so we can take a look into it once you're in here what you want to do um I've put this all into a batch file so it should be pretty hands off is click on this um setup Cuda dobat and it's going to do all of the installations so so just let this run it's going to take quite a bit of time because it's got to download a lot of stuff so just give it some time and one thing if you are running into any Cuda issues uh make sure your drivers are um up to date or at least at 12.1 that's what I have on this operating system so you can go to Cuda windows x64 for Windows 11 and then installer for the exe so that's what I would do there and just make sure those are up to date if you're running into any issues all right and it'll Auto boot up and download the additional models once it's all finished so if we take a look at the screen here you can see a bunch of downloads occurred and what you're going to want to do is go up just a little bit until you see this uh running on local URL and then just click control and then right uh left click it and it'll pop up open up in this browser in the case that it doesn't what you can do is just search up Local Host put a call in and then do 7860 and that'll bring you into the gradio interface so um here is your prompt and what we want to do is we want to generate some audio first to make sure that it's all working so put samples down to two leave iterations at 30 and then you can just use the random voice here we're going to generate a prompt or we're going to generate the sentence here and so if you and if we go into the command line window scroll back down you'll know that everything is going well um if you don't see any errors inside of here and if you see it processing so here's the audio sample let's go ahead take a listen your prompt here all righty and so we have it working um a couple of additional settings are deep speed and Hi-Fi Gan but I'll show those a little bit later so now that we've got that kind of working we want to let's train a voice and what you're going to need are some audio files so in this case I already have a few um on my desktop and so I've got two videos these on so I'm going to go ahead and copy these then what you want to do is navigate into the AI voice cloning repository and go into the voices folder and inside of voices we create a new folder you can name it anything I'm just going to call it me and then you'll just paste those audio samples into here if it's any other format other than MP3 it's going to take a little bit extra time to do the training um just because it's got to convert everything into an MP3 file and the reason that I'm using MP3 is that it's uh much more compressed and uses less space so now that we've got our data set inside of the voices folder um we want to refresh or we want to go into this training tab go to prepare data set for large files and then refresh this area here so to do that we go into generate click on refresh voice list here go back into training and then now you can see that we've got a data set Source uh which is me here all righty so there are a few things that you can modify inside of this tab and there's this continuation directory which is if you if something happens when whisper is running and trans describing your data set you can continue from that directory I'll share that a little bit later um and then the other parameters you can adjust are language in this case we're doing English but if you want to know what um language to use you're going to use the iso I think it's 639 um language code so just search up the iso 639 language code for your language um and then put that in here uh chunk size we're going to leave at uh 15 for some languages you might want to go smaller so let's say 10 actually we'll do chunk size 10 what this is going to do is it's going to change how long um whisper transcribes the audio files into and in this case um this is processes to use this is based on your CPU and it's required for pre-processing the data this automatically um populates based on your CPU specs so you can just leave that at default if you're doing any other language other than English I would recommend you disable whisper X alignment uh some languages don't have whisper um alignment models so just do that and then rename audio files I would suggest yes in case your audio files have anything that aren't um utf8 may cause issues and then if you don't want it to delete the original files inside of the voices folder in this case uh the MP4 files then leave um keep original files selected in this case what I'm going to do since I copied them from another location is uh uncheck this so that it deletes those files so my recommendation would be to always have a backup of your data set before you run any of this now we can just click on transcribe and process and it's going to go and run through everything you can see that it's running through um The Whisper X stuff and it's going to download the model by default it uses whisper uh large V3 and I have that hardcoded so um um I maybe will add that for a um option that you can change later on in the future but as of right now it's hardcoded to large V3 for whisper so we're going to wait for that to finish up a little bit and all right and once it's done um you'll see a lot of lines pop up inside of the terminal and then you'll see a transcription um and processing completed sub uh successfully so let's go back into the file browser here um now since I deleted the original files or I unchecked the keep original files I have shorter or MP3 files in the voices folder here we can now go into um well first go back into the AI voice cloning uh root folder go into training and then you'll find the folder that has uh the name of the one that you named inside the voices and you'll find a train text a validation text and then an audio um folder and this contains all of these segments that were transcribed and cut from the uh data set that you provided so I'm using a very small data set in this case um and at the end of the video I'll talk a little bit more about some of the best practices that um I've come to figure out with these data sets so stick around if you want to know a little bit more about the process and with that I'm going to continue along with the uh the way that tool works so once we have the transcription finished here now we want to click on create bpe tokenizer so this this is going to um create a tokenizer and this is pretty quick so uh once you have that finished you'll find it located inside of the um the AI voice clothing route uh you'll go into models tokenizers and then you'll find a tokenizer here so in this case I have an English tokenizer uh what it does is it takes the name here uh and then appends it with the underscore tokenizer name so before we run training uh we'll want to uh want we'll want to select that tokenizer um and in this case you'll find it inside the settings if you click on refresh model list um it'll refresh whatever models you have in here and then you can select that tokenizer so you'll want to do that before you run any training and once you have that we can now head over into the generate configuration tab for training in this case uh you can select um however many epochs you want to do generally more Epoch means you train the lot model longer just for the sake of the video I'm just going to do Epoch and then for learning rate for other languages you want to crank this all the way up to Max um and then text learning rate you also want to go to one as well so this is in my experience what I found works the best and then for this um cosine analing is generally uh fine so I go with cosine analing and then learning rate restarts to four batch size um I usually just leave it here for um one and then I'll automatically validate it so then you want to select save frequency and epox um in this case since I only have one up here I can only choose up to one but in the case you had 10 and you'd probably want to um divide this by two and then put it inside of um you'd want to divide the epox by two and then put that number inside of save frequencies just so you don't have too many um save states but in this case I've got to go one and one and then down here you want to click on refresh data set list so that it pulls in that data set that you have all right so we've got all of that we can now click on validate configuration and one peculiar thing you'll notice here is that we've got a batch size of 101 and then a gradient accumulation size of 25 and this is going to result in an error because um you want to make sure batch size is evenly divisible by the gradient accumulation size or else you run into issues so in this case um just doing some Mental Math we've got 125 and then now we can save this so so once we've got that we go run training refresh configuration so that you can find the training file and then you'll um select the uh name right here and then what we need to do is click on train down here and it's all good to go so we'll now hop back into the uh command line window and what you're going to find is it's going to download the dvae pth file and then it'll start training so if you run run into any issues an error will usually pop up inside of the training window here okay and I actually ran into an issue here uh where it says division by zero not possible so that's um I forgot that this is an issue but that's because my configuration has uh more Epoch than learning rate restarts so in this case I'm just going to let's just do four epochs um and resave the configuration and then in this case um once you run one training what you have to do is you have to click on reload TTS so that it refreshes and relaunches the um terminal window so now we just got to wait for this to reload back up and we'll get back into training all right so it says loaded TTS ready for Generation we can go back into run training refresh configs and then run training once again so this is going to go through the process once more and should start training all right and now you can see that um the graphs right here uh they they populated and you'll see it start to uh continue along and in this case it should give you a rough estimate of the ETA in this case it's giving me 23 seconds per Epoch with um an estimated time of 3 minutes but I found that this isn't the most accurate all the time so um you might just have to let training run all righty so training is done uh you'll see that the console stops glowing orange once it's done and then also in here you'll see uh finished training inside of the terminal so now we just want to confirm that we've got models so go back into the root of the AI voice cloning repository head into trining go to me and you'll now see this folder called finetune and then inside you'll see uh models and here are the models that you're going to use so in this case training State you can actually just delete all of these training State files um that took up an additional 9 gigs on my computer which is a lot of space and um we once we've confirmed that those files actually exist we can go into settings click on refresh model list and then scroll down to whatever model you want to try out in this case we're going to do 5 gp. pth and then uh for demonstration sake I am going to do use deep speed for speed bump and then I'm going to do use high-fi Gant instead of diffusion so what these allow you to do is generate faster and anytime you change any settings in here just click on reload TTS and what that's going to do is it's going to rerun the script here so we'll wait a little bit of time for that to run and while that is going um now what we want to do is we need to have some reference audio for tortoise in order for it to run so in this case the Mi folder has large files that are too large uh to use as reference audio so what we want to do is let's just delete these um cuz we don't don't need them anymore and we'll take two audio files from um the name of the uh model that we're going to use we'll sort by size I'll take the two largest ones um you can actually listen to to them to make sure that there aren't any Cuts or any weird um audio uh artifacts inside of it before you move them but I'm just going to take the two largest ones go back into the root go into voices go back into to that folder um that I deleted those big files from and then paste those in here so we've got two reference audio files uh now inside of the voices folder and we'll use that for generation and now there is just one more thing we have to do we need to move um some files so go back into training go back into the name of the folder um and we're going to create a folder called backup and you don't need to do this you could actually just delete everything if you want to but in this case I'm just going to create a backup folder move all of them into backup and all you should be left with is finetune and then backup the reason we have to do this is if you don't then when you try to use the voice um inside of the generate option or the generate tab when you try to use uh the me voice in this case or whatever you named it it's going to try to use all of the training audio you trained with for reference audio so if you have 30 hours of training data it's going to try to use 30 hours of reference audio for that um and that will take forever so I know this is kind of bulky but that's kind of the process you have to do right now until I can get around to adjusting that inside of the code now that we've got a model trained we've selected it inside of the settings for the autor regressive model we've enabled deep speed and Hi-Fi again we can now generate so refresh voice list select the voice that you want to use or whatever model you want to use here and you can change the prompt so I'm going to say thank you for watching the YouTube video let me know if you have any questions so I'm going to run with that and in this case uh one important distinction with HiFi Gan is that you don't use any of these sliders uh for samples iterations because uh it's it's this is for the diffusion model so you could actually just leave these at their default values but if you are not using um I'll I'll show that actually later now click on exper show experimental settings and we've got a couple of other options here but right now I'm just going to click generate so that we can um first see how this works so it's going to run through the process and it should generate pretty fast and in this case it took 4 seconds on my 360 for 5 seconds of audio so let's go ahead and listen to this F uh L audio real quick thanks thank you for watching the YouTube video let me know if you have any questions and so that is you know my trained model the reason sounds terrible is because I trained it on a small amount of audio and then I also only train it for uh four Epoch so it's not enough and then so I'll talk a little bit more about these other settings in here so um you can also use RVC voice models in here but I'm not going to go into too detail as I've got a previous video on it but what you basically do is you go into the uh root folder AI voice cling go into models and then you just put whatever RVC voice models you want to use inside of this RVC models folder and then you can select and use them as you normally would with RVC so I don't have any um voice models on this computer but this is the exact same from the other video as well um and then these other settings in here I usually just leave most of these at default I might move length penalty to eight and repetition penalty up to eight um this is sometimes it might help if your model is repeating words or it's having any artifacting or any glitching inside of it this might help a little bit for that now I will show you what it uh how to run it without highi again so let's head into settings turn off the higher diffusion and then we're going to reload the TTS one more time so let's wait for that to uh finish up all right and you always want to make sure it says loaded TTS ready for Generation before you do any Generation Um inside of the web interface here and now we can play around with these options here in my recommendation I recommend leaving samples at the lowest amount um or you can put it at like four and then you leave iterations anywhere between 30 to 50 so I'm going to do 4 and 50 and then generate some audio in this case I'm going to get this possible late in mismatch um this is because I previously previously used Hi-Fi again so now what I need to do is just click this recompute voice uh recompute voice latent button and it'll recompute those latent and then regenerate so this this will take a little bit more time just because the diffusion model is pretty heavy now 9 seconds instead of four for the same sentence and let's take a listen to this one thank you for watching the YouTube video [Music] all righty and it just completely didn't say that last part of the sentence but once again that's because the model is lightly trained um and only is trained on a small bit of audio so other than that all these other options um if you have a low amount of vram you can try selecting this low vram option which will load in the models um when you use them for Generation not on bootup so that you can open up the the interface and that is about it all right so now as promised I'm going to talk about training other languages and some um tips that I can give you for that so when training other languages you want to make sure that you use this prepare data set for large files um window and for other languages in specific you're going to want a lot of data so I recommend at minimum 25 hours or so of um any type of audio in that other language and I'd recommend that you isolate the vocals so using a tool like a uvr to get rid of any background music from those files but in some of my testing I found that toris is actually pretty robust as long as a majority of the audio is speaking it should be fine but you want anywhere in that range 25 hours or more in my training I used up to 1,500 hours of audio and that was all transcribed using this process um which is this prepare for large files and I just ran through this and it it can take days to transcribe a data set so now one thing if you do need to come back to it later um say that you run into some issues um I'll show you real quick how you can continue it so I'm going to move those files back into uh voices and in this case we're going to select data set source I'm going to click on transcribe and process all right so it finished transcribing one audio segment and I'm going to exit out of the command line window to simulate um any possible issues uh or needing to restart so now uh what we need to do is just go into the start.bat this is how you'll actually launch the program um after closing it and it's going to launch back up and in order to see continuation directory if you only have one folder this is a little bit of a bug that I haven't fixed yet um you need two folders inside of the voices uh here so create let's just say a mock folder and just throw in a single audio file into it and now what we can do is um since when we're clicking continuation directory here we can't see it you go into generate refresh voice list and then you can now um click on mock click on me and you'll see that there is a uation directory called run here so that's a little bit of a roundabout way um to go through it until I can fix that bug uh to continue running the transcription so now all we need to do in order to finish that previous transcription we were running is click on transcribe and process and it's going to continue and finish off the process so you can see that it's um since I only had two audio files it's already on step two here and it's just going to run as normal here all right and once it's finished um it's going to be the same thing you'll notice inside of the training folder that you've got the same stuff here and if we try to transcribe and process again we're going to run into this error here and so I put this here so that you don't transcribe another data set with already existing files so to resolve that you just click this archive existing button what this does is it mov moves all of the the audio train. txt and validation. txt into archive data um sorted by date and you'll see those folders in here if you want to restore these folders to the um training so you just cut these go back into me and then you can paste them into here and then you can train with those files but in this case I'm just going to go ahead and archive existing once again and it'll throw it back into here so yeah you'll see these so those are the um two kind of other options on this window the continuation directory and then archive existing those are kind of just more convenience features in case something happens so and then some other things as well for training other languages for Generate configuration my option uh my go-to is I'll do 10 say for a th000 hours I did 10 Epoch so if you have aund so so if you have 1,000 hours of data I just use 10 Epoch with learning rate to Max and text LR at Max as well if you have let's say 100 hours of data you could probably increase this to um 100 and run training there or you could still stick it at 10 and see how the model sounds after a few epochs now for other languages what I did was cosign annealing uh four language restarts batch size gradient accumulation I let validate configuration um calculate it on my own but like I said earlier if you run into any issues with batch size not being evenly divisible by grading accumulation size you will run into some issues so just make sure that is the case and then for Save frequencies with such large data sets I want to try to save as often as possible so in this case I save every one Epoch so I'll will get 10 save files for this um this training data or for this training run and then uh the source model you leave that as autor regressive once again the data set list you would just refresh to uh whatever you named it and then you can save the configuration now let's say you want to um let's say you want to continue a run so I know this is a question that uh several people have had so uh what I'm going to do is refresh data set going to validate and save real quick in this case I'm just going to do 10 and five save run training with this and you'll notice that inside of here um we're going to get a folder called fine-tune and you'll notice that once it's training um you'll start to see some files inside of uh training state so fine-tune training State these are basically save states that you can continue from so now I'll show how you can continue from any any safe state so let's go ahead and close out of this and re open up the um window so we'll do start.bat so the program is going to run again once it's started loading up here uh what we can do is go back into training generate configuration refresh data set list select the name that you want to train and you want to click on reuse import data set so this reuse import data set is going to uh take that previous uh yaml file that we used to save our uh configuration and it's now going to populate this with that state so this is 6 State 6 if I go if I go into the training folder you'll see inside of finetuned I've got a training State and that's going to be this one here and then you can do all the modifications you want to do here so if people want to train longer you could select 100 Epoch and then save configuration and continue training from there you can change your batch size any of these settings you can change um with this resume State path I just don't I don't know if you can change uh the learning rate but I think I think you should be able to and then so once you have that you can just refresh configuration and rerun training and it's going to continue on training so that's how you would start over or start from a certain point or certain Epoch or that's how you would train longer for the tortoise model all right so that's how you get the latest version of the AI voice cleaning repository set up and this one allows you to train other language so so good luck on training other languages if you are going to Endeavor in that sorry it's take me a little bit longer to get this video out uh number one I ran into some bugs that I had to fix and so that took a little bit of time and there are still some the current bugs that I have to fix so that's going to take a little bit more time and yeah that'll be in the future so I do also have some plans on making this a package as well but since I'm using a completely new version of python that's going to take a little bit longer because I've got to package all the dependencies and make sure they work so with that I'd like to thank all the members of my channel for supporting me thank you for subscribing and I will see you in a future video so see you later
Info
Channel: Jarods Journey
Views: 14,969
Rating: undefined out of 5
Keywords:
Id: WWhNqJEmF9M
Channel Id: undefined
Length: 29min 39sec (1779 seconds)
Published: Thu Apr 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.