Complete Guide: AI Voice Training with So-Vits-SVC - Part 1: Google Collab

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's going on today YouTube today I'm going to be doing a full tutorial on sovitz SVC which is an AI model that you can train voices on in order to have an AI voice sing any song basically so it's not text to speech rather you feed it in the vocals of a song and it'll just change that vocals into the voice that you traded on today I'm going to be using my own voice I'm going to be trading it with my own voice getting the audio samples for that and then running it all the way through to where it's trained so it's gonna be a full end-to-end tutorial today and I'm going to be using both Google collab and then a local version of it so we can get both sides without further Ado let's go ahead and jump into it we are going to need to download some things and before we actually get started in installing anything I really recommend you install python git and vs code a description link down below will be to a tutorial that I have released on it that you can download everything in less than five minutes so really quickly to get all that installed but if you have all that or if you don't care about any of that um it is okay because you'll be able to follow the tutorial without it but I really highly recommend it if you want to do anything on your own computer um to do these type of things so the first one is going to be voice recorder GitHub and then the audio splitter GitHub these are things that I created so the voice recorder is just a simple python script that records your own voice so that you can create training samples and training data for the AI model so what you want to do is click um this is going to be in the description down below the link to this GitHub and then you want to go over to this releases area in the bottom right corner and then you're going to scroll down to voice recorder.exe so go ahead and click on that then you're going to want to do go ahead and do save as and then you're going to want to save it into a folder so I have a folder called voice recorder so go ahead and create a folder called voice recorder and then you can go ahead and save it into there and then Microsoft Explorer or Microsoft Edge is going to have this warning up here um that you'll need to accept in order for you to download it because exe files can often contain viruses um so that is the warning here if you trust it then you trust it I made this so I know for me it's safe but you know use your own discretion when you're downloading these type of things so in this case we're going to go ahead and do keep and then we're going to go ahead and click show more and then we're going to click keep anyways so that is going to allow it to download successful successfully into the folder if not you're going to have an incomplete exe file the next thing um we're gonna need to do is download audio splitter which is the exact same thing go into releases scroll down to audio X audio splitter I'm going to click save as up here I'm going to go into that audio splitter folder and then I'm going to go ahead and save it we're gonna do the exact same thing here so so when it's done downloading we're going to click the same three dots I'm gonna click keep we're going to click show more and then we're going to do keep anyway so now that now if we scroll into or go into this folder we have everything we need so audio splitter and then voice recorder so what are these two things you just downloaded voice recorder is just simply a voice recorder that uses your default microphone to record audio and output it to a folder so if you have some type of recording software you can use that instead of this voice recorder I made um that works just fine as well and then the audio splitter what this does is it takes um long audio files and splits them into 10 seconds or less this is needed because of the vram issues uh for Soviets SVC and if you don't know what that means that's okay um we just think about it we need audio files that are 10 seconds or less in order for this to work so that's what this does so with those two installed we don't need to do much more in order to run the collab version of some of its SVC so we're gonna go ahead and jump into data curation and how to get some data and so what I'm going to be reading for my data set are these Harvard sentences which are supposed to be phonetically balanced there are a bunch of sentences here so a lot of variation and I'm gonna do a hundred samples so I think that is going to be 10 lists and if you're using some other type of data if you're not training it on your own voice don't worry I will go over on how you can cut those audio files up and split them later but we're going to go ahead and record my voice right now for all 10 of these so what we're going to want to do is if you're following along with me is go into the voice recorders folder and then you're going to want to open up that voice recorder exe so your antivirus May block it you just want to permit it allow it to go through and then this little tiny window is going to open with a TK on it so there are two buttons here stop start and stop so here's a quick demonstration I'm going to click Start hello this is a quick demonstration I click stop so it stops recording and then if I go into audio files here let's go ahead and play the output hello this is a quick demonstration as you can see there it just captured it as needed and that's perfect and it created its own folder here so what we're going to do is we're going to we're going to run it and then we're going to say all of the sentences um down below and it's going to give us a nice long audio file that we can split up evenly so I'm going to go ahead and do that and we'll be back when I'm finished all right so we just finished reading all 100 samples and we clicked stopped recording here perfect that is all we need to do for that and we have our data so um there are a couple of things that we want to do if we go ahead and go into that voice recorder folder you can see that there's an output file here with uh six minutes of audio so this may or may not be enough audio samples needed for training your data set so you might need to record yourself saying some more things um maybe in more natural context but you definitely don't want to go over 10 seconds for the length of those audio files um and that is going to be where the audio splitter comes in handy so let's head on over into our audio splitter folder and we're going to go ahead and process the data now um so into Auto splitter we've got the EXT here so let's go ahead and run it you'll see two options pop up you'll see audio site audio segmenter and audio splitter so audio segmenter is going to remove move the silence and split all of them into their own WAV file and the audio splitter is going to instead go through an entire folder check if any file is 10 seconds or more and then cut those files into shorter ones so for this one we're going to use the audio segmenter or Choice number one so we're going to go ahead and click it and then do one and then click enter and the window is going to pop up and we're going to navigate into voice recorder that's where our voice audio file was saved into the audio files folder and then we're gonna go ahead and click output.wave so we're going to go ahead and click open and then um we're gonna name the folder so this folder we're going to name let's call it um Harvard sentences because that is what I did I'm trying not to have any spaces in there so Harvard sentences and then it's going to go ahead and find all the silences and split them into each individual WAV file this might take a little bit of time so just wait for it to finish and we'll be back when that's done alright so it finished and it automatically closes but now you can see inside of the file browser here I have a folder called Harvard sentences and if we double click into there it's going to load all of those new files and you're gonna see a bunch of wave files or wav so here we go here we have all of the samples we have 100 samples of 99 here because it starts at zero so we're gonna go ahead and take a listen to a couple of them and so we're gonna play the first one the Birch canoe slid on smooth planks cool play the second one glue the sheet to the dark blue background and then the third one it's easy to tell the depth of a well and as you can see the audio files here are each they're each individual file and they're cut to the appropriate length so that is going to be the part of the audio splitter that we use right now the other part of the audio splitter is only needed if your WAV file if your individual wav files are more than 10 seconds so say that this audio file was instead of two seconds it was actually 15 seconds if I used option two when I opened up this executable file it would split that into a into two audio files one that's 10 seconds long and one that's five seconds long and to show you what I mean I'm gonna record a quick dummy audio that is 15 seconds long and we're gonna go ahead and just do the start record here and so we finished that and due to it being a little bit different um inside out of here we're going we're going to now take this full this file which is 20 seconds long we're going to go ahead and copy and then we're going to bring it into the audio splitter folder here we're going to create a new folder called you can name it whatever you want if you have audio files that are only going to take seconds I'm going to call it test and I'm going to paste it into here and one thing to note is that this will delete it will delete the original file so only do this if you want to delete the original file or make sure you have backups of it so let's go ahead and run that we're going to click option we're going to do option two and then we're going to navigate into um audio splitter test and it's going to scan through the entire directory which is why you don't see any files inside of here so we're going to select this test folder select and it's going to run through it and output three files here that's because it was 20 seconds so this one is super short here as you can see Zero length but the two files here uh 10 seconds and 10 seconds right here and let's go ahead take a listen real quick hello I'm talking into the mic consistently so that I can create okay and then the second one purpose of this is to demonstrate how the audio splitter segmenter works and there you go as you can see the audio has been split into three and you can put as many files into this test folder as you want before you run that audio splitter option on the exe file on the executable file so I'm going to delete that because I don't need that and then I'm going to also delete the original file because I don't need that make sure you delete this output.json we don't need that we have all 100 samples we're gonna go ahead and head on over into sovitz SVC 4. so that is this GitHub repository right here this is where all of the magic happens and this is the model that you train the data on so we're going to we're going to do the Google collab first we're going to go ahead and scroll down into um this Google collab option here so go ahead and click and open it and it's going to open a Google collab window and before we actually do anything in here we need to we need to upload the data into our Google Drive so head on over into your Google drive as you can see I already have one here um you're going to need to create a folder called sovit SVC Fork so go ahead click new go ahead do new folder and then you're going to name it sovits SVC Fork so once you have that done navigate into it and you're not gonna actually have any of these so I'm gonna delete all that stuff that's old stuff and we're going to create um a new folder and this is going to be the speaker name so you can name this whatever you want I'm gonna call it me because the model is me and we're going to create it and then inside of the the speaker name folder we're going to upload that um those sentences so here I have it I'm just going to go ahead and drag and drop into here and upload all 100 samples so we'll go ahead and wait for those to upload as you can see in the bottom right corner it's uploading slowly and that'll finish up while that's taking place we're going to head back into Google collab and we need to set up a couple of things so go over into runtime we're going to go into change runtime type and make sure you have a GPU make sure you have GPU selected here if you don't have premium you won't have option to a100s or v100s those are much more powerful gpus if you want to train models faster the T4 works just fine for the free option so go ahead click save and then we're going to click connect in this top right corner and it's going to connect us so uh we'll wait for that to connect already it's gonna say connected and if you click this bottom right oh if you click on it you can see these different charts right here um that we'll take a look at you want to click this check GPU option and then you want to click run anyway as as you can see the important thing is to say that it's a Tesla T4 so now that we have um confirmation that we have a GPU running we want to mount our Google Drive so what this basically does is give the collab access to our Google Drive and if you don't trust sovitz SVC Fork 4.0 you don't you know you will probably deny it access but we need to give it access in order for this to work so you have to run the cell and it's going to tell you the same thing right here it's going to give your Google Drive it's going to give this notebook access to your Google Drive so connect to Google Drive and connect the window is going to pop up here click on your email and then you're going to go ahead scroll down and then click allow all right and you'll know that it's successfully done if it says mounted at content drive and to double check we're going to click this folder in the left sidebar that says files and then we want to make sure that it says drive here so if you click on the drive folder and then you click on my drive you'll actually see everything that's inside of your Discord uh your Discord sorry your Google Drive so here we have the SVC Fork if we click on that we have our speaker name and then if you click on the speaker name we have our data samples so you should be able to see all of the data samples in here um if you've uploaded them so we're going to go ahead and close all of that back down to drive and we're all set so let's go ahead and re minimize that and scroll down go ahead and install dependencies this will take a little bit of time a couple of minutes or so as it installs all the python libraries necessary all right and so once that is done you might see this restart runtime um button here so so I don't think we actually need to do this restart runtime option for it to run but we'll see if we get later on it doesn't actually run the next part we do is make data set directory so it's going to create a data set directory so if we run it here if you take a look down here now you have a data set raw folder it's got nothing in it um and then to put stuff in there we're going to have to actually copy the data set so this is where it's going to access the data inside of our Google Drive and this data set name is going to be the speaker name that we did so as you as you remember mine is named me if you don't remember you can go back into your Google Drive and take a look at the name so here it is me and uh if you take a look here you can just see the directory so my drive so if it's svc4 data set and data set name here wait um one second I actually made a quick mistake so that's why it's always good to read things you want to create a new folder called data set in in your Google Drive and then you want to move your samples into that so the correct file structure should be so if it's SVC Fork data set then speaker name and then all of your samples so um now if we take a look back here we have it correct so data set and then data set name which is me so let's run that and you'll know it's successfully run when you see a green check mark as you do up here um on this cell as you can see it finished here we'll go ahead double check in files make sure we have stuff here and we do we've got all of the sentences processed and copied so since everything is on Google Google is cloud it actually copies everything super quickly once it's on there and yeah so we already have our data set so we're going to scroll past this cell and these are the next four important cells that you know train your model so the first thing is we're going to do um pre-processing of the data so and then what we want to do is pre-config here and so this creates a configuration file that we'll need to configure so what we want to do now is um scroll or go into configs and then go into 44k and here is our config.json so when we click on it a window is actually going to pop up in this right side here where we have all of our training um variables that we need to set this can get a little crazy and so I'm not going to go with too much in depth with it um but some people say that you want to have a total of 25 000 steps and so a quick way that you can calculate the amount of steps that you're doing um so basically what a step is is how much how long it takes to finish a a batch and so a batch is just the batch size numbers how many samples so in this case we're running through 16 samples so each step is 16 samples long if you just do the math you have a formula that comes out to the amount of steps equals your total sample size so in my case 100 hundred divided by the batch size multiplied by your epochs so you don't need to really do any math you could just leave it as is so that's what we're going to do here we're going to do leave it as 10K we're going to actually set our log interval at um 400 um so this is going to save a new model at every 400 steps you can just use that steps formula to calculate how many times it's going to create a new model and I'll show you what that model is later and then we're going to leave learning rate at its normal at its regular rate and we'll just go ahead and leave batch size at 16 which is the default so you could pretty much leave this all at default and it should be fine so click control save control s if you saved anything and it'll save and then we can go ahead and re-hide that window and now cop and now run the next cells so we're going to click copy configs file and so it's going to copy over that configuration file into our Google drive as that's what it says here and then this is important we want to make sure that it's select that it's set to creep or crepe so crepe actually produces a more human-like a more natural sounding model so you want to make sure that you train your model with crepe I don't believe you can go back and change it once you've done this and so the next thing that um you might have to do as well is do Dash N2 and this basically I believe tells it how many cores you're running it on or how many threads you're running it on but we'll go ahead and run it without it just to see how that runs first so let's run this and see if it and see if it goes through pre-hubert correctly let's go ahead and run train and see if it does anything and so this will be the biggest indicator to see if you did any of the previous steps incorrectly where you might just have to run the previous cells again so let's go ahead and see if it runs correctly and and we'll yeah we'll see that all right and so I've been sitting and waiting it uh waiting for it to Output some more stuff and it's been going for about two minutes and 45 seconds which means I think something is happening so we're going to go ahead and click this refresh option here um on the tensorboard and see if it's doing anything on here um let's go to scalars go to hparms hprems and so I do actually think it is training so but I want to wait for one thing before I leave before I leave this to show you that it is in fact training okay so it looks like it actually is training or else it would have stopped at like a minute um so I actually um stopped it and re-executed to see if it if it to see if it was actually doing anything and it looks like it's actually training so we'll go ahead and leave this till it hits the first Mark and see how that works um in the meantime we'll go ahead and jump into another step of the tutorial which is extracting vocals from audio files and getting those so um let's go ahead and open up a new a new tab and we're gonna go ahead and open up uh or go ahead and go into or go to Ultimate vocal remover which is a get another GitHub repository that allows you to remove vocals from a audio file this installation is pretty easy I'm gonna go ahead and go over the windows one and so what we have to do is just click this main download link here and it's going to ask you up here in the top right um what do you want to do with it so click save as and we can go ahead and just save it and so it is a huge file that is going to get downloaded and we'll go ahead and wait for that and once that is done or once it's finished you if you're on edge you want to do the same thing keep show more and then you know keep it so same process as we did for my exes earlier and then we're gonna go ahead and navigate into where we downloaded it at and run the exe so I have this already installed so I'm not gonna run through an entire installation of it um but if you double click into it a new window is going to pop up an agreement is going to pop up here you're going to click next and then it's going to ask you to do desktop shortcut and then next and then it's going to do install I'm not going to do install because I already have it so I'm going to go ahead and cancel out of that once you have a uvr installed um what we're going to want to do is is get some type of song so here I have um Hey There Delilah this is a music video that I just found on YouTube and I went ahead and you know downloaded it and um what I'm going to do is go ahead and run uvr and we're gonna go ahead and extract the vocals from it the instrumental and vocals from it alright and once it opens up you're gonna go ahead and click select input we're going to navigate into that folder and then we're gonna go ahead and click it like it doesn't matter if it's an MP4 video which is what this is this is an mp4 file we're gonna go ahead and open it anyways and then we're going to select the same folder as the the output so we're going to want to save it as a WAV file a WAV um and then I just selected MDX net default batch size volume compensation Auto and then you want to select this uvr MDX in this case we're going to need we're going to do both so we will leave these unselected and then I'm going to use GPU for my conversion because I have a a good GPU but you can leave this unchecked if you wanted to run on your CPU you can try either one to see which one is faster for you so I'm gonna go ahead and click start processing and it's going to process it and it's going to run through the process and done so here we go here we have the um the vocals of it here we have vocals and instrumentals if I uh widen this out a little bit and let's go ahead listen to the vocals passwords to and as you can see it's pretty good um it catches the vocals pretty good and pretty spot on and same with the instrumental [Music] all right so I just did that um you know just to just to sample it I don't know if it's gonna do one of those copyright things again for my video so that's why I didn't play too much of it um and yeah what is important is this vocals file so um we're actually going to rename this to something shorter so let's go ahead and rename it to vocals dot wave and then we're going to upload this into our Google drive folder so head back on over into Silvis SBC fork and we're going to need to create a new folder called audio and inside of this audio folder you want to drag and drop that vocals um instrumental that vocal file into so it's going to now upload this into your Google Drive all right so here now we're back so I'm gonna jump back on over into Google collab and this is what I wanted to show you so it should if your model is training it should tell you and show you your epochs um how fast it's going through so as you can see it's finishing for each Epoch nine seconds so each Epoch is taking nine seconds so if you do the math here for 100 epochs it's going to take 900 seconds and that comes out to 15 minutes I believe and yeah so this is just this is the longest part that it's going to take um you know you just have to be patient and wait for it so um as I said if this process that this pre-hubert didn't work and your train is taking like one minute or or something shorter then what you want to do is Dash N2 but it looks like they have fixed it in the more recent version so we don't need to do that and so it's going to continue training and here is your tensor board where you can see you know how it's training but if you don't have any experience with training models you don't really need to look at any of this and remember because I have my log eval set to 400 it's going to do a log evaluation every 400 steps so that saves your data and then it's going to evaluate a new model every 800 so if we do 800 divided by 6 it should output a new model it should do output a new model into this folder once it hits 133 so if we take a look back at here it's at 101 so I'll just painstakingly sit here until it hits 134 and see if it outputs a new model in my Google Drive alright so here we are at 133 um and a new window popped up here which should be um indicating that it's saving so as you can see nothing is happening here um so it is currently saving the model and it's going to Output it to the Google Drive so we'll just see what it says after alrighty and as you can see inside of my Google Drive folder I have a g134 um pytorch file and it's going to also output a d134 pytorch file into my Google Drive but it looks like that's what it's doing right here so um if you just want if you want to know how often it outputs models that's going to be this number right here this evaluation interval divided by your amount of steps to finish an Epoch and you know you get your you get your evaluation model so 133.3 and so it's going to round up to the nearest whole integer and that's where you get the 134 so um we'll go ahead and wait till it finishes and outputs the D and then starts continuing and then I'm gonna stop it here and then show how an inference works for the model or basically how you can you know use it it's quote unquote it's going to be terrible because it's only one evaluation but I'm not going to sit here and train it on Google collab the entire time alrighty and so it has finished um outputting the DPI torch file into the Google Drive and here we go so if we go out to the sovits SVC Fort folder we should have one that says data sets log and audio and if you noticed earlier these are some huge files so it's going to periodically delete them every three models it outputs and so uh you might want to keep track of that because some of the there's something called over training but you really shouldn't really care about it too much that can happen but if you notice that your finished model is sounding pretty terrible there are a couple of things that could go wrong here maybe either over training the model or you're not having enough good quality audio samples or data samples and there's a ton of stuff that could be going wrong I don't have an entire list of what could be but we're gonna go ahead and stop the training here and I'm going to show you how you can now use a finished model or how you can use your model after the fact after you're done training it so we continue scrolling down let's say that your let's say that you come to it another day you're going to still need to do all of the installation back here you're going to need to Mount Your Google Drive install dependencies and then you're going to scroll down over to the inference section this inference section and then we're going to go to use trained model because that is what we have so it gives you the instructions here put your dot wave file into self it's default audio directory and as if we take a look here we do have the the folder or the file and so um aha okay so it says um name is not defined so let's go ahead and name it to what we have in here so vocals dot wave full goal all right and it looks like it's inferencing right now and so the issue with that is um there's supposed to be something that looks like this I believe on the right hand side here where you insert your name of the file but it wasn't there so we just hard coded it into the code on the left side here um and it should output a file at the end of it so let's just go ahead and wait for it to finish all right and so my inference time took one minute for it to finish uh or about you know two minutes and if we go into here now we have two files all right vocals.web and vocals dot out so we're gonna go ahead and play this and it's um I don't know how it's gonna sound after one cycle of training so let's go ahead and run it oh let's watch it do to me oh let's watch it too for me all right and as you can see it did output it in a in a terrible rendition of my voice but that is because oh I actually I would say it is accurate sounding of my voice um but the model was trained not long enough so as you can see that is a very undertrained model we only allowed it to go through once but as you can see it already has a resemblance of my voice and that is the power of this uh repository here and this is actually me coming back at the end of the video um there might be an issue with this so it says SVC infer you might want to add Dash and a before this uh and then run it so if we run this here let's go ahead and see how that changes the output of the um the sound all right and so now we're running this use train model as you can see there is this parameter that says Auto predict frequency um zero false and we want to make sure that this is off so it wasn't off on the first time around and that is why we got that weird sounding audio so we'll see how much it has improved just simply by adding that Dash and a to this inference option but yeah it finished and it is right here so it looks like I can actually play the audio sample here um but we're not going to do that we're just gonna go into our Google Drive and play it from here so we'll go ahead and run this and see if it made much of a difference on that model that I only trained for literally one um evaluation interval Hey There Delilah what's it like in New York City um a thousand miles away but girl tonight you look so pretty as you do oh what's it doing to me yo it's what you do for me oh that was pretty shocking that is almost as good as the uh the model that I had trained on 25 000 steps and this was only 800 steps um that's pretty crazy um so that is that if you want to hear how the other ones sound just go back a couple of minutes and you can hear how terrible that other one sounded with auto predict on but that is that so all right and so this video is getting pretty long this is going to be the Google collab part I will be releasing a second part where it shows you how you can train it on your own GPU and that'll be the part two for this Soviets SVC tutorial but hopefully you found something insightful for this video if you liked it please leave a like it really helps me out and leave a comment down below if you need help with anything or um you found anything confusing or you have any Revelations etc etc but see you again in a future video look forward to the other one and I'll see you guys sometime later
Info
Channel: Jarods Journey
Views: 74,874
Rating: undefined out of 5
Keywords:
Id: xgvT7UnUTng
Channel Id: undefined
Length: 34min 15sec (2055 seconds)
Published: Sun May 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.