Train / Finetune Custom Voice With Piper TTS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we will be looking at pictures of ducks and then we will sorry to interrupt but I have returned and today we will actually learn how I returned using Piper well I am pretty excited about this this voice kind of sounds like the other voice and it generates very fast because it is using Piper today I will show the process I use to create this voice and add it to this Piper ttsi if we look at the top we see there is a new dropdown called custom there is where I can select from Custom voices that I have created for Piper I do notice that this voice contains some as they call it artifacts and distortions I created this voice with only nine clips that were about 5 to 10 seconds in length and it is fine-tuned from an existing voice rather than a new voice from scratch I could use more clips and have an overall longer duration and create a new voice from scratch with that to get better quality anyways let's get started with the process here is the fold with the audio clips these are very short clips that I randomly took from audio I used in a previous video here for today's tests this is the input I tried to find different typ this one is a very short 4C HD closeup of the face I just randomly picked some short clips for this test at some point I would gather a lot more high quality clips and find two or train again for better quality here is a transcript file which is required by Piper it basically contains the folder name and file name followed by a vertical pipeline symbol and then the text that is spoken by the audio it also requires a specific form up for the audio file so I used audacity to convert the audio files I first drag and drop the original MP3 file into audacity then go to file and Export audio for the sample rate I selected 22,50 HZ for the encoding sign 16bit PCM for the channels select mono for this example I will change the folder name to the desktop so it generates the wave file on the desktop and change the file name to a temporary name for this example and then click export to generate the wave file and here is the final wave file I did this manually for all nine clips for this test voice when I do it with more Clips I will automate this process next I will upload the wave files to my Google Drive I created a wave zip file which contains the folder with all nine wave files I am going to just drag and drop the zip file to my Google Drive once that is done we can begin the training process go to the piper GitHub page and go into the notebooks folder then click cck the piper multilingual training notebook and click the open and conab Link this notebook will do the actual training of The Voice using our audio clips and transcript run the first cell for the anti- disconnect the next cell will check the existence of the GPU then we need to mount our Google drive on the left we will see the drive folder appear in the folder view once the drive is mounted then we install the software this part can take a while once it finishes you can see the output and for me there are messages in red about pip dependency resolver we ignore those and move on next we extract the data set this mentions the details about the requirements of the input audio files and that they need to be numbered and in that format we already did all that this ZIP path already matches the path where the zip was uploaded I just changed it to a lower case for the file name and then we can run this out next we upload the transcript file here it mentions the format that transcript file needs to be in we have already created our transcript file which has this information for each audio file in this format I will go ahead and run this cell throw down and there will be a browse button where we can select the transcript file did you mean to say scroll down because it kind of sounded like you said throw down oh hush Alba anyways moving on to the next sell for the language I will select English British I am going to leave the rest of these as default for model name output path data set format single speaker sample rate and resample let's click the Run button for this cell it has finished let's move on to the settings next there is a lot of information here about how to continue training and convert single speaker model to multi-speaker and so forth I am going to change the action dropdown from fine tun to train from scratch I will leave the batch size to the default 12 for the quality I will select High because why not the rest of these settings I am going to leave default for this test let's click the Run button for this cell well I don't think that is a good idea it says you need at least 8 hours of AIO and you have less than 50 seconds of total or you oh fine very well it looks like I won't be training from scratch ideally you would want more than 50 seconds of total audio to even fine-tune it the voice you are currently hearing was fine-tuned on less than 50 seconds of total audio so there will be some artifacts and distortions and such as we have heard during this video now we get an error saying there is is no pre-trained model we can fine tune for the English British language so we will need to select English us for the language I am going to go back to step three and change the language drop- down value and then rerun this cell and then let's go back to step four and rerun this cell for the settings again this time it worked if we scroll down we can select which pre-trained model we want to find tune these are the English US models available I am going to use the piper UI app and have a listen for each one and see which one I want to use let's listen to Arctic first this is all wrome I shouldn't be out here I should be in school on the other side of the ocean let's listen to Joe next this is all wrong I shouldn't be out here I should be in school on the other side of the ocean let's listen to Kozel this is all wrong I shouldn't be out here I should be in school on the other side of the ocean L2 Arctic next this is all wrong I shouldn't be out here I should be in school on the other side of the ocean lasac high quality this is all all wrong I shouldn't be out here I should be in school on the other side of the ocean liary TTS this is all wrong I shouldn't be out here I should be in school on the other side of the ocean I think I like the lasac high quality this is all wrong I shouldn't be out here I should be in spool on the other side of the ocean I will just go ahead and select this one to use as the pre-trained model to fine tune with my less than 50 seconds of audio once the pre-trained model is finished downloading we can begin the actual training we can simply click the Run for the train cell and it will begin the training process for this test I let it train for 2 hours ideally you would want to have much more high quality audio samples that is longer than 50 seconds and train for longer if needed but for this test I am going to just use the 50 seconds of audio trained for 2 hours it will print out this debug message the entire time during the training and I think the max EPO was set to 1,000 so I am not sure how long that equates to but you can click stop once you want to stop the training it will keep a checkpoint file for the iteration specified in the settings and use it next time in case you want to resume the training later it has now been 2 hours so I am going to stop the training it will take a while to stop and to finalize the last checkpoint file here is that last checkpoint file on the Google drive it was referencing in all of those debug messages you can leave this here and reuse it later in case you want to continue training where you left off but I am just going to go ahead and download it it is nearly 1 gab there is also a corresponding Json file let's go to it keep going up one folder until you get to the folder where the config Json file is you can leave this here too but I am just going to go ahead and download this one too and now we are done with training next before I export to the Onyx format I am going to run this inference notebook to test this last checkpoint out but first I am going to disconnect and delete the runtime from the training note notbook before going to the inference notebook in the inference notebook we can generate audio this notebook is optional but I want to try it out anyways to see how it works I am going to check the Ed GPU for the first step also I will change the runtime to use GPU explicitly through the settings as well your voice sounds a bit shaky are you feeling okay oh I am fine I was created with less than 50 seconds of audio my voice is fine this is fine let's go ahead and click the Run cell for the install installing the required components one moment please well that was unexpected The Notebook is talking to us this one also had the same red pip dependency error messages that we ignore next we download the model and config here we will just need to copy paste the URLs for the last checkpoint and config Json in the Google Drive I will go into the piper folder and in test there is the config Json file click on share change from restricted to anyone with the link and then click on done you can copy the link from here but I forgot to do that so I will just click on the three dots select share and then copy link go back to the Google conab and paste it in the URL of the config Json file box back in the Google Drive go into lightning logs version zero and checkpoints and do the same for the last checkpoint file first make it sharable with the link select anyone with the link and click done and then click on the three dots and go to share and copy link back in the Google callab paste that Link in the model Link Box and then let's run this cell to download the model and config downloading voice package the notebook's voice sounds friendlier than yours in my opinion how incredibly rude and now we have three different AI voices in one video what Madness the voice package has been downloaded and finally it is time for inference let's run this cell it is pronounced inference you put the wrong emphasis on the wrong s aable interface opened write your texts configure the different synthesis options or download all the voices you want and shy great and now we are done with inference that sounds good enough for this test inference inference close enough now we can go on to the third notebook to export the model as an onyx file remember to disconnect and delete this runtime first in this notebook it will export the model and create the Onyx file from the input checkpoint file run the first cell to install the software once it finishes ignore any red dependen erors and scroll down to the next cell here just like last time we paste the link to the model and the Json so we go to the Google Drive and click on the three dots next to the model and next to the Json files and click copy link for each and paste each in the Google coab in the corresponding boxes I am going to leave the language as it is and also leave the voice name as it is but will change the quality to high I am not sure if that affects the process or if it is just for the output file name I will leave the right model card as is unchecked and then let's run this cell why does this message say downloading model and his confe starting the process this may take up to 2 minutes we will notify you when everything is ready that is a pretty nice voice you have their notebook once that finishes I will select the option to download the voice package on my device from this export mode dropdown and then run the cell to start the download this might take a while once it finishes it should have downloaded a file that you can extract with seven zip if you don't have seven zip it is a free app you can download Just Google 7zip you can select extract here or extract to a folder with that name mentioned I chose the folder so let's go in the new folder and extract this T file here this will give us the Onyx file and the Json file for our model that we can now use with Piper TTS in the folder where the piper UI is for me I am going to create a new folder called custom and I will move the Onyx and Json files into this folder I will actually just copy paste it the UI has a new dropdown that reads Onyx files in this custom folder the Onyx and Json file names have to match so I am going to change the name of both to my name so that it is more user friendly when seen in the dropdown now if I go back and launch the UI I should see my name in the new dropdown I am going to go ahead and select it and enter some text in the prompt this is all wrong I shouldn't be up here I should be back in school on the other side of the ocean if I were to select values for the language voice and quality dropdowns while there is a value selected for the custom dropdown it will prioritize the custom voice value over the built-in ones and generate using the custom voice this is all wrong I shouldn't be up here I should be back in school on the other side of the ocean I can click on clear to unselect the custom voice now if I generate it will use the selected Alba's voice this is all wrong I shouldn't be up here I should be back in school on the other side of the ocean I have merged the updated code to GitHub and cre created a new release package with the updated files with the new custom drop down so I just wanted to show the process I am using with the Google colab notebooks to train or find tuna voice I will probably be playing around with that some more and writing an automation script to autocreate the wave files in the appropriate format from input audio files and auto transcribe the audio and formatt it into the transcript text file that pper requires so that I can train a voice from scratch also would probably want to train it for a longer time for better results which you find the time to breathe when talking non-stop like that anyways back to staring at Duck pictures ducks
Info
Channel: Natlamir
Views: 3,805
Rating: undefined out of 5
Keywords:
Id: 67QvWOp3dkU
Channel Id: undefined
Length: 13min 37sec (817 seconds)
Published: Wed Nov 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.