5x Faster Voice Cloning | Tortoise-TTS-Fast | Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

as the sun rises over the vast expanse of our planet a new day begins and with it the wonders of the natural world awaken once more in this video I will show you how you can clone any Voice using the Tardis TS model and I know many of you have been using the Taurus TDS model in the past and I got definitely some questions can't the model be faster and unfortunately that's why it's called Taurus TDS but there is a new library called Tardis TDS fast and today I will show you two highlights we will be able to generate speech more than five times faster and on top of that there's a web UI so now it's way less hacky and you can just use the web interface to basically clone any voice and generate speech with it and before we start generating speech I first want to get a better understanding with you how this was even possible because we are still using the Taurus TDS model but there were some changes made to the code so the inference is way faster and now let's have a look how this was possible because I think five times or more faster is very impressive and here we can see the main contributions of the repository to the existing throttle Studio s model and what the authors State expect speedups of 5 to 10 times faster than the original project and what the repository adds as config options for the tortoise TS model are the following options for faster inference so we have an KV cache and in case that sounds a little bit cryptic to you maybe check out my video about attention because those are the key and value vectors which are referred to and to better understand this let's have a look at a diffusion model or the denoising process of our diffusion model which actually happens under the hood of the tortoise TDS model so basically we start with a gaussian random noise around here and then we add like in each denoising step our conditioning latents which are the speech embeddings so the voice that we would like to clone we basically represent the characteristics of the voice in a vector and use this as a condition when we denoise and as well we add the text so what we're basically doing we basically guide the denoising step with a text token or text embeddings and with a speech or voice embeddings so that the model ends up generating speech for the given text using their specific voice and if you're not thinking hey those are images and we're actually generating speech yes you're right the model is actually using under the hood Mass factograms which can look like this one so what actually happens is we take our wavelength audio sample and transform it into the spectral domain and get a representation that can look like this one and this is basically aligned to how we humans perceive voice and is a more effective or more efficient representation of wavelength audio but I don't want to go too deep in this topic but just be aware that we can take those representations we see here right now and convert them back to a speech file or wavelength speech file and this way now back to our KV cache what actually happens inside those different denoising steps we often apply attention inside the Transformer so basically in each of the nursing steps we will calculate the key and value vectors while they are technically not changing compared to the step before so what we're actually doing we save a lot of computation when storing our key and value vectors during the denoising step and this is also where 90 of the Taurus model's inference is spent and this might not sound too fascinating but for my experimentation this is definitely the biggest Improvement in the throttle CDs fast library and I think the KV cache concept was first time applied in the gpt2 OR gpt3 model and as you maybe can imagine this is a little bit of a trade-off because our memory requirements increase a little bit because we have to store the key and value vectors but on the other hand side our inference gets way faster and especially for longer sequences so if you have a longer input text the impact of using the KV cache will even increase further okay the next change made is the half Precision inference and this is a concept we have especially seen with large language models recently where we were even able to run them on our local computer and the concept behind this is basically just storing the model weights using less bits per number so instead of using for example 32 bits to represent a floating number and this implementation we would just use 16 bits which is half as precise so overall our inference is less precise but the assumption is that the perceived audio quality of the generated speech is almost as good as a for precision mode while the inference is way faster using just half the Precision to calculate and stroll the model weights but I don't want to talk much further about this option because the author also State here that half Precision currently significantly versus outputs so feel free to experiment with this option but do it on your own risk another thing that the repository ads is a DPM solver which I showed you earlier this process of the diffusion model where we basically start with a gaussian random noise and steadily generate a picture that is less and less noise easy until we end up having a high fidelity image and as you can see here we have one two three four five six steps to denounce the whole image while in practice often you apply way more steps to actually generate a new multifactogram or the image and the DPM solver is basically an improvement to that denoising process that allows to generate an image like this using way less denoising steps and to to further explain this here you can see how an image after 10 denoising steps in a classic diffusion model would look like while compared with a DPM server after 10 steps you already have a high fidelity image so basically to reach the same amount of quality or Fidelity in an image using the DPM server we need way less denoising steps compared to a classic diffusion model but actually using the DPM Server slows down the inference of the model as you can see here in this study of the author but potentially by using the DPM server we can get the best beach quality so in case the speech qualities is more important than the inference time for you using the DPM server could be a good option for you okay and there's a last thing the repository enables you to toggle the CPU offloading which basically the turtle CS model consists of multiple models and each time a specific model is used basically the weights are loaded to the vram actually calculate the inference with that specific model so for example to get the embeddings for our audio samples we have a different model then for the denoising the generating speech part and what this option basically allows us to do by default the taught us repository or model offloads parts that are not used to save vram so basically that the model also fits into gpus with less vram but let's say you have like 24 gigabytes or even more in that case disabling CPU offloading would help you because of course off and unloading the model weight takes every time a little bit so if vram is not a problem for you and you have a GPU with like let's say 10 24 or more gigabytes this could be an option to further improve or speed up the inference time of the model okay and those are basically all the changes made that led to a speed up of the tortoise inference of around five to ten percent especially depending on the sequence length and and there are a few more things that this repository has to offer one thing that I already mentioned in the start is that there's a stream alert web UI which is super cool because now it's really easy to generate speech and another really cool thing is I think I got maybe asked 50 100 times if it would be also possible to generate speech for languages other than English using the Taurus TDS model and unfortunately so far that was not the case but now there's a repository that allows us to fine tune the tortoise TS model which you can find here and using the AR checkpoint flag you can then use your fine-tuned Taurus TDS model in this repository which is also really cool that they provide support for this and the last thing they added voice fixer which basically model for speech restoration overall this also slightly improves the speech quality when applied but also of course increases the inference of the model so one more time if your biggest priorities generation speed then I wouldn't use the voice fixer at the end of the speech generation but if your biggest goal is high quality speech then it definitely makes sense applying voice fixer to your generated speech okay but enough theory for now now let's start cloning a voice and generating speech for that clone voice and for this we will use the web UI as I mentioned earlier and one thing that I encountered is that there's no real option to actually upload audio samples and create a new voice so for this reason I create a fork of the Taurus TDS fast library and added an option to upload new audio samples and create new ways to the web UI so this way it's basically an all-in-one solution okay and before we can use the web UI we first have to set up our environment so all the required dependencies and modules are installed and as always we'll start with setting up our virtual and environment and for this we will use mini Condor and to run the model we will need a Nvidia GPU and if you don't have one yourself don't worry I will share a link for Google notebook in the description box so just use the column notebook in that case and as always I will also share all the commands that I now use in this medium article which I will also Link in the description box and first we will download and install mini Condor for this as a next step we will clone the repository that I forged and navigate into the created folder and I know some people recently had issues with running the tortoise CDs model so by following the setup of this virtual environment with the comments that I used it definitely should also work on your computer so now let's create the virtual environment which we call TDS fast and use Python 3.8 for this okay and once this is done we can activate our new environment by calling or by typing in Contour activate TT as fast and I need it on this environment to install GCC to actually build real files but this is optional but in case you encounter errors when installing the repository maybe come back to this step and then we will install pytorch for the kuder version 11.7 and this step might be optional but in the past I also had some issues with installing the tortoise TDS Library so because of this I'm running this command but it might be optional I'm not 100 sure to be honest okay and once this is done this can take a little bit longer we will install all required modules for the for the Taurus TS fast Library and then as a last step we will also install the big vegan or regen I'm not sure if this is supposed to be pronounced vegan model but anyway this will help transforming a mouse spectrogram into a waveform audio file okay and now we'll install the python module for NG Rock we will serve the web UI in our interface and since I'm working on this Cloud GPU I can't access the Local web server so I have to proxy that session so that I can access the web UI over the internet in case you're working on your local computer if you don't have to worry about this you don't have to install it because you will be able to access the web UI just on your local computer and in the collab notebook I'm using the repository called local tunnel which is a nice way to avoid signing up for NG Rock but since node and npm wasn't installed in this environment and I had a NG Rock account I just used NG Rock in this case but also feel free to install npm and then use local tunnel this way you can avoid setting up an NG Rock account installing the python and g-rock module by typing in PIP install Pi NG Rock and then you can specify your auth token for NG Rock by typing in NG Rock auth token and here in user user interface you have getting started your auth token then you can just copy the token don't worry I will reset the token so don't put in the effort to try this auth token okay and now we have the auth token here and set this up okay sorry I had a small type in here I fixed the auth token in the notebook I will fix it later so now we will type in NG Rock and use just a local host and then we can start the web UI using this command since our NG Rock agent is running in the background usually prompts us the public URL with which we can access our web UI since it's running in the background we can't really see it so this way I went to the dashboard to the agents and here you can see your running agent and just double click on this one and then you can copy this link in case you trust yourself then you can click visit side okay perfect and now our streamlit web UI is running and first time we are starting the model now all the required model weights are loaded in the background this will take around 5 to 10 minutes depending on your bandwidth but if you're working on your local computer you just have to download those model weights once case you're working on a cloud GPU like myself or on the collab notebook unfortunately every time you start the session you have to download the model weights again which is a little bit annoying because it takes a little bit but yeah if you're working on your local computer you have to worry about this only once and Ebola now everything is set up and running and as you can see this is the option that I mentioned earlier that I added so you can just browse your files and upload audio samples for a new voice for example I have those audio samples that I created in an earlier video which I will link here so overall if you don't have your audio samples yet I don't show it in this video but make sure to watch the other video where I showed how you can record your own voice using the tool audacity and I also show which settings you should have the author of the original Taurus 2DS showed here that good sources will be for example YouTube videos where you don't have much background noise you want to have at least three Clips another important thing is to save your audio samples in the WAV file with floating porn format and a 22k sample rate and if you're not sure how you can do this just check out my other video that I linked here and there you will definitely see how to do it and now you can see that I uploaded the audio samples of my voice so I will just add Martin as a new voice and then you can click on the create voice button and as you can see now the voice is added and here we can then just select the new voice let's see where it is it's right here and I'll just leave the text but here you can see you can type in even longer text so for example you can just you know copy and paste it very many times I mean that doesn't make sense but you can see like there's basically Limitless text because I also got to ask if you can generate longer text and actually in the background this is now handled in this repository the only thing is if you work on colab I figured if you type in a text that's too long maybe you will run out of memory so be a little bit careful in the code up notebook yeah if you plan to generate speech for longer text maybe then also check out a cloud GPU like I'm using right now and for the moment I will use the voice Martin and the preset Ultra fast so this is basically the fastest option to generate speech that still has quite decent or good quality but we also have like other presets for example this is of the old Turtle CDs model the ultra fast but we also have a high quality setup also we have the advanced options so we can generate multiple candidates for a given text and then decide which is the best for us there's even an improvement or they have the latent averaging mode this is basically how the vector representation that characterizes your voice samples gets calculated and apparently there are also improvements in this repository I haven't checked them out to be honest yet here we can see which sampler you would like to use so as we have seen inferences faster with those two but if you want to have higher quality speech then maybe use the DPM sampler and I also saw that they recommend at least 20 steps using the DPM sampler maybe even used 30. I think that was the best preset that the author stated and here are the other options that we saw so the key value cases now activated half Precision is disabled and we are also still in low vram if you want to have faster calculation even disable this one and yeah also we will apply the voice fixer to slightly improve and remove any background noise from the generated speech and that's basically it and then you can type in start and now one candidate is generated as I said if you want to have multiple just increase it here and as you can see now this was very fast we generated five seconds of speech and it just took eight seconds and if you remember how long it took with the original Total CDs model this is like a big Improvement and I'm very sure many of you will be happy about this Improvement and now let's listen to this one the expressiveness of Auto regressive Transformers is literally nuts I absolutely adore them yeah and I think this is what I also encountered in earlier versions of the TARDIS model since my my voice samples still have a German accent the model is not really good at diverse inputs so if you're like a native English speaker I'm pretty sure the generated speech with your voice would sound more similar to yourself than it does for me I think it's okay but I think like people wouldn't say hey that's that's basically Martin that's me and to prove the point that for native speakers this works better I created the same sample for a very famous person and I don't tell you the name but please write in the comment what's your guess yeah and then let's see how good the voice cloning capabilities of the Taurus model actually are so if you can actually identify the person so now I will play this one the expressiveness of Auto regressive Transformers is literally nuts I absolutely adore them all right and that's it for today's video I think this is definitely a bigger Improvement to the tortoise TDS model and yeah many thanks to the author or creator of the tortoise TDS fast Library which allows us to now generate speech five to ten times faster as I said I think that's really cool and I'm also pretty sure many of you will like that and now I will show you an extended version of the introduction video If you like this video give it a thumbs up I would really appreciate that also if you subscribe to my channel and until then have a good time see in the next video and now enjoy the outro video as the sun rises over the vast expanse of our planet a new day begins and with it the wonders of the natural world awaken once more today we delve deep into the depths of the internet where we will witness a spectacle unlike any other for in this digital realm we will explore the captivating world of YouTube a place where creators from all corners of the globe share their passions and Inspirations with the rest of us join me David Attenborough as we embark on a journey through the diverse and fascinating landscape of YouTube discovering the extraordinary stories and personalities that make this platform truly one of a kind

Info

Channel: Martin Thissen

Views: 50,293

Rating: undefined out of 5

Keywords:

Id: 8i4T5v1Fl_M

Channel Id: undefined

Length: 19min 53sec (1193 seconds)

Published: Thu Apr 20 2023