Hello everyone I am Lao Sun Today I will introduce a voice cloning project to everyone Called GPT-SoVITS This project has just been launched not long ago Also received a lot of praise For example, you only need to input a five-second sample Just able to generate text to language conversion Text volume exceeds one minute Significant improvement in the similarity of sound In our subsequent tutorial, Will gradually give everyone a demonstration and introduction In addition, its feature is that it can be directly Through a Chinese uh speech dataset Can generate Chinese in a timely manner English Japanese is a real-time translation feature Another thing is that it is an integrated tool with a web UI Includes separation of sound Slicing of the training set Chinese recognition is the function of ASR and text annotation Until the end Fine-tuning and reasoning section All can be done within one Weibo interface Therefore, its integration level is still very high Also very suitable for beginners to experience This is a feature of voice cloning The author has a demonstration interface on Bilibili However, because the author of this demonstration interface This demonstration is just a sample Quite simple So many people have expressed that they find it a bit difficult to understand Too fast So I also gave my friend I have also introduced this project to a few of my friends The final result is I basically need to record a demonstration video for them Show it to them Instead of me recording videos over and over again I might as well just make a complete video From collecting and processing samples to the final completion of inference A complete process I just want to include it in this video Do it all over again If there are friends who need it, they can refer to my video Because I am also a beginner in speech conversion synthesis So I also have some experience only with the use of this project So I can only do this project for everyone User Manual This is the project address for this project If everyone has the means You can go to the author's address to take a look In addition, the author has a project demonstration on Bilibili I also marked it here Everyone can also directly go to Bilibili to watch the author's demonstration If it is an experienced friend Should be able to understand by just taking a look If it's someone like me who is a beginner You can follow along with my video That believes after watching it You should be able to run your own model Complete reasoning To obtain this project We have three ways First, the first one is through the author's GITHUB code page You can directly obtain the code from here As a beginner I suggest using the method mentioned later The second method is an author His uh, a collection of contracting integration version To integrate version You can directly take this integrated version from here and unzip it for use The third one? This is a cloud storage address Also, it will be updated in real time Everyone can download it from this cloud storage Just decompress it Started running this project Finished installing the file Let us introduce again If you want to get this project up and running What preparations are needed beforehand? It means preconditions First, your graphics card is recommended to be an N card with a minimum of 204 The VRAM is best to be greater than 6GB This is not necessary If you are a series of graphics cards Actually, it can run Just, um, the efficiency will be a bit lower For example, one of my friends He is a 1630 from many years ago He can also run But the speed is just a bit slower Especially during training, um In addition, I recommend 32GB of memory I am 32G Increasing the virtual memory a little bit is quite easy If you are like I recommended If using the integrated version You basically don't need to consider the Python PyTorch in front Because the entire class already has this environment You only need to install GA11 on your computer for the virtual environment And test it Expand installation successful But if you want to start from the source code yourself If you build this environment little by little by yourself Then you can refer to the project requirements on the G Hub You need to bring the project down Then install various dependency models for download So I suggest that beginners download the integrated package We start by getting our hands on the simplest and easiest way Get your own project up and running If you integrate the package now Then you just need to check your enlargement 11 Because the project recommends expanding by 11 Jiang Kun's 11 and Kudin Just install it on your computer Okay, then let's officially begin now If the preconditions for preparation are met Then we can go ahead and download this integration package You don't care which way After obtaining the integration package, decompressing it will yield these files We need to run this go web UI This BT batch file You just open it Then a terminal will pop up Then wait a moment, it will be used in your default browser Can open the UI interface of this project This is the UI interface document for our project Let's briefly talk about it Don't look at this UI for now Actually, it mainly involves preliminary preparation Training and reasoning these three processes Preparation in advance is the acquisition of sound We will handle the sound Of course, you need to have some original sound materials After we obtain the sound We need to do some processing on the sound For example, let's talk about noise reduction For example, extracting the sound from the background After completion, sound slicing will be performed And then, voice recognition Finally, proofreading This is the preliminary material preparation stage The second part is training That means we will make some umm in our lives Text extraction, etc. etc. etc. etc. etc Finally, it is the training phase After the training is completed, we officially begin our reasoning This is what we will ultimately use Train this using a well-trained model for inference and training Let him speak the words we want him to speak The main thing is to divide it into three steps Then let's start with a specific case Let's go through this process First, we need to process the audio data This project itself brought a UVR5 Separation tool Today, we don't need to use separation for our materials So we can directly enter the segmentation tool This step If there is time in the future I can make another video Specialize in explaining how to be How to separate sound I have prepared an audio material Ah, it's the madman Zhang San Let's listen We continue the class So now let's start introducing another repair This is a little over 3 minutes One is a constituent element of a material It means joint crime Yesterday we introduced Because we are doing a presentation So my material is not very long We can copy it over first Copy it over The directory for submitting materials is required here Including file name Then, um, this is the default directory for the segmented output Just let's take a look at our, umm Not this one Take a look at our folder It will output to, uh, this folder Put it inside here We can take a look later Let's run it first Seven Brother Then the cutting is completed How much should we cut? Cut 32 projects Let's take a look at its time, which is 3 minutes and 17 seconds Then I saw the duration Ten, ah, the biggest thing is ten seconds The smallest is three seconds When it is possible to split the data We can directly use the segmentation inside here Also can be used You can also make a uh in other external tools The segmentation of the audio in the document But no matter which segmentation method is used Suggested duration is around ten seconds Do not exceed 20 seconds Because your duration is longer The pressure on the GPU will be greater And it will also increase his burden during training So the suggestion is to do it in about ten seconds Too short is not good We watch a three-second one First, let's take a look at the broadest definition of common crime It's okay If a sentence If there are too few words in a document, it is not good So this kind of segmentation is still acceptable now If it is a beginner Do not change anything in the middle My experience is to come and take a look after the cutting is done Take a look at the duration See if there are any that are too long If it's too long I suggest you delete it If your materials are not particularly scarce You can look for another end Can I find Find another audio clip and cut it again Because of this, you can cut it many times You can just switch to another file For example, I still have one file After I hit it over I filled it in after this Click to cut the audio It generates a new file Still put it in here So theoretically, you can cut countless files and put them inside If it's too long, umm You can delete it directly Just keep what is appropriate Unless you have a special shortage of this file No more audio material of him Then you can do it manually For example, cut out a clip that is around 30 seconds to a minute long Then you can manually do it outside Take another tool and manually cut it Try to keep the cutting within ten seconds For example, I have encountered many of those types that have been edited twice The sound that has already been edited There is basically no break between his sentences You need to use this default integrated tool to split He just can't cut it For example, there are several files that can be cut into a few seconds There is still more than a minute left Then you have to not abandon and switch to a new file for re-cutting Otherwise, you can take it out later Just leave the big one outside Cut it open manually with other tools Just copy it over Another thing, the file name doesn't matter If you are particularly a very neat person You want to select it all Then rename it But there is no need Because this is a pre-processing file for some audio processing It will eventually have to deal with what comes after So it doesn't matter whether you change it or not Another thing is that it doesn't matter if it's in Chinese or not It doesn't matter After cutting it Our second step is an ASR It is the step of generating text one by one Then this is the work we need to do Put the name of this folder inside here Then it will generate inside this folder We are waiting to take a look We can start directly Let's print it out Because in a moment, you will also see a lot of information from this terminal window Let's click on "start" first Then see, he will be dealing with this information No need to bother about him Warning message No need to bother about him Processing completed Processing completed Let's take a look We open this file to take a look Take a look at the file structure of this document File name directory And, um, this represents the meaning in Chinese And this is what it automatically recognizes, um The number of sentences should be the same 32 Then just now that file was also 32 32 files That's correct I will hit here for a while The next step will generate a new file in this folder Take a look at this Turned off, huh This step is completed Then the next step can be to proofread and annotate the text Let's also demonstrate it We still need to get the folder we had just now Still need the directory document just now We copy the path of this file and paste it here Then open the annotation tool for this text This startup is a bit slow Wait a moment, please Opened Then these are the 32 files generated inside here Corresponding text and sound You can take a look If there are any questions, you can make further modifications After the modification Save the edited text And then, um If 文 Because the voice we have is already very suitable So we don't need to cut anymore If you feel it's not appropriate You can choose to cut it off Then, or you can choose two and merge them Merge it or cut it off, either way is fine That's because we are doing a presentation Let's not go Let's not take it seriously to proofread it Let's turn it off first We will look at the next feature We do this by segmenting the text and voice The subtitles for ASR are completed And then the next step We start with some preliminary work on the training set behind First, in this step First, we need to give our model a name Just call me Zhang San Then it will display your graphics card information And some paths of pre-trained models If you downloaded the integrated version This, well, this is just a bottom film, you know We call this the bottom film This has already been written inside If you are creating your own environment Then you need to ensure that you have these few things If you have no issues with this information Then let's move on to the next three steps First, also put the folder of the text annotation file here Uh, yeah It's this one Just now we copied that subtitle file And then, the directory of the training set is the same as this directory path This is the path of the audio file Also include the audio file path And then you can default everything here Text acquisition is enabled by default Click the text to retrieve Look at the display of this terminal Hmm, it's done After this is completed Let's take a look at this output file There is already a folder for Zhang San here And this is the text that was just retrieved Generated folder 3 and folder 2 We are extracting the second step of SSR Also looking at this inside Looking at this, both correspond to watching Hey, the 5th folder has been created Then it displays that the process has ended Let's take a look at folder number 5 Did you copy over the 32 audio files? Brought over That's fine Then let's proceed to the third step of extracting the cloud token This is also to take a look at the terminal Completed showing him this document Document number 6 has also been generated That's it, these three steps have already been completed If you become proficient in the future Then you don't have to take such small steps Just use the "One-click Triple Combo" at the bottom directly He will complete these three steps all at once I just did it to demonstrate to everyone Because some friends of mine are doing this Samsung thing Specify which item will cause an error Uh Take one step at a time in the early stage It is easier to identify where the problem lies After completing this formatting tool We have reached the next step This is the most important stage of training the model There are a few places in the fine-tuning training phase that need to be emphasized The first one is this batch size value Because mine is a 3060ti It is 8GB of video memory So I set it to four, which is more appropriate This place has a saying that Suggest dividing your graphics card's video memory by two It is more appropriate to fill it up Of course, everyone has different graphics cards You can give it a try This is not the lower, the better Please test it yourself when you are using it Then the total number of people is the number of rounds you want to train The number of training rounds is also related to the amount of data you have If your data volume is like ours, it will only take 3 minutes I suggest that 22:25 is enough If it has been several hours for you That can correspondingly increase the number of people It's not to say infinite Not saying the more the better The bigger, the better This, um, still depends on your sample This default is 25 rounds I because I changed the document So it can now reach several hundred Because it's about testing Hmm, then let's set the number of people to twenty-five Then the save frequency is every few rounds save one file We set it to save a file every five rounds So the default behind it is fine Then, um, now the training is set for SV The following is GBT training GBT is also the two that can remain consistent And the number of rounds can also be set higher For example, let's say 20 rounds This GBT training The model trained by GBT This does not require a lot, um Also, it can be saved at a frequency of ten to fifteen times a day Similarly, using five rounds as a save file Then we started training The final training file will be saved in This is the directory of SWITH This is the directory of GBT The final service model will be saved in this file We start training now You can see the steps of the training through this terminal If this training is particularly slow Then I will cut off this training process It now looks okay in terms of speed, oh right I will give everyone a general introduction during this time Why do we need to conduct two trainings? GBT training A sweet training This is also GBTSWISS This project is different from some other projects It's not about which one is better or which one is worse Just different ways of implementation Because GBT searches for more of this GBT model It serves as the audio generated for the final reasoning It provides a reference What specifically do you provide? I have a document here that I also wrote Specifically, what it provides is, um It will inherit the timbre, emotional pauses, and so on from the reference audio In other words, we are in other projects After you generate a model You let him reason to generate new speech It will generate throughout your model library And GB t so with it It will generate based on the tone and emotion of a specific sentence you provide This will make it more convenient for us to generate a specific sentence It may be said that this is a bit confusing Let's understand it together when we reason it out in the end This is relatively fast Now I'm already 25 Let's take a look at the document, hmm This is to promote one file every five rounds 151 fifteen Twenty twenty-five Okay, this is done It's done We are now starting the training for GBT Also 20 rounds Then save 5GBT will exist GBT will exist in this folder Should have started, didn't start, hmm Good job, it's done We see fifty fifteen inside here 20 also generated three or four files This is the GBT model Then we can move on to the next step And then reasoning began When reasoning, it is also important to choose a suitable name for our experimental model You can click to open the web UI interface for reasoning We need to be prepared with this text at this time I will need it later Then you can experience why I mentioned earlier What does "参考音频" mean? Let's refresh the path here First, this one is ranked first It is a bottom film It's just that if you don't train your own model If you only rely on the base model of this GPT and the base model of the service You type the text It can also speak But it just doesn't have any unique features And then use what we have trained, um Model bar We will use 20 Then use ours, which is also 25 rounds This is GBT's model This is a sweet model Then this one arrived We have reached a critical point in our project We just need to upload a reference audio That is, we take out one from the audio inside our place Just choose one randomly We can listen to his voice His voice is the theory of the unity of subjectivity and objectivity The theory of the unity of the subjective and objective We find a subjectivity and objectivity from within this That is subjective, oh, there is an experience about this It is us as us as the reference audio It's best to be a sentence And without these pauses Then let's change it We find one Let's start with this first First, let's take a look at the broadest definition of common crime With this, we can take a look at what his text file name is, huh 3587840 Let's search for audio through text Delete it and call again as 358784035878 Is this audible? First of all, the sound quality is good Speak louder and come closer to see the broadest definition of joint crime. Hmm, okay Then this is, uh, the reference audio and the text of the reference audio Then there is a small reminder It's just that if there is no period in the text you generated You'd better add a period Because it will prevent errors in the input of speech output later Now we can let him say this sentence himself Then say something else This segmentation refers to If you want to generate a large text Then you need to choose to cut it This can also be divided underneath And then, you can also segment within this We can just cut it here now Just now, I forgot to mention This place requires selecting the language to be Chinese If you let him speak in Chinese That also means choosing Chinese Because we have fewer words So just don't let him generate it and give it a try I want to hear the reason First, let's take a look at the broadest definition of common crime Then please lower the volume of the sound we are listening to First, let's take a look at the broadest definition of common crime Is there anything else? If you listen carefully There is some electronic music First, let's take a look at the broadest definition of common crime First, let's take a look at the most.. This is a complete process from sound preparation to segmentation to recognition To process to generate models The reasoning is completed in the end This is generating a text Then you can click to download the sentence below And this is just a 25-round Just a 3-minute audio reference If, for example, you are using a large amount of data For example, there is half an hour, two hours His tone will be better than now Let's say something else Cannot say that word Say something else So this combination makes the problem increasingly complex Let him say it again Not the words in our reference audio We are looking for this this this text From Metro 2033 This text is relatively long Let's just cut it into pieces Aljom has no impression of his father Mother accompanied him until he was five years old The mother and child lived in Jimiriyazev Station for many years Life there is all good The days pass peacefully and calmly Until one day Rat infestation has invaded the Jimiriyazev Station Good results are like this, hmm We can choose the model inside here I suggest choosing fifteen or twenty Too small is not good Too big There is no need And then this switch model, well, it is also much better If you train a little more When it reaches around 100 rounds or several tens of rounds Please come and listen again You can hear the difference in each round every time If you have more rounds Then you will be able to hear some subtle differences from them Let's test the situation of the bottom film Actually, it can be generated using only the base film So even if you don't have a sample No training on your own Can be generated solely relying on the basal membrane See the effect Those mice are obscure Scary big size Just on that day Without any warning The wet and dripping mouse group is like a tide Emerging from that pitch-black tunnel So, this is not our model It is the sound made by the bottom film And then we switch back again Switch back to our voice model Then let's test another feature It is, um, cross-language support We let our model emit it Learn English English, we find a test text of mine This one Say something This comes from the novel "Heartbeat" Our life has always wanted Juliet Baker to leave me alone fire The back off You know, just give me some space This is the English effect So, speaking Japanese Japanese comes from Norwegian Wood Leave a part of it Too much, right? Can you hear a bit of the timbre? Seems like there is still some This is our um The second and third functions are cross-language support It is still very convenient to have this Because you only need to train a Chinese model And then you can let him go and say um English language and Japanese language That, in turn If you are studying English models or Japanese models You can also speak Chinese We won't demonstrate this in today's presentation When everyone tries it themselves You can go find some corresponding sound materials Go and train a bit The entire process is like this Let's take another look, um The text I organized myself on the back Finally, let's mention the precautions once again First one is virtual memory Uh Just said In addition to your graphics card, there are other prerequisites in our requirements To satisfy Dual graphics card Then, um, more than 6GB of memory Outside of the 32G machine's memory Your virtual memory is better to be higher Virtual memory is best set up with this type of system management Do not set its size Let it allocate itself systematically This will reduce a lot Error situation In addition, there is a document about S2 JSON file It is a file about modifying half precision to full precision If you encounter any related problems Can you please modify this file and take a look? Of course, this question, um After version updates This problem basically no longer exists It seems that some graphics cards may still have this issue The third one is the number of rounds Like just now, my number of laps inside this Its default number of rounds is 25 Can be adjusted to more than 300 You can change this yourself in the project file I'll tell you where it is It's this one, this one, this position This is what they call training This can be changed, can be changed The default is 25 Can change to 300 Then this is trained by GBT It can also be changed There is no need to make big changes to this It's okay now This fourth item batch size This has already been mentioned while we were operating just now According to your own graphics card situation Suggest setting it to half of the video memory These are the precautions In addition, there are several common questions We must mention In the first one is swallowing words Because we have a GPT model One of its autoregressive characteristics 会会导致在训练的时候呃 Especially when outputting large files, uh There may be some word omissions when the text is long For example, he will focus on the first few words A few words in the middle Will not read or repeat reading What should I do if this situation occurs? First, it is important to refer to both the reference audio and the reference audio's standards First of all, your audio quality needs to be high High definition clarity and cleanliness Then, um, there is no noise There is no background music There is no music This is all taken care of In addition, this is the text automatically generated by ASR It's better to take another look Manually type the label yourself Let each word correspond to it This is very important Ah, another thing is that the fine-tuning cannot take too long Like, about 40 minutes of material We basically have enough with the default numbers But this is not necessarily the case Still, in the end, it relies on the ears to listen Go and feel for yourself the quality of this, um, sound Uh, of course, this also applies to the feeling of sound The longer you train Will become more sensitive Another solution to solve the problem of swallowing fish bones Just don't make the synthesized text too long If it's too long You just need to manually split it And here is an additional one It's the um that I just mentioned in reference to the audio A small trick No matter whether your generated content meets the standard or not Is there any change? When you put it in the reference text You just need to put an end to him Then this will also reduce the probability of making mistakes later on The second one is a memory and storage error That is a video memory error Then just satisfy your VRAM requirements If your graphics card is too poor Then you won't be able to upgrade your graphics card, right? Another thing is during training During this fine-tuning period And then it is to set the value of BESIZE appropriately This is something you need to try for yourself Another one is virtual memory Definitely must definitely must not restrict To use the system for automatic management But the prerequisite is that your physical memory is also sufficient I suggest 32GB The third one is to compare the two projects What is the difference between g b t service and BT, which one is better? This, uh, personally speaking I think BT with it is more stable There is also him two However, his training also requires a longer period of time Its model is bigger You give him a large text A whole article can give you a stable output when read aloud But emotional changes are not something you can control But GB4V4, it can pass through your Through our reference text, we can provide you with what you want Timbre or emotion So I think each has its own advantages and disadvantages The last one is the GBT model What role does it play in the end? Overall, the GBT model is under this below Its role is that it sets a reference audio Come for our synthesized sound Provide a reference for tone and emotion A reference for pauses and speech rate The whole project is like this The purpose of my video is Also for beginners like me Provide an introductory explanation I hope to explain through my presentation Everyone can both contribute and get this project up and running Feel the joy of this project If anyone still has any questions You can leave a message Let's discuss together I also just started making these types of sharing videos I don't have much experience If anyone has any suggestions or opinions Can you leave me a message? Let's discuss together Learning is good enough here Thank you, goodbye