Actually attempting to CLONE my voice using ML?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's happening guys how are you doing today i just got out of a bunch of different client meetings actually backed about client meetings and i did a little bit of prep compared to yesterday so that we're just spending an absolute ridiculous ton of time just downloading pie torch tenses but we are going to get this done tonight because i've tested it out and it works and uh be interesting to hear your thoughts on the results but those of you that didn't tune in to yesterday's live stream what we were doing was testing out some voice cloner technology so there's a github repository available on github by a guy called corentinj which gives us the ability to clone our voice and generate speech so synthesizing speech purely from a clip of our self speaking and passing through a text prompt it's actually able to generate that text which i thought was was pretty cool because you can actually train it on your own voice or somebody else's voice don't be going using this for bad things guys be good people um i'm assuming you guys are good people and you're gonna use it for good things but anyway yeah so we're gonna give this a crack what's happening everyone welcome on down to the live stream yeah there was a typo in the title of the video originally thanks david for pointing that out i did catch it eventually once you mentioned it it's all fixed now but okay we're going to dive into this i am going to keep this one a little bit shorter tonight because my throat is killing me hope it's not the rona but we shall see in a short time i don't know we'll find out alrighty let's uh jump into this cool so also let me know if the audio sounds better because we had an absolute shocker in yesterday's and that was because i don't know if i explained what happened but i had let me jump back i had the game plan was to use some powerline adapters for ethernet to ideally speed up my neck because it's really fast at the the router but it's slow as hell over wi-fi so i got all these powerline adapters i plugged them in and then i got a massive amount of distortion via the microphone and it went absolutely terrible well then thinking like going on all this effort to get like a good mic and then just the power line adapter throws it out so i literally ripped them all out you can see him here i tore him out straight away i was like gosh what an absolute waste of time anyway you live and you learn that's part of the game alright cool let's jump back into it okay awesome i'm loving that the audio is good all right cool so we are going to go back into where we were yesterday and if you didn't watch yesterday's stream we went and set up the or we went and set this up so that we could effectively go and use this voice cloner now anyone that has not checked it out it is by karen tin oh also i got my language translator model working this morning i went through and debugged the absolute hell out of it it is working and it's training right now let me show you it's at like what is it we're at 10 epochs so it's taking like half an hour to train per epoc but probably super small i don't know probably can't see that it's way too bright yeah 10 epochs we are getting there all right cool what are we doing so we are going back to corentin j and we are leveraging this real-time voice cloning capability and the guy's a machine learning engineer resemble ai an absolute wizard because he's made this all open source and available for you guys it's brilliant okay so we're going to give this a crack so what i've gone done is yesterday if you tuned in we were just downloading the we're downloading the different tenths the pie torch weights to be able to leverage this model so if we actually let me open this up inside of vs code so it's a little bit easier to see don't need those so if i go into here so what i've gone and done yesterday is i went and finished downloading this saved model so you can see that we've now got an encoder set of weights a synthesizer set of weights and a vocoder set away so we need all three of those different sets of weights to be able to leverage this library so they are effectively the pre-trained model components which is going to make our life so much easier now i have gone and done one additional set of prep to be able to run through this i've actually gone and recorded some examples of my voice now don't criticize my singing but one of them is me singing quite possibly one of the greatest songs in the world so these are my mp3 audio files here i've got this one so this is me just talking test this is me speaking just to test out the voice cloner we're trying to work out whether or not this works on my voice me speaking cool and then i did this one 99 bottles a beer on a wall 99 bottles of beer take one down pass it around 98 bottles of beer on the wall cool yeah so we got some audio this got to me singing so we're going to try to use those inside of the voice clone i don't know if there's a prescribed length of audio that you need to include in here let's go we'll go back and read the documentation later but ain't nobody want to read documentation we want to get coding or get this up and working so i'm going to copy these two recordings i really need to get these a little bit bigger to show when i'm working inside of the window explorer and i'm going to drop them inside of the real-time voice cloning folder i've already got them there but i'm just going to show you so we're just going to replace those two files so these two recording files are purely what we went and oh purely what i went and recorded today if you're curious as to how i recorded these i recorded them via obs so the thing that i do streaming on probably you can see it in that corner over there when i'm actually um when i'm on the chat window and then what i actually did is i passed it to vlc which is just like a video like an open source video watching thingo and you can actually convert it to mp3 so you can actually go and use that um and that's effectively what i've gone and done nothing fancy that is purely it now we're going to go try to give this a crack so there is the documentation here but i've got an idea as to how to do this now because i went and played around with it a little so what we can do is just go into the youtube folder where i saved it or cd youtube and then we need to go into the voice cloner folder which i have dubbed as commencing in 2021 we're clearly in 2022 my bad uh and then we're going to go into the real-time voice cloning github repo that we went and cloned and then we need to activate our environment so that is actually we should have gone back and activated our environment first so it's going to be dot voice cloner backwards scripts backward slash activate and then we are going to go into the real-time voice cloning folder and then what we need to do is run python demo underscore can you see that yeah you can see that demo cli dot pi and as soon as we run this command this is going to go through a bunch of prompts and allow us to actually test this out so i can actually go and run this now and it should ideally give us some reasonable results we go through the rest of the cli steps okay cool so this is good so you can see it's gone and found all of the different file paths for our saved models you can see that it's not checking using our cpu it's going to be using our gpu it's gone and loaded up a bunch of stuff so it's going to be using cuda so that means it's going to be using our gpu it's building a waver in it also i know i'm getting sidetracked but damn have you seen the prices of gpus lately they've dropped massively let me know in the comments if it's time for us to upgrade that deep learning machine beneath my desk to uh a 30 90 ti you let me know um all right cool so we're up at to this point so we've gone and run uh python a demo underscore cli dot py we now all we need to do is pass through a reference voice so you can see that that's what it's asking us for right there so we need to pass through reference voice enter an audio file path of a voice to be cloned now keep in mind we've got those two reference files that i just went and showed you that i copied over into this folder so we just need to pass through the name of that file so we can pass through recording one or recording two i've got a feeling passing through a longer recording might yield better results but who knows we shall find out so if we go and type in recording one dot mp3 can you guys say that yeah you can that will train based on uh wait what's happened there that looks okay yep all right cool so you can see it looks like it didn't you have happened successfully with pi sound file but it's trying audio read that then loaded successfully created an embedding then all we need to do is write out a sentence that we actually want to synthesize so we can type in i don't know you let me know guys what should i go and write out based on my cloned voice because we've now effectively cloned our voice now we can test it out all right i'm just gonna come up with mine um hey how you doing hope you're having a good day i hope punctuation doesn't screw it up okay there you go so it is now creating a spectrogram that's doing it i think how are you doing hope you're having a good day do you guys hear that so when and synthesized it so that is an example of how to actually go and do it and then it'll prompt you again so we could actually go and do it with another file so i could use a record as a recording 2 dot mp3 hello world nice it's over anakin i have the high ground it's over anakin i have the high round thanks devita so it's synthesizing keep in mind the audio files that i put in were tiny as well like they were like eight seconds and it's generating it's over anakin i have the high ground how cool is that do you hear it let me know if you hear it give me a thumbs up in the in the chat i'll see oh well clearly you're hearing it you're saying so cool and the nice thing is that when it actually goes and generates it you it actually creates these outputs so if you wanted to go and play it later on or you want to go and do something with it you could do it i want to try making a longer source file and seeing if it actually generates slightly different audio so let's actually close this now i'll actually see if i can try to show you how i did it so what i actually do so this is very meta i'm actually going to be using obs so i'm actually going to record or start recording my voice so let me just turn the system audio off hey this is me talking just having a chat seeing how everything is going we're recording some live audio where's youtubing right now and we're getting stuff done this is an audio file that we are recording so that ideally we're gonna be able to generate a significantly better voice cloner who knows where this will go uh today i went to the gym i did a bunch of random workouts i did some chin ups i did some dips i'm running out of stuff to say who knows where this is going to go but i'm going to keep talking so that we get a sufficiently long audio file okay so that is the audio file generated there alright so that's now 38 seconds still not massively long but i don't know we'll we'll see if that actually works so then what i've been doing is i've just been opening up vlc and then going into media and then over here you can hit convert save and then add the file that you want so i'm going to add that file that i just recorded which is 38 seconds i wonder if that'll improve results who knows we'll see but i've got a feeling that a longer source file will actually clone a lot better so if we go throw that one in and then hit convert save and we want it to mp3 and i'm just going to dump it on my desktop and it happens really reasonably quickly uh that has saved a mkv file no that's not what i want this is still transcoding having a chat seeing how everything is going we're recording some live audio this is the audio but i don't want an mkv file i want it in uh mp3 ideally let's go try that again so convert save add i didn't try it live so i mean go grab this grab this grab this then convert and save i'm saying convert source and then the destination file i want on the desktop and it's going to go to mp3 oh is that because we're saving it as dot mkv okay maybe that'll work all right so let's try that that's mp3 okay that looks better hey this is me talking just having a chat seeing how everything is going we're recording some live all right cool we're good so now let's go uh recording longer recording and we're going to cut that and throw it into our main repo so youtube voice cloner various time voice cloning i'm going to throw it in there so that we're in the same folder oh i close the command prompt all right we're going to have to go do this again d drive cd youtube cd what is it uh yep that one cd uh we need to activate dot backwards voice cloner script set activate and then we need to go into the folder and then run python demo see a lot you probably can't see that because my chat's blocking it run that how to hydrate all right cool so it's asking for the reference voice this is crazy how easy this is i actually really really like this library it's it mad props to you corinthian jay guys if you're gonna use this make sure you go give the guy some stars on the github repo i'll go and do it afterwards but um actually before i forget otherwise i will forget uh give him a star uh sign in mad props this is sick all right uh what are we doing ma longer recording so we need to pass through the name of the recording so it's called longer recording dot mp3 all right guys give me the prompts what do you want me to type what are we generating yeah it's really really quick rudy like it's nuts how fast that is i mean i wonder if it's because i've got the gpu and it's not training it's probably like using some sort of um single shot model what a small glass of water oh yes got to stay hydrated and i just dropped water all over my keyboard all right what uh i'll try spinning that's a good trick is that all right oh right i know it's already got 30k stars but an additional one won't hurt i'll try spinning that's a good trick i'll try spinning that's a good trick so that is sounding a little bit more like me i mean it's a little bit crusty but i mean it's doing the right thing longer recording i wonder if i just had like a monster video of me i'm probably going to try that at some stage because it i'm really enjoying this i'm gonna recording your mp3 what did we actually say we were just talking about uh i am live streaming right now on youtube what's up my money don't jiggle jiggle it folds i like the way you wiggle wiggle show oh we should do that next i am liz streaming right now on youtube what's up kind of works i mean i'm really enjoying this i mean that's the voice cloner guys relatively straightforward to set up so what do we do yesterday we really just in um we went and installed pytorch we set up a virtual environment and then it's really just cloning it and then downloading the pi torch tensors yeah i need another ml to delete the extra noise yeah i mean that would be good all right let's do the um lewis louis thero wrap my money don't jiggle jiggle it folds i like the way here we go wiggle oh show can i say foe i will say for sure wow jay-z over here yeah thanks ash my money jiggle jiggler foals i like the way you wiggle wiggle for sure that's great i'm loving this um there is a um there's a what's a quote as well how do we exit out of this if i quit it how do we stop the cli do we actually have to close it down yeah there we go all right so control c closes down the cli but there is a gui as well python demo toolbox.hi let's run that we can run let's clear this python demo toolbox got pi what does this do is the model only male it's got only it's a it's a voice cloner i was cancer so you can clone your own voice or clone any voice within reason um and it will generate whoever's voice that you want oh this looks h hardcore what are we doing here alright welcome to the toolbox to begin loading utterance from your data sets or record one yourself right where do we start add four or more points i should we should really watch a youtube video of how to get this set up use embedding record play using embedding from can i just throw this uh wait longer recording okay that's loaded up a spectrogram cool add three more points to generate the projections so what happens if i load this this is gonna load each one at two more points just saying we need one more recording let's just quickly create one my money don't jiggle jiggle it folds i like the way you wiggle wiggle oh show we're gonna start uh making turning this into a rap channel in a sec i mean i'm not opposed to that this is the longer recording longer recording this is the shorter uh we'll call it louis louis and then i'm gonna open up vlc media let's open that bad boy up i'm just converting it from make v which is what the the default output that i've got obs set to and then convert it i'm going to dump it on my desktop and we'll call it louis.mp3 it should convert looking good so then i'm going to grab that throw it into the voice cloner folder bang and where's that gui then i'm just going to add that one last one okay so this is drawing projections now interesting wow i don't know what to make of this i should go and read what what it is that we've got in here my money don't jiggle jiggle okay so that is calm down nick uh synthesize and vote code oh wait says they're going to type this out oh so you can do it via a gui rather than using the cli here i am just clicking buttons you're gonna play it welcome to the toolbox to begin load and utterance from your dataset or record one yourself once it's embedding has been created you can synthesize any text written here the synthesizer expects to generate outputs that are somewhere between 5 and 12 seconds to mark breaks write a new line each line will be treated separately then they are joined together to make the final spectrogram use the vocoder to generate audio the vocoder generates almost in constant time so it will be more time efficient for longer inputs like this one on the left you have the embedding projections load or record more utterances to see them if you have at least two or three utterances from the same speaker a cluster should form synthesized utterances are of the same color as the speaker whose voice was used but they're represented with a cross synthesized utterances are of the same color as the speaker whose voice was used but they're represented with a cr all right so what we're saying is that we need more embeddings to produce closer representations of my voice at least that's what i'm taking it because what we're saying here is that a cluster will form as i pass through more samples so if we actually pass through more samples we should get better results yes pierrot you've created a monster we need to start integrating some more rap into the youtube channel but what do you think guys i mean i think this is absolutely awesome for something that like it is completely free and it works and you can also enhance the the output as well at least you ash buddy oh at least you know this tick-tock song too bad you missed the memo of my dog steph donna b look i don't watch the news you know this um all right cool so let's try something else and so if you wanted to go and pass through your own text you could pass through whatever you wanted um so this is us testing out out our own text so could we then in what happens if we enhance the vocoder output it'd be funny to pipe that output back onto itself i wonder if you'd see it like a huge distortion in the um in the cluster welcome to the toolbox to begin load and utterance from your dataset or record one yourself once it's embedding has been created you can synthesize any text written here the synthesizer expects to generate outputs interesting so you can see that it's getting further and further away from this cluster so i reckon if we i can good practice would be to pass through like a ton of samples of you speaking and then you'd probably actually get a really really good a reasonably good example of um this what happens if we export what are we exporting are we exporting the audio files so what happens if we synthesize and vocode with the enhance let's do this this is us testing out our own text okay okay i'm actually i mean this is the uh slightly oh wait hold on so we are seeing clusters forming although th this sample i wonder if that's me rapping and this is like me just speaking should we add in another one let's quickly jump back on over the chat let's have a quick chat let's discuss shall we yes i mean what did you think i thought that that was actually like reasonably cool like in terms of what's actually possible like open source what did i just throw it on like my little gpu oh nobody said should are you guys looking at gpus at the moment for deep learning that they're actually coming back down is it possible to input more than one audio it seems like it only allows one nah piero we we actually put through a bunch just then in the gui like if you actually jump back you can actually pass through a bunch of different files into the gui and then ideally it's it's suggesting that a cluster will form and you'll get better audio results i think at least that's what it's saying what about the spaceship challenge on kaggle yeah i'm we're definitely still going to go back to do some more kaggle stuff i'm just uh finding my feet in the streaming game and then just trying to cover some interesting topics that i'm particularly interested in but i'm probably definitely going to do the the kaggle stuff is it possible to speak another language that i don't know i don't actually know um maybe if we can combine it with deep fakes it's gonna be really cool i reckon it'll get super super interesting yeah i'm 100 with you david uh upgrading their gpu something definitely needs to happen soon but um yes anyway cool let me just quickly show you that yes you can definitely put back more audio files so you were seeing this on the screen here right so we can add in let's add in some more samples let's actually just record a bunch more samples and see if we actually get more of a cluster forming so if i go back and start recording hi this is me testing out speech just to see whether or not we're going to get sufficiently high quality audio synthesized audio out of the current model a let's test out what recording audio via this model sounds like when we give it a bunch of different types of samples i wonder what i'm gonna eat for dinner tonight i'm just recording a bunch of different samples here and we're gonna try using all of these and see what they end up turning into should we do one more wrap i'm dan dangerous ain't too many can bang with us straight up we'd arrange us label us notorious it's a little bit of uh biggie there all right cool so what are we going to do we are then going to take all of those audio files and push them out so if we go back to vlc and if we remux them is that going to allow me to do probably pizza yeah 100 i'm definitely going to eat pizza for dinner at night my diet is terrible at the moment lily you know this um where did we start 559 from here to here i think so yeah must be those all right we're going to take all of those convert and save we're going to output why can i not change the output all right let's just do that see what happens okay so we've got all of our mp3 files so all of these have just been created now okay yeah let's throw these in the youtube voice cloner real-time voice cloning and if we give it more samples so let's go and add in more so this one so you can see it's loading it up and this one and then this one all right so there is definitely a cluster forming now do we load that one already 57 yes we did you can see which ones you've loaded from over here i'm just this little region all right so you can definitely see a cluster forming that ideally should get better results i won't this is probably the rap one me changing my voice all right what if we go and do this now synthesize and vote code this is us testing out our own text i'm sounding like american now but at least you can see it it looks like it's a little bit off but i still think this is absolutely brilliant in terms of what we're actually doing what does toolbox output do three two welcome to the toolbox to begin uh these are them to the toolbox this is us testing out our own text these are the all different all of the different samples these are brilliant what happens if we use this this is us testing out our this is us testing out our own text me just clicking around anyway that is the voice cloner now run through hopefully you enjoyed this one guys hopefully you enjoyed testing out the voice cloner and again mad props to the guy that created corentin j make sure you give him an extra star on his github repo because the guys out there doing god's work and helping all of us developers alike to be able to leverage some new and advanced technologies thank you so much to all of you that have joined the live stream i'm so pumped to have all of you here i'll see you tomorrow i'll catch you in the next one peace you
Info
Channel: Nicholas Renotte
Views: 19,661
Rating: undefined out of 5
Keywords: data science, machine learning, deep learning, python
Id: 6rFXsG0tUMI
Channel Id: undefined
Length: 34min 10sec (2050 seconds)
Published: Wed Jul 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.