Free Speech: Reviewing Coqui-ai, Mycroft Mimic3 and Tortoise TTS Libraries

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to code with JV today I'm exploring the world of text to speech I've found three free and open source libraries that I'm comparing against each other head to head to see which one's the best for getting your computer to generate speech for you there are a whole bunch of paid Services out there where you can use apis like 11 labs and what not to do a whole bunch of fancy stuff with voice but today I'm really focusing on what can you do on your own machine what can you do without paying anyone a dime let's jump into it the first project I'm reviewing koki AI TTS so if you go pip install TDS in Python this will get you there they're a commercial project which is built on top of this open source platform they are shipping things quite recently there was a release that went out yesterday in a patch that went out today it's got a lot of stuff going on it's a bit intimidating to jump into there's a lot of academic papers that they link to but it's also not too hard to get going once you find your way around pip install TTS will get you the command line tool and if you want to train your own models then you'll need to install the GitHub repo locally let's have a look at some things I've got a TTS demo so if we go TTS then what's the simplest thing we can do is just get it to speak to us let's save the output into a file so hello world this is the default voice you get but they've got tons of voices that you can configure all these settings here yeah you can tweak them one of the things you can do list models you can see all the models that are available and there's a lot of them one tip is the ones towards the end of the list like if you look at the English ones the ones towards the end they are more recently added so I think Jenny was added earlier this year if you start to say okay what does this model here sound like you can copy it and then you can text if you want model name or flip your spell model correctly hello I'm a different model different voice these models often there's a couple of gigabytes to download so they can take some time now the other thing you can do TTS server let's go with the same model name made the same mistake didn't I just one day you'll learn and so when you start a server you don't have to keep typing the text in it'll run something on localhost 5002 we'll pull up a server whatever you want to say so you can start to do this kind of thing I found this is useful for when you get a voice that you really want to kick around a bit fire up the server throw some different things at it see if you want to invest further in using that voice that's the first things you can do is you can run on your command line you can run a server okay not too bad but you can also run it from python they've got quite a lot of examples in their docs about the sorts of things you can do if we jump over to python here I've got a index.python and here I've just got a couple of ones I was playing with one of the most recent English models that they've got in here is the Jenny model if we go into this python index it looks fairly similar than when you were typing it out with the terminal but you can control a lot more settings and you can also save them so you have to keep typing which model you want Etc hello this is Jenny you can get weekly updates at codewithjv.com newsletter would you catch the Irish twang if you look at Jenny Jenny is from a public repo they link to it in the docks of koki and here it's a 30 hours voice data set and newspaper headlines YouTube videos books transcripts from Friends a whole bunch of prompts and you can download it they've got a license but basically it's saying if you use it to make a voice assistant you have to call the voice Jenny if you just generate the clips you don't need the attribution that's Welcome commercial use is fine don't do unfair things like claim it as your own so it's just quite a generous license that's the sort of thing which using the kokri library and the data set of 30 hours the creators of this repo are able to create that model so if you want to spend some time making a model you can do that if you don't want to spend some time well you can also just do some simple voice cloning this is the JV model I spent a minute recording some voice let's have a look at what this one sounds like you'll notice they don't say letters very well I found you sound the Letters Out to Get It hello this is a corn JV maybe this is what I would sound like if I was born somewhere else in the world maybe I can hear a little bit in there but not much like that's the difference you spend 30 hours recording quality stuff you'll get a good high quality voice coming out you spend a minute just throwing up some wav files in there you'll get less quality the cokery crew they've got a website like they're a business built on top of this libraries they're investing in it they're updating it and essentially first put away your sword they're doing things like voices for game characters I've signed up to their free account here they've got a whole bunch of characters as their sort of teaser to get people into that system if you go in here only one of us walks out of here first put away your sword stuff I'd believe in a video game hey if you think you might have it tough with AI coming your way what do you think about the voice actors who are voicing for video games High budget games might still use real actors but the low budget games you can see them moving to this Tech pretty fast right I've made a testing project here you've got emotions so you can actually say do you want your voice happy sad Etc not too sure how they've done this using their Library you might need to play around a bit more why hello there how are you today why hello there how are you today why hello there how are you today messed around with this one a bit because they've got like this fancy editor where you can shrink words or if you turn the phoneme level on you can shrink particular syllables of it and raise the pitch or the energy so you can really tweak the way you want a lot of their thing is built so you make multiple takes with the voices and each time you do a take they're slightly different it's a little bit like chat gbt being slightly different when you ask the same question they're doing the same for the tonation of the voice just so people who can get exactly the voice they want out of it down here you can say why hello there how are you today why hello there how are you today why hello there how are you today you can start to stitch them all together to make scenes and dialogue between multiple characters Etc all of that built on top of this Library which you can download and start to mess with so all right next project this is mimic 3 from Minecraft open source of voice assistant business but I don't know if they'll continue they put out a notice in January that they had to downsize and weren't doing active development but they've built a text to voice system called mimic 3 designed to run on very very cheap Hardware like raspberry Piers or computers without big graphics cards so you can do this without the same compute load as the other one it's got a fairly similar install path where if you start to say hey install this library and then get yourself the microft mimic Library you can start to just run it in the terminal which is what I did so if we go over to mimic here we can go it's mimic three and you just throw words at it hello world I am mimic you may not have heard it but my graphics card didn't spin up to do that kind of work it's much less compute but it's also a little bit more monotonous I found less emotion in there so it's more suited towards voice assistance than voicing video games if we go to mimic 3 it also has a server when you run that server like this here we can see this is the English one a rainbow is a meteorological phenomenon that is caused by reflection so you can start to do that if you look at the English us it actually has a bunch of different more names but also a bunch of different speakers a rainbow is a meteorological phenomenon that is caused by Etc I'm mentioning Mycroft because maybe you want to build a voice assistant and maybe you don't want to spend lots of compute having a fancy voice and you want to put on a lower powered device mimic 3 would be what I'd definitely recommend for that use case whereas if you want to get a fancier Voice Assistant with more expressive voice you may be able to Resource the koki setup to generate voice on the Fly for you but you'd probably see a bit more of a delay in there all right last one this is my favorite one this has taught us it's mostly because it's just someone who has finished a computer science degree bought like 15 Nvidia graphics cards and trained a model at home on 50 000 hours of voice so really interesting there's a great readme encourage you to look at it they've found things like if you put I am really sad then it can start to change the intonation it had a whole bunch of the ethical considerations where they basically did this work and realized how powerful their model was and how people could use it for fraud and they were a bit conscious about whether they put this out there but then they figured like well I'm just someone with a computer science degree in 15 graphics cards anyone's going to be able to do this therefore they have built a tool for detecting Twitter so that so they'll think about hey can you spot when someone's generated using this data set and here's a tool they can they ship with it it's a cool Library but it's a little bit more complex to set up like you need to clone the repo and I had a bit of a Mumble of a time doing this getting versions right because it was written a while ago and if you use modern versions of python like 310 or 311 it wasn't working I had to go down to three nine and then tweak some dependencies so a non-trivial thing if you want to run this and it's causing you pain shout out and maybe I'll be able to set you up a Docker file let's have a look at tortoise here's tortoise the thing about tortoise is there's only one model but there's lots of voices so tortoise slash voices you can see them all you can call tortoise do TTS takes text all right what voice do you want what's one I haven't listened to the ones with chain in front of them uh the higher quality ones so they're ones which are closer to the training set whereas the others are just folders where the author's gone and put like four or five sound files next to each other about 10 seconds long let's see what the narrow sounds like the thing about tortoise is it's slow and you must have a Nvidia graphics card so it's not a very efficient model but the voices are really interesting and the way they built it was inspired by the Dali image generated from open AI it basically Maps a voice into Vector space where each of the numbers in the vector changes a little bit like the speed or the pitch or the volume or the voice defect so lots of things about the quality of the voice what it does is it samples all the wav files and then it says oh let's generate a vector based off that and use that Vector to generate the voice so it's kind of like a complete synthetic configuration and you can actually save these vectors and not have it sample the files in doing this and you'll see that's gonna be a minute 40 to generate just this sentence here so while that's going the other thing it does is it saves it all in a results folder so here's a bunch of them that I was playing with did a JV clone again the results JV the programmer's eyes were bright in the hope someone would subscribe to his YouTube channel not much better than the other one but it was based off five little wav files next to each other the programmer's eyes were bright in the hopes someone would subscribe to his YouTube channel the other thing you can do with tortoise which is interesting is essentially it's got is this python read file what it does is it reads a big chunk of text and then it narrates it for you this is the idea of like make your own audio books it takes ages like let it run overnight on a computer with a graphics card if you want to get this done in practice what it'll do is it will Chunk Up the file make lots of little wav files and it will stitch them all together in a big one and you can get your own narrated book this is the main use case of tortoise if you actually want to use it is for reading poetry or books or that kind of thing here's one that I messed around with I didn't go for that long but it still took like 20 minutes to run all the stuff lives in long form and I think it was train Grace that I used in the idyllic Valley of mongaroa river just a stone's throw from Wellington lived a devoted programming teacher named GV his eyes normally bright and sparkly were dull for many people would watch his videos but few would subscribe yeah subscribe let's go see what oh here we go this one's finished if we go yeah play was it well it's usually it's prefixed by the speaker so you can find them again and I found would you like a Docker file comment below if so yeah that there's a couple of versions and they've got slightly different intonations would you like a Docker file comment below if so so that is tortoise and again probably one of my favorites in short I would suggest mimic if you're building a voice assistant koki I think is probably the best general purpose one where you can start to do quite a lot of different things with it and go quite deep and taught us if you want to get something which can narrate text for you and build up long form audio if you want to see another video about training models and building up some more custom things let me know have fun
Info
Channel: Learn Code With JV
Views: 15,617
Rating: undefined out of 5
Keywords:
Id: JZWeYbtCisk
Channel Id: undefined
Length: 14min 23sec (863 seconds)
Published: Tue Jun 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.