XTTS FAQ | Interview with Josh Meyer from Coqui AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good day and welcome to this video where I will talk with Josh Meyer from K AI about xtts but before starting with the actual video let me say two thank you on the one side to you my wonderful YouTube Community for providing me with xtts related questions and on the other side obviously to Josh for taking his time to talk with me on this xtts topic it's been a pleasure having you as my first ever interview partner or guest on this YouTube channel and as it's the first time please let me know in the comment box below what you think is this interview type format something you would like to see more often and if so who would you like to see as a next guest on this channel and do not forget to give this video a thumb up and a subscribe on my channel thank you and now without further Ado let's put the spotlight on to Josh first of all Josh uh let me say thank you for taking uh some time to talk with me on your newest xtts Evolution or product or project and um before this video I asked my wonderful YouTube community on questions in the comment box on xtts to be honest and transparent I sent you so just the questions the community sent me before so you could talk with your crew and team and uh hopefully provide us uh the best possible answers or information on xtts yep y so and again that's all for the intro though thank you Joshua for taking your time and for joining me today yeah um I am excited to be I'm honored to be the first guest here um it's a you know I've been following your YouTube channel and this other kind of work you've been doing the open Wiki I recently actually uh was Googling for just Googling for an answer on speech technology and the first best answer was your Wiki but uh yeah so um happy to to talk about xtts from all different fun angles uh if anybody's watching that wants really deep into the weeds technical information on xtts I would encourage you to check out our Discord server and and talk to some of our core Engineers who are working on it um I've been involved with xtts uh since we've been working on it but I'm not the one who's super deep in the weed you know training the models I don't get to train models so much anymore unfortunately but that being said uh happy to to jump in maybe you would like to introduce yourself for the people you do not know you yet yeah yeah for sure um so my name is Josh Meyer uh I've been working on speech technology for over a decade now which sounds weird to say uh but I came out of Academia I did a PhD in speech recognition um and focused at that time and I've always had a a deep love of uh expanding speech Technologies to new languages that are typically um how do you say underrepresented under served um so I spent a good chunk of time something like six or seven years working on speech recognition and making um these kind of deep neural net models work more efficiently when you have only a small amount of data so I looked at fine-tuning uh and multitask learning and towards the end of my research uh kind of academic research I joined the Milla group working on speech Technologies so at the time there were kind of three projects going on in parallel there was Project common voice um there was Project deep speech and then there was ttf yes text to speech and for um for a long time the project deep speech and project common voice were the kind of the shining stars of Mozilla working on speech technology actually historically project common voice came out of a need from Project deep speech so they started working on Project deep speech they realized they need more data uh and that's where project common voice kind of came to life and so I joined the team um basically they were working on English they're making very good speech Technologies for English and my kind of act to grind was how do we make this work for as many languages as possible so that's what I spent most of my time working with them both on the data collection side and on the um training uh speech recognition model side all the time in parallel uh our team wasn't big it wasn't huge to be begin with and there was always uh one person working on TTS that's Aaron and um Aaron kept working on TTS and kept making these breakthroughs and uh ended up having these new algorithms especially for Zer shot voice cloning cross lingual zero shot voice cloning that ended up being um as to the best of my knowledge the first kind of publicly available uh grof of concept we had I think it was like two years ago uh and it went viral on Hacker News if I go back and listen to the quality of that now versus what we have today uh it makes me cringe you know every time we have new the state of the art is moving so fast that um every breakthrough you you listen to it and you think how did I ever like this stuff in the past you know um but in any case uh long story short started an Academia joined Mozilla working on open open speech Technologies open source uh the product was always MPL 2.0 and the data was released under the Creative Commons uh zero and basically we were working as a team um and we decided that the projects we were working on we could uh outside of Mozilla get some get some resources build a bigger team and really push this further uh more than just kind of an R&D project inside Moosa so that's what we did um about two and a half or something years ago we left Mozilla as a team the machine learning group uh project common voice continued um but we left as a group to to continue working on speech Technologies and in particular we started making more and more breakthroughs with the pts with the the the generative voice I guess people are calling it now right um and uh we felt the strongest pull in that direction so we've focused uh our entire company to work on uh TTS and not just TTS but also voice cloning and and cross lingual voice cloning and um yeah that's where we are today we still have a very strong open- Source uh project the TTS project which is released under the MPL 2.0 and different models are released under their own licenses um but I think that's something that's really special about us and what we're doing is we have we're a for-profit company uh we have an API that's a paid access and we have a k Studio which is a kind of Creator application so for people who don't code but they want to use these core Technologies in their workflow like let's say an animation team or a video game development team studio um if they want to use these Technologies and have a nice interface I kind of pitch it as Garage Band for voice over um that's that's what we provide and then we also have the open uh Source or Open Access there's there's some uh Nuance there um but we open really openly release uh lots of code in models and um I think that's something that special about us in our project uh is that we are we started as an open project and we've kept that at our core so that's maybe a long intro but uh there there it is thank you for the introduction uh and let's start with the probably most obvious question so what is xtts and what makes it so special in comparison to existing models yeah um so the X doesn't officially stand for anything [Music] I uh I like to think of it as extra good PS uh it's also kind of expressive TTS um we we decided on the name before Twitter switched to X so it's unrelated um but xtts is a it's a voice synthesis model that is able to uh do zero shot voice cloning and also synthesize speech from that voice clone in any language that we've uh trained the model on so right now there's 13 languages and uh next release is going to have more languages and the next release after that is having more languages we're always adding more languages um but um in a nutshell xtts is a kind of this new wave of speech synthesis models it's uh basically the best one that's been openly released it's doesn't fit the typical kind of definitions of Open Source so um we created a new license for it um there's a we can get into the details of that um but basically it is currently released under the Ki public model license the cpml and it's uh basically non-commercial use you can do whatever you want with it uh commercial use we are very we encourage it but you should come talk to us and get a get a license a commercial license from us um but the model is is special for a few reasons so it stands out in terms of quality um there's always been a pretty wide gap between the best open source or Open Access TTS model and what you can get commercially from providers and the release of xtts is is closing that Gap um xtts is actually the model that we have in production well we have a version of it in production um it is unlike lots of currently available speech synthesis models out there that are open access or open source it's built on um GPT and it the level of expressive is is much higher than you get these kind of uh you know in the past probably one of the most common model architectures out there is vits um and in general a lot of the neural uh voice models out there they have this kind of um how do you say kind of boring audiobook like uh narrator style that is just a um how do you say optimized to be not as expressive because that's easier for the model to learn it's a function of the data that was used to train but it's also a function of the AR architecture itself actually um and so xtts is is able to clone voices with a much higher emotional range um it's something that we released I think just over a month ago yeah um and it's uh so it stands out for Quality it also stands out for for for efficiency in terms of uh compute usage uh it can run on a GPU with as little as four gigs of vram and in terms of speed it's it is the the first of this kind of new wave of of speech models that is got native model level streaming so the the latency the time to the first chunk of audio from like you give it text and you start getting audio back can be as low as like 130 milliseconds on a 3090 GPU uh so it's very fast so this is this is special for and this is kind of we're already at the point where you could have this running alongside a game engine on your computer and be talking to uh talking to your video game in real time right and this is with realistic voice and that uh has really not been possible before so those are when I when when you ask me kind of what's what's xtts those are kind of the the most important things that come to my mind why is it special it's it's a voice synthesis model it's a voice cloning model those have been around before but they've never been this high quality they've never been this uh fast and they've never been able to run on such uh normal consumer grade gpus so this is like in terms of making uh consumer grade engaging AI voice applications possible this is this is now possible you addressed two important points that I have on my question list so that's a a good good point on the one side the the new C model license we'll talk a little bit later about that and uh as you said running on your local system and that's probably the next question uh can I use xtts to clone my own voice fully offline or are there any dependencies as you said the Ki Studio or that API provided so can I clone my voice locally offline or not 100% yes um I when we first made the release that was the first thing that I did was I wanted to make sure it's working on my MacBook with my CPU uh so you know you can download it use it and if you're using it for you know your own purposes for personal non-commercial you can go do this today you just do pip install TTS um and you you get the xtts model you use a clip of your own voice and you can start syn sizing you know you can turn off the turn off your Wi-Fi and it'll still work which is pretty awesome um yeah so the short answer is yes please let me know in the comments if I should make a tutorial on TTS voice cloning locally and um so as you said just run pip install TTS and then the magic happens uh that's probably answering the next question too so if I can clone my voice locally fully Wi-Fi disconnected uh after running P install obviously and um I can synthesize for private or not non-commercial I can synthesize the audio too locally yep I guess yeah and also say uh also the there's one model the xtts model which speaks all the languages so you could uh use you know five seconds of your voice speaking German and then synthesize your voice speaking English Chinese Arabic whatever you need to download 13 different models it's one model right oh well then that's one of the other questions indeed but first of all as you said 5 seconds or I've read three or six seconds around 5 Seconds of pure audio which is really amazing this six okay and um did you encounter if I well if people would provide just six seconds of audio um or in comparison to let's say 1 minute 2 minutes 3 minutes does this affect the quality of the cloned Voice or does this not make any sense no uh I makes sense um and this is something that people try to do all the time the model has uh explicit architecture to use six seconds um so if you upload two minutes it's not going to get any better um and this is this is something that we're very actively working on because it depends on the voice right like sometimes six seconds is perfect and it gives you uh an amazing voice that sounds identical sometimes it doesn't and the question is okay what do I do next right like if I don't like um if I don't like my voice clone from six seconds of me speaking from this audio clip uploading more isn't going to help what will help is if you try different audio clips so this is when I'm ever um cloning a voice or doing a demo um maybe I have like 10 different audio clips that are each six seconds each and I try them all and see which one I like best um and actually a fun this is kind of a I guess a a feature not a bug um you can take the same voice and I I'd be interested to hear you do this because you've got uh for the the thoron voice data that you've also got emotional data right yeah true yeah so you could um make a clone of each of your emotions and they would technically be different voices but it would actually be one voice with different emotions um we've seen people doing this this is what I recommend to people who are who want to have explicit emotions that they could call from like the API uh whether it's via our SAS API or locally um you can uh there's there's some emotions that we provide they're limited there's like it's not for xtts so I'll say that clearly but um we with our API you can have like a happy sad angry surprised but if you're making a custom voice you can record The Voice doing any style or emotion and then you can have that in your bank of emotions right it's just yeah little tip so maybe that's that's wor an effort so just to play around with my TOs and voice emotional data sets uh and give this a try but um what I found interesting because I had expected that more audio data would provide other or better results but as when I got you right 6 seconds is perfect and if you are not satisfied with the result result do not upload more seconds but instead use phonetical other input audio so maybe just play around with the input text but stay with six seconds because the model is optimized for this type of length yeah 100% don't upload more audio upload different audio yeah yeah I'll give this a try so maybe I'll make a special video on how multiple inputs on xtts will lead to other results because I think this might be really interesting yeah in my opinion um then you set one aspect on um foreign languages so normally if I put German input I would guess that I can clone my German voice and uh you said that there are right now 13 languages more to come soon um supported so xtts provides the feature I input 6 seconds of audio and then let say I synthesize French or English in my cloned voice MH that's possible 100% yep you know that I will give this a try and will uh call you back if this is not working yeah yeah but that's that's it you you upload six seconds of any language and it actually doesn't have to be a language that we support so for instance we don't support um Hindi currently okay you could upload 6 seconds of yourself speaking Hindi or Bengali or whatever language uh and it still will synthesize perfectly in the other 13 languages so the input can be anything uh because we're we're doing it's a speaker embedding extraction it doesn't rely on text uh so it's language independent well that's interesting if you asked me this before I would never get this so I really can I can input to the model a language that is not supported M it's doing its um magical fairy dust uh machine learning magic and uh but it will just synthesize audio from 13 and more to come languages but the input language does not matter at all correct yeah one other question when it comes to foreign languages um because this is mainly or one of the questions that are I hear a lot um I guess this is true for most languages but in German I know this we have lots of English words in our everyday talk so talking on computers and cloud and party and whatever we have lots of words that come from the English language and um normally the models have problems when I synthesize my German model with maybe English words this is not sound or not pronounced Ed um it should be so how does or does xtts handle an English phrase with foreign languages inside typically what happens this is called code switching right like when you have uh a sentence that's mostly in one language but you might switch into another language um and there's also it's interesting there there's also kind of opinions on how this should be done should we keep the phology of the first language or should we use the phology of the second language and um if you're a fluent speaker in English and German then you might use the English phology right when you're switching but if you're don't speak English at all then what would you do but I think typically what happens is uh typically what happens is you get the native phology of both languages so if you're switching German and English uh you'll get English phology coming out then the next uh and important question um why should people subscribe to my TOS Voice YouTube channel yeah uh well the simple answer is because you're the best YouTube channel for open voice technology um yeah I I honestly think um this is something that I've talked about uh few years ago like NLP like the natural language processing has always had this kind of big hypey community lots of people and lots of um uh ways to get information but speech has not been like that voice technology has always been um a smaller field it's changing now I think it's I think it's really changing now especially with this new wave of Technologies and even for me who's like I've been in this space for over 10 years uh I'm pretty familiar with like in the weeds uh but there's so much stuff happening all the time that it's it's hard to make heads and tails of it especially knowing what's I be noise from you know companies that are trying to sell you something versus what's interesting you know breakthroughs and interesting new technologies and that's why I watch your videos um and uh and also like honestly the the how Hands-On uh like if somebody's trying to install and use a new tool um I Still Point people to your your videos like when they're saying like oh how do I you know how do I do K on on Windows we have one aspect when it comes to the RTF so to the comparison of CPU versus gpus as you mentioned a GPU 3090 uh as a consumer Hardware uh provides really way more than real time so do one do2 real time Factor um are there any experiences for people that just use a CPU because they have no Cuda enabled GPU yeah so um using a CPU makes sense if you are willing to it doesn't make sense currently for Real Time conversational applications but if you want to generate like I don't know uh let's say that you like listening to your news but you don't like reading the news you could I could you could very easily set up something on your laptop to uh you know synthesize your favorite News Channel like the text you could just like scrape the website synthesize it overnight and then when you wake up in the morning you've got your audio file and then you just like you got to listen to that right um so anytime where you want to generate audio for consumption later uh it CPU makes a lot of sense and CPU is not like crazy slow I say overnight that's kind of that's overk kill uh the the real time factor is it 1.7 that's not too bad yeah I mean it's it's too slow for a real-time conversation but I mean it you know I think it's 1.7 so if you want to generate 10 seconds of text it takes you 17 seconds to generate it like it's not it's not bad same model could be uh faster way faster than real time on a GPU or slower than real time on a CPU and there is a very real difference in the cost of those things too so for instance we have customers that might be interested in using fast gpus because they have a realtime game a video game they want to plug this into or they are generating news overnight and they're not generating on a laptop their gener r on servers uh but the cost of running you know let's say some slow gpus or fast CPUs uh might be a really great time uh money saver for them right um so it's it's more than just like I guess we've talked a decent amount in this conversation about local um but even if you're running in the cloud there's there's ways that this can interact uh and you can take advantage of you know the the time cost trade off one question is when it comes to the license because I guess people are sometimes confused because ki's code repository is on the one side of this open open source license I guess it's is it Mozilla MPL I'm not too sure but so we have on the one side the code base and people are sometimes confused yes so um so your question is uh kind of clearing up a little confusion about the licensing situation with with Ki right because K TTS is a repository on GitHub um and it contains code mostly in Python um and that code is for training and inference of models um and all of that the the training and inference code is released under the Mozilla public license 2.0 however there are individual models that you can uh that are pre-trained that you can use inside K TTS and those have different licenses and some of those are licenses that we set and some of those are licenses that other people set right so um for instance we support um meta's uh meta's uh 1000 something um speech synthesis models that are are um I think they're all under a non-commercial I think it's the CC by non-commercial 4.0 uh and then there's also different you know communities like somebody and their friend decided to train a model in whatever Bulgarian and they want to share it with the community so they release it under uh Apache or MIT or something and you know make it very free and so we have uh a few different modelss under a few different licenses but we've been talking about xtts in particular xtts is released under a a licens that we created in collaboration with basically the world leader on open source licensing uh she wrote the book on open source licensing Heather Meer um so we we created a license for two reasons one is we are business uh and we need to run the business sustainably and right now releasing xtts under a completely permissible license is something that um we don't we don't want to do because we need to monetize it somehow right um and so like being very transparent there we you know spend a lot of time and resources you know gpus are not cheap uh and people need to get paid to use the gpus to train the models this is all not cheap and and um there's some other startups out there that their kind of Paradigm right now is uh or I say larger corporations where they can they can afford to lose a few a few million dollars uh to release a model uh openly which is which is really great for the community but uh we are uh not at the level of you know meta money that where we can throw away a few million dollars um so yeah so we created the cpml for two reasons one was so that we can monetize the model because we do want it to be available for researchers and people who are using it for uh non-commercial purposes like if you have a home assistant set up in your house that's controlling you know your refrigerator and your garage and you're like really into home assistant uh love it we we we get it we want you to run xtts and do your thing um but if if you're a corporation and you want to use the xtts xtts model we still want you to do that but we want to sell that to you right um so that's one reason why we created uh the license another reason is um I I think and we we think uh largely uh is that open source licensing for machine learning models is is a weird landscape and the old licenses that have been typically approved by the open source Initiative for uh code for source code is uh that they don't apply to machine learning models for for a few reasons so um source code is very different from a machine learning model a machine learning model is really closer to Binary because it's been trained it's compiled you can't you can't like open a a file of any machine learning model and and interpret the weights and the parameters and make changes to them as if you were looking at the Linux kernel right um so I think a lot of I think actually any of the open source initiative licenses that were written for source code don't make sense for machine learning models and furthermore um the licenses from the uh creative Commons which those are kind of the two sets of licenses that people will typically use for machine learning models they'll either grab a license from the OSI the open source initiative or they'll grab one from Creative Commons and uh the creative common license uh creative commons' license uh they are made for some kind of content that has been artistically created it's Creative Commons because they use them to license creative works like a picture a song um a photograph or you know like a painting right um and those licenses talk a lot about derivative works and so I take this and I can I take your photo and I can change the colors and then I can you know redistribute that and sell it or maybe I can't because the creative commment has clear commercial non-commercial um and so the problem with that is uh those licenses were made for human created things not for uh machine learning models that are they're almost kind of they're they're a weird place because they're generated by code machine learning models are generated by code but then they can also generate content so we wanted to have a license that applied to both the model itself and the content that the model generates and there's nothing really out there currently that does a good job at this um so we also wanted it to be as kind of clear interpretable as possible that cpml is a very short license I mean you can read it uh it's not like full of super legales it's it's you know it's pretty clear what you can and can't do um it it falls you can basically do whatever you want as long as you're not using a commercial so yeah so you try to to close this gap between as you said we have the source code side with lots of great licenses we have the creative artist or artwork music photographs and so on uh license but this new field so machine learning models um there is this Gap and you try to close this Gap or find a good life or hope to find a good license um yeah all this we we wanted something that was going to work for us to create a sustainable business while also being able to share what we're doing with the research Community we all come from the research Community our team we all come from the research Community we love working with the research community and also the open source Community I mean we we came from Mozilla and like this is we built the project from open source um so we wanted a license that was going to be as clear as possible and allowed everybody who's you know the the DIY hackers you know who love raspberry pies and arduinos and doing all that stuff uh we wanted to make it so they could use this and we also wanted the PhD students and the master students and the universities to be able to use it um we wanted to to get it out there uh and so this is the clearest way that we could we could do that um but machine learning models are like compiled code that create content so you want to be able to cover all of that um and it's not just about um restricting people from doing certain things it a license has to be very clear or else uh companies institutions people won't use it if there's any if there's any kind of unknowns so that's that's a reason that a a license has to be very clear um yeah it's uh it's a hard thing I I think we we did our best to to make a good license and I'm I'm personally very happy with the cpml I think it's I I love how clear and just straightforward it is so I'll put a link to the license um in the description so people can take a look to it and uh get an idea of the license and they know know the motivation and the idea of the new license and uh I'm really excited to see in the comments if the people have more questions or have now a better understanding on the on the idea on the vision of um the requirement for this type of new license so and when I take a look to our question catalog I think that we covered all the important topics so by now I I can just say it's been a pleasure to have you as my first and special guest on this little place on YouTube on open voice technology so thank you Josh for taking your time uh for joining me today and for providing as I think great answers to the questions thanks to you and to the team at K Ai and uh if there's anything you would like to share please the the the next words are y yeah um I'd say if you're watching this uh keep watching and follow thoron because these videos are great uh not just the videos you know the content he posts on other platforms like Twitter at least or x uh I very much enjoyed um so uh besides that um if you want to get involved with uh with K there's a few ways there's GitHub github.com ki- PS that's the the main kind of project repository um you can find also our models uh and some demos on hugging face um just you know search for xtts or search for KI um also we have Discord I guess we'll have a Discord link hopefully here um the Discord is a really fun place like it's uh it's something the community actually on Discord has has existed from way back in Mozilla times like originally it was IRC and then changed it to I think element and then getter and now we're landed in Discord um but it's a fun group of people it's a lot of people who really uh care about uh sharing research technology what they're working on a lot of different language Enthusiast so if there's a language that you want you don't it's not in xtts yet uh come chat with us on Discord there's people working on lots of different languages um yeah thanks for having me doron um and uh I you know if we want to do this again I'm more than happy to thanks again Josh it's a pleasure for me and I'll put all the links Josh mentioned right now in the description box so thank you Josh I wish you a nice rest of the day and again thank you and hopefully we'll stay in regular contact and see us soon yes definitely thanks bye Josh thanks for watching I hope you found it useful please give this video a thumbs up if it is so do not forget to subscribe to my channel and please let me know in the comment Box about what you think on this interview type format and who you would like to see as a next guest just in case you like the video and so that's all I hope you liked it I wish you all a nice rest of the day and if you like we might see us next time bye
Info
Channel: Thorsten-Voice
Views: 1,592
Rating: undefined out of 5
Keywords: TTS, Text to Speech, Voice Tech, Machine Learning, AI, KI, Voice Assistant, Python, Tech, Open Source, Künstliche Stimme, Coqui, STT, Raspberry, News, Coqui tts, coqui voice cloning, XTTS, ExTTS, xtts tutorial, xtts-v1, Josh, Josh Meyer
Id: XsOM1WZ0k84
Channel Id: undefined
Length: 41min 49sec (2509 seconds)
Published: Mon Oct 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.