Custom Speech-to-Text (STT) and Text-to-Speech (TTS) Servers for Mycroft AI | Digi-Key Electronics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey jay i know that we've been working on your robots and you want to have the chat bot and we have the wake ward hey jorvan as well as controlling some hardware and i think the next thing that we were talking about is wanting to get it so that it's completely offline right yeah okay and i know that a lot of smart speakers like our alexa devices or siri they require back end components and a lot of it has to do with doing that speech to text and text to speech machine learning to interpret your voice as well as give some sort of audio response and right now a lot of that's done online my hope is that we can take this to something that's either local running on the pie or more likely going to be running on like a laptop are you okay doing something like that of course i casually wear robots outside normally well this wouldn't bother me at all this will make it actually more fun to see especially how it will react with the uh outside people excellent so i think i can get speech to text and text to speech running on say a local laptop you would just connect via an ethernet cable or wi-fi and that way you could just throw a laptop into your backpack and walk around or you know be up on stage and have the laptop right there i don't know about getting the mycroft back end stuff because there's a bunch of things where it's like oh it has to connect to the back end to get weather data or to verify your device i don't know about that quite yet but i think i can get speech attacks in texas speech that's the first step are you good with at least getting those right now of course of course as we both know i am not the best programmer so the fact that you're even helping me with this is 100 appreciate it excellent i am excited to get a chat bot up and running um i want to see one on stage or out in the wild and you wearing it and having people talk to it so let's get started [Music] here's a basic look at how mycroft works there's a wake word listener running on the raspberry pi from the last episode we have the wake word set to hey jorvan when that wake word is heard mycroft streams raw audio to a speech-to-text server across the internet by default mycroft uses google's stt engine the audio is parsed and converted to a string mycroft compares that string or intent to the trigger phrases in the available skills if there's a match the skill runs its code often this involves fetching information from the internet like the time or local weather a response is provided to the skill the skill parses that response and sends a string to a text to speech server which converts that string to audio that audio is streamed out over the speaker as a response to the user as you can see most smart speakers rely heavily on remote servers to process information in an attempt to make mycroft less dependent on the internet and possibly a little more secure we're going to run speech to text and text to speech on a separate laptop there is a tts service called mimic one that can be run on the pi but it's a bit slow and robotic we're going to use a fork of mozilla tts to give us a more natural sounding voice speech to text is notoriously difficult to perform while it might run on the pie everything i've read said that it makes things incredibly slow and one reason to run things on a local network is to make response times faster finally mycroft is hopelessly attached to a number of backend services at home.mycroft.ai this online service manages skills and accounts and it helps provide responses to skills that need information mycroft is open source including the back end which is called celine it's possible to run it locally but it requires a lot of effort and is honestly a bit overkill for what we're trying to do a few projects like this one from open voice os have tried to create a dummy back end for mycroft that prevents it from phoning home i've tinkered with it but i've not had much luck getting it to work completely offline it's likely new versions of mycroft circumvent some of the hacks here if you're able to get a completely offline mycroft or other smart speaker working please let us know in the comments it's something i'd like to keep working toward for future videos if you head to mycroft-ai.gitbook.io docs you can get a lot of information about how to customize mycroft let's head to customizations which can be found right here and we'll go to speech to text here you can see all the different stt engines that mycroft supports we'll be using mozilla deep speech as it's open source and relatively easy to set up next head to text to speech you can see that mycroft defaults to mimic 1 or mimic 2 depending on the voice you select however we'll be using koki tts which is a fork of the mozilla tts project it comes with its own server which is perfect for our needs to start you'll want to install ubuntu 18.04 on some computer this will run our stt and tts servers i highly recommend using something with an nvidia graphics card as we can enable the machine learning engines with cuda to make them much faster i'll put a link in the description that points you to a guide showing you how to configure cuda on ubuntu i'm going to ssh into my laptop it just makes it easier for me to screen capture what's going on and we don't need a gui anyway i've actually disabled x windows or x on my laptop just to save some processing power i'm going to remote in with ssh if you installed cuda which i'll make sure there's a link in the description to walk you through that process you should be able to run nvcc-version to get an idea of what cuda version is required note that for these exercises you really want cuda 10.1 the stt and tts servers seem to really like those otherwise you'll be fighting with versions and dependencies which is why i'm specifically sticking with ubuntu 18.04 not that version of python let's try python 3 python 3.6.9 seems to work along with cuda 10.1 there's a whole dependency chain that you have to take care of and make sure everything is happy and matches i'll put that in the written guide because that's a really boring walk through i want to get to the interesting stuff which is installing the servers for speech to text and text to speech we'll be using mozilla deep speech which is an open source project to perform speech to text so go ahead and create a folder to hold our deep speech project in this case i'm just going to make a projects directory to do that if you don't have virtual envy you might have to install it but i can run it to create a virtual environment inside of this directory that's going to keep all of my dependencies python versions package versions everything separate from my main system so that i can have different versions of different packages running for my different servers so i'll use virtual end dash p set it up with python 3 to create my new virtual environment i'll then call source deep speech v and bin activate to activate that virtual environment you should see my virtual environment name in parentheses here mozilla maintains a version of deep speech for kudo which is called deep speech-gpu that will allow you to use cuda on your graphics card and deep speech will use the acceleration in your gpu to make stt a lot faster if you're not using the cuda version this would just be deep speech without the dash gpu we also want deep speech dash server so that we can create a server that runs as a service on our computer and then we're going to have microsoft connect to it over wifi press enter and let those run this may take a moment if you don't already have it installed like i do once it's done we can download a model into this deep speech directory i'm going to paste it in because it's quite a long url we're getting it from the deep speech github repo and it's the pdm model we're going to be using 0.9.3 when that's done we need the score from that same location so instead of that dot pbm we're gonna get the dot score the dot score is a language model and i suppose you have different languages that works in conjunction with the acoustic model and that's the dot pbm when it's done let's download a wav file to use as test once again that deep speech repository has some test files for us they come zipped up in this tarball so we'll download that and we will untar those and you can see it's just some wav files with people speaking you'll notice that since we've installed deep speech we now have this deep speech command that we can use we'll feed it a model that model is the dot pbm we'll give it the score which is that language model both of those we've downloaded in that folder and we'll give it a test audio which should be in this audio directory we just unzipped and let's give it that 2830 something something something dot wave if all goes well it should fire up you should see the speech to text so that wave file gave us this string as an output experience proves this if you listen to that wav file you'll hear somebody saying that and if you're using cuda with your graphics card you should see your graphics card being used during inference and take a look at the time to make sure that looks pretty good here it took less time than the actual audio file itself that's promising what i've noticed is that it has a habit of running faster after you fire up the tensorflow engine or whatever inference engine you're using especially on graphics cards because i think some data is cached and the second time you run it it's going to be a little faster i'm going to use an example configuration settings from this main road deep speech server this is an example configuration server so we're going to be using this as our example in fact i'm just going to copy this as it is if we look in this directory we can see we have our models the audio example files and the virtual environment i'm going to create a config.json file right in this directory i'm going to paste in that example we just saw we're going to create a system d service that's going to run and one thing i've learned about systemd stuff is it does not like relative paths so i'm going to give it absolute paths to the model files in this case it's going to be in my home directory projects deep speech and then the name of the model and if you remember we downloaded 9.3 or 0.9 not 0.7.1 and i'm going to do the same for the scorer model i don't like their default port so i'm gonna put it on five zero zero eight so that i can put my tts server on five zero zero nine just easier for me to remember where those are everything else i believe you can leave the same so ctrl x to exit yes to save and save that file next we're going to run deep speech server with that config file set up we can call deep speech dash server and feed it that config file which should fire up a server for us that we can connect to from let's say another computer i'm going to open up a browser on my laptop point it to my other laptop and that's going to be at 192.168.1.2 for me and i can get that in fact let me cancel this server if you do ifconfig you should be able to see that i recommend setting a static ip whether it's on your laptop or giving a preferred ip address to your mac address on your router which is my preferred way to do it but here you can see where i'm getting that 1.209 from from the laptop that i'm ssh into so let me fire up that server again and i'm going to refresh this page you can see that 404 not found means that i need to interact with the server in a different way it's not serving up web pages and that's perfectly fine in fact what we're going to be doing is sending commands to this address slash stt and method not allowed meaning it really doesn't want to communicate with my browser that's fine we're going to use other methods to send it audio data but this means that it is working and in fact what i can do is open up another ssh window into that same laptop i'm going to go into that deep speech folder you can see it's the laptop i'm working in i'm going to send it that way file we just tried and even though it's sending it to itself you can see how you would interact with it across a network and if you open up the window showing the output of the server sure enough you can see the stt result experience proves this and that's how we're going to send audio files or we're going to stream audio data directly to that server let's close down that session and we can exit that server with control c i'm going to call deactivate to get out of my virtual environment which you can see that string has gone away in front of my command prompt i'm going to create a systemd service known as stt for speech to text dash server.service and i'm putting that in systemd system and i'm going to create my unit file and i'm just going to fill out the basics of this i won't go into creating unit files here but know that this allows you to run software or services as soon as linux boots up and as far as i know this is probably the preferred method of doing things on boot for ubuntu there are other methods of running programs or applications when you start linux but this seems to work well for my purposes i'm going to do this on two separate lines remember i have to use absolute paths for everything here and i'm going into that virtual environment binary folder to run the deep speech dash server tool and i'm going to pass it the config file that we created and this wanted by multi-user target basically says run this pretty much after everything else and you're waiting for a user to log in this ensures things like networking is up before we attempt to run this service control x to save save that unit file we need to call systemctl daemon-reload so that systemd knows where to find that unit file it just re-reads all the unit files and says oh there's a new one and loads it in then we're going to enable it so that it will run on boot we're going to start the service right now which means the server should be running and we can call service to see if it's running it looks like it is which means we can bring this up let's refresh and sure enough these pages are working if you try to go to say the wrong port your browser shouldn't be able to connect to anything so that's how i know that the server is running we can get out of that and we can use journal control to follow the output of that server in real time so i'm going to take this window and i'm going to hide it but first i'm going to bring up a new ssh session because we have one server running at the moment and i'm going to log into that laptop again so that we have a new interactive shell so this old one i'm going to put this window away for the moment and we'll come back to that when we want to see how mycroft is interacting with our server now let's do the same thing but for our text to speech server so these are two different servers that are running on one laptop we'll be using mozilla tts but it's going to be encapsulated in that forked koki tts server this should look familiar we're creating a tts folder inside of projects we're going to create a virtual environment for this project and we're going to activate that virtual environment which we're calling tts-vn lucky for us this tts project comes as a pip package so we just need to do python 3-m pip install tts and this is the koki tts that we saw earlier when it's done there's an example server you can use by calling python-mtts.server.server if you have cuda enabled you can use the parameter-use cuda space true let's fire that up it should download a couple of default models for us when it's done you should see a url given to you assuming everything fired up correctly this is using ipv6 if you want to use ipv6 i'm going to use ipv4 and let's refresh this page and hopefully everything runs it does not and that's because i'm using the incorrect port number so it should be 5002 which will change in a moment so if we go to 5002 you should see the koki tts and i'm connected to my laptop to get this i can type in some message and let's play it hello this is a test of me text to speech server it's pretty close it's definitely less robotic than the default text-to-speech on mycroft so this will be fun to use but you're welcome to use either text-to-speech engine i'm just showing you how to set up one that's fairly easy to instantiate on a separate server somewhere and we'll connect mycroft to it let's ctrl c to exit that server we'll do tts-list models you can see all of the speech models and the vocoder models the speech models takes text converts them to essentially a spectrogram which isn't really speech and then the vocoder takes that spectrogram and creates the actual sounds from it so you need a combination of both if you only list the tts model it should automatically pick the best vocoder to use based on whatever the koki developers found or the mozilla developers found you can also head to mozilla tts wiki released models learn about some of the models used at least for the mozilla one not necessarily the cokey fork but you can read about some of the models being used here if you scroll up you can see that tacotron 2 ddc is the tts model that's being used with english language support and it was trained on the lj speech data set the vocoder model is this hi-fi gan which is a generative adversarial network v2 also trained on the same data set for the english language and it's usually good to match the data set and the language can't guarantee that even if you match those it'll work but these two models seem to work very well together let's create the tts dash server service and we'll do that by writing a unit file just as we did before we will run the actual command from the virtual environment directory which is tts dash vn slash bin and we'll feed it some parameters on multiple lines here for my purposes i am going to use it with cuda but feel free to leave this off if you're not using cuda instead of 5002 i want this running on port 5009 because it lines up with my 5008 from earlier if you'd like to specify a particular model you can do that with the model underscore name parameter here i'm going to use that same model it seems to work pretty well so you would specify the tts model versus the vocoder the language then the data set it was trained on and then finally the name of the model all of those are separated by slashes and again we will have this wait for that login prompt so that we know networking is running and this and the stt server should start up on reboot feel free to test that but i'm going to manually start the service now i'll call system control in order to reload the whole systemd daemon or unit files i'll enable the service we just created and then i'll start it finally i'll follow it so that we can get updates in real time the good news about this is it does show me we have problems so it's telling me that it failed to execute command no such file or directory it cannot find tts server so i must have messed up a name here so let's get out of this now there is a tts server let's see if we misspelled something we'll go back into our unit file and sure enough they should not have capital letters let's get out of this and i'll call system control restart on that service oops i forgot to call reload first because that needs to happen if we change those unit files then we will run restart and then we will do the general control to see what's going on there we go it is indeed running let's go to my engine if i try port 5002 i shouldn't get anything so if i try 5009 i get cokey tts again hello this is another test it works all right and then we are going to log into the raspberry pi running mycroft which i've called picroft because that's the name of the distribution i should get a login prompt remember the default is username pi and default password is mycroft and that should bring us into the cli little gui that they've created i don't know how graphical it is but i'm going to take this tts server output and i'm going to put this window aside for now we'll come back to this and the stt once we are done setting up mycroft so from here ctrl c to exit that cli client and i'm going to call mycroftstop just for good measure once that's done let's clear the console and we'll start from here we want to run mycroftconfig edit user that's going to bring us into this temp mycroft.json configuration file by using that wrapper command it will also check this file to make sure we've appropriately formatted the json stuff i actually find it very helpful this should look familiar from the last episode where we configured the wake word to be hey jorvan and we're using the precise wake word engine we're going to add a couple of entries down here the first one is the stt module which is speech to text this is telling mycroft where to go for speech to text we're going to override whatever is default by using this configuration file we're going to set up that deep speech server which we're defining at the address 192.168.1.209 just like we saw earlier remember we need that slash stt because that's the interface that is expecting audio data to be transmitted or it's received on the server and then it gives us text or strings out as a result and i believe this is a mycroft thing but you have to specify this module deep speech is something that's supported in mycroft you just have to tell it where that server is located and it will no longer be using the google stt instead it will be using its built-in i know how to work with deep speech but i just need to know where the server is in order to send that audio data and that's what we're doing here as it turns out mycroft doesn't have a specific koki tts module that we can use instead koki tts should use the same thing as mozilla tts since those back ends are essentially the same so we're just going to use mozilla tts for our json key values here and the other one's uri and this one's url i'm not exactly sure but it seems to work maybe url would work for both i have not tested it if it works and you try it let me know and remember there's no slash tts or anything for this the server is just running at port 5009 on our laptop and i probably didn't mention it before the laptop and your raspberry pi must be on the same network you can run an ad hoc network if you really want to make this somewhat portable but you still need something of an internet connection for the back end of mycroft it needs to reach out for whatever skills and configuration it needs with our stt and tts modules set up in mycroft things should just point to those servers on our laptop let's control x we're going to save this if you messed up some of the json you'd get a notification here that there was an error with the json syntax and we're going to minecraft start all since we stopped those services and we're gonna start up that cli client i'm also gonna turn on my speaker that i have plugged into the pi and we're gonna wait while this loads up i'm going to bring in my windows to show what's going on on my laptop the top is deep speech which is my stt server the bottom is my tts server whenever a request comes in for one of those servers you should see the output update in those windows i apologize if the text is somewhat small for this but this way you get an idea of what's going on and you can see that stt is running and tts is running whenever mycroft makes requests let's give it a shot hey jorvan what's the weather like my speaker takes a moment to power up because it's uh right now it's overcast clouds and 23 degrees today's forecast is for a high of 27 and a low of 19. so something isn't quite right you can see that the stt worked because uh you saw the request coming here what's the weather like and go back out however nothing came in for tts but it sounds like mycroft is still using the default mimic because we don't have the female voice and there's no requests coming in to tts so let's figure out what's going on so i'm going to stop this server and let's go into that config file to see if there's anything that's been misspelled my first instinct of something being wrong is i misspelled something here a quick look doesn't show that what's nice is that i can look at some of the log files if you go to var log mycroft you can see the log files here audio usually is things like tts and stt so i'm going to look at that if we scroll up yeah you can see it's spitting out errors here if we look through the errors we find something related to tts the back end couldn't be loaded falling back to mimic okay i suppose the the better way to do this would be to grip something like tts which is not giving me matches anyway anyway we found it by looking at the raw output so it has something to do with that if i go back to koki tts it says use configuration mycrof kind of the same as using the mozilla tts configuration um module oh we put mozilla tts because i got that information from a different site so the module name should just be mozilla so let's go in and edit this we're going to update this to mozilla instead of mozilla tts save that buffer we're going to call microsoft start all restart which should restart all of these services and we can start that cli client once more give it a moment here to load all of the skills or at least the ones that can i know that a few of the skills probably don't work 100 this is where some of that back end needs to be connected to the internet not only is it trying to update skills and some configuration you also need things like wolfram alpha weather information and so forth so some of those things i'll have to figure out if i want to run this completely offline but for now we can test our stt and tts servers hey jorvan what's the weather like overcast clouds and 23 degrees today's forecast is for a high of 27 and a low of 19. and sure enough you can see the speech to text server gave me the result what's the weather like as i asked it and my tts server gave the response encoded here well it's two responses that were spoken out through mycroft and it's my speaker that takes a moment to kick into gear because it's going to sleep essentially it's a battery-powered speaker but with this you can see it working we have an stt server running locally we have a tts server running locally and when i say locally i mean on a separate computer on my own network but at least i have control of where my voice and words are going instead of going across the internet i hope this gives you a start running your own speech to text and text to speech servers i can imagine there are a lot of fun things to do with these outside of just using them for mycroft i plan to continue trying to get mycroft or maybe some other smart speaker application running fully offline until then happy hacking [Music] you
Info
Channel: DigiKey
Views: 67,750
Rating: undefined out of 5
Keywords: ai, artificial intelligence, chatbot, machine learning, ml, mycroft, python, raspberry pi, robot, robotics, smart speaker, speech to text, stt, text to speech, tts, voice assistant, voice recognition
Id: v-I3imLNxcw
Channel Id: undefined
Length: 33min 33sec (2013 seconds)
Published: Mon May 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.