Build your own Alexa with the ESP32 and TensorFlow Lite

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I've been playing around with TensorFlow Lite and it works pretty well - end-to-end wake word detection takes about 100ms in total including pre-processing the audio to get the Spectrogram.

There's room for quite. a bit of optimisation as I'm currently using floating-point FFT and could switch to a fixed point version.

It's fairly robust - it only really get confused by words that are very close to the wake word ("Marvin") - so words like "Marvel" definitely cause false positives.

GitHub repo contains the training notebooks and the firmware: https://github.com/atomic14/diy-alexa

πŸ‘οΈŽ︎ 21 πŸ‘€οΈŽ︎ u/iamflimflam1 πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies

Wow super awesome!

πŸ‘οΈŽ︎ 6 πŸ‘€οΈŽ︎ u/kaiomatico πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies

Can we make the voice snootier?

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/SFMissionMark πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies

This is great. I’ve been wanting some of the benefits of home assistants for a while, without the always-on bigcorp data logging. What’s your take on the wit.ai developer experience? Do you have any sense of their privacy or identifiability policies?

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/thicket πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies

Pretty interesting! How is it compatible to Jarvis personal assistant and other selfhosted personal assistant softwares? I thought of an centrilzed jarvis server and multiple esps for audio input output connected via network. Would this work for this? Maybe it would be a great idea to design it especially kompatible for multiple personal assistance solutions (self hosted and so on). Would be a great step to high performance and high privacy personal assistant/ home assistant.

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/totalydifferenruser πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies

I'm amazed there is enough power on the 32 for that

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/light24bulbs πŸ“…οΈŽ︎ Oct 04 2020 πŸ—«︎ replies
Captions
"Marvin - turn on the lights" *OK* "Marvin - turn off the bedroom" *OK* "Marvin - turn off the kitchen" *OK* "Marvin - tell me a joke" *What goes up and down but does not move?* *stairs* "Marvin - turn off the lights" *OK* Hey everyone So, if you've been playing along at home you'll have known that we've been building towards something we've covered getting audio into the ESP32 getting audio out of the ESP32 and we've looked at getting some ai running using TensorFlow Lite This has all been building towards building an Alexa type system So, what actually is an Alexa system? What components do we need to plug together to get something working? The first thing we're going to need is some kind of wake word detection system. This will continuously listen to audio waiting for a trigger phrase or word When it hears this word it will wake up the rest of the system and start recording audio to capture whatever instructions the user has Once the audio has been captured it will send it off to a server to be recognized The server processes the audio and works out what the user is asking for The server may process the user's request and may trigger actions in other services In the system we're building we'll just be using the server to work out what the user's intention was This intention is then sent back to the device and the device tries to perform what the user asked it to do So we need three components: A wake word detection Audio capture and intent recognition And intent execution Let's start off with the wake word detection We're going to be using TensorFlow Lite for our wake word detection and as with any machine learning problem our first port of call is to find some data to train against. Now, fortunately, the good folk at Google have already done the heavy lifting for us and collated a speech commands data set. This data set contains over 100 000 audio files consisting of a set of 20 core command words such as up down left right yes no and a set of extra words each of the samples is one second long There's one word in particular that looks like a good candidate for a wake word. I've chosen to use the word "Marvin" as my wake word *Oh God I'm so depressed* Let's have a listen to a couple of the files: *Marvin* *Marvin* *Marvin* *Marvin* *Seven* *Seven* *Seven* I've also recorded a large sample of ambient background noise consisting of tv and radio shows and general office noise So now we've got our training data we need to work out what features to train our neural network against it's unlikely that feeding in raw audio samples will give us a good result Reading around and looking at some TensorFlow samples a good approach seems to be to treat the problem as an image recognition problem We need to turn our audio samples into something that looks like an image to do this we can take a spectrogram of the audio sample To get a spectrogram of an audio sample we break the sample into small sections we then perform a discrete fourier transform on each of these sections this gives us the frequencies that are present in that slice of audio putting these frequency slices together gives us a spectrogram of the sample I've created a Jupyter notebook to create the training data As always the first thing we do is import the libraries we're going to need and set up some constants We've got a list of words in our training data along with a dummy word for the background noise I've made some helper functions for getting all the files for a word and also for detecting if the file actually contains voice data some of the samples are not exactly one second long and some of them have truncated audio data We then have our function for generating the spectrogram for an audio sample we first make sure the audio sample is normalized and then we compute the spectrogram we reduce the result of this by applying average pooling we finally take a log of the spectrogram so that we don't feed extreme values into our neural network which might make it harder to train For each file we collect training data from we apply some random modifications we randomly shift the audio sample in its one-second segment this makes sure that our neural network generalizes around the audio position we also add in some random sample of background noise this helps our neural network work out the unique features of our target word and ignore any background noise Now, we need to add more samples of the Marvin word to our data set as it would otherwise be swamped by the other words in our training data so we repeat it multiple times this also helps our neural network generalize as there will be multiple samples of the word with different background noises and in different positions in the one second sample we then add in samples from our background noise we run through each file of background noise and chop it into one second segments and then we also generate some random utterances from our background noise once again this should help our network distinguish between the word Marvin and random noises During my testing of the system, I found that there were some particular noises that seemed to trigger false detection of the word Marvin These seem to consist of low-frequency humming and strange scraping sounds. I've collected some of these sounds as more negative samples for the training process With all this data we end up with a reasonably sized training validation and testing data set. So we can save this to disk for use in our training workbook We can also have a look at the spectrograms for different words in our training data So here are some examples of Marvin and here are some examples of the word yes So that's our training data prepared let's have a look at how we train our model up I've created another Jupyter notebook for training our model Once again we have to pull in the imports we need. we also set up TensorBoard so that we can visualize the training of our model we've got our list of words it's important that this is in the same order as in the training workbook and we have the code to load up our training data if we plot a histogram of the training data you can see that we have a lot of examples of the word at position 16 and quite a few at position 35 combining this with our words we can see that this matches up to the word Marvin and to our background noise now for our system, we only really care about detecting the word Marvin so we'll modify our y labels so that it contains a one for Marvin and a zero for everything else plotting another histogram we can see that we now have a fairly balanced set of training data with examples of our positive and negative classes We can now feed our data into TensorFlow datasets. We set up our training data to repeat forever, randomly shuffle and to come out in, batches. Now we create our model I've played around with a few different model architectures and ended up with this as a trade-off between time to train accuracy and model size We have a convolution layer followed by a max-pooling layer followed by another convolution layer with a max-pooling layer and the result of this is fed into a densely connected layer and finally to our output neuron looking at the summary of our model we can see how many parameters it has this gives us a fairly good indication of how large the model will be when we convert it to TensorFlow lite finally, we compile our model, set up the TensorBoard logging, and kick off the training With our training completed, we can now take a look at how well it has performed looking at the tensorboard we can see that our training performance is pretty close to our validation performance there is a bit of noise on the unsmooth lines ideally we should probably try and increase the size of our training and validation data Let's see how well it does on the testing data set I'm going to use the best model that was found during training and work from that You can see that we get pretty good results checking the confusion matrix we can see how many false positives and how many false negatives we get these are pretty good results as well I would rather get more false negatives than false positives as we don't want to be randomly waking up from background noises let's try it with a higher threshold and see how that performs this is probably what we will go for in our code we will get a lot more false negatives but also far fewer false positives So, as we don't seem to be overfitting I'm happy to train the model on our complete data set training validation and testing all combined into one large data set let's see how this performs on all our data once again we have pretty good results our next step is to convert the model to TensorFlow Lite for use on the esp32 Let's jump into another workbook for this We have our imports to bring in TensorFlow and NumPy we're also going to need our data we need this so that the converter can quantize our model accurately once the model has been converted we can run a command line tool to generate the C code and we can now compile that into our project We'll take a look at the wake word detection code on the ESP32 side of things later. First, we need to get to another building block Once we've detected the wake word we need to record some audio and work out what the user wants us to do. we're going to need something that can understand speech So, to do the heavy lifting of actually recognizing the text we're going to be using a service from Facebook called "wit ai" This is a free service that will analyze speech and work out what the intention is behind the speech We log in using Facebook and then the first thing we need to do is create a new application So let's just call this Marven and we'll make it private for now Now we need to train our application to work out what it is we're trying to do so let's add a few sample phrases let's try turning something on we need to create an intent and then we need to start highlighting some of the bits of text let's try and pull out the device that we're trying to turn on I'll create an entity Now we have an entity called device and we've highlighted bedroom as the piece of text that should correspond to that device now we can add a trait for on and off This is the built-in trait that's supplied by wit and we want to say that this should be turned on So let's train this Now let's try adding another piece of text So you can see that it's worked out the device already and it's worked out the value should be off so let's add that to our "turn off and on " intent let's try adding another one let's try and turn on the kitchen so it's worked out that it's an on-off trait and it's worked out on and then let's highlight this and tell it that's the device and we'll train that as well let's try another one "turn off the kitchen" so it's improved its understanding now it can see the device is kitchen and the trait is off so let's train and validate that you can keep adding more utterances to improve the performance of your application but I think for now that should be enough for our use case so let's try this out with some real text I've made some sample utterances and recorded them to WAV files I have a turn-off, a turn-on and another example turn-on Let's have a quick listen to these files "turn off the bedroom" so hopefully that should turn off the bedroom Let's try running this through wit So. we have a curl command here that will post up the WAV file to the back end service You can see it's detected that the device is bedroom and it's detected that we want to turn off the bedroom let's try another one so we'll try "turn on the lights" so this should turn on the light so let's try that so we can see once again it's detected the device it tells us that it's the light it's found the intent turn on device and it says we want to turn it on Let's try our last sample so turn on 2 This should turn on the bedroom as well Let's check that works It's found the device and it's worked out we want to turn it on So, I think this wit application should work for us Let's integrate it into our code So, that's our building blocks completed. We have something that will detect a wake word and we have something that will work out what the user's intention was let's have a look at how this is all wired up on the ESP32 side of things I've created a set of libraries for the main components of the project We have the tfmicro library which includes everything needed to run a TensorFlow lite model and we have a wrapper library to make it slightly easier to use here's our trained model converted into C code and here are the functions that we'll use to communicate with i we have one to get the input buffer and another to run a prediction on the input data we've covered this in more detail in a previous video so i won't go into too many details on this now moving on we have a couple of helper libraries for getting audio in and out of the system we can support both I2S microphones directly and analog microphones using the analog to digital converter samples from the microphone are read into a circular buffer with room for just over one second's worth of audio our audio output library supports playing WAV files from SPIFFS via an I2S amplifier we've then got our audio processing code this needs to recreate the same process that we used for our training data. The first thing we need to do is work out the mean and max values of the samples so that we can normalize the audio we then step through the one second of audio extracting a window of samples on each step the input samples are normalized and copied into our FFT input buffer The input to the FFT is a power of 2 so there is a blank area that we need to zero out before performing the FFT we apply a hamming window and then once we have done the FFT we extract the energy in each frequency bin we follow that by the same average pooling process as in training and then finally we take the log this gives us the set of features that our neural network is expecting to see finally, we have the code for talking to wit.ai to avoid having to buffer the entire audio sample in memory we need to perform a chunked upload of the data we create a connection to wit.ai and then upload the chunks of data until we've collected sufficient audio to capture the user's command we decode the results from wit.ai and extract the pieces of information that we are interested in we only care about the intent, the device and whether the user wants to turn the device on or off That's all the components of our application let's see how these are all coordinated in our setup function, we do all the normal work of setting up the serial port connecting to wi-fi and starting up SPIFFS we configure the audio input and the audio output and we set up some devices and map them onto GPIO ports finally, we create a task that will delegate onto our application class before we kick off the audio input our application task is woken up every time the audio input fills up one of the sections of the ring buffer every time that happens it services the application our application consists of a very simple state machine we can be in one of two states: we can either be waiting for the wake word or we can be recognizing a command let's have a look at the detect wake word state the first thing we do is get hold of the ring buffer we rewind by one second's worth of samples and then generate the spectrogram this spectrogram is fed directly into the neural network's input buffer so we can run the prediction if the neural network thinks the wake word occurred then we move on to the next state otherwise we stay in the current state for the command recognition state, when we enter the state we make a connection to wit ai this can take up to 1.5 seconds as making an SSL connection on the ESP32 is quite slow we then start streaming samples to the server to allow for the SSL connection time we go back one second into the past so we don't miss too much of what the user said once we have streamed three seconds of samples we asked wit.ai what the user said we could be more clever here and we could wait until we think the user has stopped speaking but that's probably work for a future version wit.ai processes the audio and tells us what the user asked we pass that onto our intent processor to interpret the request and move on to the next state which will put us back into waiting for the wake word our intent processor simply looks at the intent name that wit,.ai provides us and carries out the appropriate action "Marvin tell me about life" *Life, don't talk to me about life* So, there we have it a DIY Alexa. How well does it actually work? It works reasonably well We have a very lightweight wake word detection system It runs in around 100 milliseconds and there's still room for lots of optimization Accuracy on the wake word is okay We do need more training data to make it really robust You can easily trick it into activating by using similar words to Marvin such as "marvellous", "martin", "marlin" More negative examples of words would help with this problem The wit.ai system works very well and you can easily add your own intents and traits and build a very powerful system There are also alternative paid versions which you can use instead one is available from Microsoft and Google and Amazon also have similar and equivalent services All the codes in GitHub the link is in the description All you actually need is a microphone to get audio data into the ESP32 You don't necessarily need a speaker. You can just comment out the sections that try and talk to you Let me know how you get on in the comments section As always, thanks for watching I hope you enjoyed this video as much as I enjoyed making it and please hit the subscribe button if you did and I'll keep on making videos
Info
Channel: atomic14
Views: 39,323
Rating: undefined out of 5
Keywords: Alexa, diy alexa, esp32, tensorflow, tensorflow lite, tensorflow lite esp32, esp32 alexa, wake word detection, wit.ai, wit.ai tutorial, amazon alexa, smart home, amazon echo, home automation, iot projects, machine learning, neural network, speech recognition, esp32 projects, esp32 projects 2020, smart device, esp32 tutorial, voice apps
Id: re-dSV_a0tM
Channel Id: undefined
Length: 24min 1sec (1441 seconds)
Published: Sun Oct 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.