How to Do Speech Recognition with Arduino | Digi-Key Electronics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Applause] [Music] [Applause] [Music] speech recognition is a fascinating field where we can teach computers the ability to recognize spoken words this is still a relatively novel way to interact with computers but it has a variety of uses from speech to text to being able to control your phone or smart speaker without the need to use your hands in fact powerful computers can use natural language processing to understand complete sentences and questions but we're a lot more limited when it comes to microcontrollers for the most part microcontrollers are limited to being able to understand a few spoken words at a time so in this episode i'm going to show you how to train your arduino to recognize a couple of custom keywords we'll be using the arduino nano 33ble sense as it has a beefy processor and a built-in microphone we're going to use edge impulse to train our neural network note that edge impulse generates a library for us to include in our arduino project right now it only works with the nano 33 ble sense but i imagine with some effort it can be ported to other arduino boards that have similar specs i've put together a python script to help us prepare our training data which you can find in my ei keyword spotting repository there is a notebook that you can run in colab that will walk you through using the python script but i'd rather show you how to do it manually on your local machine so let's download the repository and unzip it note that windows will likely struggle with some of the path lengths so i recommend using a program like 7zip we'll want to copy the dataset curation python script and utils.py to someplace on our computer let's take a look at the code note that you'll need to have the numpy librosa sound file and sh util packages installed to run the script which i'll talk about later i've included an example call so you can see what you'll need to run the script ideally you'll want to start with the google speech commands data set and add your own custom keywords in another folder the script will mix them together to create an output directory structure that looks like this the noise category is just random snippets of background noise unknown consists of random other keywords that are not the target keywords the target keywords will have their own directories inside the main output directory the script takes care of augmenting the data by mixing random bits of background noise with each keyword to help make the model more robust when the script starts it will read the arguments and set up the output directory structure it then goes through each clip of background noise and extracts some random segments of it to create a bunch of smaller noise clips next it takes a random assortment of samples from one of the target keywords and mixes those samples with background noise it repeats this process for all of the target keywords you tell it finally it grabs the keyword samples that aren't targets and mixes them with background noise before saving them in the unknown category in essence what the script is doing is gathering samples from multiple input directories and curating a data set that we will send to edge impulse the target keywords we want it to listen for are in separate directories a random selection of other words make up the unknown category and random bits of background noise is the noise category additionally the script is automatically adding in background noise to each sample to help make the model more robust against noise before we run the script we need to collect some samples i recommend starting with the google speech commands data set finding it can be a little tricky if you search for it you'll likely find a link to pete warden's blog post where you can download the first version of it to get a more recent version you'll need to click on the training link which takes you to the tensorflow repository in that speech commands directory click on the tutorial link in the readme that tutorial has a link to the more recent version of the data set so we'll download that one when it's done downloading extract the archive once again if you're on windows i recommend 7zip because this is a large dataset it will take some time to unzip and unpack everything i recommend unpacking the tar file into a speech commands dataset folder when it's done move the speech commands dataset folder to some location where you keep datasets like this go into that folder and move the background noise folder out as the curation script needs to have background noise samples separate from keyword samples you're welcome to pick some of the keywords available in the speech commands data set but i'm going to show you how you can make your own custom keyword because that's way more fun create another folder named custom keywords to create a custom keyword you first need to pick a short word or phrase and then record it a whole bunch of times the best device to record it with is the device you plan to use for speech recognition because it's using the same microphone in this case that would be the arduino but i have found that a computer or a phone worked just fine for this i recommend getting about 50 samples to start with but what you really want is to have a lot of different people with different voices genders inflections and accents submit samples to this that will help you train a more robust model that responds to something that's not just your voice so for that i recommend aiming for about a thousand samples but 50 is a good place to start all right what should i use as a keyword i know digi key digi key digi key digi key digi key once we've recorded a bunch of samples we first want to create a new folder in our custom keywords directory with the name of our keyword the name of this folder is important and it's what our script will use to identify samples with a particular label next we'll go into our recording device which is my phone for me and copy the audio file to somewhere else on the computer then we need to use a program like audacity to edit the file since our microcontroller only operates on 16khz audio we should change the project to 16 kilohertz this also matches the sample rate of the other google speech commands samples i recommend resampling the entire recording to match that sample rate from there we need to select one second of audio around each utterance and test playing it digi-key make sure the sample sounds ok and export just the selection as a wav file 32-bit float works as that matches the bit depth of the other samples we repeat this process for every occurrence drop any samples that take longer than a second or do not contain the spoken keyword notice that i'm selecting exactly a second this is not strictly necessary as the script will drop anything that's over one second or pad zeros onto any sample that's less than a second however you don't want to cut off any part of the spoken phrase i'm also trying to vary when the utterance starts in each sample the more variation in pitch pronunciation and location in the frame the more robust model it will make finally don't worry about the file name of each sample the script will randomize the files and give them new names anyway i've also found that phrases with more syllables perform better than short monosyllabic words when you're done you should have a nice collection of samples i've got 68 which should work for this prototype if you're on windows i definitely recommend using anaconda as it makes managing some of the python packages like librosa much easier if you haven't done so already use pip to install librosa numpy and sound file you shouldn't need to worry about sh util as that comes with python note that i'm using python 3.7 for this when that's done navigate into your speech recognition directory let's call the curation script with python and feed it some arguments first we need to determine which keywords we want as our targets we'll obviously go with our custom keyword and let's pick something else from the google speech commands data set i'll go with well go you can have one or more keywords that are separated by commas here note that i don't have much luck with more than about two keywords but you're welcome to try you just might need to adjust the neural network we use later next we give it the number of output samples per label we want the script to produce i find around 1 500 works pretty well then we define the utterance volume which i'll keep at 1 meaning there's no amplification the g parameter is the background noise volume i'll set this to 0.1 so that it reduces the mixed noise volume to 10 percent of the original s is the sample time which we want to keep at one second since that's how long all of the samples are r is the sample rate which we'll set to 16 000. note that this script will resample any input audio files to this given r value so your raw files do not need to be all the same sample rate e is our bit depth which we'll put at 16-bit pcm that means our input files will lose some resolution from 32-bit float but since our arduino's microphone samples at 16 bits we need the output training files to match that next we defined the location of the background noise samples which we pulled out of the google speech commands data set earlier o is our output directory and then we list all of the input directories containing raw audio samples there's no parameter label for input directories so you can list more than one the script will mix them all in we'll then tell the script to run and go grab a snack when it's done feel free to check the results you should see a set of folders named after the desired labels in the output directory the unknown directory contains a random assortment of non-target keywords try opening one of the samples and playing it you should be able to see how the background noise was mixed in with the original utterance head to edgeimpulse.com make an account if you don't already have one log in and create a new project which i'll call speech recognition go into that project and click on data acquisition click the upload existing data button and click choose files in the new screen we'll upload samples from one label at a time so select all of the samples from the noise category and click open let edge impulse automatically split between training and testing data and let it determine the label from the file name which is just the string before the first period click begin upload and let it run repeat the process for the unknown digikey and go categories this will take some time depending on the speed of your internet connection when that's done you should be able to go back into data acquisition to see all of the samples that you uploaded i recommend verifying that about 20 of your files ended up in the test set and the rest in your training set and you should also check that the labels were correctly read go into impulse design and click on the processing block icon edge impulse recommends using the mel frequency sepstral coefficients as features for spoken keyword spotting so let's select that click on the flask icon to create a learning block once again the default neural network that edge impulse recommends is the best for this application click add and click save impulse go to the mfcc part of your design where you can see how each sample will be converted to its respective mfcc components feel free to play the sound samples and look at the output mfcc features from this block click on the generate features tab and then click generate features to let edge impulse calculate the mfccs from all of our audio samples when all of the features have been extracted head to the nn classifier section here you can play with the neural network parameters if you'd like but i find the defaults work pretty well we're using a one-dimensional convolutional neural network with a few dropout layers to help reduce overfitting click start training to let the model begin training on our data while that's going note that you can construct your own neural network using their graphical tool or you can click on the pop-up menu to switch to expert mode if you know how to use keras here you can define the model manually in python but i'll leave it alone and let the model finish training when training is complete take a look at the console output you definitely want the loss to be lower than on the first training step and both accuracy measurements should ideally be above something like 80 percent for a task like this scroll down to find a confusion matrix checking the performance of our model the rows are the actual labels or ground truth labels and the columns are the predicted labels the numbers in the diagonal where the predicted label is the same as the actual label should be much higher than the other labels which show false negatives and false positives we also get a general score for accuracy and 90 looks pretty good the real test is trying the model on unseen data which we can do in the model testing section select all the samples in the test set and click classify selected let that run for a minute as you can see the model performed much worse on unseen data because the model did better on training and validation data than it did on the test set it likely means that the model overfit the training data there are a few ways to combat this but it involves playing with the model's hyper parameters like the number of training steps and trying out different neural network configurations with maybe more layers or changing the number of nodes in each layer however if we scroll through the test results we see that go performed very poorly but digikey did very well once again confirming my observations that multi-syllabic words and phrases perform better as keywords i'm going to call this good enough for now and head to the deployment page edge impulse gives us a few options to package up our model the generic c plus library is great for general microcontroller usage cube ai works for stm32 parts and web assembly is good for javascript environments for our purposes let's go with arduino we could have it build a ready to go sketch for the nano 33 ble sense but i'd rather show you how to use it as a generic arduino library feel free to run the optimization analysis to get an idea of how much space and processing power the model and library will take up you can see that it should take around 5 kilobytes of ram and 36 kilobytes of flash for the neural network part it should take about four milliseconds to run inference with the neural network on an 80 megahertz microcontroller but that does not take into account the time required to compute mfcc's which is a lot longer the confusion matrix shows us a summary of the test set results we looked at earlier click build to generate and download the arduino library in a new arduino sketch go to sketch include library add.zip library and select the dot zip file you just downloaded from edge impulse go to file examples and locate the library you just installed note that you might need to reload the arduino ide for this to appear the name should match your edge impulse project name so it's speech recognition for me open the microphone continuous example feel free to look through this example sketch to see how everything is set up and what functions are being called to make inference happen note that we define a slices per window number which tells us how often inference is performed since the window is one second in this case we run inference three times per second in setup you'll notice there are some ei printf statements this function is used to print strings to the console and is needed by some of the backend library functions for debugging if you dig into the implementation you'll find that it uses the serial print functions on arduino so it's not much different than serial.print here the sketch then calls the edge impulse init function which sets up all of the necessary mfcc and neural network functions in loop the code first waits for the microphone buffer to fill up it then calls this run classifier continuous function which calculates the mfccs from the captured audio and runs inference using the neural network we created once per window or once per second in our case the sketch prints out the results of the neural network inference to figure out what our target classes are we need to look at the index of this classification variable to find it you'll need to go into the arduino libraries folder on your computer and go into the speech recognition library you just installed note that this library not only contains all of the necessary tensorflow lite functions to perform inference but also the model that you created on edge impulse that model is stored in the tf light model directory as raw byte code however what we're looking for is in model parameters model metadata.h this holds information about your model including the list of labels in an array the ordering is what's important here so to identify the digikey label we need to use an index of two back in our code let's set up the nano's onboard led to blink whenever it hears digikey we'll start by making the led pin and output we want to compare the output value to a threshold every time inference is performed we do this just before the section that prints out the results which only does so after every three inferences remember that to access the digikey label we need to use 2 as our index the output value is essentially a probability between 0 and 1 noting how likely the model thinks it heard the target keyword point 7 seems like a good place to start if the value is higher than 0.7 we'll turn the led on otherwise we'll turn it off select the arduino nano 33ble as your board which should be in the nrf528x category and select the serial port associated with the board click upload and wait if you're on windows you'll probably get an error telling you that some of the file name and extensions are too long the edge impulse documentation mentions this in their arduino section the fix is to save the platform.local.txt file in the directory specified on that page which is c users your username app data local arduino 15 packages arduino hardware embed 1.1.4 feel free to take a look at what's in that text file go back to arduino and try uploading again note that building takes quite a while as there's a lot in the library that we downloaded when it's done you can see that it uses around 230 kilobytes of flash and around 45 kilobytes of ram for global variables the nano 33 ble sense likes to change serial ports during the upload process so make sure you have the correct serial port selected open the serial terminal and watch the output this should tell you that mfcc feature extraction takes about 261 milliseconds and neural network inference takes about 13 milliseconds that means the processor is fully loaded classifying audio slices for 274 milliseconds out of every 333 milliseconds as this is running three times per second that's about an 82 utilization which means there's not much processing power left to run your code you'll need to be careful about how you implement your code in this sketch or you'll find that it will start missing audio samples alternatively you could consider using the nano as a coprocessor that just handles audio inference and it sends notification via gpio or other interface when it hears one of the target keywords when i say go you should see the probability of the go label go up and when i say digi key you should see the probability of that label go up i've found that for this board you need to be close to the microphone you could amplify the captured signal in code but i'll leave that as an exercise for you to try out as i mentioned i need to be pretty close to the board for it to recognize the target keyword so when i say digi key the yellow led should light up digi-key i hope this video has helped you create your own custom keyword spotting and speech recognition device with an arduino please subscribe if you want to see more videos like this one and happy hacking you
Info
Channel: DigiKey
Views: 46,392
Rating: undefined out of 5
Keywords: DigiKey, speech recognition, keyword spotting, Edge Impulse, TensorFlow, TensorFlow Lite, Machine Learning, AI
Id: fRSVQ4Fkwjc
Channel Id: undefined
Length: 20min 51sec (1251 seconds)
Published: Mon Nov 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.