Voice Controlled Robot using the ESP32 and TensorFlow Lite

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

GitHub repo is here - https://github.com/atomic14/voice-controlled-robot

It will work with "left", "right", "forward" and "backward" - and also with other random noises and words... :)

👍︎︎ 2 👤︎︎ u/iamflimflam1 📅︎︎ Oct 28 2020 🗫︎ replies

That’s incredible! Is it trained on your voice specifically or a few different ones?

👍︎︎ 1 👤︎︎ u/ReasonablyClever 📅︎︎ Oct 28 2020 🗫︎ replies

Wow, this is great! I subscribed!

👍︎︎ 1 👤︎︎ u/20mcgug 📅︎︎ Oct 28 2020 🗫︎ replies

Lol the motors hiss defiantly, “fineeeee huumaaaannn ssssssssss”

👍︎︎ 1 👤︎︎ u/str8nobull 📅︎︎ Oct 30 2020 🗫︎ replies
Captions
"Forward" "Right" "Forward" "Right" "Forward" "Left" "Backward" "Backward" "Left" "Forward" Hey Everyone, We're back with another dive into some speech recognition. In the last video, we built our very own Alexa using wake word detection running on the ESP32. "Marvin" "Tell me joke" "What goes up and down but does not move" "Stairs..." The actual processing of the user's request is performed by a service called Wit.ai which takes speech and converts it into an intention that can be executed by the ESP32. In this video we're going to some limited voice recognition on the ESP32 and build a voice controlled robot! Once again we'll be using the Commands Dataset as our training data. I've selected a set of words that would be suitable for controlling a small robot: "left", "right", "forward", and "backward" We'll train up a neural network to recognise these words and then run that model on the ESP32 using TensorFlow Lite. We're going to be able to reuse a lot of the code from our previous video with some minor modifications. Let's have a quick look at generating our training data. We have our standed set of imports and some constants. In a departure from our previous Alexa work we're going to split the words into two sections, command words and nonsense words. We'll train our model to recognise the command words and reject the nonsense words and background noise. We have the same set of helper functions for getting the list of files and validating the audio and we have our function for extracting the spectrogram from audio data. Once again, we're going to augment our data - we'll randomly reposition the word within the audio segment and we'll add some random background noise to the word. To get sufficient data for our command words we'll repeat them multiple times this will give our neural network more data to train on and should help it to generalise. A couple of the words - forward and backward have fewer examples so I've repeated these more often. For our nonsense words we won't bother repeating them as we have quite a few examples. As before we'll include background noise and we'll also include the same problem noises we identified in the previous project. With the training data generation completed we just save it to disk. Here are some examples of the words in their spectrogram format. In our previous project we just trained to recognise one word, we'll now want to recognise multiple words. Once again we have our usual includes, and we have the lists of words that want to recognise. We load up our data and if we plot a histogram we can see the distribution of words. Ideally we'd have a bit more of a balanced dataset but having more negative examples may actually help us. We have a fairly simple convolutional neural network, with 2 convolution layers followed by a fully connected layer which is then followed by our output layer. As we are now trying to recognise multiple different words we use the "softmax" activation function and we use the "CategoricalCrossentropy" as our loss function. I do have a couple of introductory videos on TensorFlow that explain these terms in a bit more detail. After training our model we get just under 92% accuracy on our training data and just over 92% accuracy on our validation data. Our test dataset gives us a similar level of performance. Looking at the confusion matrix we can see that it's mostly misclassifying our words as invalid. This is probably what we'd prefer as ideally we'd like to err on the side of false negatives instead of false positives. Since we don't appear to be overfitting the model I've trained it on the complete dataset. This gives us a final accuracy of around 94% and looking at the confusion matrix we see a lot better results. It's possible that now we might have some overfitting, but let's try it in the real world. For that we are going to need a robot! I'm going to build a very simple two-wheeled robot. We're going to use two continuous servos and a small powercell. We'll need quite a wide wheelbase as the breadboard with the ESP32 on it is quite large. After a couple of iterations, I've ended up with something that looks like it will work. To assemble it, it's pretty straightforward we just need to bolt the two servos onto the chassis and attach the wheels. The breadboard just sits on top of the whole contraption. motor noises... Let's have a look at the firmware. We have some helper libraries: The tfmicro library contains all the TensorFlow Lite code. We have a wrapper around that to make it slightly easier to use. This library contains the trained model exported as C code along with a helper class to run the neural network prediction. We then have our audio processing. This recreates the code that we used when we generated the training data. This processes a one-second window of samples and generated the spectrogram that will be used by the neural network. Finally, we have our audio input library. This will read samples either from the internal ADC for analogue microphones or from the I2S interface for digital microphones. In the main application code we have the setup function which creates our command processor and our command detector. The command detector is run by a task that waits for audio samples to become available and then services the command detector. Our command detector rewinds the audio data by one second, gets the spectrogram and then runs the prediction. To improve the robustness of our detection we sample the prediction over multiple audio segments and also reject any detections that happen within one second of a previous detection. If we detect a command then we queue it up for processing by the command processor. Our command processor runs a task that listens on this queue for commands, when a command arrives it changes the PWM signal that is being sent to the motors to either stop them or set the required direction. To move forward we drive both motors forward, for backwards we drive both motors backward. For left we revers the left motor and drive the right motor forward and for right we do the opposite, right motor reverse and left motor forward. With our continuous servos a duty cycle of 1500us should hold them stopped, lower than this should reverse them and higher should drive them forward. I've slightly tweaked the values for the right motor forward value as it was not turning as fast as the left motor and this caused the robot to veer off to one side. Note that because we have the right motor upside down to drive it forward we actually run it in reverse and to drive it backwards we run it forward. You may need to calibrate your own motors to get the robot to go in a straight line. So, that's the firmware code. Let's see the robot in action again! How well does it actually work? Reasonably well... It's a nice technology demonstration and fun project. It does occasionally confuse words and mix up left and right. It's got a mind of its own and will just start wondering around it you don't talk to it. We're starting to reach the limits of what's really possible We have a limited amount of RAM to play with and the models are starting to get very big. We also have a limited amount of CPU to play with. The larger models take longer to process making real-time detection harder. Having said that, there are a lot of improvements that can be made. So, thanks for watching, I hope you found the video useful and interesting, please subscribe if you did. All the code is on GitHub - let me know how you get on in the comments! See you in the next video!
Info
Channel: atomic14
Views: 9,367
Rating: undefined out of 5
Keywords: esp32, esp32 projects, voice recognition, tensorflow, tensorflow lite, tensorflow microcontroller, tensorflow esp32, robots, robot, machine learning, neural network, voice control, arduino voice control, I2S ESP32, I2S Microphone, ESP32 voice control, esp32 voice command, esp32 voice, esp32 voice assistant, speech recognition, tensorflow lite esp32, voice controlled robot
Id: cp2qRrhaZRA
Channel Id: undefined
Length: 9min 52sec (592 seconds)
Published: Wed Oct 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.