I Built a Personal Speech Recognition System for my AI Assistant

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
speech it's the most natural form of human communication this is my demo of a real-time speech recognition system using deep learning yo what's up world michael here and today we're going to be talking about speech recognition why it's hard and how deep learning can help solve it later in this video we're going to build our own neural network and train the speech recognition model from scratch so humans are really good at understanding speech so you would also think it's easy for computers to do too as well right speech recognition is actually really hard for computers speech is essentially sound waves which lives in a physical world with their own physical properties for example a person's age gender style personality accent all affects how they speak in the physical properties of sound a computer also got to consider the environmental noise around the speaker and the type of microphones they're using to record so because there's so many variations and nuances in the physical properties of speech it makes it extremely hard to come up with all the rules possible for speech recognition not only do you have to deal with the physical properties of speech but you have to deal with the linguistic properties of it as well consider the sentence i read a book last night i like to read read and read are spelled the same but they sound different red can be spelled like the color red but in this context it needs to be spelled like red for reading so language itself is really complex and it has a lot of nuance and variations that you would have to come up with all possible rules for it as well to have an effective speech recognition so what do you do when you have like what seems like an insurmountable amount of rules you use deep learning so deep learning has changed the game for a lot of complex tasks any modern speech recognition system today leverages deep learning in some way so to build an effective speech recognition system we have to have a strategy on how to tackle the physical properties of speech as well as the linguistic properties of it let's start with the physical property to properly deal with the variations in nuances that comes with the physicality of speech like age gender microphone environmental condition etc we'll build an acoustic model on a high level our acoustic model will be a neural network that takes in speech waves as input and outputs to transcribe text in order for our neural network to know how to properly transcribe the speech waves to texts we'll have to train it with a ton of speech data let's first consider what model we want to use by looking at the type of problem we are trying to solve speech is a naturally occurring time sequence meaning we need a neural network that can process sequential data the neural network also needs to be lightweight in terms of memory and compute because we want to run it real time on everyday consumer machines recurrent neural networks or rnn for short are a natural fit for this task as it excels at processing sequential data even when we configure it to be a smaller network size so we'll use that as our acoustic model now let's consider the data what a neural network can learn is dependent on the data you train it with if we want our neural network to learn the nuances of speech we'll need speech data that has a lot of variations of gender age accent environmental noise and types of microphones common voice is an open source speech data set initiative led by mozilla that has many of those variations so it'll be perfect let's listen to a couple samples [Music] gordon no longer mentions boot laces and the two engines soon talk about trucks favors granted jews latin colonies continue it has since been renamed yes a raft international airport each audio sample comes with labels of the transcribed text okay i think we have a solid approach we have a data set that contains some variation in nuances as well as a lightweight neural network architecture now let's talk about the linguistic aspect of speech to inject linguistic features into the transcriptions we'll use something called a language model alongside a rescoring algorithm to understand how this works at a high level let's look at the acoustic model's output a speech recognition neural network is probabilistic so for each time step it outputs a probability of possible words you can naively take the highest probable word of each time step and emit that as your transcript but your network can easily make linguistic mistakes like using the word red for color when it should be read for reading this is when the language model comes into play a language model can determine what is a more likely sentence by building a probability distribution over sequences of words it trains on you can use a language model and rescore the probabilities depending on the context of the sentence you'll get an idea of how all this works in a minute for our implementation of a language model we can use an open source project called ken lm which is a rules-based language model we want to use ken lem because it's lightweight and super fast unlike the much heavier neural network based language models a neural based language model like transformers from hugging face have been proven to produce better results but since our goal is low compute we'll use klm which works well enough for the rescoring algorithm we'll use what's called a ctc beam search the beam search combined with the language model is how we'll rescore the outputs for better transcriptions the logic of the beam search algorithm can get pretty complex and very boring and i'm way too busy to attempt an in-depth explanation here but here's the basic premise the beam search algorithm will traverse the outputs of the acoustic model and use the language model to build hypothesis aka beams for the final output during the beam search if the language model sees that the word book exists in the transcription it will boost the probability of the beams that contains red like reading instead of red like color because it makes more sense this process will produce more accurate transcriptions thank the programming guides for open source code so we don't have to build the language model and the ctc beam search rescoring algorithm ourselves which will save us a ton of time so to sum it up using a language model in the ctc beam search algorithm we can inject language information into the acoustic model's output which results in more accurate transcriptions okay i think we have a pretty solid strategy on how to tackle the speech recognition problem for the physical properties we'll implement an acoustic model and for the linguistic properties we'll implement a language model with a rescoring algorithm let's get to building [Music] first we need to build a data processing pipeline we'll need to transform the raw audio waves into what's called metal special grams you can just think of male spectrograms as a pictorial representation of sound we also need to process our character labels our models will be character based meaning it will output characters instead of word probabilities decoding character probabilities is more efficient because we only have to worry about 27 probabilities for each output instead of like a hundred or thousand of possible words we need to augment our data so we can effectively have a bigger data so we'll use spec augment spec argument is augmentation technique that cuts out the information in the time and frequency domain effectively destroying pieces of the data this makes the neural network more robust because it's forced to learn how to make correct predictions with imperfect data making it more generalizable to real world now let's move on to the model the model consists of a convolutional layer three dense layer and an r and lstm layer the purpose of the convolutional layer is two things it learns to extract better features from the mel spectrogram and also reduce the time dimensions of the data in theory they say an n layer will produce features that should be more robust causing the rnns to produce better predictions we also set the shot of the cnn layer to 2 therefore reducing the time steps of the male spectrogram by half allowing the rns to do less work because there are less time steps which would make the overall network faster we added two more dense layers in between the cnn and the rnns the purpose of the dense layer is to also learn to produce a more robust set of features for the rnn's for the rnn layer we're using lstm bearing the rnn takes the features produced by the previous dense layer and step-by-step produces an output that can be used for prediction we also have a final dense layer with a softmax activation that acts like a classifier classifier takes the rnn's output and predicts character probabilities for each time step we add layer normalization yellow activation and dropout between each layer with the purpose of making the network more generalizable and robust to real-world data in deep learning adding more layers can lead to better results but since we want this to be a lightweight model we stop at five layers we have one cnn layer one lsdmr and then layer and three dense layers for the training screen we'll use pytorch lightning pytorch lightning is a library for pytorch that decouples the science code from the engineering code for the training objective i use ctc loss function which makes it super easy to train speech recognition models it can assign probabilities given an input making it possible to just have your audio sample pair with their corresponding text labels without needing to align these characters to every single frame of the audio oh hey just finishing up the rest of the code here this code is open source so if you want to see the full implementation details make sure you check out the github repo in the description below now that we have the code we need to start training this is the perfect time to introduce my training rig war machine war machine is my personal deep learning rig i use to train models if you're following along i recommend using a gpu so if you don't have a gpu you can use free alternatives like google collab or kaggle kernels okay let's get to training so [Music] all right so training's finally finished it took a couple of days but i'm pretty happy with the results check it out the lost curves they both look pretty good it doesn't seem like anything's overfitting also while everything was training i implemented the language model and the rescoring algorithm from the open source packages also i made a little web demo using flask um to demo the speech recognition model so let me set everything up and let's test this thing out okay so i got the demo prepared the first demo is going to be just the acoustic model without the language model and the rescoring algorithm this is to showcase why it's important hi this is michael demoing my speech recognition system without a language model and a rescoring algorithm as you can see it's it's not very good let me set up the second demo okay so this is the second demo with the language model and rescoring algorithm hi this is michael demoing my speech recognition system with a language model and a rescoring algorithm as you can see it does a lot better but it's not perfect okay so the speech recognition system with the language model and the rescoring algorithm works pretty well with me but let me reveal something i haven't mentioned yet i collected about an hour of my speech and trained it with the common voice data set which is about a thousand hours one thing i forgot to mention when filming this clip was that i also up sampled the one hour recording of myself to about 50 hours so it can be more representative in the entire training data set okay play so there is a possibility that that extra hour of my voice has bias the algorithm to work really well with me so to test that theory i want to try the speech recognition system with other people well you know if it's my brother long long i need you to test this out for me real quick all right just press that start button and say whatever you want hi baby girl say something else you look fine okay say support stuff hey you better quit go around [Music] okay as you can see it doesn't work very well on his voice but when i start talking you can see it start picking up on what i'm saying right oh then that's cool okay thank you okay so our next guest is big on privacy so we'll just skip the introductions and go straight into the demo and play say anything you want i love charlie baby and oliver say something else this thing sucks [Laughter] okay as you can tell it only works well for me again it doesn't work very well for her so she thinks it sucks alright so as you can tell from my guest's reaction the speech recognition system is not that great at least on them works really well for me that's because i biased the algorithm by adding my own data all of this was expected though these speech 2 a very famous speech recognition paper from baidu claimed that you need about 11 000 hours of audio data to have an effective speech recognition system and we use like what a thousand hours or so their model also has 70 million parameters compared to our model which is 4 million parameters i chose a small architecture on purpose because i want it to be small and i wanted to run real time on any consumer device like my laptop so i think overall the system works pretty well if you can collect your own data so if you want to train your own speech recognition system i recommend you collect your own data using something like the mimic recording studio which is what i used i'll also open source a pre-trained model that you can download and then you can just fine-tune that on your own data so you don't have to go through the trouble of training it on the thousand hours of common voice like i did i have the links to all the goodies in the description so make sure you check that out so this video was part of the series of how to build your own ai voice assistant using pytorch so far i've done the wakeword detection which is the first video now i did the automatic speech recognition i still have the natural language understanding part which is the way to map the transcription to some sort of action like what's the weather like and then i also have to do the speed synthesis part which is the synthetic voice of the ai voice assistant so if you want to be updated when those videos come out make sure you hit that like and subscribe button also i have a discord server that's getting pretty active if you want a community of ai enthusiasts practitioners and hackers make sure you join the discord server we want to start planning events in the discord server so if you're curious about what they are then make sure you join okay so that's it for this video and as always thanks for watching [Music] [Music] you
Info
Channel: The A.I. Hacker - Michael Phi
Views: 79,546
Rating: 4.933784 out of 5
Keywords: speech recognition, deep learning, machine learning, pytorch
Id: YereI6Gn3bM
Channel Id: undefined
Length: 16min 32sec (992 seconds)
Published: Wed Jul 29 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.