A Basic Introduction to Speech Recognition (Hidden Markov Model & Neural Networks)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, I'm Hannes. Hey, I'm Rodolphe. And last week together we made a workshop about speech recognition at our job and one of the parts of that workshop was about the technology behind it and to be honest it's not our field of expertise at all so we had a hard time reading a lot of stuff, a lot of difficult stuff, and we never found a video that explains it easily and now that we put a lot of effort in it, maybe it's time that we put this video online so here you have it: A Speech Recognition for Dummies. Rodolphe: Oh yeah, and by the way last week we had the help of two experts called e-Rodi and e-Hanni and those two guys are also gonna come back to help us explain that to you today. Enjoy! All right, so it all starts with us humans making sound, making noise in a normal environment. Actually, technically we call that an analog environment and the thing is that the computer cannot work with analog data. It needs digital data to be to be able to work with so that's why the first piece of software that we need is what we call an analog to digital converter. Actually, we can name that also a microphone. Hey, e-Rodi, can you help the people here understand how a microphone works? E-Rodi: Okay sure I'll help. I'll pretend to be a microphone converting a sentence from analog to digital. Which phrase do you want me to use? Rodolphe: How you doing? E-Rodi: All right, let's go! In order for you humans to see the conversion, we computers use a visualization called spectrogram. To create the spectrogram three main steps are needed. First, I capture the sound wave and placed it in a graph showing its amplitude over time. As you can see the amplitude units are decibels and we can even guess here the three words that you just said. Second, I chop this wave into blocks of approximately a second. I'm not really that good of a microphone to be honest but I've colleagues that can make this blocks much thinner. As you can see the height of the blocks are determining its state. To each state we can allocate a number and a number being something digital we have successfully converted this sound from analog to digital. Two steps down, one more to go! Even if the data is digitized, we are still missing something. In the speech recognition process we actually need three elements of sound: its frequency, its intensity and the time it took to make it. Therefore we will be using a super complex formula called Fast Fourier Transform to convert the graph you're currently seeing into what we call a spectrogram. To ease your understanding I show you here both a handwritten version and a computer made version of the spectrogram. As you can see the spectrogram shows the frequency on the vertical axis and the time on the horizontal axis. And the colors are actually the energy that you use to make the sound. The brighter the color, the more energy was used. The last interesting fact about the spectrogram is the time scale. As you can see it's way more precise: each vertical line is between 20 to 40 milliseconds long and is called an acoustic frame. Okay, now it's time to really be honest. I'm a good analog to digital converter but actually my job stops there. Although I managed to have a digital version of the sounds you made, I have no idea whatsoever what they're supposed to mean. If they even mean something. So I suggests that my colleague e-Hanni tells you all about how the computers can understand the meaning of sound. But before that our real-life versions have to do a little work. Let's split the work like this: you guys explain the concept of phonemes and we take over the rest of the heavy explanation. Rodolphe: You got yourself a deal. All right, it's time for a little introduction to linguistics. What is a phoneme? A phoneme is as small as 20 to 40 milliseconds, so it's super super short and it's a unit of sound that distinguishes one word from another in a particular language. To put it differently, it's the tiniest part of the word that you can change and that also makes the meaning of that word change. For instance, the word thumb and the word dumb are two different words that you can distinguish by the substitution of one phoneme 'th' by another phoneme 'd'. Those phonemes can be spoken differently by different people but it's always the same phoneme that is meant. Those variations are what we call allophones. And the reasons of those variations are the accent of the person its age, the gender, the position of the phoneme within the word or even the emotional status of the speaker. Those phonemes are important because they are the very basic building blocks that the speech recognition software can use to actually put them in the right order to first form a word, then afterwards a sentence and etc. So the speech recognition software does that by using two techniques: the Hidden Markov Model and the Neural Networks. All right, so I'll explain those two. And let's maybe first start with the Hidden Markov Model. So as Rodolphe just said, it's actually the purpose to reconstruct the phrase that has just been said so by putting the right phonemes after each other and the Hidden Markov Model does that by using statistical probabilities. So we will check how probable it is that one phoneme follows after the other and so on. To be precise: the Hidden Markov Model does that using three different layers and maybe, e-Hanni, you can help us by visualizing this? E-Hanni: Okay sure, I'll help. I'll pretend to be a speech recognition system using a Hidden Markov Model. Which phrase do you like me to use? Hannes: Dolphins swim fast. E-Hanni: All right, let's go. So first of all the model has to check on an acoustic level the probability that the phoneme it has heard really is that phoneme. So that means, as Rodolphe just said, we say phonemes in a very different way according to emotion position in the phrase, and so on. And so the system first needs to check whether that variation it has heard in a phoneme really is that phoneme. E-Hanni: Okay, so the first utterance I recorded in the phrase of Hannes was d. Statistically seen, Hannes could have said 't', 'th', or 'd'. But most likely, most probably, it was a 'd'. So let's take that one. So, once the software has reached a decent probability of what the most likely said phoneme is, then it is time to go to the second layer. And in the second layer the Hidden Markov Model will actually check whether phonemes next to each other, if it's probable that they are standing next to each other, yes or no. So maybe an example in English: if you have the sound 'st', then it's most likely that a vowel will follow for example an 'a', such as in stable and it's less likely, or maybe not even possible in English, to have the sound 'n' after it because 'stn': I don't think it exists and if it does then it's not probable. E-Hanni: Ok, so after the 'd' I heard an 'o'. Statistically seen it is actually quite probable that an 'o' followed after the 'd', so let's keep it that way. After that I've also heard an 'l' and again it's quite probable that an 'l' follows after an 'o'. So, I think I've put together the first phonemes to make a word. The word 'doll'. Hannes: let's see about that, e-Hanni! Because in the third layer now the software will check on word level, so it will check whether words standing next to each other if that's probable and if it makes sense. So for example, it will also check if there are too many verbs or too few verbs in the phrase, it needs adverbs, if there are enough subjects in it, and so on... E-Hanni: Well, I think I already have to go back to the second layer again because while you were talking I've put together the second word and it's Fins. But 'doll Fins', it doesn't really make sense. So let me go back to the second layer and reassess... Ah, you probably said dolphins. Alright! Now the next phonemes I've put together made the word 'swim' and the word 'passed'. But now my phrase doesn't really make sense, because I have two verbs. So let me maybe check 'passed' again. I need to find an adverb that sounds like 'passed' so that my phrase is grammatically correct. So let me go back to the previous layers again and... I already see it. It seems like the 'p' in the first layer could also be 'f' and then it makes 'fast'. Dolphin swim fast! Hannes: That's right, e-Hanni. Now people who sometimes dictate to their phone, they may already have seen this happening. So the more input you give to your phone, then it may be that sometimes words in the beginning of your phrase start changing because the system has become wiser, it knows what you're trying to say, or not trying to say, and that's why it changes some words. E-Hanni: So in short about the Hidden Markov Model. It has a great fit with a sequential nature of speech. However it's not that flexible Also, all the varieties of the phonemes, it cannot really grasp it, it's too much. Hannes: All right, next to the Hidden Markov Model, we also have the Neural Network. So let's maybe talk a bit about that one. And a good thing about the Neural Network is that this one is flexible. So as the name says itself, the Neural Network is actually, the working of it, is based on how our brain works. So it's with a lot of nodes that are all connected with each other. And maybe let's visualize again, so e-Hanni, can you help us? E-Hanni: Yes! So a Neural Network is built up by an input layer, a hidden layer, and an output layer. The middle layer can be composed of many different layers. Now, as you can see, the connections all have different weights, ... so that only the information that passes a certain threshold will be sent through to the next one. Next to that it also means that if a node has to choose between two inputs so here C has to choose between the input of A or B, then it will choose the input of the node with which it has the strongest connection. So in this case it will take the information from A. Sometimes, in some systems, it can also take both inputs and that makes a ratio of it. So here you can see that it takes most of the input of A, but also a little bit of the input of B. Hannes: The interesting thing about Neural Networks is that it's flexible, so it can change over time. This means that in the beginning we have to train the Neural Network which also means that in the beginning all the different connections have the same weight. E-Hanni: Yes, indeed! So here you can see an empty neural network so that means that everything has the same weight. So we will give a certain input to the neural network and we will say what's the desired output is. Then we will let the neural network do its thing and it will come up with a certain put which is of course not the same as desired outputs. Because it's still young, it still needs to be trained. The difference between that we call the error. We also tell it to the Neural Network, that there is an error. From that point the Neural Network can start adapting itself, ... so that it can make the error smaller. Now, for the Neural Network to improve, to keep improving, it needs a lot a lot of inputs to make the error go away. Hannes: And that's a downside. Another downside is that it has a bad fit with a sequential nature of speech but on the plus side, as already said, it's flexible and it can also grasps the varieties of the phonemes and with that I mean that it can see a difference between unique voices, emotions, phonemes in the beginning or at the end of the phrase, and so on. So that's really good. Now I think it's for e-Hanni to do the conclusion. Right, e-Hanni? E-Hanni: These plus and downsides are very compatible with the plusses and downs of the Hidden Markov Model. That is why the Hidden Markov Model and the neural networks are often combine nowadays. So that's why we talk about a... hybrid! So that was it. Actually, we tried to put all the difficult parts in a coherent story and now we hope you enjoyed it and if you're interested about it, please go look it up on the Internet. Ciao! Bye bye!

Info

Channel: Hannes van Lier

Views: 28,597

Rating: 4.9315314 out of 5

Keywords: speech recognition, voice recognition, hidden markov model, hmm, artificial neural networks, neural networks, introduction to speech recognition, speech recognition easily explained, artificial neural networks explained, nn, speech recognition hidden markov model, speech recognition using neural networks, speech recognition basics, speech recognition for dummies, automatic speech recognition, speech recognition explained, speech processing, hannes van lier, speech to text, kaldi

Id: U0XtE4_QLXI

Channel Id: undefined

Length: 14min 59sec (899 seconds)

Published: Sun Sep 02 2018