Hi, I'm Hannes. Hey, I'm Rodolphe. And last week together we made a workshop about speech recognition at our job and one of the
parts of that workshop was about the technology behind it and to be honest
it's not our field of expertise at all so we had a hard time reading a lot of
stuff, a lot of difficult stuff, and we never found a video that explains it
easily and now that we put a lot of effort in it, maybe it's time that we put
this video online so here you have it: A Speech Recognition for Dummies.
Rodolphe: Oh yeah, and by the way last week we had the help of two experts called e-Rodi and e-Hanni
and those two guys are also gonna come back to help us explain that to you
today. Enjoy! All right, so it all starts with us
humans making sound, making noise in a normal environment. Actually, technically
we call that an analog environment and the thing is that the computer cannot
work with analog data. It needs digital data to be to be able to work with so
that's why the first piece of software that we need is what we call an analog
to digital converter. Actually, we can name that also a microphone. Hey, e-Rodi,
can you help the people here understand how a microphone works?
E-Rodi: Okay sure I'll help. I'll pretend to be a microphone converting a sentence from analog to
digital. Which phrase do you want me to use?
Rodolphe: How you doing?
E-Rodi: All right, let's go! In
order for you humans to see the conversion, we computers use a
visualization called spectrogram. To create the spectrogram three main steps
are needed. First, I capture the sound wave and placed
it in a graph showing its amplitude over time. As you can see the amplitude units
are decibels and we can even guess here the three words that you just said. Second, I chop this wave into blocks of approximately a second. I'm not really
that good of a microphone to be honest but I've colleagues that can make this
blocks much thinner. As you can see the height of the blocks are determining its state. To each state we can allocate a number and a number being something
digital we have successfully converted this sound from analog to digital. Two
steps down, one more to go! Even if the data is digitized, we are still missing
something. In the speech recognition process we actually need three elements
of sound: its frequency, its intensity and the time it took to make it. Therefore we
will be using a super complex formula called Fast Fourier Transform to convert
the graph you're currently seeing into what we call a spectrogram. To ease your
understanding I show you here both a handwritten
version and a computer made version of the spectrogram. As you can see the
spectrogram shows the frequency on the vertical axis and the time on the
horizontal axis. And the colors are actually the energy that
you use to make the sound. The brighter the color, the more energy was used. The
last interesting fact about the spectrogram is the time scale. As you can
see it's way more precise: each vertical line is between 20 to 40 milliseconds
long and is called an acoustic frame. Okay, now it's time to really be honest. I'm a good analog to digital converter but
actually my job stops there. Although I managed to have a digital version of the
sounds you made, I have no idea whatsoever what they're supposed to mean. If they even mean something. So I suggests that my colleague e-Hanni
tells you all about how the computers can understand the meaning of sound. But
before that our real-life versions have to do a little work. Let's split the work
like this: you guys explain the concept of phonemes and we take over the rest of
the heavy explanation. Rodolphe: You got yourself a deal. All right, it's time for a little introduction to linguistics. What is a phoneme? A phoneme is as small as
20 to 40 milliseconds, so it's super super short and it's a unit of
sound that distinguishes one word from another in a particular language. To put
it differently, it's the tiniest part of the word that you can change and that
also makes the meaning of that word change. For instance, the word thumb and
the word dumb are two different words that you can distinguish by the
substitution of one phoneme 'th' by another phoneme 'd'. Those phonemes can be spoken
differently by different people but it's always the same phoneme that is meant.
Those variations are what we call allophones. And the reasons of those
variations are the accent of the person its age, the gender, the position of the
phoneme within the word or even the emotional status of the speaker. Those
phonemes are important because they are the very basic building blocks that
the speech recognition software can use to actually put them in the right order
to first form a word, then afterwards a sentence and etc. So the speech
recognition software does that by using two techniques: the Hidden Markov Model
and the Neural Networks. All right, so I'll explain those two. And
let's maybe first start with the Hidden Markov Model. So as Rodolphe just said, it's actually the purpose to reconstruct the phrase that has just been said so by
putting the right phonemes after each other and the Hidden Markov Model does
that by using statistical probabilities. So we will check how probable it is that
one phoneme follows after the other and so on. To be precise:
the Hidden Markov Model does that using three different layers and maybe, e-Hanni,
you can help us by visualizing this? E-Hanni: Okay sure, I'll help. I'll pretend to be a
speech recognition system using a Hidden Markov Model. Which phrase do you like me to use? Hannes: Dolphins swim fast.
E-Hanni: All right, let's go. So first of all the model has to
check on an acoustic level the probability that the phoneme it has heard
really is that phoneme. So that means, as Rodolphe just said, we say phonemes in a very different way according to emotion position in the phrase, and so on. And so
the system first needs to check whether that variation it has heard in a phoneme
really is that phoneme. E-Hanni: Okay, so the first utterance I recorded in the phrase of
Hannes was d. Statistically seen, Hannes could have said 't', 'th', or 'd'. But most likely,
most probably, it was a 'd'. So let's take that one. So, once the software has
reached a decent probability of what the most likely said phoneme is, then it is
time to go to the second layer. And in the second layer the Hidden Markov Model
will actually check whether phonemes next to each other, if it's probable that
they are standing next to each other, yes or no. So maybe an example in English: if
you have the sound 'st', then it's most likely that a vowel will follow for
example an 'a', such as in stable and it's less likely, or maybe not even
possible in English, to have the sound 'n' after it because 'stn': I don't think
it exists and if it does then it's not probable. E-Hanni: Ok, so after the 'd' I heard an 'o'. Statistically seen it is actually quite
probable that an 'o' followed after the 'd', so let's keep it that way. After that I've also
heard an 'l' and again it's quite probable that an 'l' follows after an 'o'. So, I think
I've put together the first phonemes to make a word. The word 'doll'. Hannes: let's see about that, e-Hanni! Because in the third layer now the software will check on word
level, so it will check whether words standing next to each other if that's
probable and if it makes sense. So for example, it will also check if there are
too many verbs or too few verbs in the phrase, it needs adverbs, if there are
enough subjects in it, and so on... E-Hanni: Well, I think I already have to go back to the
second layer again because while you were talking I've put together the second
word and it's Fins. But 'doll Fins', it doesn't really make sense. So let me go
back to the second layer and reassess... Ah, you probably said dolphins. Alright! Now the next phonemes I've put together made the word 'swim' and the
word 'passed'. But now my phrase doesn't really make sense, because I have two
verbs. So let me maybe check 'passed' again. I need to find an adverb that sounds like
'passed' so that my phrase is grammatically correct. So let me go back to the
previous layers again and... I already see it. It seems like the 'p' in the first
layer could also be 'f' and then it makes 'fast'.
Dolphin swim fast! Hannes: That's right, e-Hanni. Now people who sometimes dictate to their phone,
they may already have seen this happening. So the more input you give to
your phone, then it may be that sometimes words in the beginning of your phrase
start changing because the system has become wiser, it knows what you're trying
to say, or not trying to say, and that's why it changes some words. E-Hanni: So in short
about the Hidden Markov Model. It has a great fit with a sequential nature of
speech. However it's not that flexible Also, all the varieties of the
phonemes, it cannot really grasp it, it's too much. Hannes: All right, next to the Hidden Markov Model,
we also have the Neural Network. So let's maybe talk a bit about that one.
And a good thing about the Neural Network is that this one is flexible. So
as the name says itself, the Neural Network is actually, the working of it, is
based on how our brain works. So it's with a lot of nodes that are all
connected with each other. And maybe let's visualize again,
so e-Hanni, can you help us? E-Hanni: Yes! So a Neural Network is built up by an input layer, a
hidden layer, and an output layer. The middle layer can be composed of many
different layers. Now, as you can see, the connections all have different weights, ...
so that only the information that passes a certain threshold will be sent through
to the next one. Next to that it also means that if a node has to choose between two
inputs so here C has to choose between the input of A or B, then it will choose
the input of the node with which it has the strongest connection. So in this case
it will take the information from A. Sometimes, in some systems, it can also
take both inputs and that makes a ratio of it. So here you can see that it takes
most of the input of A, but also a little bit of the input of B. Hannes: The interesting thing about Neural Networks is that it's flexible, so it can change
over time. This means that in the beginning we have to train the Neural
Network which also means that in the beginning all the different connections
have the same weight.
E-Hanni: Yes, indeed! So here you can see an empty neural network so
that means that everything has the same weight. So we will give a certain input
to the neural network and we will say what's the desired output is. Then we
will let the neural network do its thing and it will come up with a certain
put which is of course not the same as desired outputs. Because it's still young, it
still needs to be trained. The difference between that we call the error. We also
tell it to the Neural Network, that there is an error. From that point the Neural
Network can start adapting itself, ... so that it can make the error smaller. Now, for the Neural Network to improve, to
keep improving, it needs a lot a lot of inputs to make the error go away. Hannes: And that's a downside. Another downside is that it has a bad fit with a sequential
nature of speech but on the plus side, as already said, it's flexible and it
can also grasps the varieties of the phonemes and with that I mean that it
can see a difference between unique voices, emotions, phonemes in the
beginning or at the end of the phrase, and so on. So that's really good. Now I think it's for
e-Hanni to do the conclusion. Right, e-Hanni? E-Hanni: These plus and
downsides are very compatible with the plusses and downs of the Hidden Markov Model. That is why the Hidden Markov Model and the neural networks are often
combine nowadays. So that's why we talk about a... hybrid! So that was it. Actually,
we tried to put all the difficult parts in a coherent story and now we hope you
enjoyed it and if you're interested about it, please go look it up on the Internet.
Ciao! Bye bye!