Automatic Speech Recognition - An Overview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

So, hi everyone, I'm Preethi, I'm in the department of Computer Science at IIT Bombay. And so today, my tutorial is going to be about ASR systems. So I'm just going to give a very kind of high level overview of how standard ASR systems work and what are some of the challenges. And hopefully leave you with some open problems and things to think about. For the uninitiated, what is an ASR system, it's one which accurately translates spoken utterances into text. So text can be either in terms of words or it can be a word sequence, or it can be in terms of syllables, or it can be any sub-word units or phones, or even characters. But you're translating speech into its corresponding text form. So lots of well-known examples, and most of you must have encountered at least one of these examples. So YouTube's closed captioning, so an ASR engine is running, and producing the corresponding transcripts for the speech, the audio, and the video clips. Then voice mail transcription, even if you've not used it, you might have looked at it and laughed at it, because it's usually typically very bad. So the voice mail transcription is also an ASR run engine which is running. Dictation systems were actually one of the older prototypes of ASR systems. And I think now, it's obviously gotten much better, but I remember Windows used to come prepackaged with a dictation system, and that used to be pretty good. So dictation systems, of course, you're speaking out, and then you automatically get the corresponding transcript. Siri, Cortana, Google Voice, so all of their front ends are ASR engines and so on. This is ASR, but if, I didn't get a picture of Cortona, I apologize, so this is Siri. But if you were to say, call me a taxi, and Siri responded from now on, I'll call you Taxi, okay? This is not the fault of an ASR system, so the ASR system did its job. But there's also a spoken language understanding module to which the ASR will feed into. And so, that didn't do its job very well, and so it got the semantics wrong. So people typically tend to correlate the understanding and the transcription part as ASR. But ASR is strictly just translating the spoken utterances into text. So why is ASR desirable, and why would you want to build ASR systems for maybe all languages? So obviously, speech is the most natural form of communication. So rather than typing, which is much more cumbersome, you can speak to your devices. And if ASR systems were good, then that kind of solves lots of issues, and also keeps your hands free, which is not always a good thing. So many car companies like Toyota and Honda are investing quite a bit to build good speech recognition systems, because they want you to be able to drive and talk. I don't know if I entirely recommend it, but clearly, it leaves other modalities open to do other things. Also, another very kind of socially desirable aspect of building ASR systems is that now you have interfaces for technology which can be used by both literate and illiterate users. So even users who cannot read or write in a particular language can interact with technology, if it is voice driven. And this was, so endangered languages was a point which brought up, that lots of languages are currently close to extinction, or they've been given this endangered status. So if you have technologies which are built for such languages, it can contribute towards the preservation of such languages. So that's just one kind of nice point of why you would want to invest in building ASR systems, so why is ASR a difficult problem? So, the origins of ASR actually go way back, it's a longstanding problem in AI, back in the 50's. So why is it such a difficult problem, when clearly it's not very difficult for humans to do speech recognition, if you're familiar with the language? It's difficult because there are several sources of variability, so one is just your style of speech. So apparently, the way I'm speaking, it's kind of semi spontaneous, but I more or less know what I'm going to say, or here I'm guided by the points on the slides. And it's suddenly continuous speech, so I'm not speaking any kind of staccato manner, or I'm not saying isolated words. So just the style of speech can have a lot to do with how ASR systems will perform. And intuitively, isolated words are much more easier for ASR systems than continuous speech. And that's because when you're speaking spontaneously, words are flowing freely into one another. And there are lots of variations which come in due to pronunciations, and there's this Phenomenon of what's known as coarticulation. So it's just that words, your preceding word affects the word which is coming, and so on. And so, because of this kinda smooth characteristic of continuous speech, that actually creates quite a lot of, it's quite challenging for ASR systems to handle it. Of course nowadays if you have lots and lots of data, this is not really a problem, but this is one prominence source of variability. Another of course is environment. So if you're speaking in very noisy conditions, or if your room acoustics are very challenging, for example, it's very reverberant, so you have a lot of etcho. Or you're speaking in the presence of interfering speakers, so that's actually a real killer. So if you're talking and there's background noise, but the background noise is not, like the vehicle noise or something which can be easily isolated, but there's actually people talking in the background, that's really hard for ASR systems to kind of pick out the voice in the foreground and work on it. So environment is another important source of variability and something which people are very actively working on now to build what are known as robust ASR systems which are robust to noise. Then of course speaker characteristics. So all of us have different ways of speaking, so of course, accent comes into play. Just your rate of speech, some people speak faster, some people enunciate more. Of course your age also changes your characteristic of your speech. So the child speech is going to be very different from adult speech. So various characteristics of speakers also, which contribute to making a challenging problem. And of course there are lots of tasks with specific constraints. For example the number of words in your vocabulary which you're trying to recognize. So if your looking at Voice Search Applications, they're looking at million word vocabularies. But if you're looking at a Command and Control task where you're only trying to move something, or your grammar is very constrained, there your vocabulary is much smaller so that task is simpler. And you might also have language constraints, so maybe you're trying to recognize speech in the language which doesn't have a written form. So doesn't have transcripts at all, then what you do? Or if you're working with the language which is very morphologically rich, or it has words which have agglutinative properties, so then what you do with your language models, maybe in grams are not good enough. So lots of tasks, specific constraints also which contribute to making this a challenging problem. Okay, so hopefully I have convinced you that ASR is quite a challenging prom to work on. So let's kind of just go through the history of ASR, and this is just a sampling. I'm not going to cover most of the important systems. There has been a lot of work since the 50s, this is actually the 20s, but I'll talk a little about what this is. So the very first kind of ASR, and I'm air coating, because it really isn't doing any recognition of any sort, but it's this charming prototype called Radio Rex. So it's this tiny dog which is sitting inside its kennel and it's controlled by a spring, which in turn is controlled by an electromagnet. And the electromagnet is sensitive to energies at frequencies around 500 Hertz. And 500 Hertz happens to be the frequency of the vowel sound e in Rex. So when someone says Rex, the dog will jump out of the kennel. So this is purely a frequency detector, so it's not doing any recognition, but it's a very charming prototype, but clearly, can anyone kind of, there's so many issues with this, but this kind of system in your hard coding needs to be fired at this 500 Hertz. What is an obvious problem with such a prototype? >> Noise. >> Noise, yeah, of course. >> Different people, it might not [CROSSTALK] >> Yeah, exactly. So this only works for adult men. So it's sexist, and it's ageist. It doesnt work on children's speech, it doesn't work on female speech. So yeah, that's obviously an issue when you hard code something. I recently discovered this is on eBay, and you can- >> [INAUDIBLE]. >> Yeah. >> I wanted one of these. >> Really? Okay, yeah, yeah, yeah. >> There was one on sale for, and so I started, then actually created my account to bid for this thing. And then I noticed the person who kept bidding on top of me was someone who's called Ina Kamensky. >> No [LAUGH]. >> He's a professor at James U, because he wanted one as well. >> Okay, so he got it? >> I stopped bidding and asked him for the video if he gets it. >> [LAUGH] >> So he did buy it? >> [INAUDIBLE] Giving you the video, but he did win. >> Wow, okay. Okay, so I can add that to my slide x. [INAUDIBLE] >> He has it, yeah. Okay, so that's the very initial prototype, but that's single word, and it's a frequency detector. So that's not really doing recognition. So the next kind of major system was SHOEBOX, which was in 1962, and it was by IBM. And they actually demoed the system, it did pretty well. But what it was recognizing was just connected strings of digits. So it's just purely a digit recognizer and a few mathematic operations. So it could do basic mathematic, you could say 6+5 is, and so on. So it would perform very well ,but of course, this is also very limited. So it's just a total of 16 words, so ten digits and six operations, and it's doing isolated word recognition. So actually, sorry, this is not connected speech. You would have to say it with a lot of pause in between each of the individual words. So this was just doing isolated word recognition. And then in the 70s, there was kind of a lot of interest in developing speech recognition systems and AI based systems. And ARPA, which is this big agency in the US, funded this $3 million project in 1975, and three teams worked on this particular project. And the goal was to build a fairly advanced speech recognition system, which is not just doing some isolated word recognition and would actually evaluate continuous speech. And so, the winning system, from this particular project, was HARPY out of CMU. And HARPY actually was recognizing connected speech from 1,000 word vocabulary. So we are slowly making progress, but it still didn't use statistical models, which is kind of the current setting. And this was in the 1980s kind of pioneered by Fred Yelenik at IBM and others around the same time. Statistical models became very, very popular to be used in speech recognition. And the entire problem was formulated as a noisy channel. And one of the main machine learning paradigms, which was used for this particular problem were hidden Markov models. So I'll come to this not too much in detail, but at least, at the high level, I'll refer to those in coming slides. So, the statistical models were able to generalize much better than the previous models because the previous models are all kind of rule-based. And now, we are moving into 10K vocabulary sizes. So, the vocabulary size is getting larger, and these are now kind of falling in what are known as large vocabulary continuing speech recognition systems. Although of course, now, large vocabulary is much larger than 10K, but that was in the 80s. And we were in this plateau phase for a long time. And in 2006, deep neural networks kind of came to the forefront, and now all the state-of-the-art systems are powered by deep neural networks. So any of these systems you might have used, Cortana, Siri, Voice Search, at the backend, they're powered by deep neural network based models. So okay, so this is just a video, which was actually quite impressive. And this happens to be by the CFO of Microsoft, Rick Rashid. And this was a really impressive video. So this came out in 2012 when Microsoft had completely shifted to deep neural based models in the back end, I mean, even in the production models. And this, actually, so he's speaking, and in real time, the speech recognition system is working and giving transcripts. And you'll see that the quality of the transcriptions is really good. And while the transcriptions were being displayed on the screen, they were also doing translation into Chinese. And so there was another screen, where the MT system, the machine translation system, was working and producing real time Mandarin transcripts. So it was a very impressive demo, and I'm told it was not, maybe Sirena knows more about the demo itself, but [CROSSTALK] >> Came together to develop another breakthrough in the field of speech recognition research. The idea that they had was to use a technology in a way patterned after the way the human brain works. It's called deep neural networks. And to use that to take in much more data than had previously been able to be used with the hidden Markov models and use that to significantly improve recognition rate. So that one change, that particular breakthrough, increased recognition rates By approximately 30%. That's a big deal. That's the difference between- >> Okay, so you can see the transcripts are pretty faithful to the speech. And this, it was actually happening real time. So that's very impressive. Okay, so given all of this, so you might be wondering what's next. So whenever I tell people I work on speech recognition, I'm asked this question many times. Isn't that problem solved? So what are you doing? What are you continuing to work on? So okay, so just to kind of motivate this question, and also kind of related to the topic that our team is working on, which is accent adaptation. I'll show a video of our 12th president of India. So this is Pratibha Patil. So she is sitting in a closed room, and she's giving a speech. And this is YouTube's automatic captioning system which is working, so this is the early state of the art ASR system. And I got this a few months back, so let's look at the video. >> It is this that will define India as a unique country on the world platform. India is also an example of how economic growth can be achieved within a democratic framework. I believe economic growth should translate into the happiness and progress of all. Along with it there should be development of art and culture,literature and education, science and technology. We have to see how to harness the many resources of India for achieving common good and for inclusive group. >> Okay, so the words in red of course are the erroneous words, and this is not bad at all. So if you look at this metric which is typically used to evaluate ASR systems, so it's the word error rate. So if you take this entire word sequence and you take the true word sequence, and you compute any distance between these. So just align these two sequences and compute where the words that actually swapped for other words and where certain words are inserted, and where certain words are deleted, and you get this error rate. And that's 10%, which is fairly respectable, so that's not bad. And that's because Google has a very, very, well developed Indian English ASR system, and Google actually calls all of these variants of English different languages. So they have Indian English, Australian English, British English, various variants of English, and their system is extremely sophisticated for Indian English. >> [INAUDIBLE] >> Yeah, it's clean, it's quiet, exactly, yeah. So now, let's take speech from our 11th president who is Abdul Kalam, who, and why I picked him is because he has a more pronounced accent. Most of you must have heard his speech at some point. So, let's look at how- >> [INAUDIBLE] Which any human being can ever imagine to fight. And never stop fighting until you are able to destined place, that is the unique you. Get that unique you. It's a big battle. The back in which you don't you take a the battlements you have out for unique needs, for you need to do is you must have that battle, one is you have to set the goal. The second one is acquire the knowledge and third one is a hard work with devotion. And forth is perseverance. >> Okay, so I'm actually curious, Allen, how do you think you would have done with recognizing this? >> Well, I would definitely make errors. >> Yeah. >> Okay, because he is much more heavily accented. And what's really hard is, you can follow the transcript, I'm thinking, Yeah it looks exactly like that. >> [LAUGH] >> [CROSSTALK] Listening to it and not being right. >> And I'm assuming most of us in the room are definitely going to do better than this, right? Definitely better than 39%. So actually, when I've shown this video before, and some people from the crowd said that this is not fair, because not only, cannot entirely attribute it to accent because he's in this large room. It's more reverberant than the previous video. His sentence structure is quite non standard, like he moves in and out of various sentences. So this end, of course, reading. So the read speech has a very, very different kind of grammatical structure than this, right? So I said, okay, let's try to make it fairer, right? So I chose someone who is an American English speaker, but has an accent, because my hypothesis is that accent had a lot to do with this missed recognition rate getting higher. So, this is an American English speaker who has a strong accent and has very non-standard word order. And so, an obvious choice was Sarah Palin, if you know who she is. Most of you know who Sarah Palin is, okay. >> The illegal immigrants welcoming them in, even inducing and seducing them with gift baskets. Come on over the border and here's a gift basket full of teddy bears and soccer balls. Our kids and our grandkids, they'll never know then what it is to be rewarded for that entrepreneurial spirit that God creates within us in order to work, and to produce, and to strive, and to thrive, and to really be alive. Wisconsin, Reagan saved the hog here, your Harley Davidson. It was Reagan who saved- >> Okay, so if I were to try and predict what the next word is, given her word context, there's no way I could reproduce this. It was quite arbitrary, right? You have no idea what she's gonna say next. >> If you showed me that transcription, I would say, well, it's clearly mostly wrong. >> [LAUGH] >> Nobody could actually say that. >> So, here, clearly, language model can only help so much, cuz this is really, it's quite arbitrary. She has a strong accent, but YouTube's ASR engines do have a lot of southern dialects in their training data. And she's also speaking to a large crowd in the open. So the acoustic conditions are somewhat similar to the previous video. And the only difference is that the speaker in the previous video had a much more pronounced accent. So I'm not saying that it entirely had to do with the degradation performance. But that certainly was a major factor in why the speech recognition engine didn't do as well, yes? >> The second time it picked up correctly gift basket, but the first time it says your basket, so- >> Is that a- >> The illegal immigrants, welcoming them in even inducing and seducing them with gift baskets. Come on over the border- >> So, you see- >> Like it could not [INAUDIBLE] for the first ten minutes, and it said you all, but the next time, it correctly identified [CROSSTALK]. >> I see, I see, the second, yeah, yeah, yeah, yeah. >> What would be the- >> So, and there are multiple things here at play. So maybe after I show the structure of an ASR system, I can come back to this question. Any other questions so far? Okay. Let's actually move into what's the structure of a typical ASR system? So this is more or less the pipeline of what's typically in an ASR system. So you have an acoustic analysis component, which sees the speech waveform and converts it into some discrete representation. And, or features, if most of you are familiar with the term features. So the features are then feed in to what's known as acoustic model. So actually instead of explaining each of components here, I'll kind of go into each of these individual and explain it more. So the very first component is just looking at the raw speech waveform and converting it into some representation, which your algorithms can use. This is a very high level, so there is a lot of underlying machinery and I am just giving you an overview of what is going on, so you have the raw speech signal. Which then was discretize, because you can't really work with a speech signal. So, you sample it and generate these discrete samples. And of each of these samples are typically of the order of 10 to 25 milliseconds of speech. And the idea is that once you have each of these what we know as frame, so speech frames, which are free round 25 sometimes even larger. But typically 25 milliseconds. Then you can extract acoustic features, which are representative of all the information in your signal. And another reason why you discretize at particular sampling rates is that the assumption is that within each of these frames your speech signal is stationary. If your speech signal is not stationary within the speech frames, then you can't effectively extract features form that particular slice. So, around 25 milliseconds of speech, yeah. So, you extract features, and this feature extraction is a very involved process, and it requires a long of signal processing know how. But this is also kind of motivated by how our ears work, so actually, one of the most. Common acoustic feature of presentations, which are known as Mel-frequency cepstral coefficients or MFCCs. They're actually motivated by what goes on in your ear or the filter banks, which apply in your ear. So that might be too much detail, so let's just think of it as follows. So you'll start with your raw speech wave form. You'll generate these tiny slices which are also known as speech frames. And now each speech frame is going to be represented as some real valued vector of features, which is capturing all the information and as well as possible is not redundant. So you don't have different dimensions here, which are redundant, so you want it to be as compact as possible. So, now you have these features for each frame, and now these are your inputs to the next component in the ASR system, which is known as the acoustic model. And it's a very important component. So, before I go into what an acoustic model is, for anyone plotting this cloud, everyone knows what a phoneme is. Cuz lots of linguists and many of you might have encountered it. But anyway, so a phoneme is just a discrete unit. It's a speech sound in a language, which can be used to differentiate between words. So in ASR, there is this approach, which is known as beads-on-a-string approach. And this is my terrible drawing of beads on a string. But each word can be represented as a sequence of phonemes. So five here is a sequence of three of these speech sounds. So the phoneme alphabet is very much like the alphabet for our languages, so in text, except now, it's covering the sound space, rather than the textual space. So phonemes are just the letter equal to the analogy of phonemes as letters in your written texts. And each word can be represented as a sequence of phonemes. So this mapping, so you might be wondering, how do you know that five actually maps to this sequence of phonemes? This is written down by experts in the language. So this particular mapping between the pronunciation of a word, so this is actually giving you how this particular word is pronounced. Because you know how each individual phoneme is pronounced, and this pronunciation information is actually given to us by experts. So in English, of course, we have a very well developed pronunciation dictionaries, where experts have given us pronunciations corresponding to most of the commonly occurring words in English. And actually we have CMU to thank for that, so one of the most popularly used translation dictionaries in English is CMUdict, which is freely available. And it has around 150,000 words, I think? Yeah, of that order. And that is actually one of the more extensive dictionaries. So typically, this is not a very easy resource to create. And it's clear why, because it's pretty tedious. First of all, you need to find linguistic experts in the language. And then you need to find what are the most commonly occurring words in the language, and so on. So this takes time. And not many languages have very well developed pronunciation dictionaries. And most languages have around 20 to 60 phonemes. So the phoneme inventory size is roughly 20 to 60. And this is also something which needs to be determined by experts. What are the phonemes which are applicable to a language? So this is a very language-dependent characterization, like what are the phonemes which are relevant to a particular language? Okay, so given we have these units, okay, so one more slide just to motivate, why do we need phonemes, even from a modeling standpoint. So not just from the linguistic standpoint, which is each of these sounds can be kind of discrete as in terms of these phonemes. But even from a modelling standpoint, why are phonemes useful? Why not just use words, instead of trying to split it into these subword units? So let's look at a very simple example. So say that your speech wave form is this string of digits, five, four, one, nine. And say that during training, you've seen lots and lots of samples of five, four and one. So, now you can build models to identify when someone said five, and when someone said four, and when someone said one. But when you come to nine, so this is, say, during test time, when you are evaluating this particular speech utterance. What do you then if, during training, you've never seen nine? Of course, this is not reasonable here because it is a very small vocabulary, but you can extrapolate when you move to larger vocabularies, right? So yeah, all of these words exist, but what do you do when you come here? What if you were to represent each of these words as their corresponding sequences of phonemes? So now five is this sequence of phonemes, four is this sequence of phonemes, one is this sequence of phonemes, and so on. And now you've seen acoustic speech samples, which correspond to each of these phonemes in your training. So during test time, when you see nine, you might be able to put together this string of phonemes. Because you've seen enough acoustic evidence for each of these individual phonemes, is that clear? So it helps you generalize, just to move to a more fine-grained inventory. But of course, with that said, if you have a very limited vocabulary task. I mean, say that you're almost certain that during test time you're only going to get utterances, which are going to stick to that particular vocabulary. And if you have enough samples during training, you're probably well off just building word level models, and not even moving to the phoneme level. But that is for very limited tasks, so most tasks you want to do this. So you want to move to this slightly more fine-grained representation. Okay, so that hopefully motivates why we need phonemes. So this is the problem, right? So I mentioned you have acoustic features which you extract from your raw speech signal. And you want the output on this acoustic model to be what is a likely phone sequence, which corresponded to this particular speech utterance. So this model is typically, so initially in the 80s and even now actually, Hidden Markov models is a paradigm which is used to learn this mapping. So how do I associate a set of frames, a set of features? So when I say frames and features, they're kind of interchangeable, right? So each of these acoustic-feature vectors corresponds to a speech frame. So how do I chunk, how many frames are going to correspond to a particular phone? And here I've shown you a chain of what are known as hidden states, in your Hidden Markov Model. So this is just like, think of it as a graph. And it's a weighted graph, so here I've not put weights on the axis, but the weights are all probabilities. So another thing is here, I've just shown you a single chain that is just a sequence of these phones put together. But you never know that right? So at each point you can transition to any phone, because you don't know which phone is going to appear next. So the entire model is probabilistic. So you need to have lots of hypotheses as to what the next phone could have been? So here I have obviously simplified it by just showing a single chain. And then you have estimates which say that okay, I think this initial ten frames most likely corresponds to a certain phone. And the transition properties, the properties which transition from one phone state to the other determine how many you are going to chunk. How many speech frames are going to correspond to each phone. And once you're in the particular state, there are properties for having generated each of these vectors. So once I'm in, let's say this state, there's a probability for having seen this vector, given that I'm in this state. And those properties come from what are known as Gaussian Mixture Models. So it's a probability distribution which determines that okay, once I'm in this particular state, this particular speech vector could be generated with a certain probability. And how are these probabilities learned? From training data, so that is the detail I'm not going to go into, yeah. >> [INAUDIBLE] >> Yeah, this is just the acoustic vector. >> That's it, yeah. So, any questions so far? Okay, so this has obviously glossed over lots and lots of details. But this is just to give you a high level idea of, you have a probabilistic model, which maps sequences of feature vectors to a sequence of phonemes. So Hidden Markov Models were the state-of-the-art for a long time. And now, of course, we have Deep Neural Networks, which are used for a similar mapping. So now you have your speech signal and again, you're extracting six windows of speech frames. And from these, so say that you're only considering a speech frame here. And then you have, well, I have a laser pointer. Here, so say considering a speech frame here. And then you look at a fixed window around it, a fixed window springs around it. And you generate, put all of these features together and that is your input to a deep neural network. And the output is what is the most likely phone to have been produced, given this particular set of speech rates. But this is a posterior priority over the phones. So I'm not going to go into details about what this is. But, again, here you get an estimate of what is the phone given a speech frame. What is the most likely phone? So it's actually a priority distribution over all the phones. And there are two ways in typically how DNNs is used in acoustic models. So one is, in the previous slide I mentioned, when you have HMMs, you have these states and you have these priority distributions, which govern how the speech vectors correspond to particular states. And these priority distributions are mixtures of Gaussian. But you could kind of not use mixtures of Gaussian here and instead use probabilities from your DNN. So that is one way in which the DNN and the HMM models can be combined and used within an acoustic model. So please feel free to stop and ask me any question. So at this point, I'm not really clear if I'm going too technical, so please feel free to interrupt at any point. So the idea is that we want to map acoustic features to phone sequences. And this is all probabilistic, so we are not giving you a one best sequence. We are saying this is the likely priority distribution of phone sequences. Okay, so that is the output from our acoustic model. Yes? >> Yeah, [INAUDIBLE]. >> Yeah. >> [INAUDIBLE] acoustic? >> Yes, exactly, acoustic features, and then you have priority distributions. >> How will [INAUDIBLE]? >> So the priority distribution for the, so you're familiar with HMMs. So you have observation probabilities. So your observation properties, either they can be Gaussian mixture models. Or your observation properties can be scaled posteriors from your DNNs. So it's just the priority distribution, your observation priority distribution can come from the DNN, yeah. Okay, so this is the acoustic module, which produces phone sequences. So now, we eventually want to get a word sequence, right, from the speech utterance. So this is just an intermediate representation, these phones. So now how do I move from the phones to words? So I mentioned we use these large pronunciation dictionaries. So this the model, which provides a link between these phone sequences and words. So here, typically just a simple dictionary of pronunciations is maintained. So you have these large dictionaries which say that there words correspond to these sequences of phones. And this is the only module in an ASR system that is not learned. It's not learned from data. So the acoustic model was learned from training data, and we'll come to the language model that is also learned from data. But the pronunciation model is actually expert derived. So an expert gives you these mappings. So I'll talk a little about some work I did during my thesis, which was on pronunciation models. It's kind of hinting at how restricted this particular representation is. So I mentioned that each word can be represented as a sequence of phonemes. So we looked at a very popular speech corpus called Switchboard, a subset of Switchboard. It's annotated very detailed level. So not only do we have board sequences corresponding to the speech. We also have phonetic sequences. So it's also phonetically transcribed. Meaning someone listened to all of the utterances and wrote down the phone sequence corresponding to what they heard. And this was obviously done by linguists. And when I say phone, they actually listened to how the word was pronounced. Not what the word should have been according to some dictionary. And why I'm saying this is because there's lot of pronunciation variation when you actually speak, right? So, certain words, even though the dictionary says that it should be pronounced a certain way, because of our accents or just because you're talking fast and so on, the word actually ends up being pronounced in an entirely different way. So for some data from this corpus, we actually had phonetic transcriptions which are giving us exactly how people pronounce those words. So one thing that really stands out from the data, so let's look at just four words. So this probably, sense, everybody and don't. And row in blue, the phone sequences corresponding to these words, according to a dictionary. So this is how the word should be pronounced according an American, so this is the American pronunciations. And these were the actual pronunciations from the data as transcribed by a linguist. So obviously, what is the first thing that stands out here? >> There's >> Yeah, definitely there is no one pronunciation corresponding to the word. There are lots of possible pronunciations, and you'll also see, this is, you'll see lots of like inter syllables are being dropped out of words. >> [INAUDIBLE] >> Yeah, speaking very fast. >> [INAUDIBLE] >> It's wrong, but it's completely legible from the speech, just because of the context and so on. But they're speaking very fast, so if you actually looked at the phonetic transcriptions, there are entire syllables missing, and of course these are also perfectly legitimate pronunciations. So since, when you say the word, it almost feels like you're inserting a tuh at the end, before the last suh sound, sense. So there are lots of alternate pronunciations for words, and I don't remember the exact number. But I think the average number of pronunciations they found for a word was of the order of four or five, so very far from a single pronunciation for a word. Okay, so we thought, so clearly there's a lot of pronunciation variation, and this was from a corpus which was of conversational speech. So it's not read speech, where people are speaking very fast and so on. Read speech tends to be more clearer and they enunciate, and they tend to stick more to the dictionary pronunciations. But of course, you want to recognize conversation speech, you're not always only going to recognize news broadcasters. You need to recognize spontaneous, day to day speech, so how do we kind of try and model this pronunciation variation? How do we computationally model this, and so we thought, why not go to the source, so what is creating these transition variations? So it's your speech production system, so there are various, so before going into that, I'll show you these videos from the SPAN group at USC, which is led by. And they do really good work on speech production based models, to model pronunciation variation and so on. So this is an MRI of the vocal tract and it's synched with the audio, so. >> When it comes to singing, I love to sing French art songs, it's probably my favorite type of song to sing. I'm a big fan of Debussy, I mean, I also love operas, I love singing [INAUDIBLE] and Mozart and Strauss. But when i listen to music, I tend to listen to hard rock or classic rock music. One of my favorite bands is AC/DC, and my favorite song is probably Back in Black, which I'll listen to over and over again in my room. >> Okay, so these various parts of the vocal tract which are moving are known as articulators. And the articulators move in various configurations and lead to certain speech sounds being produced. So they did some really cool work on actually- >> These are the movements that shape the z sounds. >> Tracking the articulators automatically, this is really cool. So we didn't actually use these continuous counters, but we said, let's discretize the space. So that all these various articulators, which move under various configurations of these articulators, which lead to various speech sounds being produced. So if we can discretize the space and say that, okay, there are eight vocal tract variables, and each of these variables can take one of n values. Then different value assignments to these tract variables can lead to different sounds. And this didn't just come out of the blue, there's a lot of linguistic analysis on speech production. And there's lots of very well-developed linguistic theories on speech production. And we used one of them, which is known as articulatory phonology, so we'll come to that in a single slide. So now, everybody which used to be the sequence of phones, can now be represented as these streams of articulatory features. So these, if you have each variable which states different values. Now you will have these quasi-overlapping streams, articulatory features, which lead to a particular word and how it's being pronounced. And so why do I say they are overlapping, because they're not entirely independent, so one feature can affect how another feature behaves. So we kind of got inspiration from this work on this theory called articulatory phonology, by Browman and Goldstein. But they said that this representation of speech as just sequences of phonics, it's very, very constrained, and it's very restrictive. So let's think of speech as being represented as multiple streams of articulatory movements. And that actually gives you a much more elegant framework to represent pronunciation variation So if I have to go back to he previous slide where I showed you all the various pronunciations. To try and kind of motivate how you went from the dictionary pronunciation of probably to one of these things, it would require kind of deleting three phones, inserting some other phone, a huge at a distance in terms of phones. So how do you actually motivate such a large deviation in pronunciation? It turns out that if you represent pronounciations as these streams of features, you can explain transition variation in terms of asynchrony between these feature strips. So just because of certain features are not synchronously moving. Say that you're producing a nasal sound, and your next sound that you're producing is a bubble, but then there's certain remnants of the nasality which hold on. And so your bubble also becomes a little nasalized, and so on. So there was an example which I thought may not have time to go and do so, the idea is that this articulately featured framework gives you a more elegant explanations of pronunciation variation. It's certainly more elegant, but it's very hard to model, and we learned that the hard way. I did during my thesis. So we use this representation and we built what is in the olden days, these are called DBNs. Which is not Deep Belief Networks, it's Dynamic Belgian Networks, so it's the olden day DBN. So it's just a generalization of Hidden Markov Model. So the Hidden Markov Model that I described in the acoustic, when I was talking about acoustic models. This is just a generalization of that particular paradigm. So you have various variables which represent each of these articulatory features. And then you are represent constraints between these variables and so on. >> [INAUDIBLE] >> Yeah, yeah I wish I had that slide actually. If you don't mind, can I come to this at the end, because there's a slide which clearly shows- >> [INAUDIBLE] >> Yes, yes. >> [INAUDIBLE] >> Yes. So now, okay, actually we can take that as an example. So fur. So now I'll say fur. So now I'll break the fur sound into these eight variables and the values it takes. The values it takes to produce the sound fur. So you're- >> But one to one map [INAUDIBLE] >> It's not, it's actually not a one to one map. So we left it, we kept it probabilistic. So it is mostly one to one, but it's not entirely one to one. So we did allow for- >> One to one [INAUDIBLE] >> Yeah, yeah. >> Then wouldn't have the space to- >> It's not a one-to-one mapping. Yeah. It's not a one-to-one mapping. But even if it was. Yeah, so actually if I show you that example slide, I can clearly explain it, so please remind me at the end of the talk. I would like to show that. Okay, so that was the pronunciation model. And the final model is what's known as the language model, which many of you might actually be quite familiar with. So language model is just saying, so again the pronounciation model, the output was words. So now you've mapped a phone sequence to a particular word, and now the language model comes and says how should these words be ordered, according to a particular language? So the language model looks at lots and lots of texts in that particular language. And it finds occurrences of words together, and yeah, you have a question? >> [INAUDIBLE] >> But, so now, now we are coming to this language model. What about going from the phoneme sequences to words? >> The pronunciation model. So, this one, right? So the phone sequence. Once I get a phone sequence, I can start mapping chunks of the phones to valid words in the pronunciation. >> [INAUDIBLE] >> Yes. >> [INAUDIBLE] >> Yes, absolutely. So the thing is, you're not getting a single phone sequence, right? So it's probabilistic. So you have properties for every phone sequence appearing. And so even if it doesn't exactly match, maybe it will exactly match it with a lower probability. But then the language model also comes in and then the property when you add up, then you get kind of the most likely sequence here. Yes? >> [INAUDIBLE] pronunciation models [INAUDIBLE] >> Yes, they're usually not it's determined state. You just have one sequence of, typically you just have one sequence of phones which corresponds to a word. But it can be probabilistic, also. >> [INAUDIBLE] >> Yeah, exactly, the one I was building was too probabilistic, [LAUGH] too many probabilities. Okay, so here of course, so if you saw the word contex, the dog. Obviously the most likely next word to follow this particular word context is ran, maybe even can, but definitely not pan. So pan would have a very, very low probability of following the dog. And the language model is also coming to, actually related to your question. The language model is very crucial because it can be used to disambiguate between similar acoustics. So say that our transverse is, the baby crying. It could also very well map to this particular word sequence, but obviously the first word sequence is much more likely. Because if you look at large volumes of English text, is the baby crying is probably a much more likely word ordering than is the bay bee crying? And then let us pray and lettuce spray. So if you have identical acoustic sequences, your language model has to kind of come in and do its job, then. Okay, so I just wanted to put this here if you wanted to use language models in your work. So SRILM, so actually Alan also mentioned about SRI in his talk. So they've put out this toolkit, which is extensively used in many communities. So it's known as the SRILM toolkit. It has lots of in-built functionalities implemented, so this is a good tool kit to use. Another tool kit which is getting quite popular now days is KenLM Toolkit, which handles large volumes of text very, very efficiently. So the data structures which are used to implement this toolkit are much more sophisticated. So this is much faster, KenLM, but probably only need to use this if you're dealing with very large volumes of data. And there's also this OpenGrm NGram Library. So if you like finite state machines, if you like working with finite state machines, you want to represent everything as a finite state machine. Then this is the toolkit for you, so OpenGrm NGram, it was developed by Google. Okay, so language models, like I mentioned, it has many applications. So speech recognition is just one of them. Machine translation is another application where language models are heavily used. Handwriting recognition, optical character recognition, all of these also would use language models on either letters or characters. Spelling correction, again, language models are useful here because you can have language models over your character space. Summarization, dialog generation, information retrieval, the list is really long. So language models are used in a large number of applications. So I just want to mention this one point about language models. So we mentioned that you look at these word contexts. And you look at counts of these words, and these word contexts over large text corpora in a particular language. How often does this particular set of, how often do these particular set of words appear? And then you compute some relative thoughts. So you see, okay, these chunks appear so often, and these are the total number of chunks. And so you get some relative counts. And it'll give you some probability of how often you can expect this particular chunk to appear. So just to kind of slightly formalize that, so this very, very popular language model which is used are these NGram language models. So the idea is really straightforward. So you just look at co-occuring, either two words, or three words, or four words. So if your n is two, you're looking at bigrams. If n is three, you're looking at trigrams. n is four, four-grams and so on. And Alan mentioned yesterday the five-gram model. If you're already looking at five-grams, you can pretty much reconstruct English sentences really well. But of course then you're running into really, really large number of NGrams, as you increase the order of the NGram. So here I'm looking at a four-gram, so the four-gram is she taught a class. So what is the probability of this particular four-gram? That is the word class follows, this particular word context she taught a. So you look at counts of, she taught a class, in large volumes of English text. And then you normalize it with the count of, she taught, which is the word context. So how often does class come after this particular word context? So what is the obvious limitation here? >> [INAUDIBLE] >> Yeah, exactly, so we'll never see enough data. We're always going to run into NGrams, which we're not going to see in the text corpus. And this is actually This happens far more frequently than one would even expect. Even if you have really, really large databases of Ngrams, you're going to run into this issue. So just to make sure that this is true, I went into this Google Books. So Google Books has accumulated lots and lots of Ngrams from all the books which are available on Google. It is in English. So you can actually plot how Ngrams have appeared in books over some particular time frame. So you can go and play around with this if you've not seen this before. I just typed in this particular fourgram, which hopefully is not very relevant to this crowd. So feeling sleepy right now. And there weren't any valid Ngrams at all. And this is not a very, very rare fourgram, right? And even feeling sleepy, right? None of them appear in text. So this is a problem which occurs actually very, very frequently. So, even when you work with this counts from very, very large text corpora. You're always inevitably going to run in to this issue, which is you're gonna have this unseen Ngrams, which never appear in your training data. And why is this an issue? Because during test time, when you're trying to reorder words according to your particular language model. And if any of these unseen Ngrams appear on your test sentence, then the sentence is going to be assigned a probability of 0. Because it has no idea how to deal with this unseen word. So there is this problem with what are known as un-smoothed Ngram estimates. And I wanted to make it a point to actually talk about this because Ngrams are only useful with smoothing. So these unsmoothed Ngram estimates, like I mentioned, you will always run into these unseen Ngrams, and then what do you do? So there are a horde of what are known as these smoothing techniques. So you're gonna reserve some probability mass from the seen Ngrams towards the unseen Ngrams. And then there are questions like how do you distribute that probability mass across the useen Ngrams? And there are various techniques for that as well, like how do you distribute that remaining probability mass. So this is a lot of work on smoothing methods. And it's very useful to make Ngram models, to make them effective. So for anyone who is interested, I would highly recommend reading this 1998 paper by Chen and Goodman. Goodman was at MSR, I don't know where he is now. So this is an empirical study of smoothing techniques for LMs. I highly recommend this. It's kind of long but it really gives you a very deep understanding of how smoothing techniques help. Don't be fooled by the 1998, it's still very relevant today because Ngrams are very relevant even today. So Ngrams are not going anywhere. So I'm not talking about what the latest language models are. But these days in speech recognition systems, we move towards these, what are known as recurrent neural network based language models. So that's neural network based, but I believe it's still not folded into a lot of production systems because it's not very fast. So many of the production level ASL systems probably still use Ngrams. And then do a rescoring using recurrent neural network language models but, so Ngrams models are still very, very much in the picture. Okay, so we've already covered each of these individual components. But there's this big component in the middle right, which is the decoder, that's actually a very important component. So I have all of this parts of the ASR system which are giving me various estimates of what is the most likely phoneme sequence. What is the most likely word sequence and so on. But finally I just want to get the most clear word sequence corresponding to speech utterance, and so then it's a search problem. So I have these various components, and now I need to search, putting all of them together, I need to search through this entire space. So just looking at the very simple example we started with. This is what a naive search graph would look like. So you start to a particular point and say that you only expect it to be nine or one, just these two words. Then you need to transition to nine. So here, every single arc doesn't have a weight but these are all weighted, cuz they all come with their associated probabilities. So you can, from start, you can transition in to either producing the word nine or one. But each nine is a sequence of phonemes, and each sequence of each phoneme, corresponds to it's corresponding HMM. So, and which has its own probabilities. So this is already, you can see this is slight, this is quite a large graph just for these two words. And get at least like a half decent system we'll be looking at at least 20,000 or 40,000 words. So you can imagine how much the search graph blows up. So these are really large search graphs, and I think I have another slide, yeah. So if you have, say, a network of words as follows, so the birds are walking, the boy is walking. This is really simple where there's not an model, this is highly constrained. So now each of these are now going to map their corresponding phone sequences, so the, the birds, and so on. And each of those phones now are going to correspond to their underlying and very quickly, the graph blows up. So if you look at, so just to give you an estimate, a vocabulary size of around 40,000 gives you search graphs of the order of tens of millions of states. So these are really large graphs, and so now we need to search through these graphs and throw out what is the most likely word sequence to correspond to the speech. So you might be wondering, can you do an exact search through this very, very large graph? And the answer is no, you cannot do an exact search through this graph, because it's just too large. So you have to resort to approximate search techniques, and there are a bunch of them, which do a fairly good job. So none of these speech systems that you work with are actually doing an exact search through this graph. So that's the decoder, so any questions so far? So this is the entire kind of pipeline of how an ASR system works. Okay, so everyone is with me, right? So I want to kind of end with this new direction, which is kind of becoming very hot nowadays. They are known as these end-to-end ASR systems, so I showed you all of these different components which, put together, make an ASR system. But lots of people are interested in kind of doing away with all of those components. Let's not worry about how a word splits into its corresponding phone sequence. Let's just directly learn a mapping from acoustic features to letters. So this is just two characters, so directly go from speech vectors, so these acoustic vectors, to a character sequence. And then you can have character language models which re-score the character sequence, and so on. So one kind of nice advantage of this is that, because you're getting rid of the pronunciation model, which is that you're not now looking at phones at all, you don't need that mapping. The word to phone mapping, which typically is written down by experts, and that changes for each language. So now you want to build a new system for a new language. If this worked really well, then all you'd need is speech and the corresponding text. But the catch is, you need lots and lots of these for this to work, for these kinds of end to end systems to work well. So just for people in the crowd who are interested in these kinds of models, I'll just put down a few references, which you can read. So the first is this paper, which came out in 2014. I've kind of started off this thread of work, which is this end to end speech recognition with recurrent neural networks. So I won't go into details at all about the model, this is just for you to jot down if you want to go later and read it up. But I'll put this up, which is kind off the sample character level transcripts which they get out of their end-to-end systems. So here they have a bunch of target transcriptions, and the output transcriptions. So you can immediately see, so this is without any dictionary, without any language model. So this is directly mapping acoustic vectors to letters, characters. So you can see obvious issues like lexical errors, you can see things where you have phonetic similarity, so shingle becomes single. Then there are words like Dukakis and Milan, which are apparently not appearing in the vocabulary, so that is another advantage of these character models. So in principle, you don't care about whether you will see this word in your vocabulary, because you are only predicting one character at a time. So it should recover vocabulary words, but this system doesn't actually do that too well. >> [INAUDIBLE] >> It does, yeah, so they just had this without a dictionary, without a language model. But their final numbers are all with a language model, and a dictionary also, actually. So the second improvement of this paper was by Maas et al, who again explored a very, very similar structure as in this previous paper in 2014. And they had this kind of interesting analysis, which I wanted to show. So on the x-axis you have time. And each of these graphs correspond to various phones. So remember in their system there are no phones at all. But they just accumulated bunch of speech samples, which correspond to each of these phones. And averaged all the character properties corresponding to those particular forms. So, here you can see that K obviously comes out but so does C. So the letter C also corresponds the core sound and interestingly for Shah. So, this is the phone shah. So S and H, so S, H definitely comes out. But so does T, I because as in TION, T-I-O-N, so you would pronounce that as shah, right? So that actually comes out of the data, which is pretty cool. So this yeah, this was a nice analysis, so they do only slightly better than the previous paper, and yeah? >> [INAUDIBLE] >> So the X-axis is time in frames. You can think of it in speech frames. And these are just the character properties, yeah. So the last system, which came out in 2016, and kind of significantly improved over these two. Uses this very popular paradigm now in sequence to sequence models which are know as Encode or Decoder Networks or Sequence to Sequence networks. Which is first used for machine translation and now they applied it to this particular problem and also included what is known as Attention. And all of these bells and whistles together definitely make a difference. But I want to mention that End-to-End systems are not yet close to the entire standard pipeline that I showed you earlier. So, people would really like to bridge the gap between End-to-End systems and these whole pipelines. Because clearly these are much more, at least easier to understand in some sense, at least from a modeling stand point, although it's not easier to understand what it's doing. Yeah, so there's lot of work going on in this particular area. But these systems require lots of data, lots and lots of data to train. And that's because not only are you trying to understand what are the underlying speech sounds in the speech references, you're also trying to understand spelling. You're trying to figure out what spelling makes sense for a particular address. And clearly for a language like English, where the authography is so irregular, it's a hard problem. And so these models require large amounts of data to work well. Okay, so I'm gonna come back to this question I post initially, which is what's next? So what are all the kinds of problems that we could work on, if anyone was interested in speech recognition? So there are lots of, I think there are lots of next steps. So one is you need to do more to make ASR systems robust to kinda variations in age, accent of course, which is why we are working on that problem. And also this is another thing which people are interested in. So just speech ability so there are people say with speech impairments, the distractor or they have other issues and they not able to speak as clearly as maybe all of fast in the room. So how can we adapt ASI systems to work well with those people? And this is a real, very challenging task, so how do you handle kind of noisy, real life settings with many speakers? This goes back to Allen's dream of having a bot which is sitting in a meeting, and kind of transcribing and figuring out what is going on. So that would also involve the underlying ASR system in that bot. It would have to figure out that okay, these are all the interfering speakers, this is the main speaker, this is the speaker I need to kind of transcribe. I need to haze out the other interfering speakers and so on. And this is not the state of the art for the this kind of meeting speech, tasks is not very low, the error here is not very low. This is actually, it's handled pretty well now, if you have lots and lots of level speech. This pronunciation variability actually captured into acoustic model itself. But handling new languages currently, and the only way to do a good job is to go and collect lots and lots of data. Which at least personally to me, is unsatisfying. So it seems like if you have existing models, you should be able to adapt them with not bizarre amounts of fallible speech. At least they're somewhat related. We should be able to do a half decent job, by taking existing models and adapting them to the new language that we want to recognize, or the new dialect we want to recognize. So there are these problems. So in computer science, we are always trying to do things faster, and to be more efficient, right? In a both computationally, and if you are trying to do things faster from both the computational power standpoint of every standpoint, but we should also try to be resource efficient, right? We don't want to keep going and collecting more and more data, every time we come up with a new task. So can we do many of these tasks with less? This is something that I am very interested in personally, so can we reduce duplicated effort across domains and languages? And also can we reduce dependence on language specific resources? And this is of course the holy grail I think, training with less labeled data. And actually making use of unlabeled data better. Okay, so I'll also show this one direction, which Microsoft is working on, and it's kind of very promising. So this is just an excerpt from an ad. So this is Skype. >> Can you understand me now? [MUSIC] >> [FOREIGN] >> [FOREIGN] >> You speak Chinese. >> Now, if that worked seamlessly that worked here, that would be pretty cool. So I'm told this was just setup for the ad. So Microsoft has been working a lot on speech-to-speech translation. And I think this is a very interesting problem. Because there can be cues in speech, which help disambiguate utterances for the machine translation part, and so on. So I think there is something which can be leveraged from the speech component, from the SR component. So this is something that we talked about a little bit, which was using speech production models, and how we can build speech production inspired models, to handle pronunciation variability. And that actually in principle, does reduce dependence on language-specific resources. Because all of us have the same vocal tract system, right? So they're are only so many ways in which our different articulators, can form different configurations and produce sounds. So in some sense, at least in principle, moving to that kind of a model does reduce dependence on language-specific resources. So we don't need to come up with phone sets corresponding to a particular language, if you're going to represent all of the pronunciations in terms of these articulatory features. But there are other problems with that method. And this is another problem, which I think is very interesting. So how do you handle new languages, and not have to collect loads and loads of data? So just to tell you how many languages so far have ASR supports. This is actually a year or two old, maybe this number has gone up a little. So they support roughly around 80 Languages. But these languages include Indian English, Australian English, British English, which are not clearly languages. So that numbers even lesser than 80. And if you look at the distribution across continents, Europe has the highest representation in terms of languages which are supported by speech technologies. America is of course small, also because they're largely monolingual. But Asia is dismal, even though there are so many languages spoken in the Asian subcontinent. So yeah, we should all do more to build speech recognition technologies, or language technologies, for various Indian languages and languages in Asia. And so one thing that we have looked at, is can we try and crowdsource the labels for speech? So can we just place speech utterances to crowds who speak the particular language. And then try to get transcription from them. So it will be a little noisy, but then there are techniques to kind of handle the noise in those jobs transcriptions. But that also has an issue, because it's somewhat unfair to a large number of languages. So this is just a histogram, this was all the speakers who were sampled from a large crowdsourcing platform. Just MTurk, Amazon's Mechanical Turk. And this looked at the language demographic of crowd workers on Mechanical Turk. And the yellow bars are actually the speakers of those languages in the world. So you can see there's a large distribution and mismatch, between the language background of the crowdworkers, and the language expertise, which is needed to complete transcription tasks. I mean, this tail is really really long, so Forget about minority languages or languages which, it's very, very hard to get native speakers on crowdsourcing platforms. So this also may not really be a viable solution always. So I think there are lots of interesting problems to think of in that space. So with that, I'm going to stop. I'll kind of leave you with this slide. Yeah, I think I'm doing good on time. So thanks a lot. I'm happy to take more questions. >> [APPLAUSE] >> Yes. >> [INAUDIBLE] >> Yeah, so language models, you can back off all the way to a unigram model. So as long as each of the individual words you've seen somewhere in the language model, and if your acoustic model is good, so it's going to give you somewhat reasonable phone sequence corresponding down the line of speech. You might still recover the word sequence even though the language model doesn't give you too many constraints. So for example I think that the Sarah Palin speech, the language model, I don't think anything more than maybe a bi-gram model [LAUGH] or maybe a tri-gram model at the most. So as long as the individual words have been seen in text, in large volumes of text, and you're acoustic model is good, you can still recover it even if there's no continuity between the words. Does that answer your question, yeah? Any other? Okay. Yes. >> Based on your working thesis,do you ever feel you need to, have more because the [INAUDIBLE] was obviously [INAUDIBLE] find some [INAUDIBLE] more than [INAUDIBLE]. >> But it had to have more than 30 phones? Actually, so 40 is the number of phonemes. Yeah, so the number of phones in English is more than 40, so even in those phonetic transcriptions, the number of phones were almost close to 80. Because it's actually annotating all the fine grained variations. And that, of course, helps if you have that kind of. Where are you every going to get that level of phonetic annotation, yeah. Yeah. >> [INAUDIBLE] next feature where character >> You need lots other than needing. You also need lots of computational resources. But other than that. Like I said it doesn't really work as well as the entire pipeline yet. So there's still a delta in terms of the performance of your state of the art systems and these end to end systems. And so currently, all these end-to-end systems are recurrent neural networks. So there is this issue of how much context to retain, and whether you retain that context effectively. Which is were these attention mechanisms come in, but attention mechanisms also really fall short. So if you're interested, there is an iClear paper this year which is, I think the title is something like Frustratingly Short Context or something like that. You can search for translation. So the idea is that even if you just look at the last five output representations, you can do as well as a really sophisticated attention mechanism. So attention mechanisms also need to be kind of improved further. Yeah? >> [INAUDIBLE] systems work for [INAUDIBLE]. >> Yeah, so the [INAUDIBLE] system is actually predicting characters. So it predicts a single letter of the alphabet. So auto vocabulary is not an issue at all cuz it's predicting one character at a time. >> [INAUDIBLE] data for Indian language. >> Yeah, so that's actually something that's very interesting. So for Indian languages which are morphologically rich and where you probably cannot expect to see various forms of a word in the vocabulary. End to end might actually work really well. But no one has run this yet because this amount of data is not available here. >> In fact they didn't call that particular speech for example in English the classes >> [INAUDIBLE] [CROSSTALK] >> Yeah, but that's a good mob- >> English, right? Characters don't [INAUDIBLE] >> No, not at all. So the entire system actually needs to, all these problems, it needs to learn You need to learn the sound mapping, you need the learn spelling. So yeah, so there's no, because it's so irregular, right, the mapping. So what is the point you are saying? Sorry. >> So for example .>> Yeah. >> And this part but if I take for example, I need to have a class for and >> [INAUDIBLE] So you can have, you have [INAUDIBLE]. >> Yeah, you don't need a. >> [INAUDIBLE] >> Yeah, it's up here, yeah. >> [INAUDIBLE] >> No, you might have double of 36 because you, so you have one. So split it. Unicode that is an initiative. So you would have c which is plus e. So you would predict and you predict the model and you predict and so on. >> Might be enable for all the single- >> No, you're talking about. So you just predict each of these. >> S-E-E [INAUDIBLE] so you [INAUDIBLE] then E, then E. >> Yeah. >> [INAUDIBLE] >> Yeah, yeah, that's [INAUDIBLE] but little size, of course the little space becomes larger, but probably [INAUDIBLE] double. I don't think it'll be more than that. Which if you have enough data should be okay. >> Quick fix. >> I think yes the cost of the of the mapping is much more stable. It actually might do even better than in English. Yes? >> What do you think about the minimum about the data is letting me if you want to? >> Yes so I ask this too in terms of who will and so on. So they use like $10,000 So speech, [LAUGH] all of you must be using speech [INAUDIBLE]. So I know the standard pipeline. So for switchboard, for instance, so switchboard is around 200 hours of speech. And the error rates now are 5%, the latest was 5%. >> So of course with lot of machinery. So- >> [INAUDIBLE]. >> Yeah end to end- >> What do you think the bare minimum would be if we really wanted to [INAUDIBLE]. >> To try it out. So the other papers that I showed you they actually work with [INAUDIBLE] which is 200 hours of speech. So even if you're in the 100 hours, And more, I think you can start writing on data systems. But again, 10 years, 20 years, But I would still be interested to see experiments on Indian languages with even smaller amounts of data. >> Some people are even doing it on fable languages >> Right, right >> 20 years, >> Correct >> So it's sort of beginning to work, it really has to be >> Appropriately similar data lectures from [INAUDIBLE], then maybe. >> Yes [LAUGH] That's true, that's true. >> [CROSSTALK] >> Of course, yeah. >> But [INAUDIBLE] good transcriptions, but don't estimate that [INAUDIBLE] is [CROSSTALK] >> Yeah, I think, yeah, that's a very >> I think that would be a very good just hours of Hindi speech how it itself. You have a question? >> Yea. Is there an evidence of these for making another language with this complicated >> No one has that yet. >> Not this language, any >> No so I'll not from there maybe be able and to have some. >> The only thing about that it's funny and it's working this and that it's not clear. But the argument of doing this for line is with my apology and probably the most daily token >> [INAUDIBLE] and there are people claiming that it's not a method. It's not clearly [INAUDIBLE], okay? But this is the hot research topic that people like to do. And ultimately, it would be easier because enunciation [INAUDIBLE] is hard. >> Yes, yes. [LAUGH] >> So if you can sort of get your way around that and find people that actually said it >> But, don't underestimate, people keep saying there's 26 letters in the English. No there isn't, because there's numbers and symbols and other things that you have to address. >> Mm-hm. All right. Thanks, so much. >> [APPLAUSE]

Info

Channel: Microsoft Research

Views: 97,583

Rating: undefined out of 5

Keywords: microsoft research

Id: q67z7PTGRi8

Channel Id: undefined

Length: 84min 40sec (5080 seconds)

Published: Mon Sep 11 2017