So, hi everyone, I'm Preethi, I'm in the department of
Computer Science at IIT Bombay. And so today, my tutorial is
going to be about ASR systems. So I'm just going to give a very
kind of high level overview of how standard ASR
systems work and what are some of the challenges. And hopefully leave you
with some open problems and things to think about. For the uninitiated, what is
an ASR system, it's one which accurately translates spoken
utterances into text. So text can be either
in terms of words or it can be a word sequence, or
it can be in terms of syllables, or it can be any sub-word units
or phones, or even characters. But you're translating speech
into its corresponding text form. So lots of well-known examples,
and most of you must have encountered at least
one of these examples. So YouTube's closed captioning,
so an ASR engine is running, and producing the corresponding
transcripts for the speech, the audio, and the video clips. Then voice mail transcription, even if you've not used it,
you might have looked at it and laughed at it, because it's
usually typically very bad. So the voice mail transcription
is also an ASR run engine which is running. Dictation systems were actually
one of the older prototypes of ASR systems. And I think now, it's obviously
gotten much better, but I remember Windows used to come
prepackaged with a dictation system, and
that used to be pretty good. So dictation systems, of course,
you're speaking out, and then you automatically get
the corresponding transcript. Siri, Cortana, Google Voice, so all of their front ends
are ASR engines and so on. This is ASR, but if, I didn't
get a picture of Cortona, I apologize, so this is Siri. But if you were to say,
call me a taxi, and Siri responded from now on,
I'll call you Taxi, okay? This is not the fault
of an ASR system, so the ASR system did its job. But there's also a spoken
language understanding module to which the ASR will feed into. And so, that didn't do
its job very well, and so it got the semantics wrong. So people typically tend to
correlate the understanding and the transcription part as ASR. But ASR is strictly just
translating the spoken utterances into text. So why is ASR desirable, and why would you want to build ASR
systems for maybe all languages? So obviously, speech is the most
natural form of communication. So rather than typing,
which is much more cumbersome, you can speak to your devices. And if ASR systems were good,
then that kind of solves lots of issues, and
also keeps your hands free, which is not always
a good thing. So many car companies like
Toyota and Honda are investing quite a bit to build good
speech recognition systems, because they want you to
be able to drive and talk. I don't know if I entirely
recommend it, but clearly, it leaves other modalities
open to do other things. Also, another very kind of
socially desirable aspect of building ASR systems is that
now you have interfaces for technology which can be
used by both literate and illiterate users. So even users who cannot read or
write in a particular language can interact with technology,
if it is voice driven. And this was, so endangered
languages was a point which brought up, that lots of
languages are currently close to extinction, or they've been
given this endangered status. So if you have technologies
which are built for such languages, it can contribute towards the
preservation of such languages. So that's just one kind of nice
point of why you would want to invest in building ASR systems,
so why is ASR a difficult problem? So, the origins of ASR
actually go way back, it's a longstanding problem
in AI, back in the 50's. So why is it such a difficult
problem, when clearly it's not very difficult for
humans to do speech recognition, if you're familiar
with the language? It's difficult because there are
several sources of variability, so one is just your
style of speech. So apparently, the way I'm
speaking, it's kind of semi spontaneous, but I more or less
know what I'm going to say, or here I'm guided by
the points on the slides. And it's suddenly continuous
speech, so I'm not speaking any kind of staccato manner, or
I'm not saying isolated words. So just the style of speech can have a lot to do with how
ASR systems will perform. And intuitively, isolated
words are much more easier for ASR systems than
continuous speech. And that's because when you're
speaking spontaneously, words are flowing freely
into one another. And there are lots of
variations which come in due to pronunciations, and there's this Phenomenon of
what's known as coarticulation. So it's just that words, your preceding word affects the
word which is coming, and so on. And so, because of this kinda
smooth characteristic of continuous speech, that actually
creates quite a lot of, it's quite challenging for
ASR systems to handle it. Of course nowadays
if you have lots and lots of data, this is not
really a problem, but this is one prominence
source of variability. Another of course
is environment. So if you're speaking in
very noisy conditions, or if your room acoustics are very
challenging, for example, it's very reverberant, so
you have a lot of etcho. Or you're speaking in
the presence of interfering speakers, so
that's actually a real killer. So if you're talking and
there's background noise, but the background noise is not,
like the vehicle noise or something which can be
easily isolated, but there's actually people
talking in the background, that's really hard for
ASR systems to kind of pick out the voice in the foreground and
work on it. So environment is another
important source of variability and something
which people are very actively working on now to build
what are known as robust ASR systems which
are robust to noise. Then of course speaker
characteristics. So all of us have different
ways of speaking, so of course, accent comes into play. Just your rate of speech,
some people speak faster, some people enunciate more. Of course your age also changes your characteristic
of your speech. So the child speech is going
to be very different from adult speech. So various characteristics
of speakers also, which contribute to making
a challenging problem. And of course there are lots of
tasks with specific constraints. For example the number of
words in your vocabulary which you're trying
to recognize. So if your looking at
Voice Search Applications, they're looking at million
word vocabularies. But if you're looking
at a Command and Control task where you're only
trying to move something, or your grammar is
very constrained, there your vocabulary is much
smaller so that task is simpler. And you might also have
language constraints, so maybe you're trying to recognize
speech in the language which doesn't have a written form. So doesn't have transcripts
at all, then what you do? Or if you're working with
the language which is very morphologically rich, or it has
words which have agglutinative properties, so then what you
do with your language models, maybe in grams
are not good enough. So lots of tasks, specific constraints also which
contribute to making this a challenging problem. Okay, so hopefully I
have convinced you that ASR is quite a challenging
prom to work on. So let's kind of just go
through the history of ASR, and this is just a sampling. I'm not going to cover most
of the important systems. There has been a lot of work
since the 50s, this is actually the 20s, but I'll talk
a little about what this is. So the very first kind of ASR,
and I'm air coating, because it really isn't doing
any recognition of any sort, but it's this charming
prototype called Radio Rex. So it's this tiny dog which is
sitting inside its kennel and it's controlled by a spring, which in turn is controlled
by an electromagnet. And the electromagnet is
sensitive to energies at frequencies around 500 Hertz. And 500 Hertz happens to be
the frequency of the vowel sound e in Rex. So when someone says Rex, the
dog will jump out of the kennel. So this is purely
a frequency detector, so it's not doing any recognition,
but it's a very charming prototype,
but clearly, can anyone kind of, there's so many issues with
this, but this kind of system in your hard coding needs
to be fired at this 500 Hertz. What is an obvious problem
with such a prototype? >> Noise. >> Noise, yeah, of course. >> Different people,
it might not [CROSSTALK] >> Yeah, exactly. So this only works for
adult men. So it's sexist, and it's ageist. It doesnt work on
children's speech, it doesn't work
on female speech. So yeah, that's obviously
an issue when you hard code something. I recently discovered this
is on eBay, and you can- >> [INAUDIBLE]. >> Yeah. >> I wanted one of these. >> Really?
Okay, yeah, yeah, yeah. >> There was one on sale for,
and so I started, then actually created my
account to bid for this thing. And then I noticed the person
who kept bidding on top of me was someone who's
called Ina Kamensky. >> No [LAUGH].
>> He's a professor at James U, because he wanted one as well. >> Okay, so he got it? >> I stopped bidding and asked
him for the video if he gets it. >> [LAUGH]
>> So he did buy it? >> [INAUDIBLE] Giving you
the video, but he did win. >> Wow, okay. Okay, so
I can add that to my slide x. [INAUDIBLE]
>> He has it, yeah. Okay, so that's the very
initial prototype, but that's single word, and
it's a frequency detector. So that's not really
doing recognition. So the next kind of major system
was SHOEBOX, which was in 1962, and it was by IBM. And they actually demoed
the system, it did pretty well. But what it was recognizing
was just connected strings of digits. So it's just purely
a digit recognizer and a few mathematic operations. So it could do basic mathematic,
you could say 6+5 is, and so on. So it would perform very
well ,but of course, this is also very limited. So it's just a total of 16
words, so ten digits and six operations, and it's doing
isolated word recognition. So actually, sorry,
this is not connected speech. You would have to say it with a
lot of pause in between each of the individual words. So this was just doing
isolated word recognition. And then in the 70s, there was
kind of a lot of interest in developing speech recognition
systems and AI based systems. And ARPA, which is this
big agency in the US, funded this $3 million
project in 1975, and three teams worked on
this particular project. And the goal was to build
a fairly advanced speech recognition system, which is not
just doing some isolated word recognition and would actually
evaluate continuous speech. And so, the winning system,
from this particular project, was HARPY out of CMU. And HARPY actually
was recognizing connected speech from
1,000 word vocabulary. So we are slowly making
progress, but it still didn't use statistical models, which
is kind of the current setting. And this was in
the 1980s kind of pioneered by Fred Yelenik at IBM
and others around the same time. Statistical models became very, very popular to be used
in speech recognition. And the entire problem was
formulated as a noisy channel. And one of the main machine
learning paradigms, which was used for this particular problem
were hidden Markov models. So I'll come to this not too
much in detail, but at least, at the high level, I'll refer
to those in coming slides. So, the statistical
models were able to generalize much better
than the previous models because the previous models
are all kind of rule-based. And now, we are moving
into 10K vocabulary sizes. So, the vocabulary size
is getting larger, and these are now kind of falling
in what are known as large vocabulary continuing
speech recognition systems. Although of course, now, large
vocabulary is much larger than 10K, but that was in the 80s. And we were in this plateau
phase for a long time. And in 2006, deep neural
networks kind of came to the forefront, and
now all the state-of-the-art systems are powered by
deep neural networks. So any of these systems you
might have used, Cortana, Siri, Voice Search,
at the backend, they're powered by deep
neural network based models. So okay, so this is just a video, which
was actually quite impressive. And this happens to be by the
CFO of Microsoft, Rick Rashid. And this was a really
impressive video. So this came out in 2012 when
Microsoft had completely shifted to deep neural based
models in the back end, I mean, even in the production models. And this, actually, so
he's speaking, and in real time, the speech recognition system is
working and giving transcripts. And you'll see that the quality of the
transcriptions is really good. And while the transcriptions
were being displayed on the screen, they were also
doing translation into Chinese. And so there was another screen,
where the MT system, the machine translation system,
was working and producing real time
Mandarin transcripts. So it was a very impressive
demo, and I'm told it was not, maybe Sirena knows more
about the demo itself, but [CROSSTALK]
>> Came together to develop another breakthrough in
the field of speech recognition research. The idea that they had
was to use a technology in a way patterned after
the way the human brain works. It's called deep
neural networks. And to use that to take in much
more data than had previously been able to be used with
the hidden Markov models and use that to significantly
improve recognition rate. So that one change,
that particular breakthrough, increased recognition rates
By approximately 30%. That's a big deal. That's the difference between-
>> Okay, so you can see the transcripts are
pretty faithful to the speech. And this, it was actually
happening real time. So that's very impressive. Okay, so given all of this, so you might be wondering
what's next. So whenever I tell people I
work on speech recognition, I'm asked this
question many times. Isn't that problem solved? So what are you doing? What are you
continuing to work on? So okay, so just to kind of
motivate this question, and also kind of related to the topic
that our team is working on, which is accent adaptation. I'll show a video of our
12th president of India. So this is Pratibha Patil. So she is sitting in a closed
room, and she's giving a speech. And this is YouTube's automatic
captioning system which is working, so this is the early
state of the art ASR system. And I got this a few
months back, so let's look at the video. >> It is this that will define
India as a unique country on the world platform. India is also an example of how
economic growth can be achieved within a democratic framework. I believe economic growth should
translate into the happiness and progress of all. Along with it there should
be development of art and culture,literature and
education, science and technology. We have to see how to harness
the many resources of India for achieving common good and
for inclusive group. >> Okay, so the words in red of
course are the erroneous words, and this is not bad at all. So if you look at this metric
which is typically used to evaluate ASR systems, so
it's the word error rate. So if you take this entire word
sequence and you take the true word sequence, and you compute
any distance between these. So just align these
two sequences and compute where the words
that actually swapped for other words and where certain
words are inserted, and where certain words are deleted,
and you get this error rate. And that's 10%, which is fairly
respectable, so that's not bad. And that's because
Google has a very, very, well developed Indian English
ASR system, and Google actually calls all of these variants of
English different languages. So they have Indian English,
Australian English, British English,
various variants of English, and their system is extremely
sophisticated for Indian English. >> [INAUDIBLE]
>> Yeah, it's clean, it's quiet, exactly, yeah. So now, let's take speech from
our 11th president who is Abdul Kalam, who, and why I picked him is because he
has a more pronounced accent. Most of you must have heard
his speech at some point. So, let's look at how-
>> [INAUDIBLE] Which any human being can ever imagine to fight. And never stop fighting until
you are able to destined place, that is the unique you. Get that unique you. It's a big battle. The back in which you don't
you take a the battlements you have out for unique needs,
for you need to do is you must have that battle,
one is you have to set the goal. The second one is acquire
the knowledge and third one is a hard
work with devotion. And forth is perseverance. >> Okay, so
I'm actually curious, Allen, how do you think you would have
done with recognizing this? >> Well,
I would definitely make errors. >> Yeah.
>> Okay, because he is much
more heavily accented. And what's really hard is,
you can follow the transcript, I'm thinking,
Yeah it looks exactly like that. >> [LAUGH]
>> [CROSSTALK] Listening to it and not being right. >> And I'm assuming most of us
in the room are definitely going to do better than this, right? Definitely better than 39%. So actually, when I've shown
this video before, and some people from the crowd
said that this is not fair, because not only,
cannot entirely attribute it to accent because he's
in this large room. It's more reverberant
than the previous video. His sentence structure
is quite non standard, like he moves in and
out of various sentences. So this end, of course, reading. So the read speech has a very,
very different kind of grammatical structure
than this, right? So I said, okay, let's try
to make it fairer, right? So I chose someone who is
an American English speaker, but has an accent, because my
hypothesis is that accent had a lot to do with this missed
recognition rate getting higher. So, this is an American English
speaker who has a strong accent and has very
non-standard word order. And so, an obvious choice was Sarah
Palin, if you know who she is. Most of you know who
Sarah Palin is, okay. >> The illegal immigrants
welcoming them in, even inducing and
seducing them with gift baskets. Come on over the border and here's a gift basket full of
teddy bears and soccer balls. Our kids and our grandkids,
they'll never know then what it is to be rewarded for
that entrepreneurial spirit that God creates within us in order
to work, and to produce, and to strive, and to thrive,
and to really be alive. Wisconsin, Reagan saved the hog
here, your Harley Davidson. It was Reagan who saved-
>> Okay, so if I were to try and predict what the next word is,
given her word context, there's no way I
could reproduce this. It was quite arbitrary, right? You have no idea what
she's gonna say next. >> If you showed me that
transcription, I would say, well, it's clearly mostly wrong. >> [LAUGH] >> Nobody could
actually say that. >> So, here, clearly, language
model can only help so much, cuz this is really,
it's quite arbitrary. She has a strong accent,
but YouTube's ASR engines do have a lot of southern
dialects in their training data. And she's also speaking to
a large crowd in the open. So the acoustic conditions
are somewhat similar to the previous video. And the only difference is that
the speaker in the previous video had a much more
pronounced accent. So I'm not saying that it
entirely had to do with the degradation performance. But that certainly was a major
factor in why the speech recognition engine
didn't do as well, yes? >> The second time it picked
up correctly gift basket, but the first time it
says your basket, so- >> Is that a- >> The illegal immigrants, welcoming them in
even inducing and seducing them with gift baskets. Come on over the border-
>> So, you see- >> Like it could not [INAUDIBLE] for the first ten minutes,
and it said you all, but the next time, it correctly
identified [CROSSTALK]. >> I see, I see, the second,
yeah, yeah, yeah, yeah. >> What would be the-
>> So, and there are multiple
things here at play. So maybe after I show
the structure of an ASR system, I can come back
to this question. Any other questions so far? Okay. Let's actually move into
what's the structure of a typical ASR system? So this is more or less the pipeline of what's
typically in an ASR system. So you have an acoustic analysis
component, which sees the speech waveform and converts it into
some discrete representation. And, or features, if most of you are familiar
with the term features. So the features are then
feed in to what's known as acoustic model. So actually instead of
explaining each of components here, I'll kind of go into
each of these individual and explain it more. So the very first component is
just looking at the raw speech waveform and converting it
into some representation, which your algorithms can use. This is a very high level, so
there is a lot of underlying machinery and I am just
giving you an overview of what is going on, so
you have the raw speech signal. Which then was discretize, because you can't really
work with a speech signal. So, you sample it and
generate these discrete samples. And of each of these samples are
typically of the order of 10 to 25 milliseconds of speech. And the idea is that once
you have each of these what we know as frame,
so speech frames, which are free round 25
sometimes even larger. But typically 25 milliseconds. Then you can extract
acoustic features, which are representative of all
the information in your signal. And another reason why you
discretize at particular sampling rates is that
the assumption is that within each of these frames your
speech signal is stationary. If your speech signal is
not stationary within the speech frames,
then you can't effectively extract features form
that particular slice. So, around 25 milliseconds
of speech, yeah. So, you extract features, and this feature extraction is
a very involved process, and it requires a long of
signal processing know how. But this is also kind of
motivated by how our ears work, so actually, one of the most. Common acoustic feature
of presentations, which are known as Mel-frequency
cepstral coefficients or MFCCs. They're actually motivated by
what goes on in your ear or the filter banks,
which apply in your ear. So that might be
too much detail, so let's just think
of it as follows. So you'll start with your
raw speech wave form. You'll generate these tiny
slices which are also known as speech frames. And now each speech frame is
going to be represented as some real valued vector of features,
which is capturing all the information and as well
as possible is not redundant. So you don't have
different dimensions here, which are redundant, so you want
it to be as compact as possible. So, now you have these features
for each frame, and now these are your inputs to the next
component in the ASR system, which is known as
the acoustic model. And it's a very
important component. So, before I go into what
an acoustic model is, for anyone plotting this cloud, everyone
knows what a phoneme is. Cuz lots of linguists and many of you might
have encountered it. But anyway, so a phoneme
is just a discrete unit. It's a speech sound
in a language, which can be used to
differentiate between words. So in ASR,
there is this approach, which is known as
beads-on-a-string approach. And this is my terrible
drawing of beads on a string. But each word can be represented
as a sequence of phonemes. So five here is a sequence of
three of these speech sounds. So the phoneme alphabet is very
much like the alphabet for our languages, so
in text, except now, it's covering the sound space,
rather than the textual space. So phonemes are just the letter
equal to the analogy of phonemes as letters in your
written texts. And each word can be represented
as a sequence of phonemes. So this mapping, so you might be
wondering, how do you know that five actually maps to this
sequence of phonemes? This is written down by
experts in the language. So this particular mapping
between the pronunciation of a word, so this is actually
giving you how this particular word is pronounced. Because you know how each
individual phoneme is pronounced, and this
pronunciation information is actually given to us by experts. So in English, of course, we have a very well developed
pronunciation dictionaries, where experts have given us
pronunciations corresponding to most of the commonly
occurring words in English. And actually we have CMU
to thank for that, so one of the most popularly used
translation dictionaries in English is CMUdict,
which is freely available. And it has around 150,000 words,
I think? Yeah, of that order. And that is actually one of
the more extensive dictionaries. So typically, this is not
a very easy resource to create. And it's clear why,
because it's pretty tedious. First of all, you need to
find linguistic experts in the language. And then you need to find what
are the most commonly occurring words in the language,
and so on. So this takes time. And not many languages
have very well developed pronunciation dictionaries. And most languages have
around 20 to 60 phonemes. So the phoneme inventory
size is roughly 20 to 60. And this is also something
which needs to be determined by experts. What are the phonemes which
are applicable to a language? So this is a very
language-dependent characterization, like what are
the phonemes which are relevant to a particular language? Okay, so given we have
these units, okay, so one more slide just to motivate, why do we need phonemes,
even from a modeling standpoint. So not just from
the linguistic standpoint, which is each of these sounds can be kind of discrete as
in terms of these phonemes. But even from
a modelling standpoint, why are phonemes useful? Why not just use words, instead of trying to split
it into these subword units? So let's look at
a very simple example. So say that your speech wave
form is this string of digits, five, four, one, nine. And say that during training,
you've seen lots and lots of samples of five,
four and one. So, now you can build models to
identify when someone said five, and when someone said four,
and when someone said one. But when you come to nine, so
this is, say, during test time, when you are evaluating this
particular speech utterance. What do you then if,
during training, you've never seen nine? Of course, this is not
reasonable here because it is a very small vocabulary, but
you can extrapolate when you move to larger vocabularies,
right? So yeah,
all of these words exist, but what do you do when
you come here? What if you were to represent
each of these words as their corresponding sequences
of phonemes? So now five is this sequence of
phonemes, four is this sequence of phonemes, one is this
sequence of phonemes, and so on. And now you've seen
acoustic speech samples, which correspond to each of
these phonemes in your training. So during test time,
when you see nine, you might be able to put together
this string of phonemes. Because you've seen enough
acoustic evidence for each of these individual
phonemes, is that clear? So it helps you generalize, just to move to a more
fine-grained inventory. But of course, with that said, if you have a very
limited vocabulary task. I mean, say that you're almost
certain that during test time you're only going
to get utterances, which are going to stick to
that particular vocabulary. And if you have enough
samples during training, you're probably well off just
building word level models, and not even moving to
the phoneme level. But that is for
very limited tasks, so most tasks you want to do this. So you want to move to this
slightly more fine-grained representation. Okay, so that hopefully
motivates why we need phonemes. So this is the problem, right? So I mentioned you have acoustic
features which you extract from your raw speech signal. And you want the output on this
acoustic model to be what is a likely phone sequence, which corresponded to this
particular speech utterance. So this model is typically,
so initially in the 80s and even now actually, Hidden Markov
models is a paradigm which is used to learn this mapping. So how do I associate a set
of frames, a set of features? So when I say frames and
features, they're kind of interchangeable,
right? So each of these
acoustic-feature vectors corresponds to a speech frame. So how do I chunk, how many
frames are going to correspond to a particular phone? And here I've shown you
a chain of what are known as hidden states,
in your Hidden Markov Model. So this is just like,
think of it as a graph. And it's a weighted graph, so
here I've not put weights on the axis, but the weights
are all probabilities. So another thing is here, I've
just shown you a single chain that is just a sequence of
these phones put together. But you never know that right? So at each point you can
transition to any phone, because you don't know which
phone is going to appear next. So the entire model
is probabilistic. So you need to have lots
of hypotheses as to what the next phone could have been? So here I have obviously
simplified it by just showing a single chain. And then you have estimates
which say that okay, I think this initial ten frames
most likely corresponds to a certain phone. And the transition properties, the properties which transition
from one phone state to the other determine how many
you are going to chunk. How many speech frames are going
to correspond to each phone. And once you're in
the particular state, there are properties for having
generated each of these vectors. So once I'm in, let's say this
state, there's a probability for having seen this vector,
given that I'm in this state. And those properties come
from what are known as Gaussian Mixture Models. So it's a probability
distribution which determines that okay, once I'm in
this particular state, this particular speech
vector could be generated with
a certain probability. And how are these
probabilities learned? From training data, so that is the detail I'm not
going to go into, yeah. >> [INAUDIBLE]
>> Yeah, this is just
the acoustic vector. >> That's it, yeah. So, any questions so far? Okay, so this has obviously
glossed over lots and lots of details. But this is just to give
you a high level idea of, you have a probabilistic model,
which maps sequences of feature vectors to a sequence
of phonemes. So Hidden Markov Models were
the state-of-the-art for a long time. And now, of course,
we have Deep Neural Networks, which are used for
a similar mapping. So now you have your
speech signal and again, you're extracting six
windows of speech frames. And from these, so say that you're only considering
a speech frame here. And then you have, well,
I have a laser pointer. Here, so say considering
a speech frame here. And then you look at
a fixed window around it, a fixed window
springs around it. And you generate, put all of
these features together and that is your input to
a deep neural network. And the output is what is the
most likely phone to have been produced, given this particular
set of speech rates. But this is a posterior
priority over the phones. So I'm not going to go into
details about what this is. But, again,
here you get an estimate of what is the phone given
a speech frame. What is the most likely phone? So it's actually a priority
distribution over all the phones. And there are two ways in
typically how DNNs is used in acoustic models. So one is, in the previous
slide I mentioned, when you have HMMs,
you have these states and you have these priority
distributions, which govern how the speech vectors correspond
to particular states. And these priority distributions
are mixtures of Gaussian. But you could kind of not use
mixtures of Gaussian here and instead use probabilities
from your DNN. So that is one way in which the
DNN and the HMM models can be combined and
used within an acoustic model. So please feel free to stop and
ask me any question. So at this point, I'm not
really clear if I'm going too technical, so please feel free
to interrupt at any point. So the idea is that we want to
map acoustic features to phone sequences. And this is all probabilistic,
so we are not giving you
a one best sequence. We are saying this is the likely
priority distribution of phone sequences. Okay, so that is the output
from our acoustic model. Yes? >> Yeah, [INAUDIBLE]. >> Yeah. >> [INAUDIBLE]
acoustic? >> Yes, exactly,
acoustic features, and then you have priority
distributions. >> How will [INAUDIBLE]? >> So the priority
distribution for the, so you're familiar with HMMs. So you have observation
probabilities. So your observation properties, either they can be
Gaussian mixture models. Or your observation properties
can be scaled posteriors from your DNNs. So it's just the priority
distribution, your observation priority distribution can
come from the DNN, yeah. Okay, so
this is the acoustic module, which produces phone sequences. So now, we eventually want to
get a word sequence, right, from the speech utterance. So this is just an intermediate
representation, these phones. So now how do I move
from the phones to words? So I mentioned we use these large pronunciation
dictionaries. So this the model, which provides a link between
these phone sequences and words. So here, typically just a simple
dictionary of pronunciations is maintained. So you have these large
dictionaries which say that there words correspond to
these sequences of phones. And this is the only module
in an ASR system that is not learned. It's not learned from data. So the acoustic model was
learned from training data, and we'll come to the language model
that is also learned from data. But the pronunciation model
is actually expert derived. So an expert gives
you these mappings. So I'll talk a little about some
work I did during my thesis, which was on
pronunciation models. It's kind of hinting
at how restricted this particular
representation is. So I mentioned that each word
can be represented as a sequence of phonemes. So we looked at a very
popular speech corpus called Switchboard, a subset
of Switchboard. It's annotated very
detailed level. So not only do we
have board sequences corresponding to the speech. We also have phonetic sequences. So it's also phonetically
transcribed. Meaning someone listened to all
of the utterances and wrote down the phone sequence corresponding
to what they heard. And this was obviously
done by linguists. And when I say phone, they actually listened to
how the word was pronounced. Not what the word should
have been according to some dictionary. And why I'm saying this
is because there's lot of pronunciation variation when
you actually speak, right? So, certain words, even though
the dictionary says that it should be pronounced a certain
way, because of our accents or just because you're talking fast
and so on, the word actually ends up being pronounced in
an entirely different way. So for
some data from this corpus, we actually had phonetic
transcriptions which are giving us exactly how
people pronounce those words. So one thing that really
stands out from the data, so let's look at just four words. So this probably, sense,
everybody and don't. And row in blue, the phone
sequences corresponding to these words, according
to a dictionary. So this is how the word should
be pronounced according an American, so this is
the American pronunciations. And these were the actual
pronunciations from the data as transcribed by a linguist. So obviously, what is the first
thing that stands out here? >> There's
>> Yeah, definitely there is no one pronunciation
corresponding to the word. There are lots of
possible pronunciations, and you'll also see,
this is, you'll see lots of like inter syllables are being
dropped out of words. >> [INAUDIBLE]
>> Yeah, speaking very fast. >> [INAUDIBLE]
>> It's wrong, but it's completely legible
from the speech, just because of the context and
so on. But they're speaking very fast,
so if you actually looked at
the phonetic transcriptions, there are entire syllables
missing, and of course these are also perfectly
legitimate pronunciations. So since, when you say the word, it almost feels like you're
inserting a tuh at the end, before the last suh sound,
sense. So there are lots of alternate
pronunciations for words, and I don't remember
the exact number. But I think the average number
of pronunciations they found for a word was of the order
of four or five, so very far from a single
pronunciation for a word. Okay, so we thought, so clearly there's a lot of
pronunciation variation, and this was from a corpus which
was of conversational speech. So it's not read speech, where people are speaking
very fast and so on. Read speech tends to be more
clearer and they enunciate, and they tend to stick more to
the dictionary pronunciations. But of course, you want to
recognize conversation speech, you're not always only going to
recognize news broadcasters. You need to recognize
spontaneous, day to day speech, so
how do we kind of try and model this pronunciation
variation? How do we computationally
model this, and so we thought, why not go to the source, so what is creating these
transition variations? So it's your speech production
system, so there are various, so before going into that,
I'll show you these videos from the SPAN group at USC,
which is led by. And they do really good work on
speech production based models, to model pronunciation
variation and so on. So this is an MRI of
the vocal tract and it's synched with the audio, so. >> When it comes to singing,
I love to sing French art songs, it's probably my favorite
type of song to sing. I'm a big fan of Debussy,
I mean, I also love operas, I love singing [INAUDIBLE] and
Mozart and Strauss. But when i listen to music, I tend to listen to hard rock or
classic rock music. One of my favorite
bands is AC/DC, and my favorite song is
probably Back in Black, which I'll listen to over and
over again in my room. >> Okay, so these various
parts of the vocal tract which are moving are known
as articulators. And the articulators move in
various configurations and lead to certain speech
sounds being produced. So they did some really
cool work on actually- >> These are the movements that shape the z sounds. >> Tracking the articulators
automatically, this is really cool. So we didn't actually use
these continuous counters, but we said,
let's discretize the space. So that all these various
articulators, which move under various configurations
of these articulators, which lead to various speech
sounds being produced. So if we can discretize
the space and say that, okay, there are eight vocal
tract variables, and each of these variables
can take one of n values. Then different value assignments
to these tract variables can lead to different sounds. And this didn't just
come out of the blue, there's a lot of linguistic
analysis on speech production. And there's lots of very
well-developed linguistic theories on speech production. And we used one of them,
which is known as articulatory phonology, so we'll come
to that in a single slide. So now, everybody which used to be the
sequence of phones, can now be represented as these streams
of articulatory features. So these, if you have each variable
which states different values. Now you will have these
quasi-overlapping streams, articulatory features, which
lead to a particular word and how it's being pronounced. And so why do I say
they are overlapping, because they're not
entirely independent, so one feature can affect how
another feature behaves. So we kind of got inspiration
from this work on this theory called articulatory phonology,
by Browman and Goldstein. But they said that this
representation of speech as just sequences of phonics,
it's very, very constrained, and it's very restrictive. So let's think of speech as
being represented as multiple streams of articulatory
movements. And that actually gives you
a much more elegant framework to represent pronunciation
variation So if I have to go back to he
previous slide where I showed you all the various
pronunciations. To try and kind of motivate how
you went from the dictionary pronunciation of probably
to one of these things, it would require kind of
deleting three phones, inserting some other phone, a huge at
a distance in terms of phones. So how do you actually motivate
such a large deviation in pronunciation? It turns out that if you
represent pronounciations as these streams of features, you
can explain transition variation in terms of asynchrony
between these feature strips. So just because of
certain features are not synchronously moving. Say that you're producing
a nasal sound, and your next sound that you're
producing is a bubble, but then there's certain remnants
of the nasality which hold on. And so your bubble also becomes
a little nasalized, and so on. So there was an example
which I thought may not have time to go and do so,
the idea is that this articulately featured
framework gives you a more elegant explanations of
pronunciation variation. It's certainly more elegant, but
it's very hard to model, and we learned that the hard way. I did during my thesis. So we use this
representation and we built what is in the olden
days, these are called DBNs. Which is not
Deep Belief Networks, it's Dynamic Belgian Networks,
so it's the olden day DBN. So it's just a generalization
of Hidden Markov Model. So the Hidden Markov Model that
I described in the acoustic, when I was talking
about acoustic models. This is just a generalization
of that particular paradigm. So you have various variables
which represent each of these articulatory features. And then you are represent
constraints between these variables and so on. >> [INAUDIBLE]
>> Yeah, yeah I wish I had
that slide actually. If you don't mind, can I come to
this at the end, because there's a slide which clearly shows-
>> [INAUDIBLE] >> Yes, yes. >> [INAUDIBLE]
>> Yes. So now, okay, actually we
can take that as an example. So fur. So now I'll say fur. So now I'll break the fur sound
into these eight variables and the values it takes. The values it takes to
produce the sound fur. So you're-
>> But one to one map [INAUDIBLE]
>> It's not, it's actually not
a one to one map. So we left it,
we kept it probabilistic. So it is mostly one to one, but
it's not entirely one to one. So we did allow for-
>> One to one [INAUDIBLE]
>> Yeah, yeah. >> Then wouldn't have the space to-
>> It's not a one-to-one mapping. Yeah. It's not a one-to-one mapping. But even if it was. Yeah, so actually if I show
you that example slide, I can clearly explain it, so please remind me at
the end of the talk. I would like to show that. Okay, so that was
the pronunciation model. And the final model is what's
known as the language model, which many of you might actually
be quite familiar with. So language model
is just saying, so again the pronounciation model,
the output was words. So now you've mapped a phone
sequence to a particular word, and now the language model comes
and says how should these words be ordered, according to
a particular language? So the language model
looks at lots and lots of texts in that
particular language. And it finds occurrences of
words together, and yeah, you have a question? >> [INAUDIBLE]
>> But, so now, now we are coming
to this language model. What about going from
the phoneme sequences to words? >> The pronunciation model. So, this one, right? So the phone sequence. Once I get a phone sequence,
I can start mapping chunks of the phones to valid
words in the pronunciation. >> [INAUDIBLE]
>> Yes. >> [INAUDIBLE]
>> Yes, absolutely. So the thing is, you're not getting a single
phone sequence, right? So it's probabilistic. So you have properties for
every phone sequence appearing. And so even if it
doesn't exactly match, maybe it will exactly match
it with a lower probability. But then the language
model also comes in and then the property
when you add up, then you get kind of the most
likely sequence here. Yes? >> [INAUDIBLE] pronunciation models [INAUDIBLE]
>> Yes, they're usually not
it's determined state. You just have one sequence of,
typically you just have one sequence of phones which
corresponds to a word. But it can be probabilistic,
also. >> [INAUDIBLE]
>> Yeah, exactly, the one I was building was too probabilistic,
[LAUGH] too many probabilities. Okay, so here of course, so
if you saw the word contex, the dog. Obviously the most likely next
word to follow this particular word context is ran, maybe even
can, but definitely not pan. So pan would have a very, very low probability
of following the dog. And the language model
is also coming to, actually related
to your question. The language model is very
crucial because it can be used to disambiguate between
similar acoustics. So say that our transverse is,
the baby crying. It could also very well map to
this particular word sequence, but obviously the first word
sequence is much more likely. Because if you look at large
volumes of English text, is the baby crying is probably
a much more likely word ordering than is the bay bee crying? And then let us pray and
lettuce spray. So if you have identical
acoustic sequences, your language model has to kind
of come in and do its job, then. Okay, so I just wanted to put
this here if you wanted to use language models in your work. So SRILM, so actually Alan also
mentioned about SRI in his talk. So they've put out this toolkit, which is extensively
used in many communities. So it's known as
the SRILM toolkit. It has lots of in-built
functionalities implemented, so this is a good
tool kit to use. Another tool kit which is
getting quite popular now days is KenLM Toolkit, which handles
large volumes of text very, very efficiently. So the data structures which
are used to implement this toolkit are much
more sophisticated. So this is much faster, KenLM,
but probably only need to use this if you're dealing with
very large volumes of data. And there's also this
OpenGrm NGram Library. So if you like finite
state machines, if you like working with
finite state machines, you want to represent everything
as a finite state machine. Then this is the toolkit for
you, so OpenGrm NGram,
it was developed by Google. Okay, so language models,
like I mentioned, it has many applications. So speech recognition
is just one of them. Machine translation is another
application where language models are heavily used. Handwriting recognition, optical
character recognition, all of these also would use language
models on either letters or characters. Spelling correction, again, language models are useful
here because you can have language models over
your character space. Summarization, dialog
generation, information retrieval,
the list is really long. So language models are used in
a large number of applications. So I just want to mention this
one point about language models. So we mentioned that you
look at these word contexts. And you look at counts of these
words, and these word contexts over large text corpora
in a particular language. How often does this
particular set of, how often do these particular
set of words appear? And then you compute
some relative thoughts. So you see, okay, these
chunks appear so often, and these are the total
number of chunks. And so
you get some relative counts. And it'll give you
some probability of how often you can expect this
particular chunk to appear. So just to kind of slightly
formalize that, so this very, very popular language model
which is used are these NGram language models. So the idea is really
straightforward. So you just look at co-occuring,
either two words, or three words, or four words. So if your n is two,
you're looking at bigrams. If n is three,
you're looking at trigrams. n is four, four-grams and so on. And Alan mentioned yesterday
the five-gram model. If you're already
looking at five-grams, you can pretty much reconstruct
English sentences really well. But of course then you're
running into really, really large number of NGrams, as you
increase the order of the NGram. So here I'm looking
at a four-gram, so the four-gram is
she taught a class. So what is the probability of
this particular four-gram? That is the word class follows, this particular word
context she taught a. So you look at counts of,
she taught a class, in large volumes
of English text. And then you normalize it with
the count of, she taught, which is the word context. So how often does class come
after this particular word context? So what is the obvious
limitation here? >> [INAUDIBLE]
>> Yeah, exactly, so we'll never see enough data. We're always going
to run into NGrams, which we're not going to
see in the text corpus. And this is actually This
happens far more frequently than one would even expect. Even if you have really, really
large databases of Ngrams, you're going to run
into this issue. So just to make sure
that this is true, I went into this Google Books. So Google Books has accumulated
lots and lots of Ngrams from all the books which
are available on Google. It is in English. So you can actually
plot how Ngrams have appeared in books over
some particular time frame. So you can go and play around with this if
you've not seen this before. I just typed in this
particular fourgram, which hopefully is not very
relevant to this crowd. So feeling sleepy right now. And there weren't any
valid Ngrams at all. And this is not a very,
very rare fourgram, right? And even feeling sleepy, right? None of them appear in text. So this is a problem which
occurs actually very, very frequently. So, even when you work with
this counts from very, very large text corpora. You're always inevitably going
to run in to this issue, which is you're gonna
have this unseen Ngrams, which never appear in
your training data. And why is this an issue? Because during test time, when
you're trying to reorder words according to your
particular language model. And if any of these unseen
Ngrams appear on your test sentence, then the sentence is going to
be assigned a probability of 0. Because it has no idea how to
deal with this unseen word. So there is this problem with
what are known as un-smoothed Ngram estimates. And I wanted to make
it a point to actually talk about this because Ngrams
are only useful with smoothing. So these unsmoothed Ngram
estimates, like I mentioned, you will always run into
these unseen Ngrams, and then what do you do? So there are a horde
of what are known as these smoothing techniques. So you're gonna reserve
some probability mass from the seen Ngrams
towards the unseen Ngrams. And then there are questions
like how do you distribute that probability mass across
the useen Ngrams? And there are various
techniques for that as well, like how do you distribute that
remaining probability mass. So this is a lot of work
on smoothing methods. And it's very useful
to make Ngram models, to make them effective. So for anyone who is interested,
I would highly recommend reading this 1998 paper by Chen and
Goodman. Goodman was at MSR,
I don't know where he is now. So this is an empirical study
of smoothing techniques for LMs. I highly recommend this. It's kind of long but
it really gives you a very deep understanding of
how smoothing techniques help. Don't be fooled by the 1998,
it's still very relevant today because Ngrams are very
relevant even today. So Ngrams are not
going anywhere. So I'm not talking about what
the latest language models are. But these days in speech
recognition systems, we move towards these, what
are known as recurrent neural network based language models. So that's neural network based,
but I believe it's still not folded
into a lot of production systems because it's not very fast. So many of the production level
ASL systems probably still use Ngrams. And then do a rescoring using
recurrent neural network language models but, so
Ngrams models are still very, very much in the picture. Okay, so we've already covered
each of these individual components. But there's this big component
in the middle right, which is the decoder, that's actually
a very important component. So I have all of this
parts of the ASR system which are giving
me various estimates of what is the most
likely phoneme sequence. What is the most likely
word sequence and so on. But finally I just want to get
the most clear word sequence corresponding to speech
utterance, and so then it's a search problem. So I have these various
components, and now I need to search,
putting all of them together, I need to search through
this entire space. So just looking at the very
simple example we started with. This is what a naive search
graph would look like. So you start to a particular
point and say that you only expect it to be nine or
one, just these two words. Then you need to
transition to nine. So here, every single arc
doesn't have a weight but these are all weighted, cuz they all come with their
associated probabilities. So you can, from start, you can transition in to either
producing the word nine or one. But each nine is a sequence of
phonemes, and each sequence of each phoneme, corresponds
to it's corresponding HMM. So, and
which has its own probabilities. So this is already,
you can see this is slight, this is quite a large graph
just for these two words. And get at least
like a half decent system we'll be looking at at
least 20,000 or 40,000 words. So you can imagine how much
the search graph blows up. So these are really
large search graphs, and I think I have another slide,
yeah. So if you have, say, a network
of words as follows, so the birds are walking,
the boy is walking. This is really simple where
there's not an model, this is highly constrained. So now each of these are now
going to map their corresponding phone sequences, so the,
the birds, and so on. And each of those phones now
are going to correspond to their underlying and very quickly,
the graph blows up. So if you look at, so
just to give you an estimate, a vocabulary size of around
40,000 gives you search graphs of the order of tens
of millions of states. So these are really
large graphs, and so now we need to search through
these graphs and throw out what is the most likely word sequence
to correspond to the speech. So you might be wondering, can
you do an exact search through this very, very large graph? And the answer is no, you cannot
do an exact search through this graph, because it's
just too large. So you have to resort to
approximate search techniques, and there are a bunch of them,
which do a fairly good job. So none of these speech systems
that you work with are actually doing an exact search
through this graph. So that's the decoder,
so any questions so far? So this is the entire kind
of pipeline of how an ASR system works. Okay, so
everyone is with me, right? So I want to kind of end
with this new direction, which is kind of becoming
very hot nowadays. They are known as these
end-to-end ASR systems, so I showed you all of these
different components which, put together,
make an ASR system. But lots of people are
interested in kind of doing away with all of those components. Let's not worry about how a word
splits into its corresponding phone sequence. Let's just directly learn
a mapping from acoustic features to letters. So this is just two characters,
so directly go from speech vectors,
so these acoustic vectors, to a character sequence. And then you can have character
language models which re-score the character sequence,
and so on. So one kind of nice
advantage of this is that, because you're getting rid
of the pronunciation model, which is that you're not now
looking at phones at all, you don't need that mapping. The word to phone mapping,
which typically is written down by experts, and
that changes for each language. So now you want to build a new
system for a new language. If this worked really well, then
all you'd need is speech and the corresponding text. But the catch is, you need
lots and lots of these for this to work, for these kinds of
end to end systems to work well. So just for people in
the crowd who are interested in these kinds of models, I'll just put down a few
references, which you can read. So the first is this paper,
which came out in 2014. I've kind of started off this
thread of work, which is this end to end speech recognition
with recurrent neural networks. So I won't go into details at
all about the model, this is just for you to jot down if you
want to go later and read it up. But I'll put this up, which is
kind off the sample character level transcripts which they get
out of their end-to-end systems. So here they have a bunch of
target transcriptions, and the output transcriptions. So you can immediately see, so this is without any dictionary,
without any language model. So this is directly mapping
acoustic vectors to letters, characters. So you can see obvious
issues like lexical errors, you can see things where you
have phonetic similarity, so shingle becomes single. Then there are words
like Dukakis and Milan, which are apparently not
appearing in the vocabulary, so that is another advantage
of these character models. So in principle, you don't care
about whether you will see this word in your vocabulary, because you are only predicting
one character at a time. So it should recover
vocabulary words, but this system doesn't
actually do that too well. >> [INAUDIBLE]
>> It does, yeah, so they just had this without a dictionary,
without a language model. But their final numbers are all
with a language model, and a dictionary also, actually. So the second improvement of
this paper was by Maas et al, who again explored a very, very similar structure as in
this previous paper in 2014. And they had this kind
of interesting analysis, which I wanted to show. So on the x-axis you have time. And each of these graphs
correspond to various phones. So remember in their system
there are no phones at all. But they just accumulated
bunch of speech samples, which correspond to
each of these phones. And averaged all the character
properties corresponding to those particular forms. So, here you can see that K
obviously comes out but so does C. So the letter C also
corresponds the core sound and interestingly for Shah. So, this is the phone shah. So S and H, so S,
H definitely comes out. But so does T, I because
as in TION, T-I-O-N, so you would pronounce
that as shah, right? So that actually comes out of
the data, which is pretty cool. So this yeah, this was
a nice analysis, so they do only slightly better than
the previous paper, and yeah? >> [INAUDIBLE]
>> So the X-axis is time in frames. You can think of it
in speech frames. And these are just
the character properties, yeah. So the last system,
which came out in 2016, and kind of significantly
improved over these two. Uses this very popular paradigm
now in sequence to sequence models which are know as
Encode or Decoder Networks or Sequence to Sequence networks. Which is first used for
machine translation and now they applied it to this
particular problem and also included what is
known as Attention. And all of these bells and whistles together definitely
make a difference. But I want to mention that
End-to-End systems are not yet close to the entire standard
pipeline that I showed you earlier. So, people would really like
to bridge the gap between End-to-End systems and
these whole pipelines. Because clearly these are much
more, at least easier to understand in some sense,
at least from a modeling stand point, although it's not easier
to understand what it's doing. Yeah, so there's lot of work going
on in this particular area. But these systems require
lots of data, lots and lots of data to train. And that's because not only are
you trying to understand what are the underlying speech sounds
in the speech references, you're also trying to
understand spelling. You're trying to figure out
what spelling makes sense for a particular address. And clearly for
a language like English, where the authography is so
irregular, it's a hard problem. And so these models require large
amounts of data to work well. Okay, so I'm gonna come back to
this question I post initially, which is what's next? So what are all the kinds of
problems that we could work on, if anyone was interested
in speech recognition? So there are lots of, I think
there are lots of next steps. So one is you need to do more to
make ASR systems robust to kinda variations in age,
accent of course, which is why we are working
on that problem. And also this is another thing
which people are interested in. So just speech ability so
there are people say with speech impairments, the distractor or
they have other issues and they not able to speak as clearly as
maybe all of fast in the room. So how can we adapt ASI systems
to work well with those people? And this is a real, very
challenging task, so how do you handle kind of noisy, real life
settings with many speakers? This goes back to Allen's dream
of having a bot which is sitting in a meeting, and
kind of transcribing and figuring out what is going on. So that would also involve
the underlying ASR system in that bot. It would have to
figure out that okay, these are all the interfering
speakers, this is the main speaker, this is the speaker
I need to kind of transcribe. I need to haze out the other
interfering speakers and so on. And this is not the state of the
art for the this kind of meeting speech, tasks is not very low,
the error here is not very low. This is actually,
it's handled pretty well now, if you have lots and
lots of level speech. This pronunciation variability
actually captured into acoustic model itself. But handling new languages
currently, and the only way to do a good job is to go and
collect lots and lots of data. Which at least personally to me,
is unsatisfying. So it seems like if you
have existing models, you should be able to adapt them
with not bizarre amounts of fallible speech. At least they're
somewhat related. We should be able to do a half
decent job, by taking existing models and adapting them to
the new language that we want to recognize, or the new
dialect we want to recognize. So there are these problems. So in computer science, we
are always trying to do things faster, and
to be more efficient, right? In a both computationally, and if you are trying to do
things faster from both the computational power
standpoint of every standpoint, but we should also try to be
resource efficient, right? We don't want to keep going and
collecting more and more data, every time we come
up with a new task. So can we do many of
these tasks with less? This is something that I am very
interested in personally, so can we reduce duplicated effort
across domains and languages? And also can we
reduce dependence on language specific resources? And this is of course
the holy grail I think, training with less labeled data. And actually making use
of unlabeled data better. Okay, so I'll also show
this one direction, which Microsoft is working on,
and it's kind of very promising. So this is just
an excerpt from an ad. So this is Skype. >> Can you understand me now? [MUSIC] >> [FOREIGN]
>> [FOREIGN] >> You speak Chinese. >> Now, if that worked
seamlessly that worked here, that would be pretty cool. So I'm told this was
just setup for the ad. So Microsoft has been working
a lot on speech-to-speech translation. And I think this is a very
interesting problem. Because there can be cues in
speech, which help disambiguate utterances for the machine
translation part, and so on. So I think there is something
which can be leveraged from the speech component,
from the SR component. So this is something that we
talked about a little bit, which was using speech
production models, and how we can build speech
production inspired models, to handle pronunciation
variability. And that actually in principle, does reduce dependence on
language-specific resources. Because all of us have the same
vocal tract system, right? So they're are only so
many ways in which our different articulators, can form
different configurations and produce sounds. So in some sense,
at least in principle, moving to that kind of a model
does reduce dependence on language-specific resources. So we don't need to come up with
phone sets corresponding to a particular language, if you're
going to represent all of the pronunciations in terms of
these articulatory features. But there are other
problems with that method. And this is another problem, which I think is
very interesting. So how do you handle
new languages, and not have to collect loads and
loads of data? So just to tell you
how many languages so far have ASR supports. This is actually a year or
two old, maybe this number has
gone up a little. So they support roughly
around 80 Languages. But these languages
include Indian English, Australian English,
British English, which are not clearly languages. So that numbers even
lesser than 80. And if you look at the
distribution across continents, Europe has the highest
representation in terms of languages which are supported
by speech technologies. America is of course small, also because they're
largely monolingual. But Asia is dismal,
even though there are so many languages spoken in
the Asian subcontinent. So yeah,
we should all do more to build speech recognition technologies,
or language technologies, for various Indian languages and
languages in Asia. And so one thing that we have
looked at, is can we try and crowdsource the labels for
speech? So can we just place speech
utterances to crowds who speak the particular language. And then try to get
transcription from them. So it will be a little noisy,
but then there are techniques to kind of handle the noise in
those jobs transcriptions. But that also has an issue, because it's somewhat unfair to
a large number of languages. So this is just a histogram,
this was all the speakers who were sampled from a large
crowdsourcing platform. Just MTurk,
Amazon's Mechanical Turk. And this looked at the language
demographic of crowd workers on Mechanical Turk. And the yellow bars are actually
the speakers of those languages in the world. So you can see there's a large
distribution and mismatch, between the language background
of the crowdworkers, and the language expertise, which is needed to complete
transcription tasks. I mean, this tail is
really really long, so Forget about minority languages
or languages which, it's very, very hard to get native speakers
on crowdsourcing platforms. So this also may not really
be a viable solution always. So I think there are lots of interesting problems to
think of in that space. So with that, I'm going to stop. I'll kind of leave
you with this slide. Yeah, I think I'm
doing good on time. So thanks a lot. I'm happy to take
more questions. >> [APPLAUSE]
>> Yes. >> [INAUDIBLE] >> Yeah, so language models, you can back off all
the way to a unigram model. So as long as each of
the individual words you've seen somewhere in the language model,
and if your acoustic model is good, so it's going to
give you somewhat reasonable phone sequence corresponding
down the line of speech. You might still recover
the word sequence even though the language model doesn't
give you too many constraints. So for example I think that the
Sarah Palin speech, the language model, I don't think anything
more than maybe a bi-gram model [LAUGH] or maybe a tri-gram
model at the most. So as long as the individual
words have been seen in text, in large volumes of text, and
you're acoustic model is good, you can still recover
it even if there's no continuity between the words. Does that answer your question,
yeah? Any other? Okay. Yes. >> Based on your working
thesis,do you ever feel you need to, have more
because the [INAUDIBLE] was obviously [INAUDIBLE]
find some [INAUDIBLE] more than [INAUDIBLE]. >> But it had to have
more than 30 phones? Actually, so
40 is the number of phonemes. Yeah, so the number of phones
in English is more than 40, so even in those
phonetic transcriptions, the number of phones
were almost close to 80. Because it's actually annotating
all the fine grained variations. And that, of course,
helps if you have that kind of. Where are you every going to
get that level of phonetic annotation, yeah. Yeah. >> [INAUDIBLE]
next feature where character >> You need lots other than needing. You also need lots of
computational resources. But other than that. Like I said it doesn't really
work as well as the entire pipeline yet. So there's still a delta in
terms of the performance of your state of the art systems and
these end to end systems. And so currently, all these end-to-end systems
are recurrent neural networks. So there is this issue of how
much context to retain, and whether you retain that
context effectively. Which is were these attention
mechanisms come in, but attention mechanisms
also really fall short. So if you're interested,
there is an iClear paper this year which is, I think
the title is something like Frustratingly Short Context or
something like that. You can search for translation. So the idea is that even if
you just look at the last five output representations,
you can do as well as a really sophisticated
attention mechanism. So attention mechanisms also
need to be kind of improved further. Yeah? >> [INAUDIBLE] systems work for
[INAUDIBLE]. >> Yeah, so the [INAUDIBLE]
system is actually predicting characters. So it predicts a single
letter of the alphabet. So auto vocabulary is not
an issue at all cuz it's predicting one
character at a time. >> [INAUDIBLE] data for
Indian language. >> Yeah, so
that's actually something that's very interesting. So for Indian languages which
are morphologically rich and where you probably cannot
expect to see various forms of a word in the vocabulary. End to end might actually
work really well. But no one has run this yet because this amount of data
is not available here. >> In fact they didn't call
that particular speech for example in English the classes
>> [INAUDIBLE] [CROSSTALK]
>> Yeah, but that's a good mob- >> English, right? Characters don't [INAUDIBLE]
>> No, not at all. So the entire system actually
needs to, all these problems, it needs to learn You need
to learn the sound mapping, you need the learn spelling. So yeah, so there's no,
because it's so irregular, right, the mapping. So what is the point
you are saying? Sorry. >> So for example .>> Yeah. >> And this part but
if I take for example, I need to have a class for and
>> [INAUDIBLE] So you can have, you have [INAUDIBLE]. >> Yeah, you don't need a. >> [INAUDIBLE]
>> Yeah, it's up here, yeah. >> [INAUDIBLE]
>> No, you might have double of 36
because you, so you have one. So split it. Unicode that is an initiative. So you would have
c which is plus e. So you would predict and
you predict the model and you predict and so on. >> Might be enable for all the single-
>> No, you're talking about. So you just predict
each of these. >> S-E-E [INAUDIBLE] so
you [INAUDIBLE] then E, then E. >> Yeah.
>> [INAUDIBLE] >> Yeah, yeah, that's [INAUDIBLE] but
little size, of course the little space becomes larger,
but probably [INAUDIBLE] double. I don't think it'll
be more than that. Which if you have enough
data should be okay. >> Quick fix.
>> I think yes the cost of the of the mapping
is much more stable. It actually might do even
better than in English. Yes? >> What do you think about the minimum about the data is
letting me if you want to? >> Yes so I ask this too in
terms of who will and so on. So they use like $10,000 So
speech, [LAUGH] all of you must be
using speech [INAUDIBLE]. So I know the standard pipeline. So for switchboard,
for instance, so switchboard is around
200 hours of speech. And the error rates now are 5%,
the latest was 5%. >> So
of course with lot of machinery. So- >> [INAUDIBLE]. >> Yeah end to end-
>> What do you think the bare minimum would be if we
really wanted to [INAUDIBLE]. >> To try it out. So the other papers
that I showed you they actually work with [INAUDIBLE]
which is 200 hours of speech. So even if you're in
the 100 hours, And more, I think you can start
writing on data systems. But again, 10 years, 20 years, But I would still be
interested to see experiments on Indian languages with even
smaller amounts of data. >> Some people are even doing it on fable languages >> Right, right
>> 20 years, >> Correct >> So it's sort of beginning to work, it really has to be
>> Appropriately similar data lectures from [INAUDIBLE],
then maybe. >> Yes [LAUGH] That's true,
that's true. >> [CROSSTALK]
>> Of course, yeah. >> But [INAUDIBLE]
good transcriptions, but don't estimate that
[INAUDIBLE] is [CROSSTALK] >> Yeah, I think, yeah, that's a very
>> I think that would be a very good just hours of Hindi
speech how it itself. You have a question? >> Yea.
Is there an evidence of these for making another language
with this complicated >> No one has that yet. >> Not this language, any
>> No so I'll not from there maybe
be able and to have some. >> The only thing about
that it's funny and it's working this and
that it's not clear. But the argument of doing this
for line is with my apology and probably the most daily token >> [INAUDIBLE] and there are people claiming
that it's not a method. It's not clearly [INAUDIBLE],
okay? But this is the hot research
topic that people like to do. And ultimately, it would be easier because
enunciation [INAUDIBLE] is hard. >> Yes, yes. [LAUGH]
>> So if you can sort of get your way around that and find
people that actually said it >> But, don't underestimate, people keep saying there's
26 letters in the English. No there isn't, because there's
numbers and symbols and other things that
you have to address. >> Mm-hm. All right. Thanks, so much. >> [APPLAUSE]