(Old) Lecture 16 | Connectionist Temporal Classification

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

corrects what we're gonna be I'm going to sort of pick up where we left off in the last class I'm going to take I'm not going to go to do the usual business that I do of going over everything that we've done so far I'm going to assume this is just a continuation of the last class I'm going to take two steps back just sufficiently just sufficient to maintain continuity so thank you very much exactly what I need right and thanks even so here we go so the problem you're really dealing with is sequence to sequence translation so here's the basic problem some sequence of inputs goes in x1 through xn a different sequence by 1 through i1 ym comes out note n need not be equal to M so for example speech recognition our sequence of speech vector score goes in and out pops a sequence of phonemes or a sequence of words the number of vectors that goes in is not related directly to the number of vectors that comes out other than maybe about providing you an upper bound similarly machine translation word sequence goes in word sequence comes out they again there's there's no one to one correspondence between what goes in an or comes out or if you're doing a dialogue system the user statement comes in the system's response comes out no so here's the magical thing we're going to use the same machinery for a bunch of different tasks and it just works the question answering system a question goes in an answer comes out pretty magical right we're doing so many different things using the same basic mechanism all you do all they're going to do is to change the training data so the key point is that a number of symbols in the input and the number of symbols and the output need not be the same they need not even be the synchrony between the two so here for example I have two examples one is speech recognition the other is machine translation now observe again in both of these that the number of vectors or symbols and the input is not the same as the number of sequence vectors in the output now there's a fundamental difference between the two problems in the first when you see the signal for I ate an apple the signal for I precedes the signal for eight precedes the signal for an precedes the signal for Apple so the output is order synchronous with the input it's not time synchronous its order synchronous it's occurring in the same order right on the other hand if they're looking at machine translation I eat an apple ich haben an OP fell gig essence so now if you look here look at the order of the input and the output it's completely mixed up so here you don't even don't even have a notion of a synchrony or an alignment we are going to be dealing primarily with the first kind of problem today where you have order synchrony the input and the output sequences happen in the same order although they may not be time synchronous so speech like speech recognition so you have an input which is a sequence of vectors all the purple boxes outputs occur at different times you don't necessarily get an output for every single input now the simplest case of this we saw in the last class was this one where you have a sequence of of inputs and a single output coming and a single output like question answering or speech recognition so you know speech recognition a sequence of vectors goes in when you are done seeing the entire sequence of vectors you say okay I saw the phoneme R or if you were doing a question answering a sequence of words would go in when you were done seeing the sequence of words you would provide the answer during training we only define the divergence where you have an output so the output is going to be defined only at the very end of the input unfortunately this means that one single divergence has to back propagate and provide training signals to every single box in the diagram so what what this is really doing is ignoring the outputs of intermediate steps it's ignoring the outputs at time zero it's ignoring the output at one it's not as if the system is not providing any output set one zero and one remember every column is identical to every other column which means that at time zero there was an output at time one there was an output at time two there was an outward we just discarded the outputs at time 0 and time one as being uninformative or irrelevant now this is perhaps suboptimal so in a problem like speech recognition it makes sense to believe that even the intermediate outputs really have make sense so and this is very key we're going to keep using this concept throughout today's lecture so what I would do is instead of just assuming that I have an R coming out at the very end I'm going to assume that I want the system to output at heart every single time so although the output is nominally only at the end when you are done with the input I'm going to assume that as that the outputs at every single time do matter so now when I'm trying to train the system the divergence that I'm computing is going to be a weighted sum of the divergences at the individual time instants now remember when you're training a recurrent Network you're looking at the divergence between a sequences here we're going to assume that the this divergence is the sum of divergences at individual outputs and so this divergence the total divergence here as you're going to be summing over time and you're going to be computing the divergence between the actual output and the target output you compute a weighted sum and now the weights themselves depend on the problem if you're doing speech recognition it makes sense that all of those weights are 1 if I'm going ah it doesn't matter where you sample the output the output must be an ha thank you so very much and someone should have photographed here coming in I could have put it up on next in the next lecture slides and whereas if I'm doing something like a question answering having answers over here doesn't really make any sense so in this case you just have a weight of one event and 0 elsewhere right so regardless we're going to deal with this kind of heart but for now and given we have outputs of assuming we as you have zoom outputs of this kind now it's very obvious to us how we would train this is just as now we've just reduced this order synchronous but time a synchronous problem to a standard input/output synchronous problem and we've gone through how we would do back propagation for this kind of problem now here's a more complex problem I'm going to be given a sequence of inputs I have to generate a sequence of outputs so this is again like the speech recognition problem except now when you're looking at the inputs I'm not telling you exactly where the outputs happen so now observe that this is a simple concatenation of the earlier problem that we saw right this is simply a concatenation of many copies of this basic model one one after the other but the simple concatenation actually makes things very hard so the real issue the problem is we know the our system actually outputs something at every time which are the outputs you must actually consider because the order we know is not time synchronous the order of the output is is order synchronous but not time synchronous so to understand this first we have to remember to recall what it is the network actually produces at each time the network produces a probability distribution over all of the possible symbols so if I have here for example 1 2 3 4 5 6 7 potential output symbols at each time you're going to have a probability distribution over all seven symbols so what the network is really generating is a matrix of probabilities where each column sums to 1 so given this what we can think of as the output is the sequence of symbols that has the greatest probability so and so so now remember we've been speaking about this in the past when you're speaking of generating stuff you have a probability distribution you're picking the most likely symbol so if I'm going to be generating a sequence what you really want to do is to pick the most likely symbol sequence given the impulse make sense we're going to be building on this and so here's a simple way of doing it at each time I can just simply select the most probable symbol sequence symbol so say for example at this time G had the highest probability I'm gonna pick G here G has the highest probability here I'm gonna pick G here F I mean it's not necessarily going to go sequentially bottom to top you'd get symbols in some order but this is what you would do now remember we really want to get order synchronous but time asynchronous outputs so anytime you see a repetition of this kind you have to make an assumption that this assumption really represents the repetitions of the actual symbol that must be output at the end and so you can just collapse all of these guys and say anytime I have a repetition I'm going to just collapse them into the final symbol and so the actual output over here is going to be G F E and ya alright and so this is easy now what is the problem of doing something of this kind if the real output is a repetition if there are two F's this cannot distinguish that there are actually two F's in a series because you can be collapsing them there's no the logic is insufficient to identify that there were two distinct FS that happened you just collapse the whole lot into one symbol right and then of course the other issue is that the resulting symbol maximum symbol sequence may be meaningless what's creed I have no idea so you can have external constraints so the external constraints you can impose and we won't go into this right now but you will see towards the end of the lecture if you have the patience to stay so long to see how this might be done the external constraints you will impose will ensure must ensure that the sequence of symbols that you actually output represent valid words and valid bar sequences right so we'll refer to the process of obtaining an output from the network as decoding so what we have done is we have decoded the output of the network to get a sequence of symbols in this case G F and right so now here's the key piece if I just pick the most likely symbol at each time this is not necessarily the most likely symbol sequence because you're being greedy at each time you are picking the most probable symbol but then if there are any if there's any external constraint at all that external constraint me actually or this there are any other kind of priors they may tell you that it's highly unlikely that an e will follow a B right and so those kinds of that that information is not really being considered your only this is a perfect this is a the this is greedy and so more importantly you are looking at the most likely time synchronous sequence of outputs the most likely time synchronous sequence of outputs may not give you the most likely order synchronous sequence of outputs there's a difference between order synchronous and time synchronous as we saw right we are collapsing the time synchronous outputs into the order synchronous outputs and so this is a greedy algorithm which might not give you the most likely order synchronous sequence of outputs will hold that thought for a very long time for 70 minutes before we get back to it but anyway we've sort of looked at how to when to output symbols and which are which are the real outputs we've sort of looked at it we may go back to this later but then now the bigger issue is how do we train the models now training the models in the best case is fairly easy suppose I give you the input and I give you the output training output and I tell you exactly at what time each of the output symbols must be produced no big deal now I have exactly these guys I have the sequence of inputs and I'm informed about the outputs that occur at each time or the outputs that are current at which times they occur so now I can just compute the divergence with of the target output with respect to the outputs at the times where the target output is supposed to have occurred or is expected to occur and then I can I can back propagate these divergences through the net and learn all my parameters or instead of simply assuming that I only have outputs add the precise instances where instance where I expect an output I can actually assume that although the order synchronous output is out here this really represents a repetition of B several times and so this now actually sort of stretches the signal that you're training signal that you're providing the network and you're able to use this much more detailed signal to train the network and this is the exact form of training signal that we're going to assume for the rest of the lecture so although during training the output target output is order synchronous but not time synchronous when we are going to convert it to a time synchronous output by repeating things right and now once I've done this what is the actual divergence that I'm going to minimize the divergence is going to be the sum of their divergences at the individual instance so I'll be summing over time and computing the cross entropy between the actual target output and the output of the nectar network itself now here's what happens the target output this is something that you must remember but so if have symbols s1 through SN the network at each time is going to give you probabilities y1 through yn and the target output there's going to be only one target output right so if I think of the target output as a one heart representation I'm going to have a sequence of zeros and a single one so if I so this is B this is y so if I do a summation bi log Y I this is all of these terms will disappear all of these terms will disappear all I'm going to be left with is - right - log Y off target symbol right this is all this is what you're going to be represented with at each time and this is very important the divergence that you're minimizing or the loss that you're minimizing is the negative log of the probability assigned to the true symbol at each time and this you would be summing over all time and that is going to be the final divergence that you minimize so if you were actually doing gradient descent now remember if you want to do gradient descent you want to you want the derivative of the divergence with respect to each of the outputs right that was part of the back propagation now eat and any single output is going to contribute only one term over here which is the log of P the negative log of the target symbol at that at that time so if you take the derivative with respect to all of the probabilities of all of the symbols meaning the derivative with respect to the entire output of the network you're going to get zeroes everywhere except in the term corresponding to the target symbol and at that point is going to be just minus one hour the probability assigned to the target symbol right nothing complex this is this is a straight up divergence computation but again these are all things that will come up come up when we go through the entire process of train so this is easy if I give you an input if I tell you exactly where are each output occurs no big deal now I can compute the divergence I'm going to repeat the I'm going to create time synchronous outputs from the top from the e order synchronous outputs and now I have a symbol at each time I can create a divert to computer divergence I can compute derivatives for the divergence I can do the back row I can do back propagation everything is good the world is fine we're happy now more likely this is the much more likely scenario when you're given an input you're not going to be told exactly where each symbol occurs so I'm told I'm for example I might be given a sequence of speech recording and I'm told and I'm told that this recording represents the word sequence of the phoneme sequence hello world no one's going to come and tell you this is where H occurred this is where a occurred that information is not given to you so you're only given the sequence of output symbols but no indication of which one occurs where in this case how do we compute the divergence because remember for us to be able to compute the divergence we need to be as able to assign a unique symbol to every time instant and that information is no longer being provided right so how would we do this now if I roll this problem over to you what would be your best first solution don't give me your most complicated solution give me a the first solution that comes to mind I can just randomly assign symbols right so I could start off by saying I know the order in which they are occurring so I can just randomly assign places where each of the symbols will occur and this is reasonably perfectly reasonable and then once I have done this I can go ahead and train my network is this going to work not really because this is a random assignment but this is a good starting point but then now this is going to give me a preliminary model I can use this preliminary model to go back and we estimate my alignment so this assignment of time synchronous outputs to the input that's what we will call alignment so again what you are really given is something of this kind you're given X 1 X 2 X 3 X 4 xn and then maybe you're told that this corresponds to the symbol a b c right symbol sequence ABC so the output is much shorter than the input so what we are going to do is to say okay try to come up with this kind of repetition of the sky of the individual symbols so that they occur in the right order and this is what we will call our alignment so this is the order synchronous output and this is the input right so we're going to be stretching the order synchronous output so that there's a one-to-one correspondence with the input and this business of stretching is what I will call alignment and if I have this kind of an alignment then it's easy for me to train the network we've already seen that right so what we are going to do is to start off with a random assignment of alignment of the symbols to the input then I can train my network then I can go back and rest him at my alignment and then once I've reacted my alignment for each training instance and how would I do the reassignment you can do this using the decoding methods we've already discussed we'll get back to that then I can go back and maybe I start with this alignment this was this was my random initial guess but then after I've trained my models when I really meet it I might end up with something like this which is now a B B si si si si si so in the process of training a network and really estimating a line and the alignment has changed so I can go back and not really retrain my models and repeat the process right so how exactly do I estimate the alignment I sort of magically just hidden this term of reactivating the alignment into this into a few words so let's see how we must do this so again the business of finding an alignment is this you're given an order synchronous output from that you're trying to assign it aligned exactly which of these symbols to assign to each input and that must be done such that this occurs in the same order as this guy so this is the business of aligning the output to the input right so here for example you might be given a sequence of inputs symbols as zero to escape la SK minus one so here I have ABC so case two case three right as zero is going to be a s1 is going to be B s2 is going to be C and I have some end length input in this case whatever is and I want to find an expansion so this is my expansion of my compressed output so these I'm representing using upper case s these I'm representing using the lower case s so I have hard yeah so I would say s 0 observed that the upper case SS repeat the lower case s s have unique indices right so I'm saying the symbol assigned to time 0 s capital S 0 the symbol assigned to time 2 is capital s 1 and so on right just the notation so what we want is an alignment off the target symbol sequence ABC in this case or in my example let's be be the Iowa be e for me2 my inputs right now what's the easiest way of doing it keep in mind that what I really get from the network is a vector of probabilities at each time right so given that I have a vector of probabilities at each time I'm gonna have so you probably you're probably gonna have trouble seeing these things on the board but let me try right so at each time you have a vector of probabilities P Y 1 through P Y n at T equals 1 2 3 not equal 0 1 2 so at each time you're going to have a vector of probabilities basically this entire matrix and then you want to say align the sequence - I can't see it myself I don't know if you guys can see it but anyway let's hope you can see it and if you can't see it close your eyes squint and then you'll be able to see it right and so now think of how I can align ABC to the input I have all of these possibilities I have where's my rent I'm going to keep changing between the red and the blue - you know something works ok so I have I could do this a B CCC CCC CCC that's one alignment right I could be doing a B B CCC that's a different alignment I could be doing a a B CCC see how many different such alignments can I get an exponentially large number right I have to pick one of those based on this matrix of probabilities which one will I pick so I'm going to pick the most probable one as a heuristic it makes sense right so what we're going to do is we're going to pick the most probable alignment of the target output to the input how do we go about doing this now one simple solution for me is to just go ahead and simply decode my output I could just go ahead and pick up the most likely symbol at each time if I pick up the most likely symbol at each time am I going to get anything of this kind not necessarily because in this example what I really want is maybe I'm trying to align the input to this word beefy right but if I pick the most likely symbol at each time at the very first time I got the symbol as being most likely so what I got as an alignment has nothing to do with my input this is my wish this is not going to work right so how do I constrain it to actually give me an alignment of what it is I want so firstly if I do a random decode just if I just go ahead and decode I'm going to get symbols which are not even in the target output right so the first thing I must do is to eliminate all the rows in this table which are not present in my target output if I do that I can so here I'm trying to get beefy the only symbols that really matter so what I would do is instead of you know blocking them out I go ahead run my network I get the provector of probabilities at each time I pull out the row corresponding to BER the row corresponding to e the karo corresponding to F so I get these three rows on the top and now this is all I must work with and I can just decode on the radio's grid but if I just decoded on the radio's grid I would end up if I just picked up the most likely at each time I'm probably going to get up and up we're getting something like this is this guaranteed to give me beefy no right because although I get the right symbols the order in which they happen might be quite arbitrary so what you would get over here that's for example here I've got beef feet and this is not really an alignment of beefy to my input right so I need more constraints it's not sufficient for me to say I'm only going to output these symbols I want to say these symbols must occur in the right order such that they give me an alignment of beefy to the input right so here's what I'm going to do I'm going to go back and color the appropriate rows but I'm going to are I'm going to arrange them so that when I go from top to bottom I get the output that I want right so observed that the first row is BA the second row is e the third rows for the fourth row is e again so although the network itself has only computed e the probability of e once this row has occurred only once in the output of the network brother number but when I pull the rows out and created my table up there I put two copies of heap right questions on any one spaces no it's very clear right all I did was grab it and make two copies of it and now why do I want to do this if now I'm this table the upper table represents a matrix of of probabilities I'm going to sort of pull apart through this matrix of probabilities with some constraints and the constraints are going to ensure that yeah okay yeah so and the constraints are going to be such that the corresponding alignment that I get the corresponding sequence that I get is going to give me the symbol sequence beefy and what is the constraint I want I want the path to explicitly start from the top left and explicitly end at the bottom right because I know the first symbol is bur and the last symbol is right and it is in this example have one two three four I have nine time instants index zero through eight I have four symbols right so how can I do this I know that if I wanted that when I'm going in my path every symbol every row must be visited right so if every room must be visited then at any time if I'm outputting a particular symbol then at the previous time I'm either output the same symbol or the symbol just before it this is the constraint this is this is the very simple simple constraint that guarantees that all of the symbols occur and they occur in the right sequence so for example if at at time 0 I'm I'm out to outputting the symbol bir then at time 1 i must either output by again or I must output e I cannot output it fer because if I do then I've skipped the e in between and this is no longer representing a valid alignment right so these arrows that have shown over there sure what are the valid successors for any particular symbol when I'm trying to decode yeah right so and in this particular example it's Bell has only two possible successors either itself or the next instant or the next symbol or the next instant so I can convert think of the entire thing as a graph starting them from the top left to the bottom right and I want to find a path through the graph from the top left to the bottom right and specifically I want to find the path that has the highest probability that is going to give me the most probable alignment to the input right and now what is the probability of any particular path the probability of any particular path is simply the product of the probabilities of the individual symbols in the path so here for example the path which is this guy shown by blue shown in blue that has a probability which is y 0 B times y 1 B times y 2e x y 3e x i4 x so the y the subscript represents a time the superscript represents a symbol and y represents the probability that has been offered by the network right so this clear nothing magical we just sort of explaining the representation but everything is going to build off this representation if you miss this if you don't get this the rest of the lecture is going to be lost on you right so any questions yeah not necessary you will you're guaranteed that you're going to pick the most likely most probable sequence just doing things in this manner and we because remember the probability the only constraint I need is that it has to start from me and end in E so if I find the path that has the highest probability it is the most probable sequence right so how can I actually find this most probable path I can firstly there are an exponential number of such paths you're going to have to evaluate all of them if you were being stupid but what he will really do is to use the Viterbi algorithm dynamic programming does anybody not know the Viterbi algorithm okay you don't so I have to go through it all right no no it's okay it's perfectly it's it's perfectly reasonable I assume that out of the 200 people behind the camera a certain number don't know it either right so the Viterbi algorithm is going to be is based on a very simple observation if I think of the best part when I speak of the best part I'm speaking of the most part with the highest probability right so if I think of the best path from say this guy to this guy how many ways do I have arriving at this node only two possible predecessors right now so here is the key R is the magic the best part to this guy is either an extension of the best part to this fellow or an expects tension of the best part to the second one there is no other possibility because if you if you choose the second best path to the blue box that already has a lower probability than the best part it could not have extended to the best path to be final red box right so very simple this is the basic logic that you're going to recurse so if I want to write this down the best path from y0b to y3f as either going to be the best bar best path from y0b to y2e followed by y3f or the best path from y 0b - y 2f followed by by 3f simple and i and collapse these two and say that overall parents I'm going to pick the I'm going to pick the parent for which the best path to the parent itself is the best and then I'm going to append the final symbol write the notations clear I see oh right and or I can just say this is the best path from this y0b to the best parent to which I attach the final symbol so it sort of collapsed to saying now I've reduced this problem to picking the best parent and what is the best parent the best parent is simply going to be the parent for which the score from here to here is the highest anything else is going to give you a little problem yes the parent is all the symbols to all the nodes from which you can actually get to this cup a parent is yeah there are two parents right yeah okay so parents and children parents are notes from which you can get to the node children are the notes to which you can go from the node so this logic is very some straightforward nothing magical right and the Viterbi algorithm simply just builds on it all you're going to do is dynamically track the best path from the source node every node in the graph at each node you keep track of the best incoming parentage the score of the best path from the source node to the current north through the best parent and then come eventually compute the best path from source to sink so here is the final overall algorithm I'm using the same notation as here Y subscript T is the time superscript s are now observe that s R represents the symbol assigned to the arthro right so here for example s 0 is B s 1 is e s 2 is 4 s 3 is e right so all of these values sort of conform to this notation that I've got over here and so here's how I would begin what is the best path to the top left block just itself right I'm going to incrementally build my best path from left to right so I know so the best path through the top left lock is just enough are the best parent to this guy's gesture now because there's nothing that happens before it right and the score of the best path to this guy there's simply the node is simply the node score of s0 and we're going to assume that these guys are impossible so even if they were if they had a best path score their progress score is the probability is zero you have a zero probability right so or minus infinity whatever it used to be doesn't really matter now then I can go from left to right now first there's a special case for the top row for the top row for this guy there's only one parent so what is the best parent just the one guy that it can come from and what is the score is the score of the best parent the score now refers to the probability is the probability of the best parent times the probability of the symbol itself right anytime you attach a new symbol today just multiplying the probability in and so that's going to be the best score of the parent at time t minus one time's the score of the current symbol and then as you go down for the rest of them in the in the column you can just use this logic that first for each one for the symbol you pick the best parent based on which has the highest probability and once you've chosen the best parent then all you have to do is to multiply the probability of the symbol itself to the probability of the best score to the best to the best parent and that is going to give you the best part score to the current node because you're just using the simple logic that we saw earlier and we're extending it right and so I can so having computed these terms of the first column now I can use this little logic here to compute these terms for the second column then go ahead compute it for the so these a blind are then I can compute it for the next column as I as I looped on I can keep doing this left to right right at each time what you will find is your computing two things for each node you are finding the best parent and you're computing the best path score from the beginning to that node eventually when you're done you're going to find the best path scores to all of the nodes in the final column and for each of the nodes you're also going to have the best parent but we are only interested in this final guy because we know that at the final instant the last symbol is this one right so now you can trace your way back by just finding his best parent which is this one then finding that one's best parent which is this one and finding that one's parent which is this one you trace your way all the way back and that's going to give you the anti part right and so now this gives you given the current network which gives you output probabilities for every symbol it gives you the best most likely alignment of the target symbol sequence to the input questions yeah no it's not because think about this right at each column so in each column you're going to be going through and symbols right for each symbol you're computing aspect in the worst case you're going over all possible entries in the previous column so for e to computer completely computer single column you need n squared computations and then you have T of these so it's n Square t so here just look at this right so at each column so at each column in the worst case how many parents do how many parents does each nord have just two right I can it I can even make it really horrible and say everything is a is a parent so for each of these nodes I have n terms to compute so each of these guys is going to take n computations right and I have n of these so the total is n squared so each column is going to take N squared computations and then I'm going to go over T and do this T times so it's n squared T its worst case case quadratic in n and linear and T or you know if n is the same as T it's cubic in T right so it's no longer exponential yeah simple enough and so what is the output of this algorithm the output of this algorithm is a sequence of symbols this sequence of some okay guys I'm going to go way over time today today's Friday it's spring break you have lots of time you're going to be ok so now so you have the output as a sequence of symbols right and now that I have a sequence so I have some pseudo code for the Viterbi algorithm that you can check up later but now that I have a sequence of symbols which has been derived from the current model I can use that sequence of symbols as my alignment now I can compute my divergence and now I can back propagate the divergence and I can train my network right where since the trusting no no but you're all you're always using the output from the network to compute the alignment yes every training instance you're passing the training instance and you're getting the matrix of probabilities using the matrix of probabilities to compute a new alignment right and so you'd get this but let me go through a couple of slides and then maybe you can ask me the question again but first let's hold a thought and let's assume that this is the alignment that you got if you get this alignment then the grade the divergence is now defined with respect to these symbols in the best part because the alignment was not given to you right so the best part estimated alignment and then having done that now if I'm actually computing the derivative of the divergence at each time you're going to get one over the probability minus one over the probability of the output in the estimated alignment for what time in the rest of right so this again we haven't changed anything in the formula except that the probability has come for the estimated symbol and the estimated alignment so this we are fine with I'll get back to your question right so the overall process is going to be something like this I'm going to initialize my alignment from the initial alignments I'm going to train a model with a given alignments then having trained the model I'm going to go back and compute recompute alignments for each training instance then I can use those recomputed training instances to go back and train my model and then once I've trained my model I can go back and recompute my alignments for each training instance does that make sense you know it's a trade you mean training to convergence or just thank you right so what I can do is determine the alignment for every training instance then train the model to convergence and compute the alignment or I can do this DD approach right for each input I can just first find the alignment and then use the use the estimated alignment to update the model and this is clearly the second one is clearly going to be far more efficient and as it turns out the second one actually works better questions yeah so so what is this works right this makes sense but if I told you this had a problem what would you think the problem was so you guaranteed that you're picking the most likely alignment at each time yeah but the point is you're also going to get you got you're also going to show that when you pick the most likely alignment of the next time instant you're sort of you're going to increase the probability of that alignment itself when you when you optimize the network parameter set each step the probability of the alignment that you choose alignments that you choose is going to keep increasing so the whole thing has a likely probability maximizing or a divergence minimizing characteristic so yeah it's guaranteed to converge in the same sense that as GD is guaranteed to converge right so because you're picking the most likely alignment then you're increasing its probability so it keeps improving right so in that sense so the problem is different it's dependent on here as initial on our alignment she gave us a random alignment what if she gave us a really bad one she forgot her morning coffee then it's not really going to give you it's going to give you a really bad initial estimate right so it's prone to local optima and it's heavily dependent on the initial alignment so what is the alternative that's not a line right or let's not pick an alignment let's consider every possible alignment how would he consider every possible alignment first what are we really doing we are estimating an alignment and for the most likely alignment we're computing or divergence right so and what is the divergence the divergence is the sum over all time instants of the local divergence which is simply the negative log probability of the aligned symbol probability this we've seen correct and this we are doing for the most likely alignment sequence but here's what I can do I can think of this entire graph as a probability distribution every part through the graph is a valid though not necessarily the most probable alignment right so how many such parts do I have I have an exponential number of such an arts so what I can do is that instead of committing to this one alignment which maybe was the most likely alignment I'm going to say I can actually compute the ally the probability for each of these alignments right I have the graph I have all the terms I can compute the probability for each of these alignments and then for each of these alignments I have a divergence for the alignment right this is the divergence for the alignment over here and each alignment has a probability what we did when we pick the most likely proper alignment was we just went through these peas choose the pea that was highest pick the corresponding alignment and chose that divergence why choose the one let's get a weighted combination of all of them so instead of just picking the most likely alignment and then using its divergence let's use the expected divergence over all paths yeah yeah the initially no any when you started off your initial parameter estimates are going to give you probabilities in the beginning right yeah they're not random embers it's the probability distribution what we have really done so I was so based on the current model that you've got you chosen the most likely alignment right so based on the current model that you got you're going to find the expected divergence all you've done is change the perspective nothing changed earlier you initialized your model somehow based on that model you pick the most likely alignment use the alignment to update your model and then based on the new model you pick the most likely alignment and kept repeating the process correct so here we're going to do something else so what you really did was you initialize your model somehow then you pick the most likely alignment and then minimize the divergence for the most likely alignment right here what we what I'm suggesting is that instead of minimizing the divergence for the most likely alignment let me minimize the expected divergence across all alignments given the current model so the expectation is computed using the probabilities computed using the current model as the model improves that expectation values will change those expectation values will change in the first iteration is some random initialization or whatever that's our current model right and then from that using that you can what we did earlier was to compute a divergence for the best path and then you use their divergence to update the model then you have a new model right and then you use that new model to go back and estimate Li a new alignment and a new divergence function and then you minimize that right so what we are going to do instead is instead of picking our divergence for our specific alignment we're going to use the current model and compute the expected divergence based on the current model so all of these P terms are going to be functions of the current model right and that's going to give me an expected divergence it still lost no it's okay we have time it's Friday right so okay so we are actually going to go through the toy maybe let's see so let's go back here right if I haven't initial so just look at the figure this figure should perhaps explain it where did those Y terms come from in the very beginning where did those Y terms come from we didn't assign the vise we assigned parameters to the network then we fed the input to the network correct so these Y terms came from how'd it go back a bit bear with me and him everybody it's okay and somewhere in between I should have got my clicker this would have been faster right so the machine is slow give me a second okay so look what what did we do we had the initial Network correct we had the initial network and the network gave you these probabilities at each time from these probabilities you constructed this table correct so that's where the Y's initially came from you actually had an initialized network it consumed the input it computed in a probabilities for every output symbol at each time that's what because you have a soft max computing probabilities at each time for all the symbols correct I have not lost you there okay so from these I have constructed this table based on my target sequence I didn't use you there either right how did I for my training instance I in this case I know that the target output sequence is beefy okay so this was the output of the network okay so I just grabbed a hold of this B row and put it on top then I grabbed hold of this arrow and put it second I grabbed I took took this F row and put it third and then the ITO again and this was fourth so I basically composed the upper table from the table at the bottom and now the probabilities are all from the network right I use that table to compute the best alignment earlier and then use that alignment to go back and retrain my network right that made sense okay so instead of choosing any one alignment and that what I am saying is that I'm going to use all of the alignments with their corresponding probabilities alignments are not differentiable for the divergence computed from an alignment is differentiable each alignment is going to give you its own divergence correct depending on the alignment I choose each alignment will give me a different divergence so I'm going to X I'm going to compute the expected divergence across all alignments make sense right and so that is what I'm going to minimize anybody else questions I mean you will see right so the problem here of course is that you're going to have to sum over an exponential number of alignments to computed X compute the expected alignment right except you don't and the reason ere is come on move we need things that move faster I could use a roll anyway I think so this it's more convenient to use my finger so it's slower and you know it gives you time to understand what I'm talking about when I'm going strictly forward but having a mouse to go backwards it's convenient okay so maybe I'll use it but for now so there at least partially answer your questions okay so here's what I do for any given alignment for any given alignment this is the divergence correct where I'm at each time T I'm computing the log of the probability assigned to log of the probability assigned to that time instant by the particular alignment right what I want to do is to take an expectation of this divergence across all alignments right now what is the expectation of a sum expectation as a linear operation I can move the expectation into the Sun right so while I can spend an hour or so giving you the intuition behind this this you can take on faith the expectation of the sum is the sum of the expectation beautiful right yeah and says this is this the entropy not really it's a cross atrophy call up by Connors then that should be right log by you don't have a why right so it's so here's what we're getting you're minimizing so at each time you are minimizing the expected value of log Wow okay now you might wonder what the heck this is but we'll get back to that so basically what we are really doing is you're summing over all time you're going over all the columns within each column you're minimizing this term but what exactly is this term look within a column observe that there are in this example there are four knowns right so that what we are trying to do is in the perfect alignment what is the probability that this guy will happen this stuff right what is the probability that in the perfect alignment the symbol at that fourth time instant is the second e times you know log yy4 right plus in the perfect alignment what is the probability that at time for the symbol is f times the log of you know y4f right and so on you're just basically walking down the column and saying what is the probability that this was the symbol that occurred at this time in the perfect alignment based on everything that we know right and what is it that we know we know the input and we know the target out the sequence so you see what this formula is I'm literally summing over the columns and over the column at each time I'm computing the probability of this box the a posterior e probability of that box given everything that we know and I'm multiplying that a posteriori probability of that box with the log of the probability assigned to the box by the network at the current time or the current iteration of the network yeah correct but we are summing over all columns we are right the graph is still there the graph never went away I'm still I haven't eliminated the arrows right so what is going to be the probability that in the perfect alignment the second symbol is the for second e at 0 it never happened right the graph is still there ok questions so this term over here is the probability of seeing the specific symbol s at time T given that this sum this symbol sequence is an expansion of this input and you're multiplying it by the log probability assigned to the box and summing over the entire column right so the real challenge is computing this a posterior probability term right and this a posterior probability term I'm going to spend two or three slightly convoluted slides but this you get you understand exactly what this is doing right and this if I use Bayes rule this is simply the proportional proportional to the joint probability of this symbol sequence and having that symbol go through this particular and having that sequence go through a specific box so what we are saying is if I just did an arbitrary decode over the output of the network what is the probability that the output is going to be beefy and when I get beefy at time four I'm going to have I'm going to be producing the second e right so that's basically it's a join and so I've gone from the conditional probability to the giant probability because that's easier to compute right so this we found so let's say in my example T is 3 and R is 2 right so the blue box is computing this for 1 so what exactly is this probability right this probability is the probability of going through this box and simultaneous simultaneously producing that symbol sequence and maybe I'll skip the next couple of slides go through this on your own but the overall probability that you're really looking at is the total probability of all paths through this graph that go through that node that is the probability of producing that symbol sequence beefy and being at time 3 produce and the probable and producing symbol e at time 3 makes sense right so the total so this probability that we want over here yes ugly term is simply the probability of the portion of the graph that goes through that box and there's a great Dean of gobbledegook arithmetic that I've written over here but what the gobbledygook arithmetic really says is that I can factor it into two portions the graph from the beginning till that box and the graph after the box so the total probability is going to be the probability of the graph until that box times the probability of the graph after the box conditioned on the probability of the rare you know conditioned on the first sub graph so the total probability of the entire graph is going to be the probability of the blue portion of the graph times the probability of the red portion of a graph conditioned on the blue graph right this is a straight up Bayes rule and here's something magical that happens in a I'm going to see if I can skip stuff okay and here's something that happens now think of our record neural network if I give you the input right where is the information about records hidden in the hidden layer not in the output right unless you feed the output back to the input right so the probability of YT given YT minus 1 and H is simply the probability of YT given H because there's no in because the outputs are depended from the hidden layer without any direct dependence on the previous outputs right direct dependence on the previous outputs only happens in NOx networks we saw that but for the rest of them the recurrence is strictly through the hidden layer correct and the hidden layer is deterministically dependent on the input right so which means that if I go back to this network if I'm assuming that the input X is given the probability of this portion of the graph is independent of the probe the previous graph given the input because the graph only considers the outputs and the outputs are conditionally independent given the input it makes sense to you guys right again this has to do with the fact that when I'm actually doing a recurrent network in a record in network what happens I have X X x1 x2 I have a hidden layer I have a hidden layer this feeds here and there's some output node right so this is y 1 y 2 y 2 has no direct dependence on Y 1 if I give you this guy then y2 no longer has any dependence on Y 1 because all the information about Y 1 is already captured by this node over here right so if I give you the sequence of hidden inputs this has no direct dependence on this output unless I do something of this kind and I feed the Y back in in which case is a direct dependence you see what I'm talking about right as a result you get this very beautiful situation where this the probability of this red graph is independent of the probability of the blue graph so the probability of the entire graph can be decomposed into two portions one is the probability of the red graph which I will call the forward probability and the other is the probability of the blue graph which I will call the backward probability but they're both there's one thing in common to both portions of the graph that both are portions of a graph that go through the symbol e at time 3 in this example right so I will call this forward probability alpha TR I'll call the backward probability beta TR because it's anchored to TNR right but now the total probability of that portion of the graph is going to be alpha TR times Barrett here yeah so these how do I determine these two graphs I've already if I can do the Viterbi I can do this right so think about this every symbol is going to have two children either itself or the next symbol so I can draw the graph so for at any portion I know that I know I can actually walk my way back through the parents and I know the subgraph I can wake up walk my way forward through the children and I know the subgraph but we'll look at how we actually do this computationally okay so but here is the point now the magic can be set descends to computing the Alpha and the beta terms if I compute the Alpha and the beta terms then I can compute the probability of all paths through this node right such let's look at this first term which is the alpha term right the alpha term is the probability of this graph it's the probability of all paths that start from the source and end at E at time 3 the first year time frame correct but then what there are only two ways of arriving at that destination node what are the two ways of arriving at the destination node the two parents right and they're all there you either arrive through one parent or you arrive through the other parent so the probability of the sub graph of that graph is the sum of the probability of two sub graphs is the probability of arriving the entire sub graph that arrives here times coming here right plus the probability of the entire sub route that arrives at the second node times coming here make sense right because there are two sub graphs from each I can actually end up at that node once I have done this the problem is solved why what is the probability of the entire sub graph that arrives here this is if I say that the probability of the entire sub graph that arrives here is alpha 3e is the probability of the entire Sabra that arrives here there's this alpha to be right the probability of the entire sub grab that arrives here there's just alpha 2 e I've just defined alphas at any time in terms of the alphas at the previous term right so I can literally write alpha 3 e is alpha 2 B times y 3 e plus alpha 2 e times y 3 make sense just trivially become positive right I can generalize this so anybody have any questions about this Shelly you're looking dopey wake up so anybody have questions about this this has got to be clear right just it is clear or is there's a question on your face so but did I lose you okay fine I'll trust you now if I do this I can generalize this so what is the magic of to be and to e these are the two parents of three correct so in other words I can generalize this whole equation and I can say that alpha TR is the sum over all parents or predecessors of TR of alpha t minus 1 Q times right to frighten yourself right I've just generalized the formula makes sense right very straightforward nothing magical there's a lot of gunk on the slides which tries to give this to you with a little more math but really you don't need it this big X if it's completely self explanatory you guys make sense yeah ok and so in our particular example I'm just going to say alpha TR is going to be because each node has only two parents its alpha t minus 1 R plus alpha t minus 1 R minus 1 times the probability of the current symbol very straightforward and that gives me the complete everything required to compute every alpha for every time this recursion gives me the complete formula right I have the constraint so what is alpha 0 1 alpha 0 1 is going to be y 0 be write as the entire sub graph that goes into the into that first node and then alpha 0 1 is going to be 0 for all of these guys because you know that the first no first at the first time instant you cannot be in any of those boxes right I mean our constraint is that the first symbol must occur from the top left box so for the rest of them the Alpha has to be 0 there is no sub graph which arrives at those points they have a probability of 0 by our constraints right and then subsequently so that's going to give you give you this term for the very first symbol at the top left corner and then subsequently it's very straightforward for the first guy for each guy this arithmetic is kind of the the algorithm is fairly straightforward but what we are really doing is we are summing the alphas over all of the parents of that node and then and then multiplying the current symbol probability to get alpha for the node you do this going down the column and you'd get the office for every term right and you can keep doing this so you do this for the second column then you do this for the third column you do it as for the fourth column eventually you're going to get your alphas all the way to the very end what is the computational expense N squared T right that's exactly the same as the Viterbi algorithm ok so we figure out how to compute this alpha term right I can three place the first term over here when I'm trying to compute the total probability of all nodes through that all of the sub graph that goes through the nord there are two components the first is the alpha TR I know how to compute it I have the algorithm to compute it so now the term that is left is how do you compute the probability of the blue sub graph right I'm going to use the same trick of decomposing it this is what I need this is the term that I need to compute beta TR right let me do this again so I'm going to draw the graph in this manner beta TR is the probability of the exposed sub graph and not the orange box the reason I've drawn the orange box is that the orange box represents T are the lower sub graph that is with respect to the current time and the current symbol right that's all the orange box is my anchor that is the Nord to which the graph is a chat as a child right so if I have this oh man I'm going to go so over but anyway this is here's how I can write this probability down right this guy has what are the two children of the Orient sub graph so if you look at this sub graph right I can either start in the first box or at the second box right these are the two children of the orange graph orange box right so I can see that this guy is the probability of the sub graph that starts in the green box plus the probability of the sub graph that starts in the red box right very simple I haven't done anything fancy I'm just sort of split it into two Oh what I will do is I will pull this particular probability out y4f and Y for e out and so that's this guy here right and now I'm going to pull the Y for E and y4f out and this is still the same ability that I'm writing right but now the term inside the parentheses that looks familiar that looks just like the guy on top it's just one step down right make sense yeah you're looking okay and you can see the symmetry so what is this guy what is this box this is going to be the beta term crystal corresponding to this orange box right this is going to be the beta term corresponding to this orange box and that is going to be simply beta for 1 plus beta for 2 trivia and I can generalize this further and I can just generalize this in terms of so in our equation it's going to be 4 to compute beta TR I'm going to go over both of its children for each child I take the probability of the child and multiply it with a beta T for the child and then I sum the whole thing up or if I want to write this in generic terms the beta TR for any node TR is the sum over all successors all children of the probability of the node probability for the child times the beta for that child right very nice closed form solution and now I can use this in a backward recursion the way I would use this in a backward recursion is first I know that there can be no subgraphs for the upper guys because I'm required to arrive at the bottom right corner when I'm done with my alignment so I'm going to assign a probability of 1 to the bottom right beta and then I work my way backwards because and the reason we are working our way backwards is that each beta is defined in terms of the betas of the next time instant right and so from these now I can compute the betas of the previous time instant and the previous time instant and the previous time instant and go all the way to the beginning and so now I've computed the beta's for the entire graph and the computation again is N squared T right and so having done all of this I now have both the Alpha and the beta terms over here and so the joint probability the probability of all hats that go through the colored sub graph there's going to be alpha TR times beta T oh yeah but this is the joint probability of going through this node and producing this symbol sequence so if I want to for conditional probability of the nord given the symbol sequence what do I do I normalize I'm just going to divide by the sum of this alpha I'm just going to be I'm just going to divide by the sum of the products of alpha and beta across the columns and I'm going to call this the posterior probability of the node given everything that I have right so questions so once you have this there's one final step which gives us the entire solution so this is here right and this is simply the conditional probability of the node given the entire input and the target output sequence and you might have lost you might be lost by now you know what are we talking about where is all of this coming from remember we were looking at expected divergence over all possible alignments and the expected divergence simply ended up becoming the sum over all time instants of the expected local divergence within the column and the expected local divergence is simply going to be the sum over the column of gamma TR times the length minus the sum over the entire column of gamma TR times the log of the probability of the symbol within each box in the column so we just brought everything back this is the divergence right and now if I want to do back prop what do I do I have to compute the derivative of the sky with respect to each of the output symbols at each time so if I want the derivative of the divergence with respect to the output at any time T I'm going to be computing the derivative of this time with respect to the probabilities assigned to each symbol and now if that symbol is not part of the target output the derivative is going to be zero so you're going to get nonzero terms only for the symbols that are in the target output but not unlike in the previous case where only one term was nonzero here you're going to have a whole bunch of nonzero terms every symbol that actually occurs in the output right and so now these terms must be computed from the derivative over here right and it's very easy if you actually look at there's one final trick observe that for e for example of the symbol E if I want the derivative the divergence with respect to the probability for the symbol e the symbol E occurs in two different rows correct over here so what we will do is to actually sum the derivatives at both what the divergences are both to get or rather some the derivatives at both of those locations to compute the derivative with respect to the symbol e at time T right that's the only extra so you're going to be summing over all rows such that the output symbol is whatever you had you wanted and within each row you're going to be computing the deadness derivative and you're summing them all up and how exactly do you compute this derivative it's very straight forward you know straight up chain rule there are two components to it one is gamma over Y the second is the derivative of gamma with respect to log Y and gamma is a function of Y itself because Y is in the graph right it turns out the second term there taking the derivative the second term it's not very tedious it's fairly straightforward but for some reason pretty much all of the literature only uses this this first term and you can get away with using the first term because it turns out there a perfectly good explanation if you think of it as doing a maximum likelihood estimate as opposed to minimizing the expected divergence right and so now we know how to compute this term and having done this so we are actually able to compute the derivative with respect to every one of the symbols at each time for the network which means now we need have all the terms required to perform back propagation and so if I have the overall training sequence for training procedure for sequence of sequence I am given input and output sequences without alignment I must train my models here's what I do step one I set up my network typically it's a many layer SDM initialize all the parameters of the network step two and the forward pass I'm going to compute all of these probabilities for every symbol at each time then step three I'm going to pull out the appropriate rows from this probability table to compose the upper table which corresponds to the output that I really want from the network right step four I'm going to that step three and for step four five I'm going to compute the forward backward on this graph to compute all of the Alpha and the beta terms and then step six I'm going to compute the divergence the derivative of the divergence with respect to the outputs of the network at each time which of course we know how to compute this is step six and now you die aggregate these derivatives over mini-batches and pass them through them network or back propagation questions no okay so story so far this whole thing is what we will call connection as temporal I forget about the final c4 right as well does anybody remember the classification thank you right so here's the whole thing sequence the sequence networks with it with which irregularly output symbols can be decoded by Viterbi decoding or for training they require alignment of the output to the symbol sequence for the training and this alignment is generally not given so you can either train by iteratively estimating the alignment and then updating the models or come minimizing the expected divergence over all alignments right I drag you know you're just probably waiting for me to say we are done you can go but then I promised you an extra half hour so here comes there's a key problem right we haven't solved one fundamental problem so think of this suppose you have a decode a Viterbi decoder which gives you RRR ood is this the word rod or is the is this the word rule both of them are valid words right how do you know which one it is do you know whether you should be squishing all of the o's into a single symbol or do you have to squish them into two separate symbols there is no way of determining which you must do B is just on this output right yeah but you know can we actually do something within the marble that is the question and so that's exactly what we're going to do so this is exactly the same problem that we were spoking are speaking off earlier right but do I know if it's a single fur or two F's I have no way of knowing so now first thing this kind of problem doesn't always occur your problem might have natural constraints which says a symbol cannot repeat in the compress symbol sequence right so for example if I'm doing speech recognition and I'm trying to recognize phonemes but I'm splitting the phonemes into three sub parts then I can never get the second part of a phoneme occurring immediately after the second part of a phony what after the second part of a phoneme I must get the three third part right if the second part occurs again it's just like an extension of the same basic unit so there are portions of problems where this repetition is not an issue it's not really a concern but in things like spelling or writing or other other such cases this repetition can indeed be a problem right and so that's what we're gonna have to deal with so here's how we will do it we're going to extreme to use an extra symbol whose only purpose in life it's really a minute you really have to envy the symbols job it's very easy it has only one job it has two separate repetitions of other symbols that's about it we'll call it a blank but it's typically represented at using either a blank or an underscore you can choose a symbol a symbol of your choice it doesn't matter the concept is important right so I'm going to use this - and this - is an extra character that the network can now output so if you did a best most likely decode for instance the most likely decode may end up something like this our our blank blank blank blank blank ddd right so when you have something this kind blanks have no significance by themselves that's going to collapse that's just going to become rod but then suppose the blank occurs between one instance of between repetitions of car-like are are black ah this blank basically separates the sequence of hours into two these repeated arts collapse into a single art the next star becomes an hour by itself so this one's going to collapse to RR Oh again I have a blank between D and D D so there's going to be D D right you see what the blank is doing over here right the only job of the blank is to separate repetitions if a blank occurs between repetitions of a symbol then the two instances of the symbol on either side of the blank must be separately grouped if you have so here for example I have multiple blanks within B so if I collapse this this is going to be one B these four DS are going to become a second D this is gonna be a third D so there's gonna be our o BD right now we're going to use this artifice of the blank to figure out how to deal with repetitions okay and first of course for the blank to be hypothesized the network output must now have this extra symbol in its vocabulary which is the black okay it's not just a standard symbols anymore you have this extra symbol which is the black and you will be computing probabilities for it also at every time and now let's say your decode was something like this this never touch the blank right if this never touched the blank this is simply going to be the free easy because the sequence of East collapses collapses to a single e the sequence of apps collapses to a single F but now suppose my decode is like this the first black doesn't matter right so this is still going to be ber these E's are going to be collapsing to an e these blanks will no longer matter these apps are going to collapse to a path and then e so this is so beefy okay but now suppose I have this as my decode then I have the BER again II I have an F but then I have a blank and then I have an F so there's got to be beef right you see what I'm yeah so how do we insert okay today okay this has to go do with how you compose the graph right so this is how we compose the graph earlier when we were doing beef I've changed have changed on beefy to beef simply because I wanted a repetition alright so if I did this this is how you would compose the graph but here is the problem that in any part if you collapse the two E's then you would collapse the two E's in sequence so this is this doesn't distinguishing distinguish between the E and E fur right so this graph is in a inaccurate but then we're going to modify this graph like so you would introduce a blank between every two symbols you would also introduce a blank before the first symbol and a blank after the last symbol right and now here's how I compose the graph I'm allowed to have a blank before the first symbol so the first time instant I can have either a blank or the first symbol I'm allowed to have a blank after the last symbol so the last last time instant I can either have that I define either the final symbol or a blank in between I'm allowed to I may or may not have a blank between two symbols which are different but if two symbols are the same and two sequential symbols are the same I must have a blank in between right all of those constraints are embodied in this graph so observe what is happening over here I think the next reader should actually so the edges here are such that all paths from the initial node of the final node unambiguously represent the target symbol sequence if you use the black collapse rule right at the first time you can either have a blank or the initial symbol or the last time you can either be the final symbol or a blank as I mentioned but then in between here's what you've got observe that I'm not now I'm adding arrows which skip a row but only in very specific places so how I'm adding arrows is actually being shown out here as you can see right so I'm adding arrows from ber to e because in my decode I'm allowed to produce an e immediately after a bar on the other hand I'm also having the arrow from ber to blank and blank to e because that is also valid both of them are perfectly valid and both of them will collapse to the symbol sequence B right but then when it comes from E to e I cannot have an arrow because if I have an arrow then the two E's will get collapsed or singly right so between E and E I will have a blank which cannot be skipped right and then again between E and fer I may or may not have a blank so I have a direct connection from the e to the front now I don't have connections between blanks and blanks because if I did the intermediate symbol would get skipped now if you take any path from the top to the bottom in the graph to the left it's going to give you ba e esha it doesn't matter how you take it it's going to give you bar e esha right every single path and now this graph basically is just following the same rules except the arrows are to the next time instant right so I'm I'm allowed to go from BER to e so I have this arrow I'm allowed to go from e to first so I have this ad but I'm not allowed to go from e to e so you don't have a skip arrow from one e to the other see how the graph is spawned right very straightforward nothing really fancy we just basically made the graph compose the graph such that it embodied the constraints about blacks okay and now we have the modified forward algorithm the modified forward algorithm is going to be exactly the same as before except that when you speak of parents and children the parents and children are going to be with respect to this new graph so you can do the forward algorithm and the backward algorithm using this graph and the same logic still holds right yeah and so this is basically the actual so in the forward algorithm you're going to find two separate conditions for this graph sometimes you'll have two parents sometimes you'll have three parents depending on the constraints of the graph look at the slides for the details but you should be able to figure this out yourself right the same thing for the backward you start at the end so the forward algorithm the one extra thing that you're going to have is that at the first time instant you're actually allowed two nodes in the beginning not just one because you're allowed to either big begin at the blank or the first symbol so also in the backward algorithm at the final time instant you're allowed two symbols at the end because you're either allowed to begin at the blank ordered the final symbol and then you work your way backwards and the overall training process is very simple now given input and output sequences without alignments if you want to train models same thing you set up your network and I'm speaking of unidirectional STM's but it doesn't have to be this you know how you compute the probability matrix of Y can use a unidirectional lsdm it can use a bi-directional STM it doesn't even have to be LSD amps it can just be a sequence of MLPs because as far as the upper CTC algorithm is concerned it is only worried about having the vise it doesn't care where the vise came from the vise could have come from a convolutional Network it doesn't really matter right so given this now step three after initializing all the parameters of the network you're going to include a blank symbol under vocabulary now right the first thing you will do is you'll pass each training instance through the network and compute all of these probabilities including probabilities for the blank then you would compose your graph and first you'd compute pass the input past the training instance through the network and compute all symbol probabilities and use the symbol probabilities to compose the graph with the appropriate edges and then perform the forward backward algorithm to get the alphas and betas compute the derivative of the divergence which the formula is exactly the same as before and now you can pass the diverged derivative of the divergence back down through the rest of the network right very straightforward nothing fancy so this overall framework that we saw is referred to as the connectionist temporal architecture or CTC it applies when duplicating labels at the output is considered acceptable and when the output sequence length is less than the input sequence length questions I'm going to spend the next ten minutes on a problem that I alluded to at the very beginning of the class so if you have any questions ask them now or else Lila anything of Piazza no there's nobody on Piazza everybody's on spring break right so an old decoding problem I told you that when I pick when I use the gridded decoding algorithm to pick the most likely alignment or the most likely symbol sequence this was not necessarily optimal why because the most likely symbol you know the most likely time synchronous symbol sequence doesn't necessarily represent the most likely order synchronous symbol sequence to compute the probability of an order synchronous output you must sum over all alignments and the greedy algorithm is only picking one of these alignments right so that is an issue it turns out that this business of introducing blanks somehow makes things easy even the simple greedy algorithm just works fine if you introduce blanks and the network is properly trained so what what happens is that you will find that at most times the most likely output is the blank except when there's a symbol change at which point the probability of the most of the appropriate symbol is going to jump up and so a greedy algorithm decode will actually give you a pretty good decoder output using these blanks in the CTC nevertheless if you actually do the optimal decode you can do better and so how are you going to do the optimal decode so the actual objective of decoding is to find the most likely order synchronous but time a synchronous output not the most likely time synchronous output and we see the distinction between the two right and so what widow he finds is the most likely time synchronous symbol sequence which you have to compress and so the likelihood of an order synchronous but time a closed asynchronous symbol sequence is not given by the Viterbi decoder it's actually given by the forward probability right because the forward probability sums over all possible alignment if I give you the symbol sequence that's giving you if I look at the if you go back and think of what we did when we when we we're composing this graph out here what is the forward probability of some of the forward probabilities of these two guys the sum of the forward probabilities of these two guys is simply the sum of the entire graph is the total probability of the entire graph that is simply the probability of the symbol sequence on the left hand side right is the probability that is the total probability of the order synchronous symbol sequence so if you want to really perform decoding what you want to be doing is compute composing this graph for every possible symbol sequence computing the Alpha at the bottom right and picking the symbol sequence for which this alpha is highest right and clearly this is not going to work you have an exponential number of these sequences so how can we actually get on this problem and here's how we're going to do it we are going to try to find the most likely order synchronous timing current of a synchronous symbol sequence so we're going to find the symbol sequence s such that alpha s KT the Alpha for the final symbol and that symbol sequence at the final time instant is maximized right so you understand what we are trying to do over here we are finding the symbol sequence such that the probability computed for this order synchronous symbol sequence which is summed over all possible alignments is highest okay and of course the problem here is that explicit computation of this is going to require an exponential number of Seir evaluation of an exponential number of sequences so we will do something simple we'll just first organize all possible seeing the symbol sequences in a tree think of this at the very first time suppose I have only some blank s1 and s2 right what is the very first symbol that can happen let's assume for simplicity sake that the first symbol can only be a blank right so the very first symbol is going to be a blank then what can follow the blank after the blank you can't have another blank I'm speaking of the order synchronous symbol sequence right so after the blank I can hide either have an s-1 or an S - what can follow this s 2 s 1 after the s-1 I can either have an s2 but can I have an s-1 directly no right I can I must have a blank and then I have an s-1 right but if I have a blank as to can also follow the black same thing over here for s2 I can have s1 follow as - immediately but if I want a s2 to follow s2 I want a blank or the blank Informer has to write and then from each of these guys I'm gonna have the same structure over again I'm going to have this s2 structure following every s2 I'm going to have the s-1 structure follow every s1 right and this tree represents every possible symbol sequence that you can produce correct so now every you can keep going through this for any depth so here's what I'm going to do I'm going to compose a graph of this kite this graph has the tree on the left hand side here I'm going bottom to top it's simply because it was easier for me to for me to draw will be going top to bottom and the rest of the slides and you my friend are totally asleep setup we've got five more minutes okay so you have this is the tree which represents all possible decodes right now I can compose the graph and I'm composing the graph using the same rules that we had earlier in that the arrows from one column to another column basically follow the the the the connections shown on the vertical graph right so now every node over here represents a unique symbol sequence right that's how we compose the graph which means that every row over here represents a unique symbol sequence so if I compute the Alpha at any point that alpha over there is the Alpha for a unique symbol sequence right and so now I can compose this graph and I can just compute KO left to right and compute my alphas over this graph going left to right the problem of course is this graph is going to blow up very fast if I have 100 symbols in one time I'm gonna have 100 possible symbols so this graph is going to blow up very fast so when you do a beam search what you will normally do is composed this graph but within each column you're going to pick the most likely and you're going to get rid of everything that's you know more than some distance away from the most likely sequence and then just keep expanding that locally and then when you finally get to the end remember that each symbol sequence is actually represented by two rows one is the final symbol in the sequence now there is the blank that follows the final symbol we'll be summing those two alphas to get the probability for the symbol sequence you'd pick the symbol sequence from which this total alpha is largest and that's going to give you your decode so this is the theoretically correct CPC decoder in practice the graph gets exponentially large very quickly so to prevent this we will use pruning strategies to keep the graph and the computation manageable and pruning is always going to cause some sub optimality but the fact that the CTC I'll you know decode by itself has this natural property that most of the probabilities are sort of compressed at the boundaries allows for very efficient cloning pruning because the blanks consume most of the probabilities in the graph as we saw and so CDC decoding tends to be efficient even though you know if you didn't perform pruning the graph blows up yeah that's it you go down the column and get rid of everything it has you know you pick the highest probability you say I'm gonna have a pruning threshold of say 10 raise to 6 so then anything that's more than the highest probability divided by 10 place to see unless then the highest probability divided by 10 raised to 6 you just throw it up right and so and so here's the final slide 20 half an hour's sequence to sequence networks which irregularly produce outputs can be trained by iteratively aligning the target output to the input and time synchronous training or optimizing the expected error or the expected divergence over all possible alignments that CTC training distinct repetition of symbols can be disambiguated from repetitions representing an extension of a single Samsom symbol by using blanks and decoding can be performed either by best path decoding which is which will be decoding which is very easy to implement as you saw the pseudocode is only like ten lines and that pseudocode can literally be converted to Python code or you can perform optimal CTC decoding based on the application of the forward algorithm with pruning and this too is implementation aliy simple but it's not going to work because nine implementations will run out of compute and memory really really quickly and so if you want to do pruning and we beam search you use libraries that somebody else wrote for you let me stop here so the caveats by the way the blank structure is only one way to deal with the problem of repeating symbols so you have there other variants like symbols partitioned into two or more sequential sub units like sub phonemes or symbol specific blanks so you know I can have a different line for s1 then I have for s2 for instance right I can have symbol specific blanks U and other such variants are possible the actual underlying mechanism that computes the probability's doesn't have to be in uni-directional lsdm it could be bi-directional it could be simple MLPs it could be anything at all whole number of combinations but the basic idea that we spoke about still sort of remains so I'll stop here there's a few more slides you can look look these up in the course page questions thank you and thank you for your patience and we'll

Info

Channel: Carnegie Mellon University Deep Learning

Views: 2,915

Rating: 4.9130435 out of 5

Keywords:

Id: A8IhGQCurPc

Channel Id: undefined

Length: 113min 32sec (6812 seconds)

Published: Sat Mar 09 2019