(Old) Lecture 15 | (3/3) Recurrent Neural Networks
Video Statistics and Information
recurrent neural networks can be extremely effective at modeling time series both at making predictions and analyzing them so we saw this example where there was a record network that actually wrote out a computer program and the program is complete in terms of syntactically complete every open paranthesis is closed the variables are properly referred to and so on so we saw that when we want to analyze time series then if you consider if you're interested in short term time dependencies so if you have some time series data of this kind where the prediction at any point depends only on a small window of inputs in the past then you can have iterated structures what we mean meant by iterated structures is where you're iterating over the inputs and you're pretty producing an output so these were finite response systems they have a short time dependence dependence on the past and these are basically just time delay neural networks or convolutional networks and this sort of outlines also with our earlier perspective that and that convolutional neural networks are basically just scanning the input with an MLP right but we saw that if you want our time series with long-term dependence specifically with arbitrary dependence on the past then it's not sufficient to have iterated structures you want recurrent structures things that self-refer their self-referential they refer to their own values from the past and these are this is one of the standard kind of structures that we saw at each point you compute a hidden representation of the input and the hidden representation depends on the hidden representation computed at the previous time instant that's where that's how we got our self reference right and these things tend to have be able to refer all the way back into the past so here for example some that happens here influences down this chain the output at this time instant and we also saw that recurrent structures can do what static structures cannot so the simple problem of trying to train a neural network that learns to add to n bit numbers now if you were to do this you can do this there's a completely static problem there's nothing record about the problem per se I'm giving you two numbers you want to add them but if you wanted to do that using a static structure that we both the size of the network and the amount of training data that you would have to provide could end up becoming exponential and the size of the input and what you learn at the end of it is very specific to the size of the input in this particular problem on the other hand if I just replace that with a little recurrent recurrent structure I get a much much much much smaller structure which also learns to generalize much better than the static structure itself so here's a surprise result by introducing self-reference something that conceptually is a more complex structure practically it actually you end up with structures that learn to generalize that could potentially learn to generalize even on problems that are solvable using static structures and they can do so and of this unit can be trained using far fewer examples there are only three three input bits two output bits a total of 32 combinations of inputs will complete by 32 input output combinations will completely specify everything that this unit unit must block now when we want to train these recurrent structures we saw that they can be trained by minimizing the divergence between the sequence of outputs that the network produces and the sequence of desired outputs so so when the network operates on a sequence of inputs it's going to produce some output puts could be a single value there could be a sequence and the most generic instance we will assume that the output is also a sequence now there's a desired output if you're training you're going to be given input examples where you have it the input and the desired output for that example and the desired output also is going to be a sequence and we have we must compute the divergence between the desired output and the actual output of the network and we can now compute the gradients on this divergence the divergence is not the divergence between instantaneous instantaneous outputs and instantaneous target outputs but over entire sequences again to be reiterated and we compute we can compute the derivative of this divergence with respect to the individual outputs and then back propagate everything for to to train the parameters of the network and so the primary topic for today and for the next couple of lectures is how do we deal with this business of trying to back propagate divergences when the divergences themselves are kind of imprecisely defined well the divergences are not imprecisely defined but but but they're but there is no clear way of assigning any component of the divergence to a specific uniquely to a specific output but to continue with the story so far we saw that record networks can be unstable so you've all gone through the quiz I assume so you've also done your little simulations to see how these things can forget and I could have given you the reverse problem and you could have gone through how these things can explode and we saw that if you have linear activations or even real uu activations which are basically just linear activations for an for the for an improper setting of the weights the outputs can just blow up they can block exponentially so it's not really for remembering anything it's just going crazy on the other hand if you try to control the behavior of the network so so that the so that the behavior doesn't the network doesn't explode then it has this unfortunate habit that for some activations and quickly saturates and forgets about whatever it was that in that that initiated the response of the network for other inputs for other activations it will still saturated unsaturated or so we found that the one there where the response was more reasonable was the tannic activation as opposed to say sigmoid and values of course have the tendency to blow up so these have a stability issue and we also saw that the we saw a problem that's spell that's not specific to record networks but it's true for any deep network and we already know that a recurrent network is basically a very deep network where the depth of the network is arbitrary the depth of the network depends on the length of the input and the length of the output so and when networks become really deep there many back propagate the error as you back propagated the errors the derivatives either tend to basically vanish or very locally they may blow up and so here we saw some exertion so an example this is not a recurrent Network this is just a straightforward multi-layer perceptron 14 layers deep where we were trying to classify m missed beta and this is the output layer this is the input layer these are the D and the brightness and the image shows you the magnitude of the gradient for each of the parameters of the network at different depths and we and what we saw was that and this was at initialization where the gradients have you expect the gradients to be more steep what we see is that although you see some fairly strong gradients just at the output layer as you begin walking backwards by the time you are out to the tenth layer you don't really see the gradients at all most of them are down to very very small and significant values there are a small number obviously the net work is responding to changes in the input what has happened is all of the response has been pushed on to a small number of these parameters to which now the network becomes hypersensitive and the rest of them basically don't matter you can modify them quite a bit and the output is not going to change so the what we found is by the time you go out from the fourteenth layer to the input layer basically the gradients are all basically vanishingly small or completely gone so we have this problem of vanishing gradients as them as the network becomes deeper when you try to propagate gradients backwards the the gradients become smaller and smaller it also depends on the activation function that you use now we actually saw why this must be except why this must be true all the way back in lecture number two now in lecture number two we said something about the unit's propagating information on to subsequent layers does anybody remember what we said about that chunk I see some we did say that right we said that if I use threshold units if I have messed them the required number of unit symbols in the input layer no information is being passed downstream and therefore this is therefore the deeper layers cannot recover the patterns you remember that right and we said for the deeper layers to be able to recover the patterns you want information to be pushed downstream which means you don't you want activations which are not thresholds which but which actually carry some information about the date about the distance from the boundary and the more information these activations carry the easier the network is going to find to learn what about the path about the patterns that it must discover and that actually comes back over here so what is the most what are the most informative actor activations in that sense the realm when you're on the positive side it just pushes all the information through it's only when you're on the negative side that it blocks it or you can have the softer version of the rarer which is the leaky relu or the ALU and you expect those to give you decent behavior and other activations like sigmoids or 1080 they have this saturation behavior where if you are beyond a specific distance from the boundary information about where you are as lost and of course now the derivatives you can expect the derivatives to vanish and so this is exactly what we have over here this is an ele activation ealer activation this is probably the most effective activation function when it comes to pushing derivatives through and even with a nearly low activation by the tenth layer over here the derivatives are quite gone we had other examples in the previous lecture where we had the sigmoids and tanagers and there we saw that the derivatives basically died much earlier than you're done and they did for the ena questions no can someone monitor Piazza for me one of you guys so Shannon can use yeah then we also saw the long term dependency problem that the network tends to forget or in a manner that depends primarily on the weights but not on the inputs whereas what we really want is behavior where the memory the weather weather system retains memory until it no longer has to so if you are if you are trying to parse a C program you want to remember that that a brace was opened until it's closed at which point we closed the or the closure of the bay brace becomes irrelevant right if you're trying to analyze language so you have something like this Jane had a quick lunch in the beasts throat then sheep now you want the XI to be triggered by the input and not the distance from Jane now instead of Jane if it had been Tom you want the output to be a he but you don't want the next word to be a heat you basically say that this word she Jane had a after had a the word she is inappropriate you really want the she only occurs in specific locations in the sentence which means where it occurs depends on the input and yet it must refer back all the way to Jane to decide whether it must be a sheet or a heap so you have behavior that's triggered not only based on what is remembered but also what the input is that that that triggers what the next expected input or the next output must be and so you want input triggered response and something like this there is not really input triggered this is parameter triggered and so in order to deal with this one we introduce the notion of LST amps where l STM's although they are not strictly input triggered if you've gone through the quiz you will observe that at least one of the questions refer to the fact that these two tend to forget and why is that you can think of these no matter how the entire thing has been set up at a conceptual level you can still think of this entire thing as one very complicated function and it's a complicated function which has parameters and its behavior is obviously dependent on the parameters it's just that the function is sufficiently complicated and the memory behaviors are somewhat less somewhat more how how shall I say fine-grained or somewhat more capable of holding on to two prior inputs and concepts that must be remembered then a simple can each activation function and and this week's quiz we hope to sort of at least a couple of the questions will investigate the memory behavior of the tear of the LSTA anyway so the LST M addresses the point of input dependent memory behavior the way we set it up the memory was we had a CC where the memory was not affected by tab parameters directly but was indirectly affected through these additional units which analyzed the input to decide what must have to the memory right and when we have an LST M based architecture the illest iam biased architectures that exact identical to standard RN n based architectures except that each of these boxes is no long now no longer just a single patty of units it's a it's a it's an LST M cell with many areas of units inside so basically if you if you had something like this each of these boxes is going to be and Patty going into the page and so the an LST M cell itself is a complex unit with many many components to it other than that if you think of this areas of units as being replaced by an LST M cell then this is just a standard reckoned Network where each of the green boxes get some inputs and produces some outputs and its behavior depends on its own values of the previous time instant right now once we agree that an LST M is basically just another variant of a standard recurrent structure then you can do other things like bi-directional STM's where the output depends on is come is produced by analyzing the input both forwards and backwards and combining inferences made from both now in all of this again the issue really becomes how do you train the network which is how do you define the divergence between the actual output and the target output and in the case of body bi-directional lsdm and standard LST M how do we actually compute the outputs so for today's lecture I'm going to not really be spending a lot of time on bi-directional LST and that's going to be I'm going to be talking about unidirectional structures but everything that we say into the today and tomorrow's can very trivially be generalized to bi-directional structures because as we as we know from the previous lectures there's really nothing magical about bi-directional structure other than the fact that you have to see the entire input before you begin processing it so what we are going to look at is how to train record networks of different architects is we're going to look at yes so no sir for a bidirectional structure these this network is completely independent of this network so we haven't yet put up the modified slides on their course pages but the pseudocode should explain it to you very clearly right it's just two independent networks think of two completely independent networks who are pooling their outputs to produce the area so it's a it's like an ensemble except one ensemble is working for one component is working forwards the others working backwards right now what are you going to spend the time rest of this lecture in the next one on our network architectures what are different kinds of network architectures and how do we train them and specifically for today and on Friday we are going to look at a specific kind of synchronous output where we are really we are looking at either time synchronous outputs what I mean by time synchronous is and we are the network produces outputs as it processes inputs and for every input it's going to produce an output or we are going to look at order synchronous outputs or order synchronous outputs means that the network output can need not be the same length as the network input but the order in which the outputs occur is strictly related to the order in which the input sucker so and this occurs and this applies only to some types of nets and we will look at not just how to train these networks and how to make predictions and influences with such networks so here are the various variants of networks that we've seen in the past record network so the basic one two one two one network the one tuna one network is not a record in network at all it takes an input it produces an output if you gave it a sequence of inputs it's going to analyze each input individually and it's going to produce an output for each of them then he had the many-to-many structure where you got a sequence of inputs and you produced a sequence of outputs but corresponding corresponding to each input symbol there is an output symbol so it's a one-to-one correspondence between the input and the output after it has analyzed the ket input it produces the cate output then we have the many-to-one so did this for example this is a standard MLP this is the kind of thing that you would be doing for instance if you were trying to predict the stock market you're making predictions every single day for the for what he must do with your portfolio and then you have we can have structures of this kind where you analyze an entire series of inputs and then produce an output so this for example as if this one example of this kind of processing is if you were performing say question answering you listen to the entire question then you produce an answer or if you are doing a speech recognition you just watch the entire audio input and then say this was the word hello or this is the word Alexa right and then this is the order synchronous version of this one observe that this simply looks like many concatenation of this basic guy and what is key over here is that you get a sequence of outputs in response to a sequence of inputs and what I mean by order synchronous is that if you stop the inputs at this point these two would be the outputs but if I continued the inputs all the way to here these three would be the output so there's AB is there there's an there's a all the correspondence between the input and the output but what is absent is the one-to-one correspondence between the input and the output I have many inputs here the number of outputs is smaller than the number of inputs so you also have this problem of deciding when the outputs must occur so it's order synchronous but it's not time synchronous see the difference right and then you can have at the additional variance which we won't can get into in the next couple of classes except maybe this one a little bit which we will cover today so here you analyze an entire sequence of inputs and then produce the entire sequence of outputs or you can have a single input and produce an entire sequence so the core example that we saw is where we saw a small number of inputs and then produce there and then the network produced the entire code so let's look at these things that are little more and a little more detailed there's simple one-to-one correspondence this is a one-to-one network just a regular and have P and here or if I have a sequence of inputs the entire process is quite trivial so here every you have a sequence of inputs each input is going to be individually processed and from each input you will get an output so why am i calling this part why do I include this in a lecture on record and networks when we've seen this before many times and this is just a straightforward MLP here's why it's about how we actually learned to learn the parameters of the network if you assumed that you have a target output for each input and you have the actual output for each input then during learning you can think of this entire collection is just a batch of inputs on the other hand if you thought of the entire thing are still processing a sequence of inputs where each input is individually being processed nevertheless what is being produced as an output sequence then their divergence that I define is no longer between individual elements in the input and I you know between the divergence I defined is not just for individual elements but the divergence is between the sequence of outputs and the sequence of targets are a target out but so you see do you see the difference between the tail it's in the definition of the divergence and now if I want to train the network the only thing so what happened over here is that the output at time T is not a function of the input at any other time or the output at any other time so the output at each time depends only on that time so there's a there's a one-to-one correspondence between output and actually the target outputs an actual output for the divergence still can be computed over the entire sequence and so when you want to perform training what you would want to do is to take the derivative of this sequence level divergence with respect to each of these outputs in order to be able to back propagate them the this is where and in fact the most generic case of analyzing time series data with with one-to-one networks is going to be just this one but nevertheless what we will often do is to sort of decompose it and say we're going to think of this global divergence as a weighted combination of divergence at individual instances instance and this sort of collapses this to the standard MLP structure and once you do that then the derivative of this divergence with respect to any individual output is simply going to be the weight assigned to that output times the derivative of the local divergence for that particular time instant which is the divergence between the actual output at that time and the target output a data so this is really a simplification an assumption that we are making and this kind of makes things simple to work with but you really want to be thinking of the process as trying to compute the derivative of the divergence between two strings and the individual outputs right now typically we'll set the weights to zero so this sort of this are the weights to one so this makes the back propagation much simpler but then again you could change this you could we could change this assumption and you'd get some interesting behaviors and once you decide that what you're really going to do is to be summing up the individual divergences at each time and than the obvious divergence that you would use in the classification scenario they're simply going to be the cross entropy between the actual awkward ending target output easy enough right now let's look at this guy so we started with this and we sort of said this is you don't even though it's a one-to-one and it looks like an MLP it really isn't doesn't have to be an MLP once you think of it as producing sequences of outputs and you define divergences between sequences and sequences then the training paradigm changes but then for simplification I went right back and broke it back down to a simple and collection of things that's my skype I'm sorry all right someone's trying to Skype me my apologies all right so now let's switch to the switch to the many-to-many scenario right when I switch to the many-to-many scenario I'm gonna actually briefly deterrent to modeling language I mean not actually finish this lecture today if I don't then we're just gonna continue this on the Friday so just be warned now here you have a sequence of inputs so here so let's think of a scenario where you have a many many our network relationship where in response to each input you must produce an output now although I'm showing your left-to-right Network over here in practice you would really be looking at trying to do this by directionally because very often the future tells you something about what he must what he must output the current time but here's an example like part of speech tagging for what for text you have a sequence of words two roads diverged in a yellow wood and four and for each of these words you want to output a part of speech tag so you want to be able to look at this and say two is a is a number roads is a noun diverged is a noun and you know he's a determiner yeah yellow is an objective what is a noun in as what is it I forget what I want so but the point is there's a one-to-one correspondence is correspondence between the inputs and the outputs and so for you won't analyze the input and assign an output to each of these guys but observe that what you assign to any one of these guys actually also depends on what you are signing to the other so there is this record and relationship and that's and the model captures it so are so the way you would do it you'd process the input left to right and produce outputs fph time or you could do this by direction right you can process the input left to right also process it right to left combine both their hidden representations to assign the actual output value at each time both of these are perfectly but both of these are perfectly feasible and in fact if you were doing part of speech tagging this is the bi-directional structure makes a lot more sense than the unidirectional structure because knowing what follows will tell you something about what part of speech tag to assign to any given word but for the rest of the lecture I'm not actually going to be considering bi-directional structures and assuming unidirectional structure again as I said this is easily generalized right so in this case how do you train the network you're going to be given a collection of sequences no are you being given a collection of vector inputs when I think of agronomy when I speak of a training instance the training instance has a pair which is the input and the target output but the input is actually a sequence of vectors the output also is a sequence of how the vectors are scalars depending on what kind of output it before what kind of prediction you're performing if your class performing classification the output would be a sequence of one heart vectors and so now for each training input during the forward pass you're going to pause the entire data sequence through the network generate outputs during the backward pass you're going to compute gradients using this divergence that you define between the sequence of outputs and the sequence of target outputs and you'd back propagate the gradient all the way through the network again this is not just the sum of the divergences at individual times unless you explicitly define it that way this is going to be some abstract function that you define between two sequences right so if you were performing a back propagation what is the very first thing you would do depending on how you define the divergence function the first thing you would need is the derivative or the gradient of the divergence with respect to each of these outputs and once you have the gradient of the divergence with respect to the outputs the rest of back propagation follows exactly as before so here for example you'd go down here multiplied by the Jacobian of this box so it'd go down here then multiplied by the weights go down here we saw the whole process earlier right and things would aggregate has even backwards the key component in all of this there's going to be the definition of the divergence what is this divergence now we're going to look at some simple examples of divergence and then exit for now but these things can get fairly complex and it turns out that if you end up with non differentiable error functions that you want to convert with the converted divergences then the problem actually becomes fairly difficult in fact we won't I want go cover that in class simply because it's a may be beyond the depth required for this particular intro class but it's an interesting topic so one of the simplest assumptions you can make even in this setting is to assume that the divergence between the sequence of desired outputs and the dis sequence of actual outputs is simply the sum over all of these instances offer local divergence computed between the local target output and the local actual output now this is this is a simplification this is an assumption that you are making and having something of this kind means that now if I want to compute the derivative of the global divergence with respect to say this output it's sufficient for me to compute the derivative of this local divergence with respect to this output because the global divergence has simply looked at being viewed as the sum of the local divergence is that each time and now this becomes simple right so and a typical divergence for classification problems when you think of it in terms of a sequence of local divergences well is again the cross entropy between the target output and the actual output now so here the simple example we're going to spend a little time on this particular problem and particularly look at analyzing texts and and and language models for the next few minutes so think of a model that can try to predict the next character given a sequence of characters or maybe try to predict the next word given a sequence of words so after observing the inputs w0 through WK it must predict WK plus 1 how would we do it so this is a nice little example from andhra carpathia picked it from its web page this is here for example this network has to look at a sequence of characters and critic the next character so it will look at age and it must figure out that the next word is e then if I look given H e it looks at a G and I must figure out who the next word is L then it looks at Khe and it must figure out who the next character the next characters again hell it looks at a chi ll must figure out who the next character is go it's trying to spell or predict the word hello so in this I'm speaking of characters I'm speaking of giving it an input of characters character sequences and predicting characters characters are not numbers we know that networks don't work with abstract symbols they really need to be working with numbers so what is the very first thing that we would do we're going to have to predict we're going to have to convert all of these to one heart vectors so so for example I might let's say I have only four characters in my alphabet then the four characters would be h-e-l-l-o no EHL oh the figures wrong so if I thought if I arranged my characters in alphabetical order then they're going to be e HL @ o then the very first input H is going to be what zero one zero zero that's the character H right so I've seen the character age what does it predict after seeing the character H how does it predict the character li e now in order to predict the character II what you really want Vanessa the system to predict after seeing this input is it goes into whatever box and out must come that's e H Ellen oh you want a one over here and three zeros so this uniquely tells me that the next word must be an e next character must be Amy now this is not what it's actually going to produce what do these what do the classification networks produced we use a sigmoid at the output a softmax and it produces what a probability distribution right so now when it produces a probability distribution what you really want they want for this to guy to do is to maybe produce something like this where the probability of E is high and the rest of them are lower than E so that if you pick the most probable character then that most probable character is e so now from here the next character the are the next input that's the word that's the character e and the character e has the one heart representation one zero zero zero according to our lexical arrangement of the characters so having done this having fed this to them here to the network you want the output are the next time to be small for E and H large for L and small for all right so basically this is the kind of behavior you really want from the network again the point is you're feeding it one heart representations the output is a probability distribution over the symbols right and must ideally peak at the right output so if I were trying to do this with to predict language to predict words or characters then the input is going to be a sequence of symbols represented as one heart vectors and the output is going to be a probability distribution over the symbols so if I'm going left to right then after having seen the characters or inputs w0 through WT minus 1 what it predicts next is a is a probability distribution where for each symbol in the vocabulary it is predicting it's assigning a probability so here for example after seeing these two puts its assigning different probabilities for eh Ellen um so the output is going to be a probability distribution between over these set of symbols now if I use a simple cross entropy macro simple cross entropy divergence between the target output and the desired output then what you would find is let's say the output is the desired output is say Y then the target output is going to be 0 0 1 0 the desired output is going to be Y 1 the actual output is going to be y1 y2 y3 and y4 and so we would be doing bi log Y I summation over all I right so for this term it's going to be log of y for x 0 it disappears for this term it's a lot of y1 times 0 times 0 so that 2 disappears so also this term disappears the only term that remains is this guy which is going to be minus log y3 in our particular example right so more generally what you're going to find is that a cache you look puzzled okay so so more generally all you're going to do is you at each time the cross entropy is simply going to be the negative of the log of the probability that has been assigned to the target output and the overall divergence is going to be the sum of this term over all time right and this is basically what we are going to use as our divergence to train the network so a brief detour into language models what we really saw when I began trying to model text look at a sequence of texts and predict the next word was an instance of modeling language so we're looking at modeling language using time synchronous Nets or I can generalize this just generally trying to analyze language so we'll spend a few minutes looking at language models and embeddings it's an important topic I'm not screwing spending quite enough time on it hopefully we have a guest lecture but bigram which covers some of these and hopefully we also have a recitation uncovered but now consider this right let's go back to the first example that we saw in the very beginning of the series on recreant neural networks the we saw this example of a what looked like a perfectly valid program that had been produced by record by a recurrent neural network now this now let's look at a different problem let's look at and relate the two right given a sequence of words four score and seven years if I ask you to predict the next word pretty much anybody who's familiar with American history is going to say the word is a go right or if I gave you a BH sorry a br aham Li and C or L and I ask you to predict the next character almost certainly are going to give me n as the next character so there's a very definite predictability of these or of the next character or the next word based on the past so this is the modeling language now where did these predictions come from that's because you're familiar with language you're familiar with history in this particular case but more importantly this there's the language aspect of it and what you're really doing is making lane language predictions so modeling language is all about trying to make these come up with modeling structures which can make predictions about language in this manner the basic formula is going to be something like this you represent the words as one-hot vectors repressive of pre-specified vocabulary of n words in some fixed order like here we used characters and we ordered them lexically and you're going to represent each word by an n-dimensional vector with n minus 1 zeros in a single one we're familiar with this the standard one heart representation if you were working with characters and sort of language in other words the problem at words is the number of words in the language can be arbitrarily large if you were working with Russian then a vocabulary for 400 for 400,000 words there's not at all unusual if you were working with English you're still going to need a vocabulary for about a hundred thousand words to cover enough of the language that you can model most of what you're going to encounter so the vocabulary can get very large on the other hand if you're looking at things at a character level 100 symbols can come cover pretty much all of the characters a good encounter in your day-to-day life if you're trying to write English this includes numbers period semicolon spaces tabs the whole lot so in this case if you were modeling characters your one heart representation is going to have any 100 components if you were representing words the one heart representation is going to be for English 100,000 components and now when you're predicting words what would you do you can think of the predictor the language modeler as looking at the past in my n minus 1 words and making a prediction about the 10th word so if I were to draw this as a figure you have some box F which is the prediction model it's looking at the past n minus 1 words and it's predicting the nth word but then these words must be represented in some manner and so the way we represented as each of these words is going to be represented as a one hot vector and the output is also going to be not necessarily a one hot vector it could be a probability distribution over the vocabulary but the ideal output is going to be a one heart vector right so what is the problem over here I just spoke off the vocabulary of English it's a hundred thousand words so what is the dimensionality of the one heart vector any one it's a hundred thousand dimensions right now what volume of the hundred so what is the dimensionality of the space that these vectors live in a hundred thousand even if I'm working with characters what is the dimensionality of the space the characters Levin hundred dimensional now if I were to consider a unit cube in a hundred dimensional space how many corners does this unit cube have to raise 200 right because for every and meaning even if you were thinking of all of your words is living on a unit cube you have to raise 200 or that's approximately 10 raised to 10 which is 10 billion corners and of these 10 billion possible corners how many are you occupying with your characters 100 right because a hundred dimensions so basically if you think of this in terms of volume what volume of your space is your entire collection of possible inputs occupying any one it's zero and why is that if I were to just look at my inputs and my input vectors think of this one heart representation zero one zero zero you know exactly what zero one zero zero represents it represents the character H right can you tell me what this guy represents zero one point zero zero zero zero what does this vector represent anyone doesn't represent anything right it doesn't matter what the extra bits you add after one are it doesn't matter how small the epsilon is it represents nothing there is no meaning to it the total volume occupied by that point is zero the total volume occupied by all of your training data is zero right so given this so fine we've come to the fact that we're using is immensely large high dimensional spaces and yet we are occupying exactly zero volume in this space does that even begin to make sense does anything make sense do distances make sense two directions make sense do point makes sense none of that makes sense because you know that if you go just a little wiggle off from wherever you started you're offering to nowhere you know no man's land novanet right so let's just concede that okay we are willing to live with occupying zero volume how about the density so if I look at a cube unit cube of side of side R then the total density of points the size of the cube is R raised to M the number of points I'm using I'm actually using in that space is just n so for example over here zero zero one zero means something or zero one zero zero means something but zero one one zero doesn't mean anything at all right so I'm still at the corners of the cube but most of my corners basically all of my corners are not being used if I'm living in a hundred thousand dimensional space of the are raised to 100 thousand I'm using in the heart rate of volume of are raised to 100 thousand I have exactly one hundred thousand points the density is effectively zero right so if you are making such an inefficient use of the space why do we use one heart factors anyone not quite and we'll see why anybody else so think about it from the perspective of a computer is the word you know floccinaucinihilipilification any different from the word duh the computer doesn't have language it's just a symbol right one symbol is not more important than the next symbol what is the length of every single one hot vector that you are using one what is the distance between any two one hard vectors square root of two right so basically you're coming up with a semantic agnostic representation you're not assigning and the one heart representation is the only way you have of not assigning a relative importance to words going in or a relative importance to symbols going in it's also the only way we have or one of the few ways we have of actually ensuring that that the distance between any two words is the same regardless of what the words are so you are not imposing any notion of closeness or distance on the language right the representation if you used anything other than one heart representations then you would have to make sure that the values you assign the vectors you assigned each of your symbols we're such that the distances between these vectors actually made sense how do you come up with that that's a very subjective thing if you want to do this on your own so you don't know what you just say everything is exactly the same all two points the distance between any two points is the same as the distance between any other two points every vector is exactly the same as every other vector but the problem with this is that although you know makers make no assumptions about the relationships between words or about the actual relative importance of words the usage of the space is horrible so what we would like to do is project all of our points into a lower dimensional subspace now when you project everything down to a lower dimensional subspace the total volume is supposed to use to still zero because each point still becomes just a point so the projection of one point zero zero zero zero zero one still doesn't mean anything you're not you're not really improving on the volume property of the space but you are improving on in terms of the density of points but now the problem becomes what is this projection that what is this hyperplane that I was projected on such that everything makes sense that the positions or if not the positions at least the distances between points make sense so we're going to learn this plane and if you learn this plane properly the distances between projected points should ideally capture semantic relations or if we you learn this from analysis of language then your projections should be such that the geometric arrangement of all of your words on this projected plane actually has some minimum some semantic meaning to it so here's how we're gonna do it I'm going to assume that I'm going to take these one hard factors and project it onto some plane and the projection is going to be done with some matrix P now we know that the projection is just a linear operation I just take my vector I multiplied by B I get a project there are P I get a projected version of the vector right and so to predict the next word instead of looking at the words themselves I'm going to look at projected versions of the words so here is the same figure that I had earlier except that instead of working directly off the one heart vectors it's going to work of the projected versions of the one-hot records right and this projection is going to take this n dimensional vector and zap it down to some M dimension dimensional space so projection is going to be an M cross M matrix and we're going to use lon P so as a result P times W is going to be an M dimensional vector and they're going to learn the matrix P using an appropriate objective now how do you do it in the new neural network context it's it turns out it's actually very simple now think of it this way I'm going to start off with an N dimensional vector I'm going to multiply it by an M cross n matrix and I'm going to end up with a M dimensional vector but then one way to look at it is that there is an N dimensional input and a M dimensional output and each output is obtained as a linear combination of the MN puts and inputs right so I can think of this entire matrix operation and we have seen is the force this is not surprising right as just a simple one layer network with a linear activation so once I do that this entire operation over here the in this entire operation over here all of these projections become one layer networks with linear activation with one extra constraint that this network is identical to this network is identical to this network so all of those networks are shared parameter networks and we know exactly how to deal with shared parameter networks right so and now all we will do is to use this entire structure to learn our body ability to predict in a projection so here's one of the early papers by Benji at all Infinite 2003 where they were using a time-delay neural network to predict the next word they were looking at the past and in this example they're looking at the past four words each of the four words is projected down and it parts pass through a record through a MLP to predict the fifth word then they take w2 through w5 and predict predict w6 this dotted line over here is just in to indicate that in their paper they actually had direct connections from the output of these to the operators to this orange boxes well that's kind of irrelevant but here's the basic structure that you can really use just simply use a time-delay neural network that looks at the past key words to predict the next word the one distinction being that you're first going to pass all of these words over here through the projection now observe that at the output the output is still one heart you're trying to predict the next word and you're going to be using the same structure as we had over here the only thing that happened is that at the input you went from this uninformative n-dimensional space to some some lower dimensional space where maybe the arrangement actually made sense and then use those representations to actually learn the language you can have other variants so this is the so-called soft bag of words where you if you want to predict the word w-4 you look at all of the words and either side all of them are projected down to this lower dimensional space and then instead of using them independently you just do some kind of a mean pooling and then you get a single vector that you used to make the prediction so what this means is that here the order the arrangement of the words at the lower level doesn't really matter although it does become insensitive to the number of words that you're using to make the prediction so these are just models that have been proposed or here's the alternate model where you use w7 to predict all the words on either side so you'll be using each word to predict the past three words in the next three words or the past five words in the next and the next five words and so on so these are called skip grams this is a soft bag of words they are the variants this so hopefully we'll cover these in a future recitation but I was just trying to convey the idea of projecting things down and computing so-called word embeddings or character and pairings yes good how do you represent it right so whenever the point is if you are looking at the embedding in the output what do you mean by predicting a real-valued vector remember the l2 distance doesn't make sense the projection of you know zero one zero zero makes sense the projection of zero one point zero zero zero zero zero zero zero one doesn't make sense so there's no so then so having an l2 distance based error is although it's kind of differentiable as not an informative in the linguistically linguistic sense it's not informative right yeah I mean look at this let's go back here we saw this right I mean why do I need to have even over here why do I need a KL divergence can I use an l2 divergence so if you put a P over here that's P going in right yeah so the flow of information is left to right okay so the flow of information is left to right then what is output are you predict are you making a prediction of the projection or I'm making a prediction of the word itself if you're making a prediction of the projection of the word how do you quantify the error if you're quantifying the error in a cross entropy sense then the projection becomes irrelevant if you are quantifying the error in l2 divergence sends l2 divergence does it make sense right see the point yeah but what they do find are these that these projections are really kind of cool you find for example that after you project things down into this lower dimensional space the distances between vectors carry information so for example the vector that connects that translates China to Beijing is essentially the same vector the translates Russia - Moscow so if you take this dotted line and you add it to China I'm gonna end up in Beijing the same dotted line if you add it to Russia I'm gonna get you you're gonna end up at Moscow so that vector seems to capture the concept of capital off or you have these other things where you can say man plus correction equals woman you take the same correction you apply to a king you get a queen so the action is so if you learn these things properly in the embedding space the vectors seem to make sense again this is cherry-picked when people show you these examples they show that you actually learn something meaningful it doesn't mean you can actually have used those relationships like I said the embedding of one point zero zero zero zero one is not a meaningful embedding right but this is so this is some kind of post factor analysis anyway so if I want to train a language model how do I do do it using a recurrent Network I just have the record network as before I'm going to use forwards and I'm going to try to predict the fifth word I'm going to use the first key words and I'm going to try to predict the key plus ones word and so here from what W one after take W to here from W 1 and W 2 I predict W 3 here from W 1 through W la 9 at reading W 10 and so on each of these words is actually zap done using a projection matrix this could be an entire LS TM structure it could be unidirectional or bi-directional the whole thing can be trained from back for using back for patience a lot of text so now if I want to make up one of these things I can do one of the things I can do it with this model is make predictions so remember what the model is really doing it's taking a bunch of inputs it's computing a probability distribution over the next symbol in this case so here I could for example given the first after having trained the modeling I could give it the first three words after having seen the first three words again somebody tap or she'll asleep okay after having seen they want to go wash your face okay so if you after having seen the first three words if you want to predict the fourth word what you're really going to get is a probability distribution of our words in the fourth position right from those fourth from so you're going to get a probability distribution of this kind and you could just pick the most likely word and call it the next word and so once you call that the next word okay it's like seeing I've seen W 1 W 2 and W 3 and I've seen that before all the W 4 really was what I chose from my prediction of the previous time and now using this input I can go back and makes the make a predict the probability distribution for the fifth word which would be something of this kind again I can pick yes no it's a probability distribution it's started choosing from it right that's what we said that yeah I'm going to get the output is always going to be a probability distribution I can pick the most likely so let's say I give you four score and then you're going to get a probability distribution of all the words in the dictionary and if you're lucky and if you learn the Gettysburg Address the probability for the word 7 is greater than the probability for any other word so if you pick the most likely word you're going to pick seven right so at the next instant you're going to pretend that seven really it was what you saw and so now you're going to have four score and seven and then you're going that goes through the network and it's going to get to give you a probability distribution of all the words and if you're lucky then the probability for the word years is going to be higher than the probability for everything else then you pick you when you feedback into the next so it's the word right sirs is the word but then look at what happens over here the word goes in but here you can embed it right the piace the input is always the embedded version but what you're getting back is what you must treat as the next input yeah so that in this particular case known okay we haven't gotten to that problem yet yeah here we're just saying that I'm seeing the past n words I'm going to predict the next word right okay so I'm running out of time really fast but that's okay so you get the idea of how this is produced right very simple very straightforward then I would take the first three words so I'm going to get something like this I would there is a chalk here so I'm gonna have four score and so this goes through my network doesn't matter I ignored this one out over here this goes and I ignored this guy this goes in and over here I'm going to get a probability distribution over all words so I pick this is the highest so I can pick this one this ends up being say seven so I'm going to say the next word is seven which is this one same thing right and then that goes in and I'm going to get a probability distribution over words once again and me this is the highest this doesn't have to be a go it could have been you know before right and if this were before then the next word I would read which is wrong and then go on to make make my next prediction Yeah right so what happened over here this was a recurrent neural network that was trained on character level character level prediction this is a character level model trained on all of the Linux source code and then it was used in this mode where they gave it the first maybe I'm not I'm not sure exactly how much they gave and then they let it run in generative mode and observe what it does it actually gets comments it quite closes the comments it opens it knows how to open a function it knows you know that functions have arguments which must be bracketed by parentheses it knows that functions must be you know started with a curly brace then things written curly braces must be closed it produces the whole thing the produces the entire program pretty impressive right or here's an even nicer example where from this website where they train this on a bunch of music and then they gave it the first few bars of a you know piano and that's produced a time it's a MIDI it sounds totally authentic doesn't it it's a perfectly reason is is my record here of that work that's pretty impressive so no questions so I'm going to return to our problem right so returning to our problem we are going to be talking about defining divergences and other scenarios we have just a few minutes I'm going to maybe go over a few minutes let's consider this problem where I'm trying to perform sequence classification you're classifying a full input sequence and like say phoneme recognition I give you a sequence of vectors you have to give me the output tell me which phoneme it was or I give you a sequence of words which represents a question and then when you're done seeing the entire question you have to answer color of sky blue right so question answering system you have to you see the input sequence and you are that our output is answered at the end of the question so question answering phoneme recognition same problem right now observe what's happening over here when I have construct a network of this kind you magically just construct a network that only produces an output at the end of the sequence not really right what is really happening is that it's considering the entire input but it's only producing an output ad once you have decided that the input is complete my question is over produce an output it's not as if the network automatically knows when the question ended you stopped it and then you're going to read the output right makes sense to you right so in this setting what this means is that you're actually also producing outputs out here you're just ignoring them this is exactly what happened when I was doing fourscore and seven years ago right I ignored these first two guys so you're actually producing outputs everywhere you're only considering the output at the final time yeah right so now if I'm training this network I'm training a question answering system or I'm training a phoneme recognition system what do I do I'm going to have input-output pairs said have an in sequence of input vectors at the end of the sequence I have a target output so the way I would do the training is I'm going to pass it the sequence of inputs I get the output at the end of the sequence I can compute the divergence at the end of the sequence I can back propagate this divergence and I can learn all of my parameters right and this divergence must propagate through the net to update our parameters what is the problem if the input is very long basically are going to run into vanishing and exploding grading problems anyways right it's also the total amount of feedback you're providing is just one entire sequence that's basically because you're pretending there's nothing useful that came out of these locations you're throwing them away but what you could do instead is to say that if this sequence of say speech vectors represents the sound ah it was are everywhere right so I can pretend that the intermediate responses must also be the symbol R and once I may pretend so then I can actually define our divergence between sequences which here the most or the the more intuitively most appropriate one is to use basically the same divergence that you used out here at each of these points and sum them up except you want to have weights maybe as you go past into the past you you consider assign lesser and lesser importance to the divergence as you go back into the past right and so this doesn't know it's kind of what what must be these weights be if you're doing speech recognition having weights of one kind of makes sense if it's a sound eyes or sound are every minute on the other hand if you're doing a question answering system it's kind of stupid to say the odds are must be blue as soon as you see the word color or the must be blue as soon as you see the word color off it's only when the word sky came in there the answer blue made sense so in this case they're obvious setting of these weights is 1 over here and 0 as well but the generic setting is this weighted cross entropy if you're performing classification or it could be l2 if you're doing anything else and this is the divergence that you actually use when you're training so we've seen how to deal with this particular problem we know how to do the influence we know how to do the training right the next one I'll stop right here we'll cover this in the next class but here is a much more challenging problem where this is the order synchronous time a synchronous sequence to sequence prediction so here for example I'm getting lots of speech and each time one phoneme ends I want to produce an output and say this was the word this was the phoneme ah this was the phony upon even death right the problem is all you are seeing is a sequence of inputs you don't really know when you must be producing these outputs because you don't know beforehand when each of these phonemes ended so you have orders is the you had you know that you must be getting a sequence of outputs the exact location of the outputs is unknown beforehand and determining these is also part of the bigger problem right so the look at this in the next class questions thank you
Views: 1,272
Rating: 5 out of 5
Keywords:
Channel Id: undefined
Length: 76min 0sec (4560 seconds)
Published: Tue Mar 12 2019
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.