LSTM Networks - The Math of Intelligence (Week 8)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello world it's Suraj and our task today is to build a recurrent Network a type of recurrent Network called an LS TM or a long short-term memory Network to generate Eminem lyrics that's our task and we're gonna build this without using any libraries just numpy which is for matrix math because we want to learn about the math behind LS TM networks we've talked about recurrent networks earlier on in the series and LS TMS are the next logical step in that progression of neural network learnings so that's what we're gonna do today and we're gonna first talk about recurrent networks we're gonna do a little refresher on what recurrent networks are how we can improve them and then we'll talk about LS TM networks and how they are improvements and the mathematics behind them then we'll build all of this build as in we're going to look we're gonna I'm gonna explain the code because there's quite a lot of code to go over here there's a lot of code but we're gonna look at all the code and then I'm gonna code out manually the forward propagation parts for both the greater recurrent Network and then the LS TM cell itself so strap on with every math hat you have because this is gonna be so hard we might as well just give up let's just forget this no no I'm just kidding no this is actually gonna be this is gonna be pretty easy this is gonna be pretty easy if you got the recurrent that video this is gonna be pretty easy stuff although it's a little more complicated but it you will get this okay get ready for this I'm gonna make sure you get this okay so here we go what is recurrent Network can you tell me what a recurrent network is in one sentence I'll give you ten seconds go okay that's all the time you get I'm not gonna wait a full 10 seconds for current networks are cool they're useful for learning sequential data we know this we know this they're useful for learning sequential data a series of video frames text music anything that is a sequence of data that's where recurrent networks do really well that's what they're made for they're very simple you have your input your input data and then you have your hidden state and then you have an output and so the difference between recurrent Nets and feed-forward nets are that recurrent Nets have something different here so in a normal feed-forward and that we would have our input or a hidden layer and then our output layer and that's it and then we would have two weight matrices between each of these layers that are the those are matrices right that we multiply write input times weight at a bias activate hopefully you said activate because that is a wrap / mnemonic device but input times wait a Tobias activate repeat over and over for feed-forward networks the difference though for recurrent networks is that we add another weight matrix and this other weight matrix called synapse H in this diagram connects the hidden state back to itself so it's just a third weight matrix and what it the reason we add that is so whenever we whenever we are training our network we're not just feeding in new data points so if the data is so if we're trying to train a recurrent net to remember the next two learn to predict the next number in a series whether and the series is 1 through 10 right we want to predict the 10 we would say okay 1 2 3 4 5 6 7 8 9 10 we wouldn't just give it the numbers we would also feed in the hidden state right so we would say okay so given a 1 predict the two okay now given the 1 and to predict the 3 okay give him the 1 2 & 3 predict the 4 and that's how we would train every iteration but that's not all we would give it we wouldn't just give it the input data we would also give it the hidden state we would give it both and so that's why we have a recurrent matrix that's connecting the hidden state to itself because we are not just giving it giving it every new data point the ones and the twos and threes we're also giving it the hidden state which is basically a which is a matrix and we're multiplying it by the hidden state and the new input data we are we are feeding in the hidden state and the input data at every time step and so that's why we have a hidden matrix and so you could think of this as unrolled as just a series of feed-forward networks right or we give it in an input data point and then we get a hidden state and then in the next time step we don't just give it this new data point x1 but we also give it the hidden state from the previous time step so we just continue to do that sets that's one way of looking at a recurrent Network as just a big chain of fee for networks a chain right a blockchain no not blockchain although I do want to talk about blockchain it's coming don't worry I'm coming for you blockchain but not yet we're still talking about recurrent networks amazing stuff amazing stuff blockchain is coming though oh I don't want to give away too much I just did anyway okay you guys can tell I'm very excited about blockchain anyway so we give it the hidden state and the input at every time step and we just keep repeating that and that is a recurring Network but there is a problem here the problem is that there are 99 of these problems and no I'm just kidding the problem is that one whenever ok so here's the problem the problem is actually really interesting it's called the vanishing gradient problem okay so it's called the vanishing gradient problem so let's say you're like what is this so let's say we we are trying to predict the next word in a sequence of texts which is actually what we are trying to do right now duh but let's say that's what we're trying to do and let's say we're trying to predict the last word in this sentence the grass is greens right so all we're given so we know that the word is you know green but let's say we're trying to predict it so all we're giving is given is the grass is and we're trying to predict that last word where current networks can do this right they can easily do this because the distance between what the word we're trying to predict the distance between that data point in the sequence and the previous data points is it's pretty small you're right the different the the distance between grass which is the context we need and green is pretty small the only difference is that the word is is between those two words and so that's easy to predict because that's the only context that we need let's say it had some time this is before that said you know the the lawn is rain or the grass is let's say you already said the grass is green or this is already like the previous sentence let's so I've already made that that distinct connection between grass and green and you know those two things are related so of course is very easy to do that but but let's also say and you but let's also say that we want to predict the next word in this sentence so let's say we have a huge paragraph look not even a pair let's say we have a 30 page essay and the beginning of the essay it is a first person essay and is about a guy named Jean okay is French dude he's got a mustache unnecessary detail but a guy named Jean and it starts off with I am French as it should be I am French and then there's 2000 plus other words and we're trying to predict the the last sentence the last word after 2000 words and that is the word French I speak fluent what and now let's say between I am French and I speak fluent French there's all these other languages like and so this guy is Spanish and I heard some German on the subway and all of that is actually irrelevant to what he speaks fluently which is French we've got to know that context we got to know that he is French so of course he's gonna speak fluent French so we've got to find a way for our network to be able to remember long-term dependencies and so the whole idea behind recurrent networks the whole reason that we feed in the hidden state from the previous time step for every new iteration is so that we can have a form of neural memory right that's why we don't just feed in the the previous input but we feed it in the previous hidden state because the hidden state is that matrix that represents the learnings at the learnings of the network so that we can give it a form of memory like what is remembered before and the new data points but the problem and you would think ok so that's all you need right of course the hidden state is going to be updated more and more and the whole idea of recurrence is made so that we have a form of neural memory for sequences specifically but the problem is that whenever we are back propagating the gradient tends to vanish so it's called a vanishing gradient problem so let me explain this so whenever we're forward propagating we have our input data right and we are trying to predict the next word or character or musical note in the sequence and so what we do is we say input times weight data bias activate repeat over and over and then you know recurring over and over and over again and you get the output which is the prediction the predict the next word or whatever and so once you have that prediction value then Europeans say okay vectorize the prediction into a number and say prediction minus the expected the the the expected output which is the actual label or next word or character because we know it and because this is supervised and so then we have the error value or loss value and then we use that error value to compute the partial derivative with respect to our weights going backwards in our network recursively or performing gradient descent or in the context of neural networks back propagation because we are back propagating an error gradient across every layer but what happens is that as the gradient remember this is a chain of operations the chain rule what happens is and we are as we are propagating this gradient value backwards across every layer we're computing the partial derivative with respect to the error the partial derivative of the error with respect to our weights as we are doing that for every layer recursively the gradient value gets smaller and smaller and smaller for the reason for linear algebraic reasons the gradient the gradient just gets smaller and smaller and smaller and - what this means is that the magnitude of D so the whole point of that propagation of gradient descent is to improve our weight values such that our expected output is closer to such that our actions such that our predicted output it's closer to our expected output right we were trying to minimize the error value and so the whole point of gradient descent is to give our weight values a direction in which to update such that the error is going to be minimized through a forward pass and so by direction I'm talking about what what values in this weight matrix the set of numbers is optimal such that if we were to multiply this by the input and then you know do that over and over again it's going to give us the right output value and so that's the whole reason we're computing partial derivatives or gradients which we also call them because calculus is a study of change right change whether it be in moving bodies or change in terms of how to update a set of values and when I say direction it doesn't mean like a literal direction like up-down left-right it means a direction in that the numbers are closer to the ideal optimal numbers that they should be in the weight matrix so when we multiply the input by them the output is going to be closer to the actual output that we want I probably repeated that several times but you know it's good it's good it's good for us and so so that's that's the whole point of performing gradient descent aka back propagation and so the gradient gets smaller and so what this means is that the magnitude of change in the the first layers of the network is going to be smaller than the magnitude of change in the tail-end of the network the last layers right so the last layers are going to be more affected by the change but the the first layers are going to be not as effective because the gradient update is smaller because the because the gradient itself is smaller and so right and there are two factors that that affect the magnitude of these gradients the weights and the activation functions and effy of these factors is smaller than one the gradients may vanish in time if larger than one and exploding might happen so that's called the exploding gradient problem the gray itself is too big so it can go either direction usually it's a vanishing gradient but yeah this is a problem right we want to somehow maintain that gradient value as we are back propagating we want to maintain that error gradient value so that it is at the full magnitude that it should be to update our weight values for every layer recursively in the correct way we want to maintain that gradient value we want to remember that gradient value as we back propagate and so how do we remember a gradient value how do we remember remember is a word you know maintain remember you know whatever you want to call it and so the solution for this is called using an LS TM cell or a long short term memory cell and if you think about it the knowledge of our recurrent network is pretty chaotic like let's say we're trying to you know caption a video a set of frames a video right a guy and he's eating a burger and the Statue of Liberty is behind him so then there were the network things okay he must be in the United States but then he's eating sushi and it thinks oh he must be in Japan just cuz it's seeing sushi but it forgot that he was just behind the Statue of Liberty there exist sushi sushi places in New York City too right and then you know he's riding a boat and it thinks oh he must be in you know the Odyssey or something but he's still in New York so we need the information to update less chaotically to account for all the learnings the memories that it's learned over a vast period of time a vast sequence to be more accurate and so the solution for this is called the LST M cell and it replaces the RN n cell and so the RN n cell is input times wait a Tobias activate right and would that weight matrix that is connecting to itself and so the difference is an LST M cell basically just you just replace it with in hell STM cell you just take that out and you replace it with an LCM cell and what it is it's just a more complicated or more extensive series of matrix operations so let me let me talk about what this is okay let me let me give this a go so LST M cells consist of three gates you've got an input gate right here if your output gate right here you're forget gate right here and then you have a cell state now you also see this input modulation gate so that's actually only you sometimes but let's just forget about that that's that's just forget that exists you have an input gate an output gate and a forget gate and so there's many there are many variants of LST ends and that's why you see this input modulation gates but the most used ones are just input/output forget and cell states so just forget about the input modulation gate for now okay so you've got three gates values and then we have a cell state so you might be thinking okay so what are these gates like like what what is a gate and so a gate is just like a layer a gate is a series of matrix operations and it is input times weight at a bias activates so in a way you could think of an LS TM cell as kind of like am a neural network these gates all have weights themselves so they all have their own set of weight matrices that means that an LS TM cell is fully differentiable that means that we can compute the derivative of each of these components or gates which means that we can update them over time we can learn we can have them learn over time and so so these are the equations for each of these right so for our four decade or input gate or an app okay its input times weight a Tobias activate where the input it consists of the input and the hidden state from the previous time step so in each of these weights these gates have their own set of weight values and so what we want is we want a way for our model to know what to forget and what to remember and what to pay attention to in what it's learned right what is the relevant and that's called the attention mechanism what is the relevant data and all everything that it's learned what is the relevant part of what it's being fed in this time step to remember and what should forget the cell States is the long-term memory it represents what all of the learnings across all of time the hidden state is akin to working memory so it's kind of like a current memory the forget gate also called the remember vector learns what to forget and what to remember right one or zero binary outcome this the input gate determines how much of the input to let into the cell state also called a save vector what to save and what not to and the output gate is akin to an attention mechanism what part of that data should it focus on and so you might be thinking well how is like I see you might be thinking like okay so I see these equations I see how like input times wait a Tobias activate is akin to forgetting an input and an output I see how there's an ordering to this right we have we have an ordering to this we first you know go through an input we we learn what the frick we learned what to forget and then we compute that cell state and then we and then we send it to the output ultimate but and then ultimately we compute the cell and the hidden state and these are the two key values that we output so you might see these and you might think okay like I see the ordering I see how they represent forgetting and remembering but I don't like make the connection between still like how it knows what to forget and what to remember and that's again that's the amazing thing about neural networks like these these gates are essentially perceptrons they're like mini networks right with with a single node the node is the gate itself right it's a single node with a single activation function all of these gates are perceptrons they're like many neural networks like single layer neural networks and it learned what to forget what to remember based on gradient descent again it's the magic of gradient descent it learns what is necessary over time and so these these components represent mechanisms for forgetting remembering and what to pay attention to or attention and so that's what the ls team provides us so instead of so this equation would be for a normal recurrent Network so you would have your input times your weight plus the hidden state times its own hidden state to hidden state matrix you'd activate that and that would give you the hidden state at the current time step the difference is that it would look like this instead it's it's a more extensive series of operations right so that's what it is and yeah and so yeah so that's kind of the high level of how that works and now we're gonna look at it in code as well we're just gonna help you know your retention but what are some use cases of this like I said it's all sorts of sequential data any kind of text any kind of sequence data it's gonna it's gonna be able to learn what the next values in that sequence are which means it'll be able to both generate and discriminate that type of data that you've trained it on right in this case it's its hand it's characters that it can draw it can draw so this is very popular in NLP Andre karpati the famous AI researcher he a great blog post on this the unreasonable effectiveness of recurrent networks I assume you've read that if not check it out I'll also link to it probably I'll link to it in the description and so it's used for translation and sentiment analysis and all sorts of NLP tasks it's also used for image classification like if you think of a picture as a sequence where each pixel is a sequence predict the next sequence so there's a lot of different ways that we can frame our problem and you know Ella stamps can be applied to almost any problem very useful and so what are some other great examples where I've got - one is for automatic speech recognition with tensorflow so yes it's a little abstract it with tension flow but it's a really cool use case and definitely something to check out so check out that demo and then also this this repository which is just a visualization of Alice cm's which is gonna definitely help with your understanding to check out this visualization very cool very cool demo it's in JavaScript I think yeah it's in JavaScript no it's in Python great so yeah so there's that definitely check that out and so our steps are going to be to look at the recurrent Network class so we'll have a recurrent Network class plain old recurrent Network and we're gonna use an LS TM cell instead so we're gonna replace it and then we'll build that LS TM class and then our data loading functions for our text and then we'll train our model okay so what is this right here this just jumble of numbers with no explanation this represents the forward pass the series of matrix operations that we were computing to get our output our predicted output okay so and we'll look at this in code so okay so our current neural network is going to remember so what we're going to do is given our mm text we're going to predict the input the next word in the sequence given the previous words so what that means is the input is going to be all of those words that we have just every word and our output will be the same thing but just moved over one that means that you know you know it'll look like this like you know during training we have a series of you know training iterations right so we'll say you know like let's say you know my name is slim shadey would be like that the text so we would say MA and we're trying to put a name okay we've got that compute the error back propagated eration weights are updated my name and that we're going to try to predict is right and so then we say try to predict is we have an actual output we have a predicted output compute the error back propagate repeat my name is I'm trying to predict slim you see what I'm saying so we just keep doing that and so that's why it's moved over by one so our inputs and our outputs are gonna be our words we have the number of recurrences that we want to do which is the number of words total right because we're gonna we're gonna we're going to perform recurrence in our network for as many times as necessary until we iterate through all those words up until we get to that predicted word that we need to be at and then we have an array of expected outputs and then we have our learning rate which is our tuning knob for you know too low and it's never going to converge to high and it will convert it'll overshoot and so we will never converge as well so that's where our learning rate is so now let's look at this so we're gonna do a bunch of an initial initial ization so remember the difference between LST M's and plain ol recur nets are that we have more parameters there's those gate values and we have new operations for those parameter values so we're going to initialize all of those right at the beginning and look you might look at this and think wow this is very complicated and difficult but remember with tensorflow with Karos you can you can do this in ten lines of code 20 30 40 lines of code we're looking at the at the details here that's why it looks so long okay so that's why it looks so long so now let's look at this so we're going to initialize our first word the size of it there next word the size of that and then our weight matrix this is the weight matrix between the input and the hidden state which we're gonna initialize randomly of the size of our predicted output well initialize this variable G that's gonna be used for rmsprop which is a technique for gradient descent that decays the learning rate I'll talk about why we're using G later on we're not for that right now but then we're gonna compete we're gonna say when it arrives the length of the recurrent Network right that's the number of words that we have the number of recurrences our learning rates these are all parameters and now we're gonna have arrays for storing or inputs at every time step on a rate for storing or cell States on a ray for storing or output values are hidden States and then our gate values right our for gate values well one of them is actually a Cell State these are at LST M values so before that these were the network the recurrent Network values and now these are the LST M cell values and then we have our array of expected output values and finally we have our LST M cell which we initialized right here giving it our inputs or outputs the amount of recurrence and the learning rate just like we did for the recurrent Network remember it's like a mini network it's like a network in a network and if you think of the gates as networks and it's a network in a network inside of a network or no it's three networks inside of a network inside of a network okay so yeah talk about inception inception recurrence inception so back to this those are our initializations and now we have our sigmoid which is our activation function it's a simple non-linearity right and then we have our derivative of the sigmoid function which is used to compute gradients that's why we have the derivative right we'll talk about that in back propagation and then we have forward propagation which is our series of matrix operations which I'm going to code okay so for forward propagation we're going to do this in a loop right because it's a recurrent network for the number of loops that we want right so we're gonna say okay so for the number of loops that we want that we define in our L we're going to set the input for the lsdm cell the input for the LCM cell which is a combination of inputs so we'll set which we can do with the H stack function of numpy and we'll say I minus 1 or is one and then self-taught x-l SEM dot x equals NP soft-headed uh yeah and so this is this so this is how we set inputs for the LCM cell by is a combination of inputs from the previous output and the previous hidden state right here right so that's that's that's us initializing our lsdm cell and now we can run forward propagation through the LS TM cell itself so it's going to return our cell States our hidden States our forget gate or so we can you know sort in our array of values C which is a cell state and then the output and so we can compute that using the forward prop function of the lsdm cell and so then we're going to say let's store the computed cell state right so then we've got all these values and now we can store them locally in this in this recurrent networks memory these are all values that were computed from the forward propagation of the LST M cell and so we can just you know set the hidden state we've got our what else do we have we have our forget gates of course the forget gate is going to be set as well these are all values that we computed in our forward propagation States we have our input gates which we're going to set like that we have our cell state and then we have our output gate right these are all values we computed there now we can calculate the output by multiplying the hidden States with the weight matrix so you know input times weight how to bias activate so we'll say okay we'll compute that output by saying self dot sigmoids will use a sigmoid function to activate the weights times the input input times weight had a bias activate right yep and then we'll set our inputs because we've computed an output we want to set our input to the next word in the sequence right because we're gonna we're gonna keep going and then when we're done with that we can return the output prediction right so that's forward propagation that's forward propagation now for backward propagation we're going to look at this so we're gonna update our weight matrices that's that's what that's the point of back propagation to update our weight matrices with our learnings so we've computed a predicted next word and we have our actual next word and we represent these words as numbers as vectors so that we can compute the difference between words like how do you compute the difference between words well you convert them to numbers or vectors then we can compute the difference to get the error value so we'll initialize the error is zero and then we'll initialize two empty arrays the two empty vectors for the cell state in the hidden state remember the cell state in the hidden state were those two values right here which ultimately these forget input-output gates are used to help compute the value of right notice how forget input and output are used here the forget gate is multiplied by the working memory state and then we add the input gate which is multiplied by the working memory state as well to remember what to forget to to learn what the get and what to remember and that's what's stored in our cells state the learnings of the forget in the input gate and we then we then activates the cell state and multiply it by the output to get our hidden state and so if our outer level with a weight matrix between the input and the hidden states we have our hidden state itself and in the cell state and those are the key like outer level parameters and then our inner level parameters are those LS TM level gradients right so we want the gradient values for our forget gate or input gate or cell or cells unit or state and then our output gate the internal ones and so we're gonna fill these out so we're gonna loop backwards or back propagate through time through our recurrence so we're gonna say we have our calculated output let's compute the error between that and the expected output and then we're going to say okay we're gonna compute the partial derivative by computing the error times a derivative of the output times the hidden state and so once we have that then we can say ok it's time to propagate the error back to the to the exit of the LCM cell or the the way out to the output of the recurrent network in general so the way we do that it's it's three steps we do we compute the error times the recurrent network weight matrix and then we set the input values of the LCM cell for recurrence we set the input values of the lsdm cell for recurrence and then finally we set the cell state of the lsdm cell for recurrence like that's pre updates and then we recursively call this back propagation using these newly computed values right so for our four this is going to compute gradient update for our forget input cell unit and help pockets and this the high-level cell state and the hidden state those two higher-level their parameters and so these are all of our gradients and now we can this is just for hair logging now we can accumulate those gradient updates by adding them to our existing empty values that we initialize right at the start and then we can update our lsdm matrices with the average of the accumulated gradient updates and then update our weight matrix with the average of the accumulated gradient updates and return the total error of this iteration and that's back propagation in general so then so then so notice this update step with this update step is is rmsprop it's a way of decaying our learning rate over time and this improves convergence there's a lot of different methodologies for improving on gradient descent there's Adam there's rmsprop there's a DES grad there's a bunch of these and rmsprop is one of them but here's the formula I'll just put it up there okay so that's what this is so now we will and so this sample function is the same thing as a forward propagation function it's just it's what we're gonna use once we've trained our model to predict or to generate new words so it's the same thing right we have our input and for a number of words that we define will say you know generate words or predict words for as many iterations as we define so it's the same thing so we can just skip that now for lsdm so so for our lsdm cell we've given it the same parameters as we did for our recurrent network it is after all a mini network in and of itself so we gave it the inputs the outputs the amount of recurrence and the learning rate and so what we'll do is very similar at the start to what we did for our recurrent network well in it we'll initialize our input the size of it our output the size of it and then our cell state is going to be empty how often should we perform our current we'll initialize that variable our learning rate as well and now we're gonna create weight matrices we'll initialize these weight matrices randomly just like we would for a any kind of neural network will neutralize the weight matrices for our three gate values for our forget our input gates and our output gate as well as our cell state so the cell state itself let me go up here has a set of weight matrices right just like all recurring that all neural networks and so that node has its own set of weight matrices that that we multiply to get that output value right it's a part of a series of operations and the weight matrix is that learning part it's the non-static it's the dynamic part of that equation that would essentially you can think of these as gates you can think of them as layers even but they're called gates but layers are very similar you know what I'm saying the layers are very similar input times wait a Tobias activate we call them gates to differentiate not to be confused with so many terms here not the different nuts it can be confused with the actual mathematical term differentiate but to discriminate right okay so back to this so now where were we we initialize our gates we and then we've initialize our gates and now empty values for our gradients that we're going to compute for all of these remember all of these gates are differentiable so that's where these gradient values will go so then we can update them through back propagation because we back propagate through that's the cell itself we don't just back propagate through the recurrent Network at a high level we are back propagating that is we are computing gray and values for each of these gates so we are updating waits for the input forget cell state the output gate where and the outer level weight matrix as well for the recurrent network so we're computing grades for everything so we're updating everything it's learning what to forget what to remember what to pay attention to the attention mechanism the output and the outer level weight matrix as well so right so then we have our activation function sigmoid just like before and our derivative just like before and so what we do is we add another activation function here this is good practice in lsdm networks you usually see the 10h function applied pretty much all the time and the you might be asking well why do we use 10 h over other activation functions I've got a great video on this it's called which activation function should I use search which activation function should I use great video on this and you'll you'll know it very fast within seven minutes if you watch that video but at high level it prevents the vanishing gradient problem the tan age function gives us stronger gradients since the data is centered around zero as opposed to the sigmoid which is not and so we use that and then we also are going to import the derivative function of it because we want to compute gradients right everything is differentiable so just like before we'll compute the for propagation for an L STM cell and so remember the forward propagation for an L STM cell is drumroll please this set of operations ultimately we want to compute the output right we want to compute the output so that's what we'll do back to this so we'll say the forget gate is going to equal the input times the forgetten so we're going to activate the dot product of the forget gate and the input input times forget gate and then activate that's going to give us our forget gate and then we're going to compute this the cell state we're gonna we're going to we're going to update the cell States by multiplying it by that for decades so it knows what to forget and then we're going to compute the input which is going to be again the activated version of the inputs times the previous input the current input times a previous input and so that's going to tell us that and so then once we've got that we can compute the cell state which we're going to apply this new activation function for the cell states to prevent the vanishing gradient which is the cell state times the input and then we activate right that series of operations and then we're going to update the cell States by adding the inputs times the cell state the cell and put times a cell and then for our predicted output we're gonna take those we're gonna say let's compute the dot product between the output gate and the cell state or know the output 8 and the inputs and that's gonna go our predicted output and so what we can do is we can say oh no that's gonna give us our output gate and to get our actual output our predicted output will multiply our output gate times the activated version of our cell state and that's going to give us our predicted output we can then return our cell state our predicted output forget gates and input gate cell and then our outputs gates as well that's forward propagation all right and that's the equation that I showed up there so now we've got a forward propagation now we're gonna look at the backward propagation remember we back propagate through the cell state as well not just the higher-level recurrent Network so what this is is we're going to compute the the first errors right so that's the error plus the hidden States derivative and we'll clip those values so it's not too big we want to prevent the gradient from vanishing so clipping it helps that and we'll multiply the error by the activated cell state to compute the output derivative and then we'll compute the output update which is the output derivatives times the activated output times the input and then we'll compute the derivative of the cell state which is error times output times the derivative of a cell state plus the derivative cell so we're computing derivatives of all of these components in the backwards or that's before which means to forget the forget date computation or update will happen near the end instead of near the beginning and so then we're computing the derivative of a cell and then the cell update the derivative of the input the input update the derivative of the forget the forget update so you see what I'm saying where you're computing gradients and then we're updating them right and then derivative of the cell state and then the derivative of the hidden states and then finally we can we can return all of those gradient value these those updated gradient values for the forget the input the cell the output the cell States and the hidden state so many different parameters that we have computed gradients for and back propagation is going to let us compute all of those right recursively computing the error with respect to our weights for every single component we have in that network in the reverse order that we did forward propagation and so yeah it's just you know it's more matrix math than we did before it's like it's like six or seven more steps than a recurrent Network maybe it's more like eight or nine more steps but yeah and then our our update step is going to just be us updating or forget input cell and output gradients and then we can update our gates using those gradient values and they're our gradient values okay so then that's it for our L SCM cell or recurrent Network and now we can look at our load text and our export text function which are not as interesting but what we do is we load up our M&M txt file of lyrics right just like that and then we compute those unique words for for all of them and so we have a sequence right we have a sequence or input and an output sequence and then we have this export text so whenever we've sampled new words we can write them from memory to disk that's all and so for our program we'll say okay so for five thousand iterations with a learning rate of 0.001 let's load the input in our output data and then we'll initialize a recurrent Network using our hyper params that we've initialized before and then for training time for that giving number of iterations five thousand will say compute the predicted next word so that's a forward propagation then perform back propagation to update all our weight values using our error and then if our error and then just keep doing them and then this this catch or this this if statement will say if our error is small enough that it's between this range here then we can go ahead and sample which means predict new words generate new words our network is trained or error as small as let's go ahead and now to predict new words without having to compute back propagation to update our weights our weights are already updated enough so then we can go ahead and define a seed word and then predict some new text by calling that sample function which is the for propagation that's all and then we can write it all to disk all right so let's run this it's beginning it's gonna take a while and then once it's done I actually have a save copy it's spitting out some pretty dope lyrics right these are some pretty dope lyrics right that's it for this lesson definitely look at this Jupiter notebook afterwards look at those links that I've sent you in a description as well as inside of the jupiter notebook and make sure that you understand why at least why to use lsdm cells the reason is because it's to remember long-term dependencies that's at least a high level you should have got from this video it learns what to forget what to remember and what to pay attention to those are the three things at an LS TM cell as opposed to a regular recurrent Network lets you do okay so yeah please subscribe for more programming videos and for now I've got to learn to forget so thanks for watching

Info

Channel: Siraj Raval

Views: 165,335

Rating: undefined out of 5

Keywords: lstm backpropagation, siraj raval lstm, lstm siraj raval, lstm explained, lstm tutorial, lstm, lstm networks, lstm network, long short term memory neural network, lstm neural network, lstm python, long short term memory, lstm rnn, rnn lstm, siraj raval rnn, lstm tensorflow, recurrent neural network, neural network, siraj raval, programming, python, reccurrent network, rnn, siraj, coding, data, AI, neural net, artificial intelligence, machine learning, DL

Id: 9zhrxE5PQgY

Channel Id: undefined

Length: 45min 3sec (2703 seconds)

Published: Wed Aug 09 2017