An Introduction to LSTMs in Tensorflow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're going to we have a computational tutorial on LS cams and tensor flow so the two people that we'll be talking today the first is a Nick LoCascio is that hey Siri last name and he's a mensch med student and csail working with in Regina Barclays group eyes current research applies deep learning to diagnostic mammography screening and then the second speaker that we have is her niece Rush and she's also a mentor student in csail and her research uses LCM to model physiological time series great okay so let's get started and be exciting hi everyone I just like to thank Janelle for inviting us out here and organizing this event very excited for it so Elsie ensign sensor flow the the outline of this talk and this tutorial series is for parts we're first going to just start with a brief overview and summary of neural networks screen is going to talk about sequence modeling with elf CMS we're going to talk about tensor flow fundamentals and then part four you're going to dive in and actually build lcms in tensor flow to classify airline tweets as positive or negative so a brief overview of neural networks first I'm sure you've all seen diagrams like this there's some nodes connected to some other nodes we just kind of dive in to into one of these notes it's called the perceptron and essentially what it does is it takes an input x simulates some sums them and applies a non-linearity so to break it down we take our input set some X vector we multiply by some weights which is a W vector we sum them we had a bias term and then we apply a non-linearity to this and we can just kind of rewrite this as a matrix multiplication and let's dive a little into this activation function G of whatever our input is so there's a lot of different activation functions one that you might have seen before is called a sigmoid activation and what this does it applies a nonlinear transformation to the data there's a graph of it on the bottom left and the key thing here is that there's lots of different activation functions that you're going to that you might come across and you might see there's things like sigmoid 10h things like Rayleigh which is very pop there now and the key idea here is that they apply a nonlinear transformation to the data and the reason this is important is because our data is most of the time going to be non-linear and if we want to make decisions and draw decision boundaries around this data we need to build models that have nonlinear capacity in them so as you can see you know linear molecule can draw straight lines is Melanie or you can you can have more complicated decision boundaries so if we're just look at the forward path of a perceptron and we pass in some input those numbers and multiplied by our weights and we add them up apply a non-linearity and we get some some output number so that's just the perceptron but if we want to build entire neural networks we need to kind of composite these things together right so if we're going to just simplify our diagram a little bit as we just have some inputs and they go to some node and we can write multi output perceptrons where we have two perceptrons that lead to multiple multiple outputs and we can even stack them together to create neural networks like this so this is an MLP if the smallest round Network you can technically create and we're going to supply a little bit more to get to our final diagram we just replace all those connections with an X and if we want to deepen our network we just stack many many many of these and that took you from perceptron all the way to the diagram that you see a lot so if we want to go back and train these neural networks to do tasks that we want them to do and you know in any machine learning model you have to train your network over here at your model so if we have our lost Finance at some function if we call it j FJ j say we want to define our objective to minimize this loss by adjusting our weights adjusting our theta so remember theta is just kind of the weights of our network that's what parameterize is our network and we do this with something called stochastic gradient descent so if we think of our loss function as as a function that relies on our weights so here's theta 0 theta 1 we just kind of pick any random initialization we start with anything with any status any weights and we have some loss for that we can compute the gradient to move in the direction off moving the direction of maximum descent and we want to minimize this loss so we kind of iterative ly do this and move our way kind of down this curve until we reach some convergence and this is called SGD or stochastic gradient descent and just kind of formally there's this idea of an update rule where you you apply this gradient but let's just dive a little bit into how we actually calculate this gradient this DJ theta of D theta and we do this with something called back propagation so with back prop you want to find how your your loss changes with respect to a specific weight so here where you want to look at how our loss changes with respect to weight to and to do this we apply the chain rule so we know it relies on how our loss changes with respect to the output and we also know that the the output relies on how weight to changes and similarly if we want to do this from week one we just apply the chain rule we apply the chain rule again so that was just kind of a brief overview and summary just to refresh your memories on I'm going on that works and Hirini is going to be taking off from here talking about how to build off of standard neural networks to create neural recurrent neural networks and lt ohms so I'll hand it off to her now yeah yeah sure so I think the key idea is with many many layers you're going to learn the the weights necessary to kind of draw the decision boundaries that you want so if you were to have a nonlinear like transformation if you just like thick times by one and you stack these many many layers it's still going to be like a fully linear function no matter how many layers you have you still need to introduce like some non-linearity in the layers to have like a long known linear model so yeah I think it's a it's a little like crazy sometimes if you look at something like a Rayleigh I'll just go back to this real quick like a rail ooh is like the simplest non-linearity that you could think of it's zero everywhere and then it's just kind of like this linear function and it surprisingly it actually works really well when it's you know it's most popular now so I said you know it's a bit of like kind of an engineering like question but it works really well and I think the key thing is like yeah people used to like you sig ones a lot because they thought it was like a cleaner you know more smooth gradient but with something like a ray loo it's not as clean but you still get their results and now I'm going to talk about sequence modeling um and basically how this is what I'm going to talk about is why we would want to model sequences with a recurrent neural network in the first place and then go into some of the motivation for an osteon specifically which is just a specific type of recurrent neural network okay so first of all what exactly is a sequence my sequence I just mean an example that consists of multiple values and this can be a variable length set of values and future values depend on previous ones so this is something like a sentence a function or a suite choice form and so neural network model specifically rnns or recurrent neural networks about a lot of success in this area and you've probably encountered asking questions to Siri or Alexa or using Google machine translate and the state of the art in both machine translation and question answering tasks is based in RNs okay so now let's say we want to go about and modeling model sequence well the first thing we need to think about is how exactly are we going to represent our sequence so in order to speed a thing something like a sentence into a model we need to represent it as a vector of numbers so one thing we might think about first off is okay well how about we just represent the sentence as a bag of words and in a bag of words we just have a vector and each slot in the vector represents the word and the number in that slot represents the number of times about where it occurs in the sentence so now we have a fixed size vector representing our sentence and we can feed that into a model and make a prediction so a problem you might notice about this is that the bag of words representation doesn't preserve any type of order in this sequence so for example these two sentences the food was good not bad at all and the food was bad not good at all had the exact same set of words so they all have the same exact bag of words representation however they mean completely opposite things so you can see how it might be difficult for our network or our model to different to get semantic meaning out of sentences when we're representing them like this because opposites sentences which mean opposite things can have the same exact representation okay so we know now that we want to preserve order to encode meaning so the next thing we might think about is just having a longer feature vector in which way in which we maintain order so in this case we just have many slots within our feature vector each corresponding to a word and so the first five slots here correspond to the first story and the next five correspond to the second word and so on so now we're maintaining order within this vector and now we just have a longer back here we see that into a model and make a prediction however in this case it's actually difficult to deal with different word orders so in these two in this example on Monday it was doing versus it was snowing on Monday these actually mean the exact same thing but they have entirely different representations in this model so the problem with this when we are dealing with sequences is that oftentimes sequences operate according to the same rules across the sequence so in the example of language there are certain rules of grammar and language that apply no matter whether we're at the beginning of the sentence or at the end of the sentence and when we have things like this since we're not sharing weights or we're not sharing parameters across the entire vector which we learn something about language at the beginning of the sentence we're going to have to relearn that at the end of the sentence so because our vectors are entirely different for different things this is also a problem because we're going to have to relearn things across the entire sequence okay so we want to deal with this problem and so something we might turn to is something like a Markov model so in this case we don't have to relearn anything because we have a set of rules and a set of states and a set of transitions so here we know that no matter where the word I appears in this model there's a point six six chance that it's likely to be followed by the word like so whether this appears in the beginning of the sentence or the end we have these rules encoded in the Markov model so limitation with this however is that we're assuming that each state depends on the previous state or a finite set of previous States which is also quite limiting so for example on in something like language if I showed you the following sentence in France I had a great time and I learned some of the blank language well you and I as well probably just figure out what word goes in that blank I looked way far back at the word France and then I looked ahead at the word language and I can pretty much tell that the word that should go in that blank is French however if I just looked at the previous words that I would get very little information about what the next word should be so in complicated sequences like language it's really important to be able to model longer term dependencies and keep track of things that happened far back in the past so something like a Markov model will often not perform very well in generalizing to sequences like language okay so this provides like a brief motivation for why we may want to use a recurrent neural network to model sequences so as a review that things we want to solve are we want to maintain word order because this is important to semantic meaning we want to share parameters across the sequence so we don't have to relearn rules across the entire sequence and lastly we want to keep track of long term dependencies and remember things that happened far back in the sequence okay so this is what a neuron looks like you might notice that it looks really similar to the neural networks that Nick introduced in the previous part and it is there's just a slight difference in what the hidden nodes are computing they're computing a slightly different function than before so let's take a look at this one hidden unit to see what it's doing so before the hidden unit computed a function of the input times some weights and then a non-linearity and produced an output here it's doing the exact same thing except in addition to a function of the input it's also producing a function of its own previous output which we call its previous state so in this way is able to remember things that saw in the past so if you look at the last equation you can see that the next output of the hidden unit is is really similar to what you saw before it's some weights multiplied by the input but it's also some weights multiplied by the spice own previous output and we just keep doing that across the sequence at every point in the sequence we feed in a new word and the previous output and we compute the new outfit so because of this a really common way to view a recurrent neural network is by unfolding it across time and by time I don't necessarily mean actual time but just where we are in the sequence so here you can see clearly that each point in the sequence we feed in the next value and the previous self state and we compute the next self state so this is the key to why this is different because we're keeping track of a state within the cell which was not happening before before the normal hidden unit only remembers that only knows something about the current example it doesn't know anything about the examples it's seen before okay so some things to notice here are that W and u those parameters stay the same throughout the sequence so this solves one of our problems which is that we want to make share parameters across sequence so once we learn something about grammar si at the beginning of the sentence we know that that also applies later in the sentence and we don't have to relearn that the other thing to note is that s events the self state at time n can contain information from all the past time steps so this solves our problem of keeping track of long term dependencies because you can see that since the state at time 0 feeds into the state at time 1 which leads into the state at time 2 at any time n we have computed a function of all the previous states okay so now I'm just going to oh sorry yes this week also because its own previous day as input again alright so some I'm just going to go through some possible tasks that you could do with this specifically apply to language but this is easily extended to other types of sequences so one possible task would be something like a language model and in a language model you we are basically training the recurrent Network on a specific corpus of language this could be something like English or French or something even more specific like all the works of Shakespeare and in doing so we are we are training it to be able to predict the most likely next word given what a scene before and and in doing so we can then produce we can then create get outputs from that Network and produce language that looks really similar to the input text so this is something that's produced from a language model that was trained on all the works of Shakespeare and you can see that it looks pretty similar ok so how exactly it does this is or how that network would would look at something like this where we have an input at each time but in addition to producing a state at each time we're also producing an output and we just do that by multiplying by another set of weights and here each output is a probability distribution over the most likely next words given what it's seen before if we're training this network we would have some loss function based on how similar that sequence is to to the sequence we were actually trying to get out and if not if we've already trained the network then we can just continue creating outputs to generate sentences like what you just saw and you can train a language model using anything so this is a funny example that you can find more examples at the URL at the bottom but this is a language model that's been trained on the King James Bible in the structure and interpretation of computer programs textbook and it produces things like has it not been for the singular taste of old eunuchs new eunuchs would not exist okay so another task a possible task is something like classification and this is what's in the tutorial later so this is something like if we have a bunch of tweets and we want to determine okay the tweet on the top is negative and the tweet on the bottom is positive so this is really similar it's just we have a network and then we're producing an output at the very end and this is a probability distribution over classes it could be something like negative positive negative and neutral in this case and something to note here which is kind of something that will come up again a couple times is that we are making this prediction about the entire tweet based on just the cell state at the last time step which means that this network is creating a representation of the entire tweet based on at just the last time step it's not just information about the last input it's information about the entire tweet because from that it's able to make a prediction about the entire tweet okay so the final task I want to go over is machine translation and this is called its encoder decoder model because it's actually made up of two iron ends the first one is an encoder and it basically takes a first sentence encodes it into some fixed length and then we take the story it takes the first sentence into an end coder network and then we take the last cell state see that in as input into the second Network which is basically just a language model for a different language such as French and then given that last cell state from the encoder as the words that we produce it can produce an output sentence in a different language so notice here again that we are using we're treating the last self state as a representation of the entire source sentence basically what that cell state has to do is encode the core meaning of the first sentence that we saw because from that is able to produce something's in a different language that means the same thing okay so now I just want to talk about how we would actually do about training every parent so it's really similar to how to what Nick talked about just backpropagation it's just that there's an additional time dimension okay so just as a quick review in backpropagation we take the derivative of the law with respect to each of our parameters and then we move the parameters in the opposite direction of that derivative so that we can minimize the loss okay so we have a loss of each time step in an RNN because we're moving through time and so it's pretty simple all we do is just sum the total loss for a particular parameters just the sum of the loss of that each time step and since we have a loss of each time step we also have a gradient at each time step and the total gradient is just the sum of the gradients at each time step okay so to kind of go into this well we can just try it out for a particular weight W on so this is just for one parameter how would we calculate the gradient so we know that the gradient for the derivative of the loss with respect to W is just the sum of the losses at each time step with respect to W so we can focus on a specific time step and then you would just do that for all the time steps and then add them up so let's take time step two we want to know how what is the derivative of W with respect to the loss of time step two in this case that's denoted as j2 okay so again we just use a chain rule if the Lassa time step two depends on the output which is why - that depends on the self State which is s - and that depends on W so there we the chain rule to get the derivative of the loss with respect to W but you might notice that we have this last term and it's the derivative of the state the cell state at time two with respect to W if we look a little closer at this we can see that the cell state also depends on the cell state 1 and that also depends on W so we can't just treat that last term as a con we have to expand it even further okay so to figure out how exactly we need to expand that let's look at how exactly s the cell state at time two depends on W okay so we know that W feeds directly into it that's one way we also know that the state at time two depends on the cell state at time 1 and W feeds into that we also know that the cell state at time two depends on the state at time 0 and W feeds into that as well so we end up with this summation where we are counting and the reason we have to do this remember is because of what I said before with a cell state at any time is on the cell face of all the previous times and since we're sharing parameters we know that those weights W contribute to the error at time two based on how they contribute at time one and time at time zero and all the previous time steps so what we're doing here is basically just the counting the contributions of W to the air at all the previous time stops so I'm just writing this as formation here this what this is just this sum written in a summation form all we're doing is getting the derivative of the state to all the previous states and then the defect contribution of W to that previous state and this is just generalized to any arbitrary time step okay so you might guess from since this was kind of complicated that our ends might be hard to train and they are and so just to kind of go into a little more into that are items are hard to train because specifically because of a problem that's called a vanishing gradient problem and this comes about from that chain rule product that we were doing so this is just the same summation from before when we're calculating the derivative of the loss with respect to a specific parameter in this case W let's take a closer look at this one term and this is the derivative state to each of the previous cell states so we can count the contribution of W at each of those times so from our previous network if we're looking at the contribution of W at South state zero we know that we have to do the chain rule back to state zero and then pump the contribution of WS States okay so that's fine that's just a chain rule of two two things however if we were saying our sequence is really really long and we were calculating the loss of the very last time step we would have to back propagate all the way back to the previous time step which means we would have to that chain rule would become really really long because we're going to have to chain rule all the way back to first input in order to get the contribution of W at the first time step so as the gap between time stops gets bigger this product the chain rule product just gets longer and longer and okay so you're robbing me okay fine why does that matter well let's look what are each of these terms each term if you notice is the derivative of a self state at time n with respect to the previous self state and if you remember Sall states depend on the previous self states by a function by being multiplied by a weight and then they have an activation function applied to it so each of those derivatives are going to be a series of weights and derivatives of activation functions because we're just taking the derivative but what you really want to get out of this is that those are both small numbers weights are usually normalized weights are usually drawn from a standard normal distribution and the derivatives of our activation functions are pretty much always less than one so what we're doing in this chain rule product is multiplying a lot of small numbers together and remember the bigger the gap is between time steps the more small numbers are multiplying together so because of this errors doot-doot too far they're back time steps have smaller and smaller gradients because they have to pass through this huge chain rule in order to be counted in as part of the loss so remember that in back propagation we shift our weights according to that gradient so if we're shifting our weights according to the gradient but the errors from are counting less and less in the gradient then we're going to be shifting our our weights so that it basically so that they only are listening today errors from very close by they're not going to be that the arrows from very far back will not count as much in the gradient and because of that our parameters are going to become biased to capture short term dependencies because those errors will be the most prominent in our loss and in our gradient ok does anyone have any questions before I keep going ok yeah so this is just reiterating said which is that since the parameters aren't trained are being biased to capture shorter term dependencies we're running into the same problem that we had before which is that in this sentence for example if I'm kind of model language and predict the word that goes in that blank it's going to be hard if my parameters are biased to short term dependencies for it to know that that word is French because it has to go all the way back and see the word France okay so now I'm going to go over like some ways that you could deal with this problem and some ways that are commonly used in practice so one way is by choosing the right activation function so Nick mentioned there are some combination functions like hyperbolic tangent sigmoid or Rayleigh and these are the derivatives of those functions so if you remember the reason one of the reasons why that chain rule term was so tiny is because there were a lot of derivative of activation function terms that were in that big long product if we're choosing an activation function like it's a hyperbolic tangent or a sigmoid you can see why that would be because the derivative is pretty much always less than zero so we're multiplying a lot of small numbers together however if we choose something like a ray loom you can see that the it's derivative is zero when it's less than zero but otherwise it's just one twos that it's that those derivative terms aren't going to be shrinking the product another another thing that we could do is initialize our weights differently so the other term besides the activation derivatives was with a bunch of weights that we were multiplying together and I mentioned that those are less than one because they're drawn from a standard normal distribution however if we instead initialize our weights to the identity function then those multiplications would at least take a step towards fermenting that product from growing and growing smaller very fast okay so the final solution which is really effective and is using something called a gate itself and this is all in LCM is it's just a recurrent it's just a recurrent unit width that uses several steps of logic gates to control the flow of information through it and it helps because with the gate a cell we have these gates specified what information flows through in full and what information gets multiplied by weights and activations and stuff like that okay so I'm going to go through a high-level overview of what the Allah scam is doing there are equations that determine what the LCM does and if you're interested in seeing those later I just come talk to me and I can direct you to other resources but right now I'm going to go over a more high-level overview of conceptually what the LCM is doing okay so the first step we have is which we forget irrelevant parts of the previous state so let's say I'm modeling language and I see a new subject I might choose to forget the previous subject because I know that the next word will be conjugated according to the new subject the next step is that I want to selectively update myself state values so similarly to so since this is a recurrent Network we're keeping track of a cell State and so I might choose if the new subject is masculine singular I might choose to encode that information in my cell state but if there's something I think is irrelevant I might not update the sulfate to remember that finally I wanna put certain parts of the cell States that I think will be relevant to the output at this time step I might not choose output everything however so for example if I am in my self state is encoded that I have previously seen the word France but I don't think that's going to be relevant to this Euler output and said in the cell space but not output it and this is all just done with a set a series of logic gates that multiplies each input by by by some number to determine how much of it how much of that input we keep okay so as a brief overview of yl STM's help us with the vanishing gradient problem and make it easier to keep track of long term dependencies that forget gate while it allows you to forget things it also allows you to remember things so you can forget certain things completely that you think are irrelevant but you could also remember things completely so we're not at we're multiplying at each time step by some tiny set of weights we can choose to the forget gate can learn if there's some important information to multiple to remember it entirely so that weight would be a 1 and we wouldn't be multiplying by a lot of tiny weights the second thing is that in the update step in the middle step we update to the new cell state based on addition it's an additive function so S sub J depends on a substrate sub 1 through addition so when we back propagate it doesn't turn into this huge long multiplication ok so um does anyone have any questions you can also ask me about it afterwards if you do ok so once you take like a single step aren't you already into it yeah so so the weights will be updated they're not going to stay as the identity you're right but if you initialize but it's surprising how much a good initialization can help because once we if we if we start moving really slowly at the beginning it's possible to get into local minima or not trained properly so if our first if our first couple of steps are really good that can make a big difference okay so I'm just gonna wrap up now and go through one example that we saw before yeah yes yeah yeah so I forgot to mention that all of this all of these weights that forget gate weights update weights these are all learned parameters so the LCM is actually harder to Train in case it takes longer to train because there's so many more parameters there's not if you remember the simple RNN there was just those two weights but two weight matrices but here there's a lot more parameters which is correct however it's also easier to train because we while it is both parameters if we train it for longer it it's easier to learn it trains better basically it learns better things about our sequences yes yeah so um I can show you the folks where this if you want to come see there's there's several gates and several additions so basically each gate computes a function of of our input and our previous cell state to determine what to remember and what to forget but but yeah they're just multiplications each h part of the input is multiplied by some value between 0 and 1 and the thing is that this can it can be completely at 1 or it can be 0 if we want to forget certain things or remember certain things yeah um can I can we talk after I'm done yeah and then it um okay uh maybe when I'm if so maybe when once I finish these slides then I'll put those equations up all right so just to go back to our model of machine translation we can replace all of these with LCMS which is a typically used in practice now and one thing that I just want to go back to that I mentioned previously is that a fixed length encoding if you remember we are trying to decode our encoder sentence based on this fixed length encoding which might be okay for a short sentence like the dog eats but they were trying to translate a wrong sentence that's going to be pretty limiting it's going to be hard for us to make a representation of our entire encoder sentence in in that single last cell state so one thing we might view and this is something that's typically used is called attention um where we create a weighted vector of over all of our previous cell States paying attention to the cell face that we think will matter most so for example when we're decoding the first word we might choose to place more weight on the first word of the encoder model and if when we're decoding the second word I might choose to put might choose to place more weight on the second state of the encoder model while placing some weight on the other states as well because those might be relevant for context or for tents or something like that and these weights are also just learned parameters of the network okay so now there's lots of different things you can do with all those gems and for current networks so you can ascend these models to time series in waveforms not just language you can generate complex models to generate text or books or code self-driving cars the way that they Parker was actually in in some cases through LCM to model motion predicts off market trends or summarize things but yeah so I wonder do you want to go through tensorflow first and then do that why don't you go through this and then I'll pull them up on on my own so to prepare for part four where we're going to be actually building LCM stew to classified tweets as positive or negative I'm just going to give a brief introduction to tensorflow filled in and shaded contents so they're so central is a deep learning framework and one of the reasons you might want to use a deep learning framework rather than kind of rolling your own is that you you get a few advantages one of which is GPU acceleration so your code runs a lot faster the second is automatic differentiation so all those derivatives and gradients that you saw in the previous lecture you don't have to think about them too much they kind of get handled for you as long as you express your function is something differentiable you know function it can it can differentiate automatically the third is kind of code reusability and fourth is anything that speeds up your idea to implementation to result time is just going to help your research so there's a bunch out there as you know Cafe and Vienna so we're going to be talking about tensor flow today and just kind of one of the core principles sensor flow as I said we have a tensor and if you've ever you know B it's actually pretty similar so enum P you can kind of make a raise you conditional eyes them as zeros or as ones very similarly in tensor flow you can just do - yup that zeros or tip that ones and you could do operations like some and get the shape and pretty much anything you could think of a numpy you can also Express in tensor flow the the other core principle of tensor flow it's kind of three steps like your first is to create a session which contains all of your graph information and all of your weights the second is to actually define your computation graph so tensor flow is generalized to to any you know computation graphs doesn't have to be a specific neural network and the third is that once you have your graph set up and kind of feature and put in and then get the result to get the output out from your graph and this is something called evaluation or running your network so sessions are kind of just encapsulated in a session object so there's a couple of different ways to make a session so in this tutorial you'll be using TF interactive session but typically if you're doing like Reece into you know individual files you just initialize a session with TS dot session and the second cool part is the graph itself so this is like a very simple computation graph where we have some inputs a and B and we're just going to compute some some output a at the end that is kind of a composite of these different functions and we can express any any graph in this way so just to dive a little deep into what a graph is actually made of so there's different types in tensor flow so when we want to feed input into a graph we kind of reserved a place for it we could we it's called a place holder and essentially you can create your place holder so we have a and B and they're accepting some input and we can also create constant so if you looked at the epidote K this is just kind of a constant it's a number and then the the blue ones are operations so we can add we can subtract if you multiply we can do any any differentiable operation we can perform we didn't even do non-differential stuff like you know taking max isn't amine so once we have this kind of graph created we want to actually run this graph we want to feed in some input and then get our output so the way we do this is we do session run and then we pass in the node that we want the output for and then we pass in any inputs as part of this feed thick object so to break this down the first one we want to evaluate the output eat for input a and B of certain values and we don't have to evaluate the last one we could value at any node so we can evaluate C and we can also evaluate you know even C at the same time so moving from a very simple graph building a neural network isn't isn't that many steps so the previous one was a pretty constant computation you said in some input and you got some you know output that wouldn't change but if we're building in our network we need our graph to change so we can't use constants anymore so instead we use something called a TSM variable so we can create variables there just kind of matrices of numbers we can initialize them as zeros or it's kind of random normal distributions or as dandy matrices they offer since fill offers a lot of different initialization functions and the key thing here is we're making this array of numbers but they're able to change in the end a non-constant so if we were to build a no network graph so this right here is just this is just a perceptron classifier where we have some input X we're going to multiply that X by our weights W at our bias and apply some non-linearity function to it at the end and this is how you would create it in tensor flow just the code to do so and note that you know TS variables for the weights of the biases the next step is we want to add our loss function so we want to Train this the cool thing about the losses in tensor flow is they're actually graph computations they're part of the graph itself so we can express them in tensor flow methods so here we're just using a sigmoid cross entropy loss so our output and the next part is we want to actually optimize we want to train our network so we can create there's lots of different options for optimizers there's that stochastic gradient descent there's things like atom optimizer that there's lots of different options and you can run this optimizer our eyes is also an operation in the graph so we do the same thing where we do session drunk we give it the optimizer node and we pass in our inputs our X's inputs and then our Y label output because we want to compute the loss between the two so so that's just kind of a brief overview of creating a graph running the graph training your graph I'm also going to just add a little into some useful features so tensorflow that are that are useful to know about before we start the tutorial so one of the coolest parts of center flow is this tab so you can actually dive in and see you once you create your network you can actually visualize it and see what it looks like this is really useful in debugging to see like what you created if it is actually what you thought you created so it's really good to see the actual computation graph that gets built by your code and you can kind of dig in click around and see what's happening in your network the second is it offers you know lots of support for logging it results in your lost seats you can see your training and convergence time and the way you use sensitive sensor board is you just kind of run it with in the command line but you want to you know prepare these outputs so use in called summary logs so to create these graphs each of these is a summary log and each kind of point on here is a graph operation that was run to save a number in a summary log so the way we create summary logs is with TF summary we can use things like scalars you can make histograms as well and essentially as you run your graph and update it it's just another part of your graphic get to run and it writes out to a log file the second really useful feature of tensorflow is something called name scoping so here we create a variable V but it's within who which is in the name scope bar and we see that the name of our variable V is kind of foo slash bar slash the name of the variable and this is really useful for a few reasons one of which is actually sharing your weights so if you want to kind of build a graph but run it over multiple different inputs at once and then kind of aggregate those results you you don't want to like create the same graph over and over with a different set of weights so you want to be able to share the weights between the same graph instances so you can do something called TF that get variable to share your weights across different variables the second thing for name scoping is it allows you to and positive and more clean code so we can take all that code that we had written earlier to kind of build a simple neural network and we can make we can put this on one function called make layer and if we did this in the naive way without variable scoping all the variables would kind of clash in their name so instead we can kind of create this variable scope which keeps things clean into kind of different layers by different having different scopes and we might do it something like this and when you use variable scoping it actually gives you cleaner graph visualizations so this com2 that you look at here in the center this is a lair somebody made but they have been the schooling of komp2 and then within comp two is all the inner operations but it gives you a lot cleaner visualizations as well so the next part is uh you know it's great to train these things yes yes it does not structure the computation it may affect the way that your weights are shared though because essentially what TF dot variable does is it looks up in session this specific key and it'll get the variable for that key and if it doesn't exist it will create it but if it already exists it'll it'll grab the exact one so it's great that we can now train our networks but its greatest also we can save them and run them later so this is there's an idea of checkpointing and basically you can create the saver object and you can save your actual session to a checkpoint and then when you want to run it again later you can just kind of load this session so you create a new session but you restore it and it basically takes your entire graph and all the weights associated with it and and and just dumps it back into the environment and you have everything that you previously had had so you can eval your model and dig into your weights and see offline what was going out since the flow is also useful as the core of many other frameworks so if you views like carrots or TF learnin or many others tensorflow is the underlying code that's being used so it's very useful to understand kind of the the core of what you're using if you if you end up using why these higher-level frameworks like like care so we'll be transitioning into part four which is actually the tutorial so you'll be again using LCM to classify tweets as positive or negative I think Carini wanted to bring up the slide of LCM equations is there ready okay wait thank you refresh quick there pretty complicated do you want this okay so basically what we're doing is we're multiplying the input we get a function of the previous cell state and the input and we multiply that by element wise by a by the sigmoid function and the sigmoid so you can imagine that this this this input is a fun is some function of the previous old state and the current input which is a vector we multiply that by a vector of numbers between 0 and 1 because it's a sigmoid and so that's the gating that we're talking about we're multiplying each each input by some number between 0 and 1 in that way we're deciding what to keep the way that yeah yeah so is it's we we combine we add up we add the previous item state and the new input multiply them by some weights and then we gate that so we can choose to pass that fully through if we if the forget gate activation is a 1 otherwise we can choose to fully fit in that way we're making the activation function for the input essentially whatever the forget gate activation is and that can be that can be the identity function yeah so we have we meet here here and yeah for so for the implicit that forget gate the update gate and the output which the so this is a sigmoid function yeah so it's a number between 1 and 0 yeah so look at where we date the values here this is how we get the new state value so this is the old state this is the new state this is an addition so before we were taking the old state multiplying if I await and getting the new state so when we were taking the derivative of the previous state with respect to over the next state with respect to previous states we were just getting wrong product yeah with a product that had a lot of weights in it because we were taking the budget Cobian of each weight matrix or the derivative in this case we're not we are no longer doing that it's the linear function so so that's one of the key parts of of the LS PM you might read about it might be called the linear carousel okay so so if in the diagram I mentioned before this is the forget gate we're basically multiplying each of the inputs by some value and then adding that to our self state it's called I forget gate you can also I think it could also be called the remember gate because you're choosing what to remember and the same if you don't forget something you're remembering it this is the update step so in the update step we are taking the new input and adding it in some way to adding it to the old input yeah in the final update step we're choosing what to what to output to H sub T which is our output at that time our output is different than our self state so we can choose to keep things in our self state but not output it all of these all of that you might notice they all look really similar that's because all of these things are just deciding what to output what to forget based on some function of the input and the previous self state so based on like what I've seen before and based on what my new input is I can decide what parts of the previous stuff I want to keep what parts of the new stuff I want to keep and what I want to output so yeah so we basically so we want to it's because we want to keep certain parts of our we want to update certain things but we also will be amount by which we update it so in this part we're choosing what we want to update and in this part we're choosing how much of each of those things we how much we want to scale each of those things so that's why there's this additional multiplication it does that make sense so here we're choosing like which parts of this new input we want to up we want to update to the sulfate and in this part we're deciding this is basically a scaling factor where we're deciding how much of each of those parts that of which we decided to keep we want to update so we can scale them up scale them down based on how important they are and that additional multiplication and then that update thing is then added to sulfate alright did you repeat the question on so the sigmoid literally means a sigmoid function so between 0 to 1 yeah it's like the aggregate of those two vectors so like concatenated together okay so first of all like I mentioned before since our our update is a linear update it's not a multiplicative update we're not multiplying there's not a derivative of S sub n - a sub n minus 1 which was where all those tiny numbers are coming from we don't keep there's not rip there's not a repeated weight matrix in there because our update is where is it it's that one that says C of T equals this yeah I'm not sure if that's clear but yeah ok so the other thing is that the ads if there's no activation function here so like let me try to come up with an example so say that say that for a given input say that our input at time 0 the forget gate activation for that input was a 1 so the identity basically we multiplied it by a 1 so we kept it completely in our self state and say after that it never got updated a case when we take the derivative of the loss with respect to that it'll be it'll completely be 1 because we only ever multiplied it by a 1 like when we take the derivative it was only multiplied by 1 so we would never be the activation function the F function in that case would be the identity and so that input would contribute fully to the loss or the end of weight that was multiplied by that input would contribute fully to a loss at a future time stuff because there is no small activation function and basically you can think about the activation function in an LS PM is whatever the forget gate activation is that's what we're multiplying our new inputs by when we sorry that's what we're multiplying our self State the values in our self state by and I guess since it's past the ugly start this or let people start and you guys can keep asking questions so part 4 is the tutorial so if you go to this link here there is a good hub repo that you should download or clone and the instructions within here on on what to run will be around answering questions and helping out but it's a really cool lab and we hope you enjoy it [Music]
Info
Channel: MITCBMM
Views: 116,419
Rating: 4.8900971 out of 5
Keywords: CBMM, Center for Brains Minds and Machines, Artificial Intelligence
Id: l4X-kZjl1gs
Channel Id: undefined
Length: 59min 45sec (3585 seconds)
Published: Fri Apr 28 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.