CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
test test okay it works okay good we should get started soon so today we'll be talking about recurrent neural networks which is one of my favorite topics one of my favorite models to play with and put into neural networks just everywhere there are a lot of fun to to play with in terms of administrative items recall that your midterm is on Wednesday this one's day you can tell that I'm really excited I don't know if you guys are excited you don't look very excited to me but the 7/3 will be out do this Wednesday it's sorry will be out in Wednesday it's due two weeks from now on Monday but I think since we're shifting it I think to Wednesday we plan to have released it today but we're going to be shifting it to roughly Wednesday so we'll probably defer the deadline for it by a few days and assignment two if I'm not mistaken was due on Friday so if you're using three late days then you'd be handing it in today hopefully not too many of you are doing that our people done with the seven two or how many people are done okay most of you okay good great okay so we're doing well so currently in the class we're talking about convolutional neural networks last class specifically we looked at visualizing and understanding convolutional neural networks so we looked at a whole bunch of pretty pictures and videos and we had a lot of fun trying to interpret exactly what these convolutional networks are doing what they're learning how they're working and so on and so we debug this through several ways that you maybe can recall from last lecture actually over the weekend I stumbled by some other visualizations that are new I found these on Twitter and they look really cool and I'm not sure how people made these because there's not too much description to it but looks really cool this is turtles in tarantula and then this is chain and some kind of a dog and so the way you do this I think it's something like deep dream that's again optimization into images but they're using a different regularizer on the image so in this case I think they're using a bilateral filter which is this kind of a fancy filter so if you put that regularization on the image then my impression is that these are the kinds of visualizations that you achieve instead so that looks pretty cool but I yeah I'm not sure exactly what's going on I guess we'll find out soon okay so today we're going to be talking about recurrent neural networks what's nice about recurrent neural networks is that they offer a lot of flexibility and how you wire up your neural network architectures so normally when you're working with neural let's here in the case on the very left here where you're given a fixed sized input vector here in red then you process it with some hidden layers in green and then you produce a fixed size output vector in blue so say an image comes in which is a fixed sized image and we're producing a fixed sized vector which is the classes course when the recurrent neural networks we can actually operate over sequences sequences at the input/output or both at the same time so for example in the case of image captioning and we'll see some of it today you're given a fixed sized image and then through a recurrent neural network we're going to produce a sequence of words that describe the content of that image so that's going to be sentence that is the caption for that image in a case of sentiment classification in the NLP for example we're consuming a number of words in sequence and then we're trying to classify whether the sentiment of that sentence is positive or negative in the case of machine translation we can have a recurrent neural network that takes a number of words in say English and then it's asked to produce a number of words in in French for example as a translation so we'd feed this into a recurrent neural network in what we call sequence the sequence kind of set up and so this recurrent network would just perform translation on arbitrary sentences in English into French and in the last case for example we have video classification where you might want to imagine classifying every single frame of video with some number of classes but crucially you don't want the prediction to be only a function of the current time step the current frame of the video but also all the frames that have come before it in the video and so recurrent neural networks allow you to wire up an architecture where the prediction at every single time step is a function of all the frames that have come in up to that point now even if you don't have sequences at input or output you can still use recurrent neural networks even in the case on a very left because you can process your fixed size inputs or outputs sequentially so for example one of my favorite examples of this is from a paper from deep mind from awhile ago where they were trying to transcribe house numbers and instead of just having this big image feed into a comment and try to classify exactly what house numbers are in there they came up with a recurrent neural network policy where there's a small comment and it's steered around the image spatially with a recurrent neural network and so the recurrent network learned to basically read out house numbers from left to right sequentially and so we have fixed sized input we're processing and sequentially conversely we can think about this is a also a well-known paper called draw this is a generative model so what you're seeing here are samples from the model where it's coming up with these digit samples but crucially we're not just predicting these digits at a single time but we have a recurrent neural network and we think of the output as a canvas and the kernel goes in and paints it over time and so you're giving yourself more chance to actually do some computation before you actually produce your output so it's a more powerful kind of a form of processing data what are question over there so every one of these arrows is kind of like a dependency in terms of I guess we'll see specifics of exactly what this means for now arrows just indicate indicate functional dependence so things yeah so things are a function of things before and we'll go into exactly what that looks like in a bit okay so of this one so these are generated house numbers so the network looked at a lot of house numbers and it came up with a way of painting them and so these are not in a training data these are made-up house numbers from the model none of these are actually in the training set these are made up yeah they look quite real but they're actually made up from the model so a recurrent neural network is basically this thing here a box in green and it has a state and it basically receives through time it receives input vectors so at every single time step we can feed in an input vector into the arlynn and it has some state internally and then it can modify that state as a function of what it what it receives at every single time step and so they're all of course being weights inside the RNN and so when we tune those weights the RNN will have different behavior in terms of how its state evolves as it receives these inputs now usually we can also be interested in producing an output based on the RNN state so we can produce these vectors on top of the arlynn but so you'll see show pictures like this but I just like to know that the Arnon is really just the block in the middle where it has a state that it can receive vectors over time and then we can base some prediction on top of its state in some applications okay so concretely the way this will look like is the RNN has some kind of a state where which here I'm denoting as a vector H and but this can be also a collection of vectors or just a more general state and we're going to base it as a function of the previous hidden state at previous iteration time t minus 1 and the current input vector x t and this is going to be done through a function which i'll call a recurrence function f and that function will have parameters W and so as we change those WS we're going to see that the ironin will have different behaviors and then of course we want some specific behavior are the Arnon so we're going to be training those weights on data so you'll see examples of that soon for now I'd like you to note that the sinc function is used at every single time step we have a fixed function f of weights W and we applied that single function at every single time step and that allows us to use the or kernel network on sequences of without having to commit to the size of the sequence because we apply the exact same function at every single time step no matter how long the input or output sequences are so in a specific case of a recurrent neural network a vanilla recurrent neural network the simplest way you can set this up in the simplest recurrence you can use is what I'll refer to as a vanilla Arnon in this case the state of a recurrent neural network is just a single hidden state H and then we have a recurrence formula that basically tells you how you should update your hidden state H as a function of the previous hidden state and the current input XT in particular in the simplest case we're going to have these weight matrices wh h and w xh and they're going to basically project both the hidden state from the previous time step and the current input and then those are going to add and then we squish them with a 10 h and that's how we update the hidden state at time T so this recurrence is telling you how H will change as a function of its history and also also the current input at this time step and then we can base prediction we can base predictions on top of H for example using just another matrix projection on top of the hidden state so this is the simplest complete case in which you can wire up a neural network okay so just to give you example of how this will work so right now I've just talked about X H and Y in abstract forms in terms of vectors we can actually endowed these vectors with semantics and so one of the ways in which we can use a recurrent work is in the case of character level language models and this is one of my favorite ways of explaining Arnez because it's intuitive and fun to look at so in this case we have character level language models using ordinance and the way this will work is we will feed a sequence of characters into the recurrent neural network and at every single time step will ask the recurrent neural network to predict the next character in a sequence so a prediction entire distribution for what it thinks should come next in the sequence that it has seen so far so suppose that in this very simple example we have the training sequence hello and so we have a vocabulary of four characters that are in this data set HDL and O and we're going to try to get a recurring neural network to learn to predict the next character in a sequence on this training data and so the way this will work is we'll set up we'll feed in every one of these characters one at a time into a recurrent neural network so you'll see me feed in H at the first time step and here the x axis as the x time so we'll feed an H then we'll feed an e L&L and here I'm encoding characters using what we call a one hot representation where we just turn on the bit that corresponds to that characters order in the vocabulary then we're going to use the recurrence formula that I've shown you where at every single time step suppose we start off with H as all zero and then we apply this recurrence to compute the hidden state vector at every single time step using this fixed recurrence formula so suppose here we have only three numbers in the hidden state we're going to end up with a three dimensional representation that basically at any point in time summarizes all the characters that have come until then and so we have do we apply this recurrence at every single time step and now we're going to predict at every single time step what should be the next character in a sequence so for example since we have four characters in this in this vocabulary we're going to predict four numbers at every single time step so for example in the very first time step we fed in the letter h and the RNN with its current setting of weights computed these unnormalized lock probabilities here for what it thinks should come next so things that H is 1.0 likely to come next it thinks that e is 2.2 likely l is negative three likely and O is 4.1 likely right now in terms of unknown wise lock probabilities of course we know that in this training sequence we know that each should follow H so in fact this 2.2 which I'm showing in green is the correct answer in this case and so we want that to be high and we want all these other numbers to be low and so every single time step we have basically a target for what next character should come in the sequence and so we just want all those numbers to be high and all the other numbers to be low and so that's of course encoding in the encoded in the gradient signal of the loss function and then that gets back propagated through these connections so another way to think about it is that every single time step we basically have a softmax classifier so every one of these is a softmax classifier over the next character and at every single point we know what the next character should be and so we just get all those losses flowing down from the top and they will all flow through this graph backwards through all the arrows we're going to get gradients on all the weight matrices and then we'll know how to shift the matrices so that the correct probabilities are coming out of the Arnon so we'd be shaping those weights so that the correct behavior the Arlen has the correct behavior as you're feeding characters so you can imagine how we can train this over data are there any questions about this diagram good yeah thank you so at every single time step we're flying as I mentioned a same recurrence the same functions always so we have a single WX agent every time step we have a single wh y at every time step and the same whh applied at every time step here so we've used WX h wh y wh h four times in this diagram and in back propagation when we back propagate through you'll of course have to account for that because we'll have all these gradients adding up to the same weight matrices because it has been used at multiple time steps and this is what allows us to process you know variably sized inputs because that's every time step we're doing the same thing so not a function of the absolute amount of things in your input okay question what are common things for initializing the first h0 I think setting it to zeros is quite quite a common set H in the beginning to 0 good does the order in which we receive the data set matter yes because so are you asking if I plugged in these characters at a different order yeah so if you see if this was a longer sequence the order in this case in this case always doesn't matter because at every single point in time if you think about it functionally like this hidden state vector at this time step is a function of everything that has come before it right and so this order just matters for as long as you're feeding it in we're going to go through some us through some specific examples which I think will clarify some of these points okay so let's look at a specific example in fact if you want to train a character level language model it's quite short so I wrote a gist that you can find on github where this is 100 line implementation in numpy for a character level RNN that you can go through I'd actually like to step through this with you so you can see concretely how we could train a recurrent neural network in practice and so I'm going to step through this code video and so we're going to go through all the blocks in the beginning as you'll see the only dependence here is numpy we're loading in some text data so our input here is just a large collection of a large sequence of characters in this case a text input txt file and then we get all the characters in that file and we find all the unique characters in that file and then we create these mapping dictionaries that map from characters to indices and from indices to characters so we basically order our characters so say we've read in a whole bunch of file and a whole bunch of data and we have hundred characters or something like that and we've ordered them in a in a sequence so we associate indices to every character then here we're going to do some initializations first our hidden size is a hyper parameter as you'll see with recurrent neural networks so here I'm choosing it to be 100 here we have a learning rate sequence length here is set to 25 this is a parameter that you'll be you'll become aware of with our nets basically the problem is if our input data is way too large say like millions of time steps there's no way you can put an RNN on top of all of it because you need to maintain all of this stuff in memory so that you can do back propagation so in fact we won't be able to keep all of it in terment in memory and do backrub through all of it so we'll go in chunks through our EMPA data in this case we're going through chunks of 25 at a time so as you'll see in a bit we have this entire data set but we'll be going in chunks of 25 characters at a time and at every time we're just going to back propagate through 25 characters at a time because we can't afford to do back propagate for longer because we'd have to remember all that stuff and so we're going in chunks here of 25 and then we have all these W matrices that here I'm initializing randomly and some biases so WX h HH and hy and those are all of our hi our parameters that we're going to train with background okay now I'm going to skip over the last function here and I'm going to skip to the bottom of the script here we have a main loop and I'm going to go through some of this main loop now so there are some initializations here of various things to 0 in the beginning and then we're looping forever what we're doing here is I'm sampling a batch of data so here is where I actually take a batch of 25 characters out of this data set so that's in the list inputs and the list inputs basically just has 25 integers corresponding to the characters the targets as you'll see is just all the same characters but offset by 1 because those are the indices that we're trying to predict at every single time step so so the inputs and targets are just lists of 25 characters targets is offset by 1 into the future so that's where we sample basically a batch of data here we this is some sampling code so at every single point in time as we're training this Arnon we can of course try to generate some samples of what it's currently thinks characters should actually what these sequences look like so the way we use character no low-level RNAs and test time is that we're going to seed it with some characters and then this RNN basically always gives us the distribution of it the next character in a sequence so you can imagine sampling from it and then you feed in the next character again and you sample from the distribution and keep feeding it in so you keep feeding all the samples into the Ireland and you can just generate arbitrary text data so that's what this code will do and it calls the sample function so we'll go into that in a bit then here I'm calling the loss function the loss function receives the inputs the targets and it receives also this H prep H prep is short for hidden state vector from the previous chunk so we're going in batches of 25 and we are keeping track of what is the hidden state vector at the end of your 25 letters so that we can when we feed in the next batch we can feed that in as the initial H at that time step so we're making sure that the hidden states are basically correctly propagated from batch to batch through that H prime variable but we're only back propagating those 25 time steps so we feed that into loss function and we get the law and the gradients on all the weight matrices and all the biases and here I'm just printing the loss and then here's a parameter update where the last function told us all the gradients and here we are actually performing the update which you should recognize as an ad a grad update so I have all these cached thing and all these cached variables for the gradient squared which I'm accumulating and then performing the a degrade update go so I'm going to go into the last function and what that looks like now the last function is this block of code it really consists of forward and a backward method so we're computing the forward pass and then the backward pass in green so I'll go through those two steps so in the forward pass you should recognize basically we get those inputs the targets we're iterating receive these 25 indices and we're now iterating through them from 1 to 25 we create this X input vector which is just zeros and then we set the one-hot encoding so whatever the index in the impetus we turn that bit on for with the one so we're feeding in the character with a one hot encoding then here I'm computing the recurrence formula using this equation so HSN T here HS and all these things I'm using dictionaries to keep track of everything at every single time step so we compute the hidden state vector and the output vector using the recurrence formula and these two lines and then over there I'm computing the softmax function so I'm normalizing this so that we get probabilities and then your loss is negative log probability of the correct answer so that's just a softmax classifier loss over there so that's the forward pass and now we're going to back propagate through the graph so in the backward pass we go backwards through that sequence from 25 all the back all the way back to 1 and maybe you'll recognize I don't know how much detail level to go in here but you'll recognize that I'm back propagating through a softmax I'm back propagating through the activation functions I'm back propagating through all of it and I'm just adding up all the gradients and all the parameters and one thing to note here especially is that these gradients on weight matrices like whh I'm using a plus equals because at every single time step all of these weight matrices get a gradient and we need to accumulate all of it into all the weight matrices because we keep using all these weight matrices at the same at every time step and so we just back prop into them over time and that gives us the gradients and then we can use that in loss function perform the parameter date and then here we have finally a sampling function so here is where we try to actually get the art and to generate new text data based on what it has seen in a trainee dia and based on the statistics of the characters and how they follow each other in the training data so we initialize with some random character and then we go for until we get tired and we compute the recurrence formula get the probability distribution sample from that distribution re-encoded in one a hot k-11 heart representation and then we feed it in the next time step so we keep iterating this until we actually get a bunch of text so is there any question over just like the rough layout of how this works good that's right yeah that's right exactly so we have basically 25 softmax classifiers at every batch and we back drop all of those at the same time and they'll all add up in the connections going backwards that's right okay good do we use regularization here you'll see that I probably do not yeah I guess I skipped it here but you can in general I think sometimes I tried regularization and I don't think it's as common to use it in recurrent Nets as outside sometimes it gave me like worst results so sometimes I skip it but it's kind of a high parameter good yeah that's right so yeah that's right so in the sequence of 25 characters here we are very low level on character level and we don't actually care about words we don't know that words exist as just characters indices so this RNN in fact doesn't know anything about characters or language or anything like that it's just indices and sequences of indices and that's what we're modeling yeah good can we use spaces as the limiters or something like that instead of just constant batches of 25 I think he maybe could but then it kind of just you have to make assumptions about language we'll see soon why you wouldn't actually want to do that because you can plug anything into this and we'll see that we can have a lot of fun with that okay so let me show you now what we can do we can take a whole bunch of text we don't care where it came from is just a sequence of characters and we feed it into the RNN and we can train the Arnon to create text like it and so for example you can take all of William Shakespeare's works you can calculate all of it it's just a giant sequence of characters and you put it into the recurrent neural network and you try to predict the next character in a sequence for William Shakespeare proponents and so when you do this of course in the beginning the recurrent neural network has random random parameters so it's just producing a garble at the very end so it's just random characters but that when you train the RNN we'll start to understand that okay there are actually things like spaces there's words it starts to experiment with quotes it and it basically learned some of the very short words like here or on and so on and then as you train more and more this just becomes more and more refined and difficult Network learns that when you open a quote you should close it later or that the sentences end with a cut with a dot it learns all of this stuff statistically just from the raw patterns without actually having to hand code anything and in the end you can sample entire infinite Shakespeare based on this on a character level so just to give you an idea about what kind of stuff comes out alas I think he shall become approached and the dameWare little strain would be attained into being never fed and who is but the chain and subject of his death I should not sleep this is the kind of stuff that you would get out of this recurrent Network good yeah thank you so you are bringing up a very subtle point which I'd like to get back to in a bit yeah okay so we can run this on Shakespeare but we can run this on basically anything so we were playing with this with Justin I think like roughly a year ago and so Justin took he found this book on algebraic geometry and this is just a large latex source file and we took that latex source file for this algebra geometry and fed it into the RN N and the RN can learn to basically generate mathematics so this is a sample so basically this Ireland just spits out latex and then we compile it and of course it doesn't work right away we had to tune it a tiny bit but basically the Ireland after we tweaked some of the mistakes that it has made you can compile it and you can get generated mathematics and so you'll see that it basically creates all these proofs it puts it learn stupid little squares at the ends of proofs it creates lemmas it and so on sometimes the RN also tries to create diagrams to varying amounts of success a my best my favorite part about this is that on the top left the proof here it's emitted the Sarna was just lazy but otherwise this this stuff is quite indistinguishable I would say from from actual algebraic geometry so let X be ml 0 scheme of X okay I'm not sure about that part but otherwise the Gestalt of this looks very good so you can throw arbitrary things at it so I try to find the hardest arbitrary thing that I could throw at the character level RN I decided that source code is actually very difficult so I took all of Linux source which is just all the like C code you concatenate it and you end up with I think 700 megabytes of just C code and header files and then you just throw it into the Armen and then it can learn to generate code and so this is generated code from the Arnon and you can see that basically it creates function declarations it knows about inputs syntactically it makes very few mistakes it knows about variables and sort of how to use them sometimes it indents the code it creates its own bogus comments like syntactically it's very rare to find that it would open a bracket and not close it and so on this actually is relatively easy for the Ireland to learn and so some of the mistakes that it makes actually is that for example it declares some variables that it never ends up using or it uses some variables that it never declared and so some of this high-level stuff is still missing but otherwise it can do code just fine it also knows how to recite the GNU GPL license character by character that has learned from data and it knows that after the GNU GPL license there are some include files there's some macros and then there's some code so that's basically what it has learned good yeah so am intended just I shown you is very small just a toy thing to show you what's going on then there's a charm and which is a more kind of a maturing plantasia and torch which is just mint charred and scaled up and runs on GPU and so you can play with that yourself and so this in particular was a will go into this by the end lecture it's a three layer lsdm and so we'll see what that means it's a more complex kind of form of a recurrent neural network okay just to give you an idea about how this works so this is from a paper that we play with a lot with this with just in last year and we were basically trying to pretend that we're neuro scientists and we you through a careful level RNN on some test text and so the arm is reading this text in the snippet of code we're looking at a specific cell in the hidden state of the arlynn or coloring the text based on whether or not that cell is excited or not okay so you can see that many of the hidden state neurons are not very interpretable they kind of fire on enough in kind of weird ways because they have to do some of them have to do quite low-level character level stuff like how often does it come after H and stuff like that but some of the cells are quite interpretable so for example we find cells like a quote detection cell that this cell just turns on when it sees a quote and then it stays on until the quote closes and so this quite reliably keeps track of this and it just comes out from backpropagation the Ireland just decides that the character level statistics are different inside and outside of quotes and this is a useful feature to learn and so it dedicates some of its hidden states to keeping track of whether or not you're inside a quote and this goes back to your question which I wanted to point out here that this RNN was trained on I think a sequence length of 100 but if you measure the the length of this quote it's actually much more than 100 I think it's like 250 and so we worked on we only back replicated up to 100 and so that's the only place where this cell can actually like learn itself because it wouldn't be able to spot the pendants ease that are much longer than that but I think basically this seems to show that you can train this character level detection cell as useful on sequences less than 100 and then it generalizes properly to longer sequences so this so this cell seems to work for more than 100 steps even if it was only trained even if it was only able to spot the dependencies on less than 100 this is another data set here this is I think Leo Tolstoy's war and peace this is in this data set there's a newline character at every single at roughly 80 characters in so after every 80 characters roughly there's a newline and there's a quote there's a line length tracking cell that we found where it starts off at like 1 and then it slowly decays over time and you might imagine that a cell like this is actually very useful in predicting that newline character at the end because this RNN needs to count 80 time steps so that it knows when a newline character is likely to come next okay so there's lung tracking tells we found cells that actually respond only if sight if statements we found cells that only respond inside quotes and strings we found cells that get more cited the deeper you nested expression and so all kinds of interesting cells that you can actually find inside these are ends that completely come out just from the backpropagation and so that's quite magical I suppose good yeah so so there's a sin this lsdm I think there were 2100 cells so you just kind of like go through them so most of them look like this but I would say roughly five percent of them you spot something interesting so you just go through it manually oh no sorry so we are completely running the entire RNN intact but we're only looking at a single hidden state fire the firing of one single cell in the Arnon so running the RNN normally but we're just kind of a recording from one cell in the hidden state if that makes sense so this cell just the entire RNN I'm only visualizing one part of the hidden state basically there's many other hidden still hidden cells that evolved in different ways and they're all evolving in different times and they're all doing different things inside the iron and hidden state good we'll go into are you asking about multi-layer RNAs and so on we'll go into that in a bit this is I think multi-layer RNN but you can get similar results with one layer good yes so these hidden state these hidden cells were always between negative one and one they're an output of at NH and this is from an LS TM which we haven't covered yet but the firing of these cells is between negative on and one so that's the scale that sets this picture okay cool okay so our NS are pretty cool and you can actually train these sequence models over time about roughly one year ago several people have kind of realized that you can actually is this a very neat application in the context of computer vision to perform image captioning so in this context we're taking a single image and we'd like to describe it with a sequence of words and these are nuns are very good at understanding how sequences develop over time so in this particular model that I'm going to describe this is actually work from roughly a year ago happens to be my paper I have I have pictures from my paper so I'm going to use those so we're feeling accomplished on an image into a convolutional neural network and then you'll see that this full model is actually just made up of two modules there's the ComNet that is doing the processing of the image and there's a recurrent net which will be very which is very good with modeling sequences and so if you remember my analogy from the very beginning of the course where this is kind of like playing with Lego blocks we're going to take those two modules and stick them together that corresponds to the arrow in between and so what we're doing effectively here is we're conditioning this RNN generated model we're not just telling it sample text at random but we're conditioning that generative process by the output of the convolutional Network and I'll show you exactly how that looks like so suppose I'm just going to show you what the the forward pass of the neural net is so suppose we have a test image and we're trying to describe it with a sequence of words so the way this model would process the image is as follows we take that image and we plug it into a convolutional neural network in this case this is a vgg net so we go through a whole bunch of cough max pool and so on until we arrive at the end normally at the end we have this soft mask classifier which is giving you a property distribution over say 1,000 categories of imagenet in this case we're going to actually get rid of that classifier and instead we're going to redirect the representation at the top of the convolutional Network into the recurrent neural network so we begin to generation of the Arnon with special start vector so the input to this RNN was I think 300 dimensional and this is a special 300 dimensional vector that we always plug in at the first iteration tells the Ireland that this is the beginning of the sequence and then we're going to perform the recurrence formula that I've shown you before for vanilla recurrent neural network so normally we compute this recurrence which we've saw already where we compute wxh times x plus whh times h and now we want to additionally condition this recurrent neural network not only on the current input and the current hidden state which we initialize with zero so that term goes away at the first time step but we additionally condition just by adding wi h times V and so this V is the top of the comment here and we basically this added interactions added weight matrix W which tells us how this image information merges into the very first time step of the recurrent neural network now there are many ways to actually play with this recurrence in many ways to actually plug in the image into there are n N and this is only one of them and one of the simpler ones perhaps and at the very first time step here this Y 0 vector is the distribution over the first word in a sequence so the way this works you might imagine for example is you can see that these straw textures in a mass hat can be recognized by the convolutional network as a straw like stuff and then through this interaction WI H it might condition the hidden state to go into a particular state where the probability of the word straw could be slightly higher right so you might imagine that the straw like textures can influence the probability of straw so one of the numbers inside y 0 to be higher because there are straw textures in there and so the Ireland from now on has to kind of juggle two tasks it has to predict the next character in the sequence in this case and it has to remember the image information so we sample from that softmax and supposedly that the most likely word that we sampled from that distribution was indeed the word straw we would take straw and we would try to plug it into the recurrent neural network on the bottom again and so in this case I think we were using word level embeddings so the straw straw word is associated with a 300 national vector which we're going to learn we're going to learn a 300 emotional representation for every single unique word in the vocabulary and we plug in those 300 numbers into the RNN and forward it again to get a distribution over the second word in the sequence inside y1 so we get all these probabilities we sample from it again suppose that the word hat is likely now we take hats 300 dimensional representation plug it in and get the distribution over there and then we sample again and we sample until we sample a special end token which is really the period at the end of the sentence and that tells us that the Arnon is now done generating and at this point the ironman would have described this image as a straw hat period okay so the number of dimensions in this Y vector is the number of words in your vocabulary plus one for the special end token and we are always feeding in these 300 dimensional vectors that correspond to different words and a special start token and then we always just back propagate through the whole thing at a single time so you initialize this at random or you can initialize your vgg net with pre-trained from internet and then the recurrent neural networks tell you the distributions and then you encode the gradient and then you back drop through the whole thing as a single model and you just train that all jointly and you get a captioner image capture lots of questions okay good yes these three hundred dimensional embeddings are they're just independent of the image so every word has 300 numbers associated with it so we're going to back propagate into it so you initialize it random and then you can back propagate into these vectors X right so those embeddings will shift around they're just a parameter another way to think about it is it's in fact equivalent to having a one hot representation for all the words and then you have a giant W matrix where every single you multiply W with that one hot representation and if that W has 300 output size then it's effectively plucking out a single row of W which ends up being your embedding so it's kind of equivalent so just think of it if you don't like those embeddings just think of it as a one hot representation and you can think of it that way go ahead yes the model learns to up at the end token yes so in the training data the correct sequence that we expect from the RN is the first word second or third word end so every single training example will sort of have a special n token in it go ahead h2r you again receiving the same output from the ABG net or are you asking like it twice yeah thank you so the question is like where so in this example we're only plugging in the image at the very first time step we're not plugging it in the other time steps you can wire this up differently where you plug it into every single state it turns out that that actually works worse so it actually works better if you just plug it in the very first time step and then the Arnon has to juggle these both tasks it has to remember about the image what it needs to remember through the RNN and it also has to produce all these outputs and somehow it wants to do that there's some hand waving reasons I can give you after class for what that's true you just have to basically be careful enjoy the confident output when you give it the start okay and then give it the subsequent tokens in the labels and in the end token into the next generation the next current event I think you're kind of not quite so at training time a single instance will correspond to an image and a sequence of words and so we would plug in those words here and we will plug in that image and we yes so like um so you see it training time you have all those words plugged in on the bottom and you have the image plugged in and then you unroll this graph and you have your losses and your backdrop and then you can do batches of images if you're careful and so if you have batches of images they sometimes have different length sequences in the training data you have to be careful with that because you have to say that okay I'm willing to process batches of up to you know 20 words maybe and then some of those sentences will be shorter and longer and you need to in your code you know worry about that because some some some sentences are longer than others we have way too many questions I have stuff too good yeah thank you so we back propagate everything completely jointly end-to-end training so you can pre train with a imagenet and then you put those ways there but then you just want to train everything jointly and that's a big advantage actually because we can we can figure out what features to look for in order to better describe the images at the end okay so when you train this in practice we train this on image sentence datasets one of the more common ones is called Microsoft cocoa so just to give you an idea of what it looks like it's roughly a hundred thousand images and five sentence descriptions for each image these were obtained using Amazon Mechanical Turk so you just ask people please give us a census description for an image and you record all of it and then that's your data set and so when you train this model the kinds of results that you can expect or roughly what is kind of like this so this is our end describing these images so this it says that this is a man in black shirt playing guitar or construction worker in orange safety west working on the road or two young girls are playing with Lego toy or boy is doing backflip on the wakeboard and of course that's not a wakeboard but it's close there are also some very funny failure cases which I also like to show this is a young boy holding a baseball bat this is a cat staying on the couch with the remote control that's a woman holding a teddy bear in front of a mirror I'm pretty sure that the texture here probably is what what happened made it think that it's a teddy bear and the last one is a horse standing in the middle of a street there's Road so there's no horse obviously so I'm not sure what happened there so this is just a simplest kind of model that came out last year there were many people who try to work on top of these kinds of models and make them more complex I just like to give you an idea of one one model that is interesting just to get an idea of how people play with this basic architecture so this is a paper from last year where if you noticed in the current model we only feed in the image a single x time at the beginning and one way you can play with this is that you can actually around the recurrent neural network to look back to the image and reference parts of the image while it's describing the words so as you're generating every single word you allow the arlynn to actually make a lookup back to the image and look for different features of what it might want to describe next and you can actually do this in a fully trainable way so the RNN not only creates these words but also decides where to look next in the image and so the way this works is not only does the RNN output your property distribution of the next word in a sequence but this comb that gives you this volume so say in this case we forwarded the ComNet and got a 14 by 14 by 512 by 512 activation volume and at every single time step you don't just omit that distribution but you also emit a 512 dimensional vector that is kind of like a lookup key of what you want to look for next in the image and so actually I don't think this is what they did in the in this particular paper but this is one way you could wire something like this up and so this vector is emitted from the RNN just like a rip it's just predicted using some weights and then this vector can be dot producted with all these 14 by 14 locations so we do all these dot products and we achieve or we compute basically a 14 by 14 compatibility map and then we put a soft max on this so basically we normalize all this so that it's all you get this what we call an attention over the image so it's a 14 by 14 probability map over what's interesting for the Arnon right now in the image and then we use this probability mask to do a weighted sum of these guys with this saliency and so this ironin can basically emit these vectors of what it thinks is currently interesting for it and it goes back and you end up doing a weighted sum of different kinds of features that the lsdm wants to or the RNN wants to look at at this point in time and so for example the ireland's generating stuff and it might decide that ok i'd like to look for something object like now it emits a vector of 512 numbers of object like stuff it interacts with commnets with the cognate activation volume and maybe some of the object like regions of that combat of that activation volume like light up in the saline c map in this 14 by 14 array and then you just end up basically focusing your attention on that part of the image through this interaction and so you can basically just do lookups into the image while you're describing the sentence and so this is something we refer to as soft attention and will actually go into this in a few lectures so we'll go into a couple of cover things like this where the RNN can actually have selective attention over its inputs as its processing the input and so that's so I just wanted to bring it up roughly now just to give you a preview of what that looks like okay now if we want to make our lens more complex one of the ways we can do that is to stack them up in layers so this gives you you know more deep stuff usually works better so the way we stack this up one of the ways at least you can stack recurrent neural networks and there's many ways but this is just one of them that people use in practice is you can straight up just plug RNs in to each other so the input for one Arnon is the hidden is the vector of the hidden state vector of the previous Arnon so in this image we have the time axis going horizontally and then going upwards we have different Arlen's and so in this particular image there are three separate recurrent neural networks each with their own set of weights and these your current neural networks are just feed into each other okay and so this is all always trained jointly there's no train first one second third one that's all just a single computational graph that way back propagate through now this recurrence formula at the top it I've rewritten it slightly to make it more general well still we're still doing the exact same thing as we did before this is the same formula we're taking a vector from below in below in depth and a vector from before in time we're concatenating them and putting them putting them through this W transformation and it's washing them at the 10 H so if you remember if you are slightly confused about this there's there was a WX h times x plus w h h times H you can rewrite this as a concatenation of X and H multiplied by a single matrix right so it's as if I stacked X and H into a single column vector and then I have this W matrix where basically what ends up happening is that your WX H is the first part of this matrix and wh h is the second part of your matrix and so this kind of formula can be rewritten into a formula where you stack all your inputs and you have a single W transformation so the same formula ok so that's how we can stack these our nets and then they're now indexed by both time and by lawyer at which they occur now one way we can also make these more complex is not just by stacking them but by actually using a slightly better recurrence formula so right now so far we've seen this very simple recurrence formula for the vanilla recurrent neural network in practice you will actually rarely ever use formula like this a basic recurrent network is very rarely used instead you'll use what we call an LS TM or a long short term memory so this is basically used in all the papers now so this is the the formula you'd be using also in your projects if you were to use your curt neural networks what I'd like you to notice at this point is it's everything is exactly the same as with a narnun it's just that the recurrence formula is a slightly more complex function okay we're still taking the hidden vector from below in depth like your input and from before in time the previous sentence Tate we're concatenating them putting them through aw transform but now we have this more complexity in how we actually achieve the new hidden state at this point in time so we're just being a slightly more complex in how we combine the vector from below and before to actually perform the update of the hidden state it's just a more complex formula so we'll go into some of the details of exactly what motivates this formula and why it might be a better idea to actually use an awesome instead of narnun good yeah you'll see Sigma's and ten age and it makes sense trust me we'll go through it just click right now mm-hmm so if you look for lsdm online you can look for lsdm when you go in Wikipedia or you go to Google Images you'll find diagrams like this which is really not helping I think anyone the first time I saw a Les Dames day really scared me like this one really scared me I wasn't really sure what's going on I understand Ellis themes and I still don't know what these two diagrams are so okay so I'm going to try to break down lsdm it's kind of a tricky thing to put into a diagram you really have to kind of step through it so lecture format is perfect for analysts in okay so here we have the ellis team equations and I'm going to first focus on the first part here on the top where we take these two vectors from below and from before so x and h HS are previous in a state and X is Dean but we map them through that transformation W and now if both X and H are of size n so there's n numbers in them we're going to end up producing for n numbers okay through this W matrix which is for n by 2n so we have these four n dimensional vectors i F oMG they're short for input forget output and gee I'm not sure what that's short for it's just G and so the ifnl go through sigmoid gates and G goes through a 10 h gate now the way this let's see so the way this actually works the ellis tiem basically the best way to think about it is one thing I forgot to mention actually in the previous slide is normally recurrent neural network just has the single H vector at every single time step and lsdm actually has two vectors at every single time step the hiddenvector and this is what we call see the cell state vector so at every single time step we have both H and C in parallel and and the C vector here shown in yellow so we basically have two vectors every single point in space here and what else teams are doing is they're basically operating over this cell state so depending on what's before you and below you and that is your context you end up operating over the cell state with these I F + o + g elements and the way to think about it is I'm going to go through a lot of this ok so basically the way to think about this is think of I and O as just binary either 0 or 1 we want them to be we want them to have an interpretation of a gate like think of it as either zeros or ones we of course make them later sigmoids because we want this to be differentiable so that we can back propagate through everything but just think of ifnl as just binary things that we're computing based on our context and then what this formula is doing here see you can see that based on what these gates are and what G is we're going to end up updating this C value and in particular this F is called the forget gate that will be used to to shut a to reset some of the cells to zero so the cells are best thought of as counters and these counters basically we can either reset them to 0 with this F interaction this is a element-wise multiplication there I think my laser pointer is running out of battery that's unfortunate so with this F interaction if F is zero then you can see that will zero out a cell so we can reset the counter and then we can also add to a counter so we can add through this interaction I times G and since is between 0 & 1 and G is between negative 1 and 1 we're basically adding a number between negative 1 and 1 to every cell so at every single time step we have these counters in all the cells we can reset these counters to 0 with the forget gate or we can choose to add a number between negative 1 and 1 to every single cell ok so that's how we perform the cell update and then the hidden update ends up being a squashed cell so 10 H of C squashed cell that is modulated by this output gate so only some of the cell state ends up leaking into the hidden state as modulated by this vector o so we only choose to reveal some of the cells into the hidden state in a learnable way there so there are several things to to kind of highlight here one maybe most confusing part here is that we're adding a number between negative 1 and 1 with I times G here but that's kind of confusing because if we only had a G there instead then G is already between negative 1 and 1 so why do we need I times G what is that actually giving us when all we want is to increment a seed by a number between negative 1 and 1 and so that's kind of like a subtle part about an LCM I think one answer is that if you think about G it's a function of it's a linear function of your context no one has to it has layers a laser pointer by any chance right ok oh man ok so G is a function of your G goes through a 10 H okay so G is a linear function of your previous contacts squashed by 10 H and so if we were adding just G instead of I times G then that would be kind of like a very simple function so by adding this I and there and having multiplicative interaction you're actually getting more richer function that you can actually express in terms of what we're adding to our cell state as a function of the previous HS and another way to think about this is that we're basically decoupling these two concepts of how much do we want to add to a cell state which is G and then do we want to add to a cell state which is I so I is like do we actually want this operation to go through and G is what do we want to add and by decoupling these two that also may be dynamically has some nice properties in terms of how this lsdm trains but we just end up that's like the ostian formulas and I'm going to actually go through this in more detail as well but maybe I should go through it now okay so think about this as cells flowing through and now the first interaction here is the F dot C so F is an output of a sigmoid off of that and so F is basically gating your cells with a multiplicative interaction so if F is a zero you will shut off the cell and reset the counter this i times g part is basically giving you a camper as basically adding to the cell state and then the cell state leaks into the hidden state but it only leaks through 8nh and then that gets gated by oh so the oh oh vector can decide which parts of the cell state to actually reveal into the hidden hidden cell and then you'll notice that this hidden state not only goes to the next iteration of the lsdm but it also actually would flows up to higher layers because this is the hidden state vector that we actually end up plugging into further esteems above us or that goes into a prediction and so when you unroll this basically the way it looks like it's kind of like this which now I have a confusing diagram of my own that's I guess where we ended up with but you get your input vectors from below you have your hidden state from before the eight the xnh determine your gates fi GN oh they're all n dimensional vectors and then they end up modulating how you operate over the cell state and this self state can once you actually reset some counters and once you add numbers between negative one and one to your country's the cell state leaks out some of it leaks out in a learnable way and then it can either go up to the prediction or it can go to the next iteration of the LSD M going forward and so that's the so this looks ugly so we're going to so basically the question that's probably on your mind is why did we go through all of this there's something why does this look at this particular way I should like to note at this point that there are many variants to an LSD M and I'll make this point by the end of lecture people play a lot with these equations and we've kind of converged on this as being like a reasonable thing but there's many little tweaks you can make to this that actually don't deteriorate your performance by a lot you can remove some of those gates like maybe the input gate and so on you can turns out that this 10 H of C that can be a C and it works just fine normally but with a 10 H of C its works slightly better sometimes and I don't think we have very good reasons for why and so you end up with a bit of a monster but I think it actually kind of makes sense in terms of just these counters that can be reset to zero or you can add small numbers between between negative 1 and 1 to them and so it's kind of like a nice it's actually relatively simple now to understand exactly why this is much better than an RN and we have to go to a slightly different picture to draw the distinction so the recurrent neural network it has some state vector right and you're operating over it and you're complete transforming it through this recurrence formula and so you end up changing your hidden state vector from time step to time stuff you'll notice that the LST M instead has these cell states flowing through and what we're doing effectively is we're looking at the cells and some of it leaks into the hidden state based on the hidden state we're deciding how to operate over the cell and if you ignore the forget gates then we end up basically just tweaking the cell by additive interaction here so so there's some stuff that look that is a function of the cell state and then whatever it is we end up additively changing the cell state instead of just transforming it right away so it's an additive instead of a transformative interaction or something like that now this should actually remind you of something that we've already covered in the class what does it remind you of resonance right yeah so in fact like this is basically the same thing as we saw with resonance so normally with the ComNet were transforming representation ResNet has these skipped connections here and you'll see that basically residents have this additive interaction so we have this X here now we do some computation based on X and then we have an additive interaction with X and so that's the basic block of a resonant and that's in fact what happens with an LS TM as well we have these additive interactions where here the X is basically your cell and we go off we do some function and then we choose to add to this cell state but the lsdm is unlike resonance have also these forget gates that we're adding and these forget gates can choose to shut off some parts of the signal as well but otherwise it looks very much like a resonance so I think it's kind of interesting that we're converging on very similar kind of looking architecture that works both in comm dots and in recurrent neural networks where it seems like dynamically somehow it's much nicer to actually have these additive interactions that allow you to actually back propagate much more effectively so to that point think about the back propagation dynamics between R and L SCM especially in the LS TM it's very clear that if i inject some gradient signal that sometimes that's here so if I inject gradient signal at the end of this diagram then these plus interactions are just like a gradient superhighway here right like these radians will just flow through all the adit ABS addition interactions right because addition distributes gradients equally so if I plug in gradient ne point in time here it's just going to flow all the way back and then of course the gradient also flows through these F's and they end up contributing their ingredients into the gradient flow but you'll never end up with what we referred to with our n ends problem called vanish ingredients where these gradients just die off go to zero as you back propagate through and I'll show you an example concretely of why this happens in vit so in an RNN we have this vanishing gradient problem I'll show you why that happens in an lsdm because of this superhighway of just additions these gradients at every single time step that we inject into the Alice team from above just flow through the cells and your gradients don't end up vanishing at this point maybe I take some questions are there any questions about what's confusing here about LST m and then after that I'll go into why RNs have vanish ingredients problem good yeah so this oh you're acting by the Oh vector is that important turns out that I think that one specifically is not super important so there's a paper I'm going to show you is called an LSD Emser space odyssey they really played with this take stuff out put stuff in they there's also like these people connections you can you can add so this cell state here that can be actually put in with the hidden state vector as an input so people really play with this architecture and they've tried lots of iterations of exactly these equations and what you end up with this almost everything works about equal some of it works slightly worse sometimes so it's very kind of confusing in this in this way I also show you a paper where they took they treated these update equations as just a they built trees over the update equations and then they did this like random mutation stuff and they try all kinds of different graphs and updates you can have and not most of them work about some of them break and some of them work about the same but nothing like really that's much better than an Ellis team any other questions well Elson let's go into why recurrent neural networks have terrible backward flow I'll show you a cute video also so this is showing the vanishing gradients problem in recurrent neural networks with respect to LS DMS so what we're showing here is we're looking at we're unrolling a recurrent neural network over many periods many time steps and then we're injecting gradient and say at say 128 time step and we're back wrapping in the gradient through the network and we're looking at what is the gradient for I think the inputs to hit in matrix one of the weight matrices at every single time step so remember that to actually get the full update through the batch we actually end up adding all those gradients here and so what's what's what's being shown here is that as you backtrack we've only injected gradient at one twentieth time steps everything we do back drop back through time and this is showing the slices of that back propagation what you're seeing is that the LS TM gives you lots of gradients throughout this back propagation so there's lots of information that is flowing through and this RN I'm just instantly dies off that just the gradient we say vanishes just just becomes tiny numbers there's no gradient so in this case I think in the case in about eight time steps or so like ten time steps and so all this gradient information that we've injected did not flow through the network and you can't learn very long dependencies because all the correlation structure has been just died down and so we'll see why this happens dynamically in a bit there's some comments here which are also funny this is like YouTube or something again okay anyways okay so let's look at a very simple example here where we have a recurrent neural network that I'm going to unroll for you in this recurrent neural network I'm not showing any inputs we're only have hidden state updates so here I'm initializing a weight wh H which is the hidden state hidden to hit an interaction and then I'm going to basically forward a recurrent neural network this is a vanilla recurrent net for some tea time steps here I'm using T 50 so what I'm doing is wh h times the previous hidden time step and then relu on top of that so this is just a forward pass for an RN and ignoring any input vectors coming in is just wh h times h threshold wh h times h threshold and so on so that's the forward pass and then I'm going to do backward pass here we're objecting a random gradient here at the last time step so the 50th time step I inject some gradient which is random and then I go backwards and I back drop so when you back drop through this right you have to back wrap through a rail here I'm using a rail you have to back drop throughout the whh multiply then through relu whh multiply and so on and so the thing to note here is so here I'm undoing the relu here's where I'm back propagating through the rail oh I'm just fresh holding anything that where the inputs were less than zero and Here I am back dropping the WH H times H operation where we actually multiply it by the WH H matrix before we do to non-linearity and so there's something very funky going on when you actually look at what happens to these DHS which is the gradient on the HS as you go backwards through time it has a very kind of funny structure that is very worrying as you look at like how this gets chained up in the in the loop like what are we doing here with these two time steps taking that product of all zeros yeah so I think that sometimes steps maybe the outputs the railways were all dead and so you may have killed it but that's not really the issue the more worrying issue is well that would be an issue as well but I think one worrying issue that people definitely spot as well as you'll see that we're multiplying by this whh matrix over and over and over again because in the forward pass we multiply by whh at every single iteration when we back propagate through all the hidden states we end up back propagating this formula whe H times HS and the back rub turns out to actually be that you take your gradient signal and you multiply it by the whh matrix and so we end up multiplying by whh the gradient gets multiplied by whh then thresholded then multiplied by whh thresholded and so we end up multiplying by this matrix whh fifty times and so the issue with this is that the gradient signal basically okay two things can happen like if you think about working with scalar value suppose that these were scalars not matrices if I take a number that's random and then I have a second number and I keep multiplying the first number by the second number so again and again and again what does that sequence go to there's two cases right if I keep multiplying with the same number either I die or just goes completely yes so if your second number was exactly one yeah so that's the only case where you don't actually explode but otherwise really bad things are happening either we die or we explode and here we have matrices we don't have a single number but in fact it's the same thing happens the generalization of it happens if the spectral radius of this whh matrix is which is the hutt than the largest eigen value of that matrix if it's greater than one then this gradient signal will explode if it's lower than one the gradient signal will completely die and so basically since the Arlen has this very weird because of this recurrence formula we end up with this very just terrible dynamics and it's very unstable and it just dies or explodes and so in practice the way this was handled was we can control the exploding gradient one simple hack as if you're grained is exploding you clip it because people actually do this in practice it's like a very patchy solution but if your gradient is above 5 in norm then you clamp it to 5 element-wise or something like that so you can do that it's called gradient clipping that's how you address the exploding gradient problem and then your conduct your recurrence don't explode anymore but the gradients can still vanish in a recurrent neural network and Alice team is very good with the vanishing gradient problem because of these highways of cells that are only changed with additive interactions where the gradients just flow they never die down if you're if you because you're multiplying by the same matrix again and again or something like that so that's roughly why these are just better dynamically so we always use alice teams and we do do gradient clipping usually so because the gradients in analyst team can potentially explode still but they they don't usually vanish good thank you so here I'm using relu people you sometimes use 10h in Velen are recurrent neural networks as well for LS DMS it's not clear where you would plug in it's not clear in this equation like exactly how you would plug in a rail and where maybe instead of the may 4G I'm not sure so instead of 10 H we would use G here really but then so these cells would only grow in a single direction right so maybe then you can't actually end up making it smaller so that's not a great idea I suppose yeah so there is basically there's no clear way to plug in a rail here so yeah one thing I'll note is that in terms of these super highways of gradients this this viewpoint actually breaks down when you have 4 get gates because when you have four get gates where we can forget some of these apps with a multiplicative interaction then whenever a per gate gate turns on and it kills the gradient then of course the backward flow will stop so these super highways are only kind of true if you don't have any forget gates but if you have a forget gate there then it can kill the gradient and so in practice when we play with Alice TMS where we use all stems I suppose sometimes people when they initialize the for gate gate they initialize it with a positive bias because that biases that forget gate to turn on to be always kind of turned off I suppose in the beginning so in the beginning the gradients flow very well and then the Ellis team can learn how to shut them off at once to the later on so people play with that bias on that four gate gate sometimes and so the last night here I wanted to mention that Ellis TMS media many people have basically played with this quite a bit so there's a search space odyssey' paper where they try various changes to the architecture there's a paper here that tries to do this search over a huge number of potential changes to the lsdm equations and they did a large search and they didn't find anything that works substantially better than just an Ellis TM so yeah and then there's this gru which also is relatively actually popular and I would actually recommend that you might want to use this a GRU is a change on the nellis TM it also has these additive interactions but what's nice about it is that it's a shorter smaller formula and it only has the single H vector it doesn't have an H and a C it only has an H so implementation wise is just nicer to remember just a single hidden state vector in your forward pass not two vectors and so it's just a smaller simpler thing that seems to have most of the benefits of an ostium but so it's called gru and it almost always works about equal width lsdm in my experience and so you might want to use it or you can use the new ostium they both kind of do about the same and so summary is that our nets are very nice ah but the raw RNN does not actually work very well so you zealous teams or gr use instead what's nice about them is that we're having these additive interactions that allow gradients to flow much better and you don't get a vanishing gradient problem we still have to worry a bit about the exploding gradient problem so it's common to see people clip these gradients sometimes and I would say that better simpler architectures are really trying to understand how come like there's something deeper going on with the connection between resonance and LSD m/s and there's something deeper about these additive interactions that I think we're not fully understanding yet and exactly why that works so well in which parts of it work well and so I think we need a much better understanding both theoretical and empirical in the space and it's a very wide open area of research and so yeah so it's a it's for 10 that's the end of a classroom with an LST M people still clip gradients that can I suppose still explode so it's not as clear why they would but you keep injecting gradient into the cell state and so maybe that gradient can sometimes get larger so it's common to clip them but I think not as may be important maybe as an RN and a lot hundred percent sure about that one good your logical basis I have no idea how being it's interesting yeah okay I think we should end the class here but I'm happy to take more questions here
Info
Channel: Andrej Karpathy
Views: 90,456
Rating: 4.9342604 out of 5
Keywords: convnet, convolutional, neural, networks, image, recognition, cs231n, stanford, course, class, andrej, karpathy, recurrent, network, captioning, lstm
Id: yCC09vCHzF8
Channel Id: undefined
Length: 69min 54sec (4194 seconds)
Published: Mon Feb 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.