CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so let me dive into some administrative points first so again recall that assignment 1 is due next Wednesday you have about 150 hours left and I use hours because there's a more imminent sense of doom and remember that third of those hours you'll be unconscious so you don't have that much time it's really running out and you know you might think that you have late days and so on but the easiest time is just get harder over time so you want to save those and so on so start now let's see so there's no office hours or anything like that on Monday I'll hold make up office hours on Wednesday because I want you guys to be able to talk to me about especially projects and so on so I'll be moving my office hours from Monday to wednesday usually I have my office hours at 6 p.m. instead I'll have them at 5 p.m. and usually it's in gates to 60 but now I'll be in gates to 59 so minus 1 on both and yeah and also to note when you're going to be studying for a midterm that's coming up in a few weeks make sure you go through the lecture notes as well which are really part of this class and I kind of pick and choose some of the things that I think are most valuable to present in a lecture but there's quite a bit of you know more material to be aware of that might pop up in the midterm even though I'm covering some of the most important stuff usually in the lecture so do read through those lecture notes they're complementary to the lectures and so the material for the midterm will be drawn from both the lectures and the notes ok so having said all that we're going to dive into the material so where we are right now just as a reminder we have the score function we've looked at several loss functions such as the SPM loss function last time and we looked at the full loss that you achieve for any particular set of weights on over your training data and this loss is made up of two components there's a data loss and a regularization loss right and really what we want to do is we want to derive now the gradient expression of the loss function with respect to the weights and we want to do this so that we can actually perform the optimization process and in the optimization process we're doing a gradient descent where we iterate evaluating the gradient on your weights doing a parameter update and just repeating this over and over again so that we're converging to the low points of that loss function and when we arrive at a low loss that's equivalent to making good predictions over our training data in terms of the scores that come out now we also saw that there are two kind of ways to evaluate the gradient there's a numerical gradient and this is very easy to write but it's extremely slow to evaluate and there's analytic gradient which is which you obtained by using calculus and we'll be going into that in this lecture quite a bit more and so it's fast exact which is great but it's not you can get it wrong sometimes and so we always perform what we call Radian check where we write all the expressions to compute the analytic gradient and then we double check its correctness with numerical gradient and so I'm not sure if you're going to see that you're going to see that definitely in the assignments okay so now you might be tempted to when you see this set up we just want to derive the gradient of the last function with respect to the weights you might be tempted to just you know write out the full loss and just start to take the gradients as you see in your calculus class but the point I'd like to make is that you should think much more of this in terms of computational graphs instead of instead of just taking thinking of one giant expression that you're going to derive on pen with pen and paper the expression for the gradient and the reason for that so here we're thinking about these values flow flowing through a computational graph where we have these operations along circles and they transfer they're basically function pieces that transform your inputs all the way to the loss function at the end so we start off with our data and our parameters as inputs they feed through this computational graph which is just an oldy series of functions along the way and at the end we get a single number which is the loss and the reason that I'd like you to think about it this way is that these expressions right now look very small and you might be able to derive these gradients but these expressions are and computational graphs are about to get very big and so for example convolutional neural networks will have hundreds maybe or dozens of operations so we'll have all these images flowing through like pretty big computational graph to get our loss and so it becomes impractical to just write out these expressions and convolutional networks are not even the worst of it once you actually start to for example do something called a neural Turing machine which is a paper from deep mind where this is basically a differentiable Turing machine so the whole thing differentiable the whole procedure that the computer is performing on the tape is made smooth and is differentiable computer basically and the computational graph of this is huge and not only is this this is not it because what you end up doing and we're going to recurrent neural networks in a bit but what you end up doing is you end up enrolling this graph so think about this graph copied many hundreds of time steps and so you end up with this giant monster of hundreds of thousands of nodes and little computational units and so it's impossible to write out you know here's the loss for the neural Turing machine it's just impossible it would take like billions of pages and so we have to think about this more in terms of data structures of little functions transforming intermediate variables to gets loss at the variant okay so we're going to be looking specifically at computational graphs and how we can derive the gradient on the inputs with respect to the loss function at the very end okay so let's start off simple and concrete so let's consider a very small computational graph where we have three scalars as an inputs to this graph XY and Z and they take on the specific values in this example of negative 2 5 and negative 4 and we have this very small graph or circuit you'll hear me refer to these interchangeably either as a graph or a circuit so we have this graph that at the end gives us this output negative 12 ok so here what I've done is I've already prefilled what we'll call the forward pass of this graph where I set the inputs and then I compute the outputs ok and now we'd like to do is we'd like to derive the gradients of the expression on the inputs and so what we'll do now is I'll introduce this intermediate variable Q after the plus gate so there's a plus gate and a times gate as I'll refer to them and this plus gate is computing this output Q and so Q is this intermediate as a result of X plus y and then F is a multiplication of Q and Z and what I've written out here is basically what we want is the gradients the derivatives DF by DX DF by dy DF by DZ and I've written out the intermediate these little gradients for every one of these two expressions separately so now we've performed the forward paths going from left to right and what we'll do now is we'll derive the backward pass we'll go from the back to the front computing gradients of all the intermediates in our circuit until at the very end we're going to be left the gradients on the inputs and so we start off at the very right and as a base case sort of of this recursive procedure we're considering the gradient of F with respect to F so this is just the identity function so what is the derivative of just it identity mapping where's the gradient of DF by DF it's 1 right so the identity has a gradient of 1 so that's our base case we start off with 1 and now we're going to go backwards through this graph so we want the gradient of F with respect to Z so what is that in this computational graph ok it's Q so we have that written out right here and what is Q in this particular example it's 3 right so the gradient on Z according to this will become just 3 so I'm going to be writing the gradients under the lines in red and the values are in green above the lines so with the gradient on the in the front is 1 and now the gradient on Z is 3 and what thread 3 is telling you really intuitively keep in mind the interpretation of a gradient is what that's saying is that the influence of Z on the final value is positive and with sort of a force of 3 so if I increment Z by a small amount H then the output of the circuit will react by increasing because it's a positive 3 will increase by 3 H so a small change will result in a positive change on the output now the gradient on Q in this case will be so DF by DQ is said what is that negative 4 okay so we get a gradient of negative 4 on that part of the circuit and what that's saying is that if Q were to increase then the output of the circuit will decrease ok bye if you increase by H the output of the circuit will decrease by 4 H that's the slope is negative 4 ok now we're going to continue this recursive process through this plus gate and this is where things get slightly interesting I suppose so we'd like to compute the gradient on F on Y with respect to Y and so the gradient on Y with this in this particular graph will become let's just guess and then we'll see how this gets derived properly so I hear some murmurs of the right answer it will be negative four so let's see how so there are many ways to derive it at this point because the expression is very small and you can kind of glance at it but the way I'd like you to think about this is by applying chain rule okay so the chain rule says that if you would like to derive the gradient of F on Y then it's equal to DF by DQ times DQ by dy right and so we've computed both of those expressions in particular DQ by dy we know is negative four so that's the effect of the influence of Q on F is DF by DQ which is negative four and now we know the look we'd like to know the local influence of Y on Q and that local influence of Y on Q is one because that's the local as I'll refer to as the local derivative of Y for the plus skate and so the chain rule tells us that the correct thing to do to chain these two gradients the local gradient of Y on Q and the kind of global gradient of Q on the output of the circuit is to multiply them so we'll get negative 4 times 1 and so this is kind of the the crux of how back propagation works is this is very important to understand here that we have these two pieces that we keep multiplying through when we perform this chain rule we have Q computed X plus y and the derivative on x and y with respect to that single expression is 1 and 1 so keep in mind the interpretation of the gradient what that's saying is that x and y have a positive influence on Q with a slope of 1 so increasing X by H will increase Q by H okay and now what we eventually like is we'd like the influence of Y on the final output of the circuit and so the way this ends up working is you take the influence of Y on Q and we know the influence of Q on the final loss which is what we are recursively computing here through this graph and the correct thing to do is to multiply them so we end up with a negative 4 times 1 okay even negative 4 and so the way this works out is basically what this is saying is that the influence of Y on the final but the circuit is negative for so increasing why should decrease the output of the circuit by negative four times the little change that you've made and the way that ends up working out is why has a positive influence on Q so increasing Y slightly increases Q which likely decreases the output of the circuit okay so chain rule is kind of giving us this correspondence good yeah thank you so we're going to get into this you'll see many basically centaur classes about this so you'll see many many instantiations of this and I'll drill this into you by the end of this class and you'll understand it you will not have any symbolic expressions anywhere once we compute this once we're actually implementing this and you'll see implementations of it later in this Innis it will always be just vectors and numbers row vectors numbers okay and looking at X we have a very similar that happens thing that happens we want DF by DX that's our final objective but and we have to combine it we know what the X is what is X's influence on Q and what excuse influence on the end of the circuit and so that ends up being a chain rule so you take a negative 4 times 1 and gives you negative 1 okay so the way this works to generalize a bit from this example and the way to think about it is as follows you are a gate embedded in a circuit and this is a very large computational graph or circuit and you receive some inputs some particular numbers X and y come in and you perform some operation on them and compute some output set Z ok and now now this value of Z goes into computational graph and something happens to it but you're just a gate hanging out in a circuit and you're not sure what happens but by the end of the circuit the loss gets computed ok and that's the forward pass and then we're proceeding recursively in the reverse order backwards but before that actually before I get to that part you've really right away when I get x and y the thing I'd like to point out that during the forward pass if you're this gate and you get your values x and y you compute your output Z and there's another thing you can compute right away and that is the local gradients on x and y so I can compute those right away because I'm just a gate and I know what I'm performing like say additional multiplication so I know the influence that x and y i have on my output value can compute those guys right away okay what but then what happens near the end so the loss gets computed and now we're going backwards I'll eventually learn about what is my influence on the final output of the circuit the loss so I'll learn what is DL by DZ in there the gradient will flow into me and what I have to do is I have to chain that gradient through this recursive case so I have to make sure to chain the gradient through my operation that I performed and it turns out that the correct thing to do here by chain rule really what it's saying is the correct thing to do is to multiply your local gradient with that gradient and that actually gives you the DL by DX that gives you the influence of X on the final output of the circuit so really chain rule is just this added multiplication where we take our what I'll called global gradient of this gate on the output and we change it through the local gradient and the same thing goes for Y so it's just a multiplication of that guy the that gradient by your local gradient if you're a gate and then remember that these X's and Y's there are coming from different gates right so you end up with recursing this process through the entire computational circuit and so these gates just basically communicate to each other the influence on the final loss so they tell each other okay if this is a positive gradient that means you're positively influencing Glaus and if it's a negative gradient your negative influence negatively influencing loss and these just gets all multiplied through the circuit by these local gradients and you end up with and this process is called back propagation it's a way of computing through a recursive application of chain rule through computational graph the influence of every single intermediate value in that graph on the final loss function and so we'll see many examples of this throughout this lecture I'll go into a specific example that is slightly larger and we'll work through it in detail but I don't know if there are any questions at this point that anyone would like to ask go ahead if Z is used by multiple notes I'm going to come back to that ul you add the gradients the gradient the correct thing to do is to add them so if Z is being influenced in multiple places in the circuit the backward flows will add but we'll come back to that point and the way you randomly initialize your weights you end up that this is your squashing function the value of that is somewhere on the asymptotes then we are computing the gradient to change so I think I would I would have repeated your question but you're jumping ahead like hundred slides mm-hmm so we're going to get to all of those issues and we're going to see yeah you're going to get what we call vanishing gradient problems and so on we'll see okay let's go through another example to make this more concrete so here we have another circuit it happens to be computing a little two-dimensional signal neuron but for now don't worry about that interpretation just think of this as that's an expression so 1 over 1 plus e to the whatever so the number of inputs here is 5 and we're computing that function and we have a single output over there okay and I translated that mathematical expression into this computational graph form so we have to recursively from inside out compute this expression so we first do all the little W x axis and then we add them all up and then we take a negative of it and then we exponentiate that and then we add one and then we finally divide and we get the result of the expression and so we're going to do now is we're going to back propagate through this expression we're going to compute what the influence of every single input value is on the output of this expression what is the gradient here yeah so you're so for now so you're concerned about the interpretation of Plus maybe in these circles for now let's just assume that this plus is just a binary plus is a binary Plus gate and we have there a plus 1 gate I'm making up these gates kind of on spot and we'll see that what is a gator is not a gate is kind of up to you I'll come back to this point of it in a bit so for now I just like we have several more gates that we're using throughout and so I just like to write out as we go through this example several of these derivatives so we have exponentiation and we know for every little local gate what these local gradients are right so we can derive that using calculus so e to the X derivative is e to the X and so on so these are all the operations and also addition and multiplication which I'm assuming that you have memorized in terms of what the gradients look like so we're going to start off at the end of the circuit and I've already filled in a one point zero zero in the in the back because that's how we always start this recursion whether 11.0 right because since that's the gradient on the identity function now we're going to back propagate through this 1 over X operation okay so the derivative of 1 over X the local gradient is negative 1 over x squared so that 1 over X gate during the forward pass received input 1.37 and right away that 1 over X gate could have computed what the local gradient was the local gradient was negative 1 over x squared and now during back propagation it has to by chain rule multiply that local gradient by the gradient of it on the final output of the circuit which is easy because it happens to be at the end so what ends up being the expression for the back propagated gradient here from the 1 over x gate the chain will always has two pieces local gradient times the gradient from the top or from above - yeah okay yes that's correct so we get minus one over x squared which is a gradient DF by DX so that that is the local gradient negative one over three point seven squared and then multiplied by one point zero which is the gradient from above which is really just one because we just started and so I'm applying chain rule right away here and the output is negative zero point five three so that's the gradient on that piece of the wire where this Valley was flowing okay so it has a negative effect on the output and you might expect that right because if you were to increase this value and then it goes through a gate of 1 over X then if you increase this then 1 over X gets smaller so that's why you're seeing negative gradient right so we're going to continue back propagation here the next gate in the circuit it's adding a constant of one so the local gradient if you look at adding a constant to a value the gradient of X is just one right for basic calculus and so the chain gradient here that we continue along the wire will be we have local gradient which is one times the gradient from above the gate which it has just learned is negative zero point five three okay so negative zero point five three continues along the wire unchanged and intuitively that makes sense right because this this value floats and it has some influence in the final circuit and now if you're skip if you're adding one then it's influenced it's rate of change its slope towards the final value doesn't change if you increase this by some amount the effect at the end will be the same because the rate of change doesn't change through the plus one gate it's just a constant offset okay we continue derivation here so the gradient of e to the X is e to the X so to continue back propagation we're going to perform so this gate saw input of negative one it right away could have computed its local gradient and now it knows that the gradient from above is negative 0.53 so to continue back propagation here in applying chain rule we would receive okay so these are most of the rhetorical questions I'm not sure but Mattia basically e to the negative one which is e to the X the X input to this X gate times the chain rule right so the gradient from above is negative point five three so we keep multiplying that on so what is the effect on me and what do I have an effect on the final end of the circuit those are being always multiplied so we get negative point to this point so now we have a times negative one gate so what ends up happening when you do a times negative one in the computational graph it flips around right because we have basically a constant multiplied of input which happen to be a constant of negative 1 so negative 1 times 1 times negative 1 gave us negative 1 in the forward pass and so now we have to multiply by a that's the local gradient times the gradient above which is 0.2 so we end up with just positive point 2 now so now continuing back propagation we're back propagating plus and this plus operation has multiple inputs here the gradient the local gradient for the plus gate is 1 and 1 so what ends up happening - what gradients flow along the output buyers so the Plaskett has a local gradient on all of its inputs always will be just one right because if you just have a function you know X plus y then for that function the gradient on either X or Y is just 1 and so what you end up getting is just 1 times point 2 and so in fact for a + gate always you see the same effect where the local gradient all of its inputs is 1 and so whatever gradient it gets from above it just always distributes gradient equally to all of its inputs because in a chain rule they'll get multiplied and when you multiply by 1 you something remains unchanged so a plus gate is kind of like a gradient distributor where if something flows in from the top it will just spread out all the all the gradients equally to all of its children and so we've already received one of the inputs is gradient point 2 here on the very final output of the circuit and so this influence has been computed through a series of applications of chain rule along the way so now let's there was another + gate that I've skipped over and so this point to kind of distributes to both point to point to equally so we've already done a + gate and there's a multiply gate there and so now we're going to back propagate through that multiply operation and so the local grade so the so what will be the gradients for W 0 and X 0 what will be the gradient for W 0 specifically some let's say 0 0 will be wrong it will be so the gradient W 1 will be W 0 sorry it will be negative 1 times point 2 good and the gradient on X 0 will be there's a bug by the way in the slide that I just noticed like few minutes before I actually create the class so it created the start in the class so you see 0.39 there it should be point 4 it's because of a bug in the visualization because I'm truncating at 2 decimal digits anyways but basically that should be point 4 because the way you get that is two times point 2 gives you point four just like I've written out over there so that's what the output should be there okay so that's so we've back propagated this circuit here and we've back propagated through this expression and so you might imagine in there are actual downstream applications we'll have data and all the parameters as inputs loss functions at the top at the end so we'll do forward pass to evaluate the loss function and then we'll back propagate through every piece of computation we've done along the way and we'll back propagate through every gate to get our inputs and back propagate just means apply chain rule many many times and we'll see how that is implemented in a bit sorry did you have a question oh yeah so I'm going to skip that because it's the same so I'm going to skip the other times gate any other questions it's point process that's right so the cost of forward and backward propagation is roughly equal well it should be it almost always ends up being basically equal when you look at timings usually the backward pass is slightly slower but yeah ok so let's see one thing I wanted to point out before we move on is that the setting of these gates like these gates are arbitrary so one thing I could have done for example is some of you may know this I can collapse these gates into one gate if I wanted to for example in there's something called a sigmoid function which has that particular form so Sigma Sigma of X which is the sigmoid function computes 1 over 1 plus e to the minus X and so I could have rewritten that expression and I could have collapsed all of those gates that made up the sigmoid gate into a single sigmoid gate and so there's a sigmoid gate here and I could have done that in a single go sort of and what I would have had to do if I wanted to have that gate is I need to compute an expression for how this so what is the local gradient for the sigmoid gate basically so what is the gradient on the similar gate on its input and I have to go through some math which I'm not going to go into detail but you end up with that expression over there it ends up being 1 minus Sigma of x times sigmoid of X that's the local gradient and that allows me to now put this piece into a computational graph because once I know how to compute the local gradient everything else is defined just through chain rule and multiplying everything together so we can back propagate through this sigmoid gate now and the way that would look like is the input to the sigmoid gate was 1.0 that's what flu went into the sigmoid gate and 0.73 went out so 0.73 is Sigma of X ok and now we want the local gradient which is as we've seen from the math that I perform there 1 minus Sigma of X times Sigma of X so you get Sigma of access point 7 3 multiplying 1 minus 0.7 3 that's the local gradient and then times well we happen to be at the end of the circuit so times 1 point 0 which I'm not even writing so we end up with point 2 and of course we get the same answer point 2 as we received before point 2 because calculus works but basically we could have broken up this expression down and did one piece at a time or we could just have a single sigmoid gate and that's kind of up to us at what level of hierarchy do we break these expressions and so you'd like to intuitively cluster these expressions into single gates if it's very efficient or easy to derive the local gradients because then those become your pieces yes so the question is do libraries typically do that do they worry about you know what some what's easy to or convenience to compute and the answer is yeah I would say so so if you notice that there's some piece of operation you'd like to do over and over again and it has a very simple local gradient then that's something very appealing to actually create a single unit out of and we'll see some of those examples actually that I think ok I'd like to also point out that once you the reason I like to think about these computational graphs is it really helps your intuition to think about how gradients flow in a neural network it's not just you don't want this to be a black box to you you want to understand intuitively how this happens and you start to develop after a while of looking at computational graphs intuitions about how these gradients flow and this by do a lot so helps you debug some issues like say we'll go to vanish and gradient problem it's much easier to understand exactly what's going wrong in your optimization if you understand how gradients flow in networks it will help you debug these networks much more efficiently and so some intuitions for example we already saw that the 8ab gate it has a local gradient of 1 to all of its inputs so it's just a gradient distributor that's like a nice way to think about it whenever you have a plus operation anywhere in your score function or your ComNet or anywhere else it just distributes gradients equally the max gate is instead a gradient router and the way this works is if you look at the expression like we have I'll create these markers don't work so if we have a very simple binary expression of Max of X Y so if this is a gate then the gradient on x and y if you think about it the gradient on the larger one of your inputs whichever one is larger the gradient on that guy is one and all this and the smaller one has a gradient of zero and intuitively that's because if one of these was smaller then wiggling it has no effect on the output because the other guy is larger and that's what ends up propagating through the gate so you end up with a gradient of 1 on the larger one of the inputs and so that's why max gate is a gradient router if I'm a max gate and I received several inputs one of them was largest of all of them and that's the value that I propagated through the circuit at the back propagation time I'm just going to receive my gradient from above and I'm going to route it to whoever was my largest input so it's a gradient router and the multiply gate is a gradient switcher I'm not actually don't think that's a very good way to look at it but I'm referring to the fact that it's not actually never mind about that part good so yeah so your question is what happens if the two inputs are equal when you go through max gate yeah what happens yeah you pick one yeah yeah I don't think it's correct to distribute it to all of them I think you'd have to you'd have to pick one yeah but that basically never happens in actual practice okay so max gradient here actually have an example so said here was larger than W so only zet has an influence on the output of this max gate right so when two flows into the max gate it gets routed to Z and W gets a zero gradient because its effect on the circuit is nothing there's zero because when you change it it doesn't matter when you change it because that is the larger value going through the computational graph I have another note that is related to the back propagation which we already addressed through a question I just wanted to briefly point out with a terribly bad looking figure that if you have these circuits and sometimes you have a value that branches out into a circuit and is used in multiple parts of the circuit the correct thing to do by multivariate chain rule is to actually add up the contributions at at the operation so gradients add when they back propagate backwards through the circuit if they ever flow they add up in in these in this backward flow all right we're going to go into implementation very soon I'll just take some little questions thank you for the question the question is is there ever like loop in these graphs there will never be loops so there are never any loops you might think that if you use a recurrent neural network that there are loops in there but there's actually no loops because what we'll do is we will take a recurrent neural network and will unfold it through time steps and this will all become there will never be a loop in the unfolded graph where we've copy-pasted that small recurrent over time you'll see that more when we actually get into it but these are always DAGs there's no loops okay awesome so let's look at the implementation how this is actually implemented in practice and I think will help make this more concrete as well so we always have these graphs computational graphs these are the best way to think about structuring neural networks and so we end up with is all these gates that we're going to see in a bit but on top of the gates there's something that needs to maintain connectivity structure of this entire graph what gates are connected to each other and so usually that's handled by a graph or a net object usually in net and the net object has these two main pieces which has the forward and the backward piece and this is just pseudocode so this won't run but basically roughly the idea is that in the forward pass were iterating over all the gates in the circuit that and they're sorted in topological order what that means is that all the inputs must come to every node before the output can be consumed so these are just ordered from left to right and we're just forwarding we're calling a forward on every single gate along the way so we iterate over that graph and we just go forward on every single piece and this net object will just make sure that that happens in the proper connectivity pattern and backward pass we're going in the exact reverse order and we're calling backward on every single gate and these gates will end up communicating gradients to each other and they all get chained up and computing the analytic gradient at the back so really a net object is a very thin wrapper around all these gates or as well as well see they're called layers layers or gates I'm going to use those interchangeably and they're just very thin wrappers around connectivity structure of these gates and calling a forward and backward function on them and then let's look at a specific example of one of the gates and how this might be implemented and this is not justice Souter this is actually more like correct implementation in some sense like this might run at the end so let's say they're multiplied gate and how it could be implemented a multiply gate in this case is just a binary multiply so it receives two inputs X and y it computes their multiplication that is x times y and inverter instead and all these gates must basically satisfy this API of a forward call and a backward call how do you behave in a forward pass and how do you behave in a backward pass and in a forward pass we just compute whatever in a backward pass we eventually end up learning about what is our gradient on the final loss so DL by D zet is what we learn that's represented in this variable DZ and right now everything here is scalars so X Y Z our numbers here D Zed is also a number telling the influence on the end of the circuit and what this gate is in charge of and in this backward pass is performing the little piece of chain rule so what we have to compute is how do you change this gradient DZ into your inputs x and y in other words we have to compute DX and dy and we have to return those in the backward pass and then the computational graph will make sure that these get routed properly to all the other gates and if there are any edges that add up the computational graph might add might add all those gradients together ok so how would we implement the DX and dy so for example what is DX in this case what would it be equal to the implementation y times DZ right and so y times DZ additional point to make here by the way note that I've added some lies in the forward pass we have to remember these values x and y because we end up using them in a backward pass so I'm assigning them to a sell stop because I need to remember what X Y are because I need access to them in my backward pass in general in back propagation and when we build these when you actually do the forward pass every single gate must remember the inputs and any kind of intermediate calculations that has performed that it needs to do that needs access to in the backward pass so basically when we end up running these networks at runtime just always keep in mind that as you're doing this forward pass a huge amount of stuff gets cached in your memory and that all has to stick around because during back propagation you might need access to some of those variables and so your memory ends up ballooning up during a forward pass and then in backward pass it gets all consumed and you need all those intermediates to actually compute the proper backward pass so that's yes if you don't if you know you don't want to do backward pass then you can get rid of many of these things and you don't have to compute you're going to catch them so you can save on memory for sure but I don't think most implementations actually worry about that I don't think there's a lot of logic that deals with that usually we end up remembering it anyway oh I see yeah so I think if you're an embedded device for example and you worry really about your memory constraints this is something that you might take advantage of if you know that a neural network only has to run and test time then you might want to make sure to go into the code and make sure nothing gets cached in case you want to do a backward pass questions yes you're saying if we remember the local gradients in the in the forward pass then we don't have to remember the other intermediates I think that might only be the case in some simple expressions like this one I'm not actually sure that's true in general but I mean you're in charge of remember whatever you need to perform the backward pass they only gate by gate basis you don't necessarily feel like it has a lower footprint and so on and you can be clever with that okay so just to give you guys example of what this looks like in practice we're going to look at specific example say in torch torches a deep learning framework which we might go into a bit near the end of the class that some of you might end up using for your projects if you go into the github repo for torch and you look at like basically it's just giant collection of these layer objects and these are the gates layers gates the same thing so there's all these layers that's really what a deep learning framework is is just a whole bunch of layers and a very thin computational graph thing that keeps track of all the layer connectivity and so really the image to have in mind is all these things are your Lego blocks and then we're building up these computational graphs out of your Lego blocks out of the layers you're putting them together in various ways depending on what you want to achieve and so you end up building all kinds of stuff so that's how you work with neural networks so every library is just a whole set of layers that you might want to compute and every layer is just implementing a small peak function piece and that function Pease knows how to do a forward and it knows how to do a backward so just to be the specific example let's look at the mall constant layer in torch the mall constant layer performs just the scaling by a scalar so it takes some tensor X so this is not just a scalar but it's actually like an array of numbers basically because when we actually work with these we do a lot of vectorize operations so we receive a tensor which is really just a n dimensional array and we scale it by a constant and you can see that this layer actually just has 40 lines there's some initialization stuff this is low up by the way if this is looking some foreign to you but there's initialization where you actually pass in that a that you want to use as you're scaling and then during the forward pass which they call update output in a forward pass all they do is they just multiply a X and return it and in the backward pass which they call update grad input there's an if statement here but really when you look at these three lines that are most important you can see that all it's doing is it's copying into a variable grad input which needs to compute that's your gradient that you're passing up the grad input is you're copying grad output grad output is your your gradient on the final loss you're copying that over into grad input and you're multiplying by the by the scalar which is what you should be doing because you're your local gradient is just a and so you take the output you have you take the gradient from above and you just scale it by a which is what these three lines are doing and that's your grad input and that's what you return so that's one of the hundreds of layers that are in torch we can also look at examples in cafe cafe is also deep learning framework specifically for images that you might be working with again if you go into the layers directory and github you just see all these layers all of them implement the forward backward API so just to give you an example there's a sigmoid layer in cafe so sigmoid layer takes a blob so cafe likes to call these tensors blobs so it takes a blob is just an N dimensional array of numbers and it passes it element-wise through a sigmoid function and so it's computing in a forward pass a sigmoid which you can see there I'll use my pointer okay so there it's calling so a lot of this stuff is just boilerplate getting to all the data and then we have a bottom blob and we're calling a sigmoid function on the bottom and that's just the sigmoid function right there so that's what we compute and in a backward pass some boilerplate stuff but really what's important is we need to compute the gradient times the chain rule here so that's what you see in this line that's where the magic happens where we take the diff so they call the gradients diffs and you compute the bottom this is the top diff times this piece which is really the that's the local gradient so this is chain rule happening right here through that multiplication so and that's it and so every single layer just a forward backward API and then you have a computational graph on top or a net object that keeps track of all the connectivity okay any questions about some of these implementations and so on go ahead yes thank you so the question is do you have to go through forward and backward for every update the answer is yes because when you want to do update you need the gradient and so you need to do forward on your sample mini-batch you do it forward you right away do a backward and now you have your NLE gradient and now i can do an update where I take my analytic gradient and I change my weights a tiny bit in the direction the negative grid direction of your gradient so forward computes the loss backward computes your gradient and then the update uses the gradient to increment your weights a bit so that's what keeps happening in loop when you train a neural network that's all that's happening forward backward update forward backward update we'll see that in a bit good you're asking about the for loop oh is there for loop here I didn't even notice okay yeah they have a for loop yeah so you'd like this to be vectorized I'm not actually sure because this is C++ so I think yeah they just stuff just do it go for it yeah yeah so this is a CPU implementation by the way I should mention that this is a CPU implementation of the sigmoid layer there's a second file that implements the sigmoid layer on GPU and that's CUDA code and so that's a separate file it would be sigmoid Cu or something like that I'm not showing you that in the questions okay great so one point I'd like to make is will be of course working with vectors so these things flowing along our graphs are not just scalars they're going to be entire vectors and so nothing changes the only thing that is different now since these are vectors x y&z are vectors is that this local gradient which before used to be just a scalar now they're in general for general expressions they are full Jacobian matrices and so Jacobian matrix is just two-dimensional matrix and basically tells you what is the influence of every single element in X on every single element of Z and that's what you could be a matrix stores and the gradient is the same expression as before but now say here DZ by DX is a vector and BL by DZ is sorry DL by DS that is a vector and DZ by DX is an entire Jacobian matrix so you end up with an entire matrix vector multiply to actually change the gradient backwards no so I'll come I'll come back to this point in a bit you never actually end up forming the full Jacobian you'll never actually do this matrix multiply most of the time this is just a general way of looking at you know arbitrary function and I need to keep track of this and I think that these two are actually out of order because these are by DX is the Jacobian which should be on the left side so that's a I think that's a mistake in slide because this should be a matrix vector multiply so I'll show you why you don't actually need to ever form those jacobians so let's work with a specific example that is relatively common in neural networks suppose we have this non-linearity max of 0 and X so really what this is operation is doing is it's receiving a vector say Oh 4096 numbers which is a typical thing you might want to do 4096 numbers real-valued come in and you're computing an element-wise threshold in a so anything that is lower than zero gets clamped to zero and that's your function that you're computing and so up the vectors are the same dimension so the question here I'd like to ask is what is the size of the Jacobian matrix for this layer 4,096 by 4096 in principle every single port number in here could have influenced every single number and there but that's not the case necessarily right so the second question is so this is a huge matrix 16 million numbers but why would you never form it what does the Jacobian actually look like no Jacobian will always be a matrix because every one of these 4096 could have influenced every it is so the Jacobian is still a giant 4096 by 4,000 Isaac's matrix but it has special structure right and what is that special structure good yeah so this Jacobian is a huge so it's 4096 by 4096 matrix but there's only elements on the diagonal because this is an element wise operation and moreover they're not just once but for whichever element was less than 0 it was clamped to 0 so some of these ones actually are zeros in whichever elements had a lower than 0 value during the forward pass and so the Jacobian would just be almost an identity matrix but some of them are actually there so you never actually would want to form the full Jacobian because that's silly and so you never actually want to carry out this operation as a matrix vector multiply because there's special structure that we want to take advantage of and so in particular the gradient the backward pass for for this operation is very very easy because you just want to look at all the dimensions where your input was less than 0 and you want to kill the gradient in those dimension you want to set the gradient to 0 in those dimensions so you take the grid output here and whichever numbers were less than 0 just set them to 0 set those gradients to 0 and then you continue backward pass so very simple operations in the in the end in terms of efficiency just refer back that's right that right you could do a matrix if you needed to in in your internal state but when you actually report back you're always just yeah so the question is the communication between the gates is always just vectors that's right so this Jacobian if you wanted to you can form that but that's internal to you inside the gate and you can use that to do backdrop but what's going back to other gates they only care about the gradient vector yes so the question is unless you end up having multiple outputs because then for each output we have to do this and so you yeah so we'll never actually run into that case because we almost always have a single output scalar value at the end because we're interested in loss functions so we just have a single number at the end that we're interested in computing gradients with respect to if we had multiple outputs then we have to keep track of all of those as well in parallel when we do the back propagation but we just have scalar valued loss functions so so we don't have to worry about that okay make sense so I want to also make the point that actually 4096 dimension is not even crazy usually we use mini batches so say mini batch of 100 elements going through at the same time and then you end up with 100 4096 dimensional vectors that are all coming in in parallel but all the examples in a mini batch are processed independently of each other in parallel and so this Jacobian matrix really ends up being 400 million four hundred thousand by four hundred thousand so huge so you never form these basically and you take some intake care to actually take advantage of the sparsity structure in that Jacobian and you hand code operations you don't actually write the fully generalized chain rule inside any gate implementation okay cool so I'd like to point out that in your assignment you'll be writing SVM's and softmax and so on and I just kind of wanted to give you a hint on the design of how you actually should approach this problem what you should do is just think about it as a back propagation even if you're doing this linear classification optimization so roughly your structure should look something like this where again stage your computation and units that you know the local gradient off and then do backdrop when you actually evaluate these gradients in your assignment so in the top your code will look something like this where we don't have any graph structure because you're doing everything in line so no crazy edges or anything like that that you have to do you will do that in the second assignment you'll actually come up with a graph object and you'll implement your layers but in the first assignment you're just doing it in line just straight up vanilla set up and so compute your scores based on W and X compute these margins which are max of zero and the score differences compute the loss and then do back drop and in particular I would really advise you to have this intermediate scores that you create as a matrix and then compute the gradient on scores before you compute the gradient on your weights and so chain use chain rule here otherwise you're like you might be tempted to try to just derive W the gradient on W equals and then implement that and that's an unhealthy way of approaching the problem so stage your computation and do back drop through this course and they will help you out okay cool so let's see something's in summary so far neural networks are hopelessly large so we end up with these computational structures and these intermediate nodes forward-backward api for both the nodes and also for the graph structure and the graph structure is usually a very thin wrapper around all these layers and it can handles all the communication between them and it's communication is always along like vectors being passed around in practice when we write these implementations what we're passing around are these n-dimensional tensors really what that means is just an N dimensional array so like a numpy array those are what goes between the gates and then internally every single gate knows what to do in the forward and backward pass ok so at this point I'm going to end with back propagation and I'm going to go into neural networks so any questions before we move on from background good the summation inside Li equals blah yeah so there's a sum there so you'd want that to be a vectorized operation that you are yeah so basically the challenge in your assignment almost is how do you make sure that you do all of this efficiently nicely with meters vector operations in numpy so that's going to be some of the brainteaser stuff that you guys are going to have to do yeah so it's up to you what you want your gates to be like and what you want them to be um yeah I don't I don't think you'd want to do that yeah I'm not sure maybe that works I don't know but yeah it's up to you to design this and to backdrop to it yeah so that's that's fun okay so we're going to go to neural networks this is exactly what they look like so you'll be implementing these and this is just what happens when you search on google images for neural networks this is I think the first results are simple like that so let's look at neural networks and before we dive into neural networks actually I'd like to do it first without all the brain stuff so forget that their neural forget that they have any relation whatsoever to a brain they don't but forget if you thought that they did that they do let's just look at score functions where before we saw that F equals W X is what we've been working with so far but now as I said we're going to start to make that F more complex and so if you wanted to use a neural network then you're going to change that equation to this so this is a two layer neural network and that's what it looks like and it's just a more complex mathematical expression of X and so what's happening here is you receive your input X and you make x matrix just like we did before now what's come back what comes next is a non-linearity or activation function and we're going to go into several choices that you might make for these in this case I'm using the threshold again zero as an activation function so basically we're doing matrix multiply we threshold everything negative to zero and then we do one more matrix multiply and that gives us our scores and so if I was to draw this say in case of C for ten with three solid 3072 numbers going in those are the pixel values and before we just went one thing matrix-multiply to scores we went right away to ten numbers but now we get to go through this intermediate representation of a hidden hidden state we'll call them hidden players so hidden vector H of hundred numbers say or whatever you want your size of their neural network to be so this is a hyper primer that say that hundred and we go through this intermediate representation so matrix multiply gives us hundred numbers threshold at zero and then one more makes multiplied to get this course and since we have more numbers we have more wiggle to do more interesting things so a more one particular example of something interesting you might want to what you might think that in your network could do is going back to this example of interpreting linear classifiers on C part N and we saw that the car class has this red car that tries to merge all the modes of different cars facing different directions and so in this case one single layer one single linear classifier had to go across all those modes and we couldn't deal with for example the cars of different colors that wasn't very natural to do but now we have hundred numbers in this intermediate and so you might imagine for example that one of those numbers could be just picking up on the red car facing forward it's just classifying is there a red car facing forward another one could be a red car facing slightly to the left let's car facing slightly to the right and those elements of H would only become positive if they find that thing in the image otherwise they stay at zero and so another H might look for green cars or yellow cars or whatever else in different orientations so now we can have a template for all these different modes and so these neurons turn on or off if they find the thing they're looking for a car of some specific type and then this w2 matrix can sum across all those little car templates so now we have like say 20 car templates of what cars could look like and now to compete the score of car classifier there's an additional matrix multiply so we have a choice of doing a weighted sum over them and so if any one of them turn on then through my weighted sum with positive weights presumably I would be adding up and getting the higher score and so now I can have a this multi modal car classifier through this additional hidden layer in between there so that's a hand wavy reason for why these would do something more interesting was their question so the question is if H had less than 10 units would it be inferior to a linear classifier I think that's a that's actually not obvious to me it's an interesting question I think you could make that work I think it could make it work yeah I think that would actually work someone should try that for extra points on the assignment so you'll have a section on the assignment do something fun or extra and so you get to come up with whatever you think is interesting experiment and we'll give you some bonus points so that's a good candidate for for something you might want to investigate whether that works or not any other questions good that's right nothing I understood the question yep yep I think so you're really asking about the layout of the H vector and how it gets allocated over the different modes of the data set and I don't have a good answer for that this since we're going to train this fully with back propagation I think it's likely naive to think that there will be exact template for sale left car facing red car facing left you probably won't find that you'll find these kind of like mixes and weird things intermediates and so on so there's no let work will come in and it will optimally find a way to truncate your data with it's linear boundaries and these weights we're all getting adjusted just to come to make it come out right so it's really hard to say well become tangled up I think good age was like Jose archer that's right so that's the size of a head and layer and that's a hyper parameter we get to choose that so I chose 100 usually that's going to be usually you'll see that with neural networks we'll go into this a lot but usually you want them to be as big as possible as it fits in your computer and so on so more is better but we'll go into that good so you're asking do we always take max of 0 and H and we don't and I'll get it's like 5 slides that way so I'm going to go into neural networks I guess maybe I should preemptively just go ahead and then take questions near the end if you wanted this to be a three layer neural network by the way there's a very simple way in which we just extend this right so we just keep continuing the same pattern where we have all these intermediate hidden nodes and then we can keep making our network deeper and deeper and you can compute more interesting functions because you're giving yourself more time to compute something interesting in a hand wavy way now one other slide I wanted to flash is that training a two layer neural network I mean it's actually quite simple when it comes down to it so this is a slide borrowed from a blog post I found and basically it suffice as roughly 11 lines of Python to implement a two layer neural network doing binary classification on what is this two dimensional data so you have a two dimensional data matrix X you have sorry it's three dimensional and you have binary labels for Y and then sin 0 sin 1 are your weight matrices weight one way to and so I think they're called sin for sin apps but I'm not sure and then this is the optimization loop here and what you're seeing here I should use my pointer more what you're seeing here is we're computing the first layer activations but this is using a sigmoid non-linearity not a max of 0 necks and we're going to a bit of what these linked nonlinearities might be so sigmoid is one form is computing the first layer and then computing the second layer and then it's computing here right away the backward pass so this is the l2 Delta is the gradient on l2 the gradient on l1 and the gradient and this is a Maitre is an update here so right away he's doing an update at the same time as doing the final piece of backdrop here where he's formulating the gradient on the W and right he's adding to - gradient here and so really 11 lines suffice to train the neural network do a binary classification the reason that this loss might look slightly different from what you've seen right now is that this is a logistic regression loss so you saw a generalization of it which is the softmax classifier into multiple dimensions but this is basically a logistic loss being updated here and you can go through this in more detail by yourself but the logistic regression loss looks slightly different and that's being that's inside there but otherwise yes so this is not too crazy of a computation and very few lines of code suffice to actually train these networks everything else is Plouffe how do you make it efficient how do you there's a cross-validation pipeline that you need to have and all this stuff that goes on top to actually give these large code bases but the kernel of it is quite simple we compute these layers do forward pass we do backward pass we do an update which isn't raining this over and over again good the random function is creating your first initial random weights so you need to start somewhere so you generate a random W ok now I wanted to mention that you'll also be training a two layer neural network in this class so you'll be doing something very similar to this but you're not using logistic regression and you might have different activation functions but again just my advice to you when you implement this is stage your computation into these intermediate results and then do proper back propagation into every intermediate result so you might have you compute your let's see you compute you receive these weight matrices and also the biases I don't believe you have biases actual on your SVM n in your softmax but here you'll have biases so take your weight matrices and the biases compute the first inner layer compete your scores compete your loss and then do backward pass so back drop into scores then back drop into the weights at the second layer and back drop into this h1 vector and then through h1 back drop into the first weight matrices and the first biases okay so do proper back propagation here otherwise if you try to write away just say what is DW 1 what is the gradient on W 1 if you just try to make it a single expression for it it will be way too large and you'll have headaches so do it through series of steps in back propagation no that's just a hint okay so now I'd like to so that was the presentation of neural networks without all the brain stuff and so it looks fairly simple so now we're going to make it slightly more insane by folding in all kinds of like motivations mostly historical about like how this came about that it's related to bringing it all and so we have neural networks and we have neurons inside these neural networks so this is what neurons look like this is just what happens when you search on image search neurons so there you go now your actual biological neurons don't look like this for cuddling they actually look more like that and so a neuron just very briefly just to give you an idea about where this is all coming from you have a cell body or a soma as people like to call it and it's got all these dendrites that are connected to other neurons there's a cluster of other neurons and cell bodies over here and dendrites are really these appendages that listen to them so this is your inputs to a neuron and then it's got a single axon that comes out of a neuron that carries the the output of the computation that this neuron performs so usually usually have this neuron receives inputs if many of them align then this cell the neuron can choose to spike it sends an activation potential down the axon and then this actually like diverges out to connect to dendrites of other neurons that are downstream so there are other neurons here and their dendrites connect to the axons of these guys so basically just neurons connected through these synapses in between and we have these dendrites that are the input to neuron and this axon that actually carries the output of a neuron and so basically you can come up with a very crude model of a neuron and it will look something like this we have an axon so this is the cell body here of a neuron and just imagine an axon coming from a different neuron somewhere in the network and this neuron is connected to that neuron through this synapse and every one of these synapses has a weight associated with it of how much this neuron likes that neuron basically and so axon carries this X it interacts in the synapse and they multiply in this crude model so you get W 0 X 0 floating a flowing to the soma and then that happens for many neurons so you have lots of inputs W times X flowing in and the cell body here it just performs in some offset by bias and then if an activation function is met here so it passes through an activation function to actually compute the output of this axon now in biological models historically people like to use the sigmoid non-linearity to actually use for the activation function the reason for that is because you get a number between 0 and 1 and you can interpret that as the rate at which this neuron is firing for that particular input so it's a rate between 0 and 1 that's going through the activation function so if this neuron is seeing something it likes in the neurons that connect to it it will start to spike a lot and the rate is described by F of the input okay so that's the crude model of a neuron if I want to implement this it would look something like this so a neuron tick function forward pass it receives some inputs this is a vector and we form of some at the cell body so just a linear sum and we put and we compete the firing rate as a sigmoid of the cell body sum and return the firing rate and then this can plug in two different neurons right so you can imagine you can actually see that this looks very similar to a linear classifier right we're forming a linear sum here a weighted sum and we're passing that through a non-linearity so every single neuron in this model is really like a small linear classifier but these linear classifiers plug into each other and they can work together to do interesting things now one note to make about neurons is that they're very they're not like biological neurons biological neurons are super complex so if you go around and you start saying that neural networks work like brain people are starting to frown and people start to frown at you and that's because neurons are complex dynamical systems there are many different types of neurons they function differently these dendrites there they can perform lots of interesting computation a good review article is a dendrite computation which I really enjoyed these synapses are complex dynamical systems they're not just a single weight and we're not really sure of the brain uses rate code to communicate so very crude mathematical model and don't don't push this analogy too much but it's good for kind of like media articles and so I suppose that's why this keeps coming up again and again as we explain that this works like your brain but okay I'm not going to go too deep into this to go back to a question that was asked before there's an entire set of nonlinearities that we can choose from so historically sigmoid has been used quite a bit and we're going to go into much more detail over what these nonlinearities are what are their trades trade offs and why you might want to use one or the other but for now I just like to flash them and mention that there are many things to choose from historically people use sigmoid @nh as of 2012 relu became quite popular it makes your networks converge quite a bit faster so right now if you wanted a default choice for a non-linearity use relu that's the current default recommendation and then there's a few kind of a hipster activation functions here and so leaky relatives were proposed a few years ago max out is interesting very recently Allu and so you can come up with different activation functions and you can describe why these might work better or not and so this is an active area of research is trying to come up with these activation functions that perform that had better properties in one way or another so we're going to go into this much more detail soon in the class but for now we have these neurons we have a choice of activation function and then we rank these neurons into neural networks right so we just connect them together so they can talk to each other and so here's an example of a what to layer neural at or 3 layer and all that when you want to count the number of layers and their neural net you count the number of layers that have weights so here the input layer does not count as a layer because there's no these neurons are just single values they don't actually do any computation so we have two layers here that that could have weights so it's a two layer net and we call these layers fully connected layers and so remember that I've shown you that a single neuron computes this little weighted sum and then passes a through non-linearity in a neural network the reason we arrange these into layers is because arranging them into layers allows us to perform the computation much more efficiently so instead of having an amorphous blob of neurons and every one of them has to be computed independently having them in layers allows us to use vectorized operations and so we can compute an entire set of neurons single hidden layer as just at a single time as a matrix multiply and that's why we arrange them in these layers where neurons inside a layer can be evaluated completely in parallel and they all see the same input so it's a computational trick to arrange them in layers so this is a three layer neural net and this is how you would compute it just a bunch of matrix multiplies followed by a non activation followed by a activation function so now I'd like to show you a demo of how these neural networks work so this is JavaScript demo that I'll show you in a bit but basically this is an example of a two layer neural network classifying a doing a binary classification task so we have two classes red and green and so we have these points in two dimensions and I'm drawing the decision boundaries by the neural network and so what you can see is when I train a neural network on this data the more hidden neurons I have in my head and layer the more wiggle your neural network has right the more it can compute crazy functions and just to show you effect also of regularization strength so this is the regularization of how much you penalize large w's so you can see that when you insist that your WS are very small you end up with very smooth functions so they don't have as much variance so these neural networks there's not as much wiggle that they can give you and then as you decrease the regularization these neural networks can do more and more complex tasks so they can kind of get in and get these little squeezed out points to cover them in the training data so let me show you what this looks like joint training okay so there's some stuff to explain here I'll let me first actually want so you can play with this because it's all in JavaScript okay all right so what we're doing here is we have six neurons and this is a binary classification data set with with circle data and so we have a little cluster of green dot separated by red dots and we're training a neural network to classify this data set so if I restart the neural network it just started off with a random EE and then it converges the decision boundary to actually classify the data what I'm showing on the right which is the cool part this is visualized is one interpretation of the neural network here is what I'm taking this grid here and I'm showing how this space gets warped by the neural network so you can interpret what the neural network is doing is its using its hidden layer to transform your input data in such a way that the second hidden layer can come in with a linear classifier and classify your data so here you see that the neural network arranges your space it warps it such that the second layer which is really a linear classifier on top of the first layer is again put a plane through it okay so it's warping the space so that you can put a plane through it and separate at the points so let's look at this again so initial okay so you can roughly see what how this gets warped so that you can linearly classify the data this is something that people sometimes also refer to as kernel trick it's changing your data representation to a space where it's linearly separable okay now here's a question if we'd like to separate so right now we have six neurons here in the intermediate layer and it allows us to separate out these data points so you can see actually those six neurons roughly you can see these lines here like they're kind of like these functions of one of these neurons so here's a question for you what is the minimum number of neurons for which this data set is separable with the neural network like if I wanted to neural network to correctly classify this how many neurons do I need in the hidden layer as a minimum for I heard some theories some force and binary search so intuitively the way this will work is let's see let's see for so what happens with for is there's one neuron here that went from this way to that way this way to that way this way to that way there's four neurons that are cutting up this plane and then there's an additional layer that's doing a weighted sum so in fact the lowest number here with I would be three which would work so with three neurons oh okay it's going to happen so one plane second plane third plane so three linear functions with a non-linearity and then you can basically with three lines you can carve out the space so that the second layer can just combine them when their numbers are one and not zero at 2 certainly so at 2 this will break because two lines are not enough suppose this work something I'm not going to look very good here yeah so with two basically it will find the optimum way of just using these two lines they're kind of creating this tunnel and that's the best you can do ok the curve I think which shall not only retain my using 10h yeah I'm not sure exactly how that works out if I was using relu I think there would be much Sorrell ooh is the so let me change to relu and I think you'd see sharper boundaries yeah yes this is three oh you can do four so let's do yeah that's because it's because in some of these parts there's more than one of those Ray Lewis are active and so you end up with there are really 3 lines I think like 1 2 3 but then in some of the corners 2 real neurons are active and so these weights will add up it's kind of funky you have to think about it a bit but ok so let's look at say 20 here so change to 20 so we have lots of space there and let's look at different data sets like say spiral so you can see how this thing just as I'm doing this update it will just go in there and figure that out or very simple data set it's not okay spiral circle yeah and then random so it's a random data and so it kind of goes in there and it like covers up the the green ones from the red ones and yeah and with fewer say like 5 oh I'm going to break this now it's not going to okay so if 5 yeah so this will start working worse and worse because you don't have enough capacity to separate out this data so you can play with this in your every time okay and so as a summary we arrange these neurons into neural in neural networks into fully connected layers we've looked at backdrop and how this gets changing computational graphs and they're not really neural and as well see soon the bigger the better and we'll go into that a lot I want to take questions before I and just sorry we're not any questions go ahead we have two more minutes sorry always better to warn yours or is it also like over big issues yeah so thank you so is it always better to have more neurons in your neural network the answer to that is yes more it's always better it's usually computational constraint so more will always work better but then you have to be careful to regularize it properly so the correct way to constrain your neural networks do not over fit your data is not by making the network smaller the correct way to do it is to increase your regularization so you always want to use as large of a network as you want but then you have to make sure to properly regularize it but most of the time because computational reasons you have finite amount of time you don't want to wait forever to train your networks you'll use smaller ones for practical reasons question do you regularize each layer equally usually you do as a simplification you yeah most of the often when you see networks trained in practice they will be regularized the same way throughout but you don't have to necessarily good is there any value to using second derivatives using the Hessian in optimizing neural networks there is value sometimes when your datasets are small you can use things like l-bfgs which I didn't go into too much and that's a second order method but usually the datasets are really large and that's when l-bfgs doesn't work very well so when you have millions of data points you can't do l-bfgs for various reasons yeah and l-bfgs is not very good with mini-batch you always have to do full batch by default question yeah so what is the trade-off between depth and size roughly like how do you allocate not a good answer for that unfortunately so you want depth is good but maybe after like 10 layers maybe if you have simple data set it's not really adding too much we have one more minute so I can still take some questions you had a question for awhile yeah so the the the trade-off between where do I a locate my capacity do I want this to be deeper or do I want it to be wider not a very good answer to that yes usually especially with images we find that more layers are critical but sometimes when you have simple datasets like 2d are some other things like depth is not as critical and so it's kind of slightly data dependent with a question over there different activation functions for different layers does that help usually it's not done usually just kind of pick one and go with it so say I for comm that's for example we'll also see that most of them are trained just with relatives and so you just use that throughout and there's no real benefit to to switching them around people don't play with that too much but in principle you there's nothing preventing you so it is 4/20 so we're going to end here but we'll see lots of more neural networks so a lot of these questions will we'll go through them

Info

Channel: Andrej Karpathy

Views: 220,248

Rating: 4.9684386 out of 5

Keywords: convolutional, neural, networks, class, stanford, backpropagation, backprop, lecture, chain rule, cs231n, visual, recognition, convnet, network

Id: i94OvYb6noo

Channel Id: undefined

Length: 79min 38sec (4778 seconds)

Published: Wed Jan 13 2016