Lecture 4 | The Backpropagation Algorithm

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
to money watch out there's a wire over here I what's have seen six people trip on it so alright we're going to start again exactly where exactly where we left off with a little recap what we've seen so far MLPs can represent any function they can be constructed to represent pretty much any kind of boolean function any classification boundary they can approximate any continuous valued function as well to arbitrary position provided you're provided your network has sufficient capacity to model a function they she really was given a function how exactly do you construct a network to model it and one way we saw to do it was even you know even if the network had sufficient architecture we had to set the parameters of the network to model the function and so one way to do it was to define the expected divergence which is a measure of the error off error between what the network actually produces and what you want it to produce and estimate your parameters to minimize this expected divergence the issue was that you don't really expected divergence you need to know several things you need to know the probability distribution of the data itself you need to know that the function that you are trying to model exactly in both of these we do not normally know so instead we said we would sample the data and provided the samples follow the distribution of the data itself then you can expect that now the collection of input/output fares and prepares that you get as a result somehow represent the function that you want and so instead of trying to compute the expected divergence you can compute the average divergence across your training points and as we'll see in a later class the average divergence is actually a perfectly unbiased estimate of the expected divergence so you can sort of hope that if you minimize this guy over here it will kind of minimize this at least an expectation that so and then once we did that we have now a function which is the term to the right which is our loss function our empirical estimate of the expected divergence and this loss function is what we were going to try to minimize we were going to try to we were going to try to estimate the parameters W such that the loss is minimized this it was an instance of function minimization or optimization we went through a crash course on function optimization where we left off was where we were talking about what we would do if you actually had functions of a multivariate impulse vectors so now the derivative of a scalar function of a multivariate input X there's a multiplicative factor that gives us the change in F for tiny variations in X this is basically the same definition that we've always used for derivative we saw this right so if I have any F equals f of X right y equals f of X then Delta Y is going to be some derivative of F with respect to x times Delta of X and Delta of X would be a column vector so this guy is going to be a row vector correct and we also saw that that in terms of definitions you can define the gradient of the of the function at any point which is simply the transpose by definition of the derivative when Kings are defined in this manner so what exactly does this gradient mean now again if I write things in vector format the derivative is a row vector which is the transpose of the gradient right so if I again by definition of the derivative what we are saying is DF is the product of the derivative and the and the incremental change in X so what is this term over here we know D X is a vector DX is simply going to be something of this kind right the X 1 DX 2 DX D right so DX is a vector this guy over here this derivative 2 is a vector of partial derivatives so the incremental change over here is basically an inner product of the derivative and DX right so what do we know about the inner product of two vectors we know this right x dot y for any two vectors x and y x dot y equals magnitude of X magnitude of Y cos theta and theta is the angle between the two correct so for R for any u NV u transpose V is magnitude of V times magnitude of u times cos theta so let's say I fix you and I fix the length of V so this length is fixed and then I rotate V around at which point is this inner product going to be the maximum when they're pointing in the same direction when the angle is 0 right now let's go back to our derivative we had Delta y equals a gradient of F f of x times Delta X so let me fix the length of Delta X and let me keep changing the angle of Delta X at which point is this inner products mum anyone so when Delta X points in the direction of the gradient is when B this term is maximum right can everybody see what I've written here and as you may assume in you can write if you cannot let me know right so basically we have Delta Y equals the gradient of Y with respect to x times Delta X this is a vector this is a vector if I fix the length of the steps I am always taking a step of the same length I can take steps in different directions right the point at which this product is going to be maximum is when Delta X is in the same direction as the gradient so what does this tell me about the gradient the gradient is the direction in which I must walk for the function for Delta Y to be largest the gradient is the direction in which you must travel for the function to increase fastest so basically if the step DX is aligned with the gradient that is when the angle between the two is zero that's when the function increases fastest so the gradient is the direction of the fastest increase in f of X so for a function lower here well over here this function increases in every direction if you are here you can take a step in pretty much any direction Delta X your DX could be in any direction correct but there is one direction where the function increases fastest that's this guy and so this direction is your gradient direction now there are two things over here there's the direction of the gradient there's the magnitude of the gradient so the direction of the gradient is the direction in which you must walk for the fastest increase in the function the magnitude of the gradient is how much the function is going to increase for a unit change in X right okay so that's the direction over here that's the gradients moving in that direction increases f of X fastest-moving we're moving in the exact opposite direction has the exact opposite consequence f is going to increase fastest right so that's the direction of the fastest decrease of X and of course any time you get to a local minimum a critical point the data where the gradient is going to be 0 because basically what is saying is that for infinity someone really really small changes in X the function isn't changing that's what you mean by a minimum if the function is continuous and smooth right so there's another property of the gradient that we will frequently use for any function I can draw an equal value locus meaning I can slice the function at any height if I slice the function at any height you're going to get these edges which are the low side equal value low side and the gradient at any point is always going to be perpendicular to this equal value locus it's something that's easily provable but I won't go into it okay but there's something that we'll use and of course you all encountered the Hessian in your quiz now so going back to our problem finding the minimum of a scalar function of a multivariate input like a loss where the input is the set of all parameters of the neural network requires the finding of a turning point where the gradient is zero so if I want to perform unconstrained minimization of the function what I can do is simply solve for this if I have a function and I know the Northey the formula for the function I can compute the gradient at any point I can compute the gradient as a function of the variable X and then I can solve for where gradient of f of X is zero that's going to give me the set of all critical points furthermore I can compute the Hessian at each of these critical points and pick the Hessians pick the locations where the Hessian is positive definite or positive definite semi definite and that's going to be a maximum right a minimum the problem is closed form solution that's not always available you're not going to be able to just say suppose you have a function of this kind a hideous function of this kind you cannot simply it has a unique minimum right but if you just try to solve for gradient of f of X equals zero you're going to find that it's not easy to solve for it it doesn't have a closed form solution may have a solution it won't have a closed form solution so in these cases you're going to use an iterative solution you begin with a guess for the optimal X and refine it iteratively the correct values obtained so what he will do is start from an initial guess x0 so you have some function so you have X you know f of X and you have some function and you're going to start off saying I think my minimum is here so this is my initial guess but then when you get the minimum you go and check the gradient will the gradient actually be 0 if you're not at the minimum the gradient may not be 0 right so in this case what is the derivative slope going to be is it going to be positive or negative it's going to be negative why why because if I increase X the function is decreasing right so in which now I must adjust my initial guess what must I do must I move forward or backward forward obviously right I must go in the direction of decreasing the decreasing function so you want to start at an initial guess and you can check whether this is a minimum or normal or not by checking the gradient and the derivative and if it's not a minimum then you can take a step in a direction where you think the function is decreasing right how will you take how will you take a step and how big it must step be is enough you're going to use this logic if at this point the derivative is negative it means that going forward is going to decrease the function right so you will take a step in the puzzle in the positive direction so if there's a function of X you're going to say X is going to be X plus some Delta X where this is a positive number right on the other hand if your current guess is out here what you will find is that the slope is positive right so going forward increases the function which means you must go back and so you're going to say X as X minus Delta X right so basically what you can do is a positive derivative means moving left B Croesus error a negative derivative means moving right decreases error so you can have this very simple logic which says that while the derivative of funk the function is not equal to zero if the sign of the derivative is positive take a step forward take a step backward and if the sign of the derivative is positive and negative take a step forward right I can compress both of these into one equation and write it like so without a check I can say this is the same as saying right if dy by DX is greater than it's less than zero this is this right so I can just say the X X plus sine of B Y over DX times Delta X these are the same statement right when the derivative is positive then actually no - right so when the derivative is positive I want to take a step back when the derivative is negative I want to take a step forward so the derivative is negative this sign is going to be negative negative a negative will become positive this term must always be positive clearly and vice versa right so this is fine now questions we're going to keep using this logic this logic is going to be the rest of the course right so if this doesn't make sense then nothing else will make sense but then I don't need to actually use the derivative sign of the derivative I can combine the sign of the derivative and the step size and just use the derivative itself and multiply the derivative by something else and it's exactly the same equation nothing has changed I have retain the sign and the size the length of Delta X has been derived as a product of some other step size and the magnitude of the derivative but it's basically the same equation as the sky right and so basically this does be your basic gradient descent rule you're going to start off with an initial guess and if the current guess for f of X is not the correct answer if the derivative is not 0 you're going to take a step and you're always going to be taking a step against the derivative so if the derivative is negative meaning if the function is increasing behind you you're going to step forward if the derivative is positive the function is increasing ahead of you you're going to take a step backward right and you will continue to do this until the derivative now when would you stop when will you stop moving the in either direction that simple test is to check the derivative itself right so the derivative is going to be okay well the derivative always be zero at the minimum can you give me a where the derivative is not going to be a zero at the minimum yeah so assuming the function is differentiable and continuous yes so this is if I have a function and the function is like this but I'm only allowed to look for solutions in this location in this in this range then we'll the derivatives be zero at the minimum no but where will the minimum be add a boundary okay so there are different things to tester a more reasonable thing to be testing is as I keep taking my steps does the function value change as opposed to is the derivative zero right okay so this is the gradient descent or the gradient descent rule to find a maximum you're going to move in the direction of the gradient so that was for scalar functions right but I can generalize this to vector functions again in a vector function of a multivariate input we know that the gradient is the descent of a direction of fastest increase of the function so if I want to maximize the function I'm going to walk in the direction of fastest increase that's how I expect to get there quickest right if I want to minimize my function I want to walk in the direction of maximum decrease so in other words I'm going to take a step in the direction of the gradient which is the direction of maximum increase if I'm trying to maximize the function I'm going to take a step against the direction of the gradient if I'm trying to minimize the function right and so this is the trivial gradient descent algorithm you're going to initialize X and then you can have multiple criteria you can either check to see if the magnitude of the gradient has become sufficiently small but that's not a sufficient criterion every every time as we saw because if you have a situation like this the gradient here can be fairly large and you've still had a minimum and AlterNet criterion is to check is if your function decreases a lot as over subsequent iterations and while this condition is not satisfied you're always going to take some step size times the gradient you're always going to take a step against the gradient to go towards the minimum so the overall gradient descent algorithm is going to look like this just for reference okay so we've looked at the general problem go get your coffee you're better off spending time outside class and grabbing a coffee than sleeping here okay so let's go back and look at our problem right you're given a step off training set of input-output pairs you have a loss function they've defined the loss function is the empirical average divergence or error over your entire training set and you want to minimize this with respect to W now this is a function minimization we've seen exactly how to do this and so the problem set now and this we can do using gradient descent right but just writing things in this manner is not sufficient everything has to be defined this is math right so what are these input/output pairs you must define this right you have to define what is this function f this function is our neural network but how exactly do you characterize it right and what are its parameters what is this divergence this error metric that we are actually used that we are actually minimizing in order to learn our function so all of these things have to be defined now first let's look at if what is what is f and what are its parameters EE this is the easiest bit we've just spent several lectures looking at it our function is just a typical network it's a directed graph directed network with sets of inputs and outputs we are going to assume no loops we will now some terminology that confusing terminology that he confused me when I was a kid we speak of input units there are no neural net neurons at the input the input units are your inputs this is general terminology the only neurons that you actually have are those in the hidden units and the output units so the output that you really want is the output of this final set of neurons everything else in between is performing some partial computation that's feeding to the output more often they're not you're not really interested in these parts outcomes of these partial computations you're only interested in what eventually happens so the guys in the middle we are not interested in they can be hidden from here we're going to call them hidden neurons now we're going to assume a layered network for simplicity simply because it's easier to explain this is not mandatory your network need not be layered pretty much all the logic we're going to talk about will still hold except that writing everything in vector form becomes a little more tedious and sometimes impossible if things are not properly layered so we will refer to the inputs as the input layer and you player no no no neurons here the outputs are the output layer and the intermediate layers are hidden layers right the individual must define the individual neurons and we are going to use the standard perceptron like function that we've seen I'm saying perceptron like because it's hard the standard definition of the perceptron has a threshold function operating on an affine combination of the inputs here we are not going to be using a threshold function we have some function f which operates on the affine combination of the inputs so there are two different things happening over here first you have an affine combination of the inputs being computed second that I find combination is going through a function we are going to assume that that function is continuous and differentiable almost everywhere I'm saying almost because if you have something like a real ooh right this function is not differentiable out here right but somewhere later in the lecture you will see how to deal with these situations as well so what are the parameters of this unit over here the parameters of the individual neuron are the weights and the bars right and so the parameters of the entire network are going to be the weights and biases of the entire network the activation functions themselves we've seen threshold activations but we're going to use something other activations like the sigmoid the tan h the Ray Liu and the soft plus so again these are just for reference you have the activation functions and their derivatives on the slide right the key point is every one of them is differentiable and you can actually compute the derivatives for every one of these guys everywhere except this relu and then observe that the rail really doesn't have a derivative at zero but the Ray Lou has something called a sub gradient now what was the derivative if you think about it in terms of gradients the gradient is the direction of the fastest increase of the function right so if I have so if I have a rail u of this kind at this point what is the direction of the fastest increase of the function right what is the slope so that's going to be here the slope is defined here the slope is defined at this point we don't really define how much an incremental change in X is going to result in an increment in Y but you can you can sort of say that it's going to increase so you can imagine pretty much any one of these lines as a reasonable approximation to what's happening at that at that corner so we're going to call them sub gradients and the slopes of any of these functions will work just fine as the derivative of that corner and typically we're going to use the best case which is this guy so derivative at zero will be defined as one right now in all of these cases we assumed that a single the perceptron computes a single output right or the neuron computer single output it computes a single act or affine combination of the inputs and from that it produces a single output but in a layer of neurons you can have many such neurons right now there are other cases where instead of just having producing a single output from singular instead of having a single Y scalar Y as some function of X 1 through X V you can have a vector Y as a function of X 1 through X T so basically in this case you have a bunch of inputs going in and a bunch of inputs coming out we are all familiar with such functions to write your neural network is a classic case of a vector function u takes a bunch of inputs it produces a bunch of outputs so you can actually have you can have what we will call vector activations we'll deal with these again in a little bit but here's a classic example of a vector activation that we often use the softmax so you have many inputs and you have many outputs but observe that you're computing a separate affine combination of the inputs with for each Harper now the actual output that you're computing is going to be this normalized exponentiation of the of the affine combinations now what's special about this what's different from a vector input from a from a vector I have between a vector activation and just a collection of scalar activations now to see this you can see what's happening over here if I have if I have a collection of independent functions right each of them is going through a sigmoid this is one case this is this this is just a collection of sigmoid activation in the other case I have a and B sir these are going through a softmax and I have a bunch of outputs so as many outputs as these combinations that I'm computing right now look at what happens here if I change this wait this is the only output that gets changed the rest of the outputs do not get changed right so I can modify individual outputs without changing the other outputs in a situation like this if I change this weight if I say increase modify this mate such that this value increases the corresponding output over here is going to increase but because all of the outputs are required to sum to one the rest of them are going to decrease right so changing any single parameter is going to change all of the outputs so though it's a fundamental difference between vector and scalar activations except you can think of a collection of scalar activations as adj degenerate case of a vector activations right you can have other kinds of vector activations like some crazy stuff anyway so a typical network in a layered network you can actually view each layer of neurons I keep calling them perceptrons when they're really neurons as a single vector activation a degenerate case of a vector in one case if you have something like a softmax it's a regular vector activation in other cases where each neuron is a operating individually on the inputs then you can think of as think of think of it as a degenerate case we're modifying one output doesn't up modify the other outputs all right let's introduce some notation so I'm going to call now every neuron has an input and an hour right and there are weights so if I look consider the neurons between say the layer K this is the K it layer I have many layers I have several hidden layers and I have a final output layer I'm going to number the input as the 0 at layer just to maintain consistency so now every the neurons of the JAF neuron of the cat-lair is going to have an output at each time in response to each input right I'm going to use the notation Y J K to represent the J how output of the Jade neuron in the cat layer I'm going to represent the inputs as simply the output of the 0th layer just so that i can use a common notation Y across the board but y0 simply means the inputs right and then if i have our connection between the say the iith neuron of the k k minus one layer and the ghiyath neuron of the cataleya so this is the k minus 1 then i'm going to call this w IJ which is to say a connection from the eyuth neuron to the death neuron where the JAF neuron belongs to the cat-lair the superscript K so the superscript represents the destination layer the second subscript represents the destination neuron the first subscript represents the source neuron right so we've defined our network we've defined the function we've defined the parameters every to be able to find a representation but then how do we represent these input/output pairs now the stuff that we were talking about when we began their course is you know we said voice goes in and transcription comes out that doesn't look like something you can represent as numbers right we want everything to be rapidly represented numerically so the inputs we're going to assume that the input is a D dimensional vector where the D components are real values we're going to assume that the output is some l dimensional output so the enth input you have a collection of training inputs the first subscript represents to the index of the training input you can have any number of these the second subscript represents the components so this means that your inputs are d dimensional vectors with d components now you have the target output what you want the function to actually give you when you present this input the desired outputs are we using the notation D to represent the desired output right and that's some L dimensional desired output the network will have a current output Y which may not be the desired output because remember you're learning the network and in any iterative algorithm the current estimate of parameters is game going to give you a network which is not the one that you really want more often than not right so the current the output of the current network in response to in response to the current input we're going to represent using the variable Y which is basically the same variable that we used also to represent the intermediate variables right now the input itself is expected to be a vector of numbers or I can be just a scalar of the input is of air is of size 1 so if you're using images the way input could be a vector of pixel values if you're trying to give speech as an input the input could be just a collection of speech sample values or spectra or there are some some collection of features like spectra that you've derived from the speech using some external process if you're going to be using text then text is not a real so you're going to have to convert the text to some real valued vector in order to input it to the network and we'll see how this can happen later right or any other real vectors now the output if the desired output is real valued there's nothing to be done right that is really the function that that is really the output that you wanted to want the network to give you but then if the output is and in this case if the desired output is a real scalar then D is going to be a scalar value if the desired output is a real vector D is going to be a multi component vector right it's a real vector easy now if the desired output is binary like if you're performing a classification task is the input inside the inside the yellow region or not is the input image a cat or is it not a cat right so in that case you have the output has stayed two values one of which represents one possible outcome the other represents the other possible outcome like inside the Pentagon and I'll say the Pentagon is a cat not a cat we are going to represent convert these outcomes to numbers so one of these numbers is going to be assigned the value 1 the other number is going to be assigned a value 0 and so now you have a numeric representation of the output so now if I get some training instance of an image of a cat then the target output is cat I'm going to say the target output is 1 and I'm still building I have a cat classifier somebody gives me an image of a tree the target output is cat this is a tree the tree is not a cat so the out target output is 0 right so this now if I have if the desired output is binary we can use a simple 1 0 representation of the desired output the output activation in these cases is typically typically going to be a sigmoid as we saw in the last class and what the output of the sigmoid is going to be is really the probability that the output belongs to class 1 given the input and this is what we will be learning right indicating the this indicates the fact that for actual data in general a feature X may occur for both classes but with different probabilities and the nice thing about the sigmoid of course is that it's differentiable if you have a multi class input like well then we will use something called one-hot representations so let's say you're trying to build a 5 class classifier every input is either a dog or a cat or a camel or a hat or a flower right so now these are concepts you have to convert them to numbers so now because there are five possible outputs we are going to use a five dimensional representation for the output and the five dimensional representation consists of five binary values each one of the five is going to be either a 1 or a 0 and for a given out for a given output value only one of these guys is going to be a a 1 the rest will be 0 so basically here you have five outputs five possible outputs so we will have a five component vector and each component represents one of these labels so the first component would be cat the second would be dog the third is camera the forces hat and the fifth is flower so now if somebody gives me a picture of a cat then I want to say and so each of these five components is basically a yes-no for the five labels right so if somebody gives me a picture of a cat I'm going to say the target output is one for the cat and zero for everything else if you get a dog it's going to be one for the dog component and zero for everything else so observe that this vector has only one component which is one and every other component is zero which is why it's called one heart one other components is hot the rest of the components are not right so now for multi-class classifiers with n classes the n the one heart representations will have n binary outputs which is basically an n-dimensional binary vector now the ideal output of the network must be also one hot right it must be able to see a picture of a cat and say this is a cat in which case the output is going to be one for the cat cat neuron and zero for the other neurons in reality what you will get is a probability vector you're not going to get five you know four zeros and a fifth one you're going to get five positive values which together sum to one that's what the softmax gives you right this actually gives you the a posterior EPOS TD or E probability of each of the five classes questions okay so yeah so this is how we are setting up the problem right if you don't have the softmax are going to have real values and you could choose a max but the problem really is that you want something that's nice and clearly differentiable right which is we have the softmax right just exponentiate them they want some two one so yes I remember that the outputs are going to be both positive and negative right there's no guarantee that I find combinations are strictly positive you can you can you can do all kinds of things I mean you can design your neurons as you want you right but there's the map the softmax is mathematically convenient and it's congruent to probability if you use something else although it could very well satisfy the requirement that the highest value kind of indicates the label that you want which is basically what we did when we used Adaline and madelyn if you remember right it doesn't actually map onto a probability if you normalize it again if you put a constraint right then if you actually map it it's very unlikely that you're going to get a function which gives you this kind of curve right this there's the nature of the curve itself which which represents how things really happen in life so so in a typical problem statement you're given a number of training instances say things like these images of digits along with information about which digit the image represents so you can have tasks like you know binary recognition is there's a two or not multi class recognition which digit is this if you were doing binary if you're trying to build a network to perform binary classification then you would convert your training instances to something like this the target output for each training input each training input is now a grid of pixel value so it's actually a numeric vector and the target output is going to be either a one or a zero right if it's if the problem here is to see whether the output is tor is the number two or not if you were performing a multi-class classification then I've actually sort of used a lazy representation the target output is now going to be ten component and it's actually going to be a one heart representation so for the first image the target output is going to be a ten component one heart vector where the fifth component will be one and the rest will be zero for the second image the second component would be one and the rest would be zero and so on right so we've defined F we've defined X we've defined D there how to represent inputs and outputs but we still have to still have this issue of what this divergence can be and we've seen that for the loss to be differentiable the divergence must be differentiable if the loss if the divergence is not differentiable then you don't have a differentiable connection between the loss and the parameters so you really want everything to be differentiable you want the divergence to be differentiable you want the function f to be differentiable that's the only way you can find out how small changes in W are going to result in small changes in the loss right so the most common the simplest divergence function that you can think of again the divergence is supposed to quantify the error between what the network actually produces and what you wanted to produce so the simplest divergence you can think off is they'll two divergence you can just say I have an output vector I have a desired output vector I can let me I can compute the Euclidian squared Euclidean distance between the two the reason you're using using the squared Euclidean distances again a matter of convenience it's different differentiable if you look at magnitudes then it's not differentiable everywhere right so the squared Euclidean and distance is actually observed that the way we define it there's the scaling factor half outside the scaling factor half outside is again just a matter of convenience so then when you take the derivative what you actually get is simply going to be a vector where every component is simply the actual difference between the target output and the actual output so this vector is a vector of errors and so originally when people described the gradient descent algorithm for learning neural networks the error that there would be the divergence function that they were thinking of was the l2 divergence and in the l2 divergence the derivative of the divergence is strictly the error directly the header and so they call the the algorithm is called error back propagation the term error over there actually comes from the fact that when you are using an l2 divergence the derivative is actually just the error right for binary classification we like to use something else again we like to think of the output as the posterior probability of the class and this is where again the definitions begin to kick in right so it's easier to compute a deaths the the I can define a any kind of our divergence between probability measures the most common one being the kullbackleibler divergence and the kullbackleibler divergence anybody remember the formula the divergence between two distributions that's going to be summation summation AI Qi log of what PIR Qi Qi will appear yeah right so P and Q are the two distributions in our case the output Y is going to be one distribution their desired output D is going to be another distribution so here this really states what is the error I get when I assign this probability to the app to the to the random variable when the true the true probability is this guy so this is the true probability you think term that's outside the log is the true probability the term that's inside the log is the assign probability you want to match the two up right so in this case what would the true probability be this is going to be just the target right the one heart vector and so you're going to get B times log of Y plus one minus D times log of one minus y where Y is the output of the network which is which goes between zero and one so if I were to take the derivative you get this interesting derivative which is observe this it's minus off 1 over y so if the desired output is 1 it's minus R and it's 1 over 1 minus y if the desired output is 0 you can just work this out of medaka it's fairly trivial but what happens when the desired output is 1 and you actually get a y that's 1 what's the derivative yeah anyone so let's say the desired output is actually 1 and your Y your network is perfect it actually computes an output y equals 1 what is the derivative here 0 I have the formula here right it's minus 1 right similarly the desired output is 1 0 and that actual output is 0 what is the derivative it's 1 right in neither case is the derivative when you get the correct answer equal to 0 this drops out don't always expect the derivative to go to 0 when the answer the network is actually correct right so this is and why is this happening because you're restricting the output to lie between 0 and 1 right so now if I have a multi-class classification problem then once again I can use the kullbackleibler divergence between the actual target output which is one hot and the add the output produced by the network and when I do this I'm going to end up and so rather than the kullbackleibler divergence we can use the cross-entropy that you become strictly equivalent so here is the divergence that we are going to use it's minus summation D log why right so this basically remember that you notice that one of the basic properties of entropy that if the probability distribution you assume is the same as the actual probability distribution then the entropy is going to be then this term is going to be minimized whereas if Y is not equal to B it's fairly trivial to prove that this term is going to be greater than you know it's not going to be minimum right so now D is a one heart vector right so this is really going to be that's what right are they cross interpret cook KL divergence between a one hot vector and a the output of the network the KL divergence is going to be summation AI di log of di over why I write is going to be summation AI di log di minus minus summation AI di log y ah what is this term over here D is summation di log di given that only one of the D terms is one and all the other details are zero what would this be this is going to be 0 correct and so this goes away you're just left with this term but then it gets even more interesting only one of the D terms is one the rest of the D terms are 0 right so that's going to be you know let's say BC equals 1 and di equals 0 for I not equal to C then you're going to get 0 times log Y 1 plus 0 times log Y 2 y 2 and then plus 1 times log why c plus zero times log you know basically all of these terms all of the terms for the outputs not corresponding to the target class are x zero the only term that is multiplied by a one is going to be this guy right so this ends up being a very very simple formula it's a negative of the log of the probability assigned to the correct class by the network everything else goes to zero so although the formula looks fairly complicated it collapses to something trivially you know trivially simple that could be divergence is simply the negative of the log of the probability assigned to the correct class now look at what happens over here if the probability assigned to the correct class is 1 this term becomes 0 right if the probability of the assigned to the correct class is 0 this term becomes plus infinity right log of 0 is minus infinity there's a minus outside it becomes plus infinity so as you go away from 1 for the correct class this value increases very rapidly and becomes very large all right and so this is why you want to minimize it and it has this nice property when you get the correct answer it actually becomes 0 right and so if you take the derivative now this is not a function of the probabilities of the wrong classes so this is crazy although the formula itself includes all classes def when you actually collapse it the only term that remained was the log of the probability of the correct class so when you look at the derivative the derivative is going to have zeroes everywhere except the one component that corresponds to the correct class where it's going to be minus 1 over the probability assigned to that class by the network right questions ok and so again yeah so here's what this is I should be a little more careful and I write this right it has to do with the fact that you are I have summation AI di log di over Y I equals summation AI bi log di minus summation AI di log Y i right and this term goes to zero because all of the terms only one term this log did log is slower than then linear right so for all but one time this guy is zero for that one term for which this is 1 this is also 1 log of 1 is 0 so this is going to become 0 dohno those that those are the true probabilities right which is the one heart representation right remember we said that the correct that a graph when you write these terms this guy here is the true probability of the data this is what is assigned to it no it's a its summation it's a positive so which is why it actually collapses to minus D log Y if you flip it you will actually find that it becomes a negative number right I mean you can plug it in right there just look at what happens if I flip it even if I flip at this term is going to be 0 correct but what would happen when things match that's going to become the worst case yeah you could it actually turns out that you cure you actually can now the issue with the squared error is that it's not going to give you this nice property where only one component remains and if you actually plot the functions I should I have the I have plots for this somewhere if you actually plot these functions the KL divergence so let's say you have some D some B's the true input right true value and Y is what you get the KL divergence is going to have some shape it looks really bad like so it becomes very steep and the quadratic is going to look something like that which stops at the boundary so the shapes are different and the KL divergence kind of pushes you towards the away from really bad values very quickly but then when you begin getting to the center if you have non-binary target outputs it turns out the KL divergence is not as good as just using the quadratic so you can actually use different error error divergence functions this just happens to be something that we use for because of mathematical convenience and the KL divergence works particularly well when you have when the target outputs are binary when you are at these at these at these junctions so let's try plotting it right maybe you can help me I don't remember this very clearly so let's say my target output is zero and this is why how is the divergence going to look so when y is zero what do I get one right and when y is 1 what do I get infinity right or when zero what do I get when y is zero the BI is 0 and Y is zero so this is going to be there is going to be zero and where and when y is 1 I have di is 0 and y is 1 so that's also going to become zero right what does it look like in between can anyone help me plug this so I'm trying to plot this the target output is di is zero right I have let's say let's consider the case where I have a two dimensional output okay so if I have a two dimensional output I have D is 1 0 right and Y is going to be whatever y1 y2 and the KL divergence is simply going to be minus log Y 1 correct so when y is what would serve when y is 1 and let's say when y is 1 what is the output the when y is 1 the output is 0 right there are a 0 right and when y is 0 it is infinity plus infinity right and it's actually going to be and the slope over here is already what is the slope it's going to be minus 1 right so this this divergence is going to look like that really steep right now if the desired output where the inverse if the desired output was 0 it's going to look like that really steep and so you can see why this is actually a good divergence in this case in that it actually pushes you towards the correct answer really quickly right if ever you if I were to use a quadratic divergence what would it look like there's got to be something like this not very nice make sense right so any more questions any questions about what I just explained must have gone over several people's heads no okay all right so once again observe that the derivative is not 0 when the answer is correct right because of this property this is what the derivative actually looks like the derivative will continue to go around but these are not valid answers right okay so sometimes for multi-class classification we like to do this other thing we do something called lower called label smoothing where instead of using one zero we might do something like 1 minus epsilon Epsilon right or if I have many classes I can do a 1 minus epsilon and then epsilon over K minus 1 for all the other guys so why would I do that the reason for this has to do with a derivative of the sigmoid again it you don't want the function the Sigma to completely flatten out so not that our derivatives the divergence itself you want to have a nonzero divergence everywhere and the sigmoid you know if your function is 1 0 then when you're far enough away the divergence is going to become negligible right you can also write down the derivative over here and you can see that the derivative for the components in correct components is nonzero whereas if you actually had just the standard one-hot representation for your target then the only thing that you're actually going to directly correct are the parameters for the correct class and the others and the rest of them remain untouched right so we like to have the slightly more slight modification of the of the target labels using the 1 heart the smooth labels in which case you are to get a reasonable derivative for even the non-target classes so here's the problem set up again yeah pardon me yeah it clearly has to do with the fact that they direct the the the direct I mean you can use any terminology that you want right but but you really understand what's going on look at what's happening with the derivatives itself if I give you a target output right which is maybe one zero zero zero zero and I get some error now let's say I mean consider this so let's say my target output is 1 0 0 0 0 right and the actual output I get is 0.01 you know 0.1 over 1.2 0.3 0.3 9 right I have four horrible outputs the only one that actually gets corrected if I were using a 1 Hart representation is this guy right you would like to say the others must also get up.get corrective they are also bad and the standard representation will not allow you to do that whereas here you actually managed to get a derivative for those for those incompetent outputs and make a correction for them right so so here's a problem set up finally you are given a training set of input-output pairs I don't even know how to check my phone okay you are given a car training set of input-output pairs you define an error on the right output using this divergence and then the loss is the average divergence across the entire training set and this is what you're going to minimize with respect to all of your parameters right and we will use gradient descent to do this this is the gradient descent formula that we saw now let's look at what happens the gradient descent literally says that you're going to compute the gradient of the loss with respect to all of your parameters and then take a step against the gradient right now how many parameters do we have we have many many parameters that vector X actually has many components so you can look at it at a component level right and you can say that you want to compute when you compute the gradient you're computing the partial derivative with respect to every single component so you're going to take a negative step against the derivative with respect to every single component so from the law from the neural network perspective that's how we here's how we define our loss we're going to initialize all of our parameters I'm going to assume the bias is also a weight for convenience and so literally for every mayor for every single IJ combination because weights go from a neuron I to neuron J you're going to compute the derivative of the loss with respect to that weight and do the gradient descent update right now this means we need to be able to compute this derivative and now here's how we define the loss the loss is the average divergence across all of your training samples the averaging is a linear operator which means that the derivative of the loss with respect to any weight there is going to be the average derivative of the divergence with respect to the weight for across all of the training samples which really means that what we really want to compute there's this guy the derivative of the divergence between the actual output of the network and the target output of the of the network with respect to each parameter for each input and again the divergence and this derivative are going to change with the input right so now here's a before we continue here is a quick calculus refresher I'm Way behind today so I may just have you know have to record some of this lecture if I run out of computer in class just wanting you next week right okay so for any differentiable did have a differentiable function y equals f of X with derivative dy over DX we know that that property must hold Delta y equals derivative times Delta X right so for any differentiable function of many variables with partial derivatives we also saw that this must hold right this was again the definition of the derivative we've gone over this a few times now consider this guy let's say I have a function y equals F of G of X so it's a nested function right what is the derivative of Y with respect to X so now let's this is actually a two-step process so I can first define Z equals G of X if I say that equals G of X then by our derivative rules Delta Z equals DG over DZ or I mean DG over DX DZ over DX times Delta X right so this is just our derivative rule and now y equals f of Z which means that Delta y equals DF over DS at times Delta Z this is again just our derivative rule right and I'm going to plug in this Delta Z from the first guy which gives me DF over DT times DT over DX right so you can see how this particular chain rule is directly derived from the basic definition of derivatives right so now I'm going to increase that one step further now let's say I have a function y which is a function of many sub functions it's a nesting of many facts it's a itself nested function where there are several sub functions so y equals some function of G 1 G 2 G 3G 4 G Phi all of which are functions of X ok now what is the derivative of Y with respect to X so again let me at the border the slides may or may be a bit busy so let me try to write this cleanly so I have y equals f of G 1 of X G 2 of X and so on correct so let me say that I equals G I of X then I can write y equals f of Z 1 Z 2 Z 3 and so on correct and we already know that in this case Delta y equals the partial derivative of Y over Z 1 times Delta Z 1 plus partial derivative of Y over Z 2 times Delta Z 2 and so on this is what we've already seen right this is just a standard definition but then I also know that Delta Z I equals BG I over DX times Delta X because that is just G of X right so combining the two together I can say Delta Y equals the partial derivative of Y with respect to G 1 times partial derivative of G 1 with respect to x times Delta X plus y with respect to G 2 times G 2 these are non partials is a complete right G 2 with respect to x times Delta X and so on right Delta X is common across the lot so I can group the entire lot and I get this function which is your well-known distributive law for derivatives right now you have to get really comfortable with the notion of distributing derivatives how exactly do you come these there's a very nice convenient way of thinking about the whole thing which you will encounter in your quiz just to make sure it's borne into your mind right I can draw the draw the entire dependency between x and y through what i will call an influence diagram right x influences g1x influences at 1 through g 1 x influences z through 2 g through g 2 and so on so x influences each of the Zed's through the corresponding G and all of these sets finally influence 1 right so how does a small change in X an incremental change in X change Y an incremental change in X is going to in change is going to result in an incremental change of each of these Z's the incremental change of any of these Z's is going to change Y and we've already seen that that is additive right so basically you can actually say that small perturbations in X can cause small perturbations in each each one of G 1 through GM each of which individually affects Y which is y and so when you have many things influencing one function we see we've seen that when you get close enough to a point their their contributions are cumulative right through the partials and so the whole thing is ends up being an addition so now how to actually compute this derivative dy the derivative of gravitons with respect to wait for a single input now first let's take a closer look at the network itself I'm drawn and drawn a network for our bivariate inputs two component inputs so the inputs have only two components X 1 and X 2 I have a network in fact there's also a bias term so I can actually expand out the actual computation in the network in this manner at each neuron you have two things happening first an affine combination of the inputs is being computed second that affine combination is actually being put through an activation function right so when you do this I can actually so each of those yellow ellipses represents the totality of one neuron and so now I can begin assigning labels where to all the variables so here are all of the weights properly the arrows are properly labeled right and then you also have two things happening here right first there's an affine combination and then there's the actual output being computed so we must label those as well so first time I'm also going to label the affine combination of the inputs before the activation and I'm going to our label the output after the activation so if I have the Z the Z to 1 is the affine combination that goes into the second neuron of the first layer Y to 1 is going to be the output of the second neuron of the first layer ok so and finally you have the output Y from which you come to which you also give the desired output D and your computer divergence right so now let's try to actually work out a little bit of how I would compute the divergence using the chain rule and everything else that we've seen for a couple of variables before we go on with the rest of the lecture in the next class right spend a few minutes on this so what is the derivative let the derivative of the divergence is the divergence is differentiable so let's assume I can compute this guy which is the derivative of the divergence with respect to the output of the network that's simply because the divergence is computable right it's differentiable now what is the derivative of the divergence with respect to the affine combination that went into that neuron again I'm assuming a single neuron single output neuron right what would that be that's simply going to be my chain rule right so that's going to be B dive by D Z what is the label I've given it z13 right is going to be the derivative of the divergence with respect to Y times the derivative of Y with respect to this is easy right nothing particularly complicated now I can take a step back and say what is the derivative of the divergence with respect to these weights the weights for the third layer so then I can write those like so but hang on before I do this there's something missing right I have happily written dy over D z13 right but Y is a function f of Z 1/3 right so when I'm computing this derivative what am I actually computing I'm computing the derivative of that function where am i computing the derivative remember if I have any odd function if I have a function let's say the derivative is different at each location right here it's like so here it's like so here it's like so here it's like so here it's like so at which of these locations am i computing the derivative part of me add Z 1/3 so which means it's F prime off at z1 3 which means I must have a value for Z 1/3 right if I don't know what z13 is I'm not computing the derivative similarly I can keep working my way backwards right if I work my way backwards now assuming I have a derivative with respect to Z 1/3 I can compute the derivatives with respect to all of these weights because the Aird 1/3 is just an affine combination of the previous inputs with the weights it's kind of trivial right but then like if I want to compute the derivatives further back right I want to compute the with respect to say these weights then I had to compute the derivative with respect to these vise which is possible right I can apply the chain rule and take a step back but then from here I need to compute the derivative with respect to this guy which means I'm computing the derivative of this function at this location again the Z must be not right so you have the situation where the location at which you're computing the derivative must be known for you to compute the derivative it's in which means that you can't just start off computing your derivatives first you need to compute the intermediate the affine combinations for every single neuron in the network before you can begin computing the derivatives so that you can make your adjustments right so we'll see in the next class how the whole process could happens forward and back I'll stop right here questions thank you
Info
Channel: Carnegie Mellon University Deep Learning
Views: 8,421
Rating: 5 out of 5
Keywords:
Id: lTPg1hhd5Rs
Channel Id: undefined
Length: 77min 0sec (4620 seconds)
Published: Mon Sep 09 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.