Lecture 3 | Learning, Empirical Risk Minimization, and Optimization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is what class number three and we have 300 students registered in class of which there are about sixteen the room right now and six about ten on streaming so I'm not really sure what's happening with the other 240 but it's a bit of a shame that the semester has hardly begun and guys can't even turn up to the class right it's I would expect somewhat better from CMU students anyway topics for the day we're going to talk about the problem of learning over for the next several lectures one two three four five lectures we are going to be talking generally about the problem of learning neural networks for today we're going to cover I just go over some basic mechanisms the perceptron rule its application to multi-layer perceptrons there are portions of the lecture thank you yes yeah this is not meant for you this is meant to point at the board right anyway we're going to go over there are some slides that I will not show Adaline and madleen please go over these because you're going to encounter these in your quiz right so all right here's a quick recap what we've seen so far of that neural networks are universal function approximate is they can model any boolean function any classification boundary even any continuous valued function but arbitrary precision provided the network satisfies the minimal architecture constraints if the network doesn't satisfy the required architecture constraints it's not going to be able to model the function for example in the case of an X R if you flip if you just miss one bit then you have 50% wrong this is a binary classification problem you miss one bit the error rate is 50% right so but that's it under the right conditions it can model pretty much any function so here we're all these tasks that we saw are transcribing voice signals captioning images playing game so many other things in each case he had this box which took an input it computed an output and this box is a function and because new networks are universal approximate errs they can model any function do you mind shutting a phone please lady T right yeah please turn your phone upside down thank you right when you turn your phones and your laptops on you aren't you're bothering me you're insulting your classmates who are actually paying attention don't do that all right so here are the questions we have this box they take some input it produces some output so what are all the questions that we can that we can ask if we are speaking of making predictions about a game for instance the immediate question is how do I represent what goes on how do I represent the state of the game how do you represent the input to the Box how do you represent the output remember these are functions they work with numbers a game a game board is not a number right and then finally as you once we figure out how to represent the inputs and outputs we have to deal with the question of how do we actually compose the network that performs the requisite function so for now for today we're not going to worry about how you represent the input and output we are going to focus on how one composes functions a quick recap again here was the original perceptron it was a simple threshold unit you took a weighted combination of the inputs and if that exceeded a threshold then you can a perceptron fired otherwise it didn't but a different way to look at it was that we first computed and I find combination of the inputs and then subsequently we put this affine combination through a activation function and the basic perceptron this activation function was a threshold function which fired if the input was non-negative so a generalization of that we can replace that activation function with other kinds of functions for example a sigmoid or we've seen other ones in class like the Rayleigh and the leaky revenue and so on so now I'm going to redraw this neuron a little bit observe how we've done how we've actually represented it you have a bunch of inputs you're computing a weighted combination and there's an extra line going in which represents this additive constant what we will call the bias for the that factors into the affine that I find value Z instead I can represent exactly the same operation like so I can I can say that my input X is augmented by one extra component whose value is always 1 and the weight of the corresponding connection is my bias so both of these are strictly equivalent now the reason I'm actually mentioning this is that very often in my diagrams and my explanations I will not explicitly mention the bias when I do not explicitly mention the bias it means that I'm using a representation of this kind implicitly but we are always looking at affine combinations of inputs which always have a bias if I don't mention it then it means that I'm it doesn't mean it doesn't exist that means it's implicit now so we have defined the basic unit the entire function itself we saw must be a network of such units and what we will assume is that the network is strictly feed-forward what I mean by feed-forward is that you have a structure of this kind so observe that all the arrows are directed so when some input comes in that input is going to get crosses by some neurons which pass it on to other neurons and so on any single neuron is going to observe a single input only once the signal doesn't loop around and come back to a neuron ever so they said these are directed networks now part of the design of the network if you want to build a network that computes a function is the architecture of the network itself we had a fairly long lecture on what it means to get the right architecture but the fact of the matter is finding the right architecture of a preferred problem is quite impossible we and we will often sort of over design and have too many parameters but for the purpose of this lecture we're going to assume that the architecture of the network is given and has the capacity to model the net the function that we wanted to model so now this entire network is a is a connection of many neurons but it takes an input X and produces an output Y so basically the entire network is just a function but this is a function with parameters and what are the parameters of the network the parameters of the network are the weights associated with each of these connections or the bias or observe that I've just add included this extra one to represent the bias so the parameters of the network is a set of weights and biases for all of the neurons in the network and now if I want this network to represent a function properly then this means I must set the weights and the bias is just right so that the resulting network computes the function that I want and when I speak of learning the network I'm really speaking of figuring out what these weights and biases must be in order for the network to compute a specific function that I wanted to compute right so questions no ok so moving on we've already seen that the MLP can represent pretty much any function to arbitrary precision right but then the question that we're faced if we're asking today is sure I know the MLP can model any function you give me a function there melted caramel pecan model it but then how do I figure out what the weights and bias is what the parameters of the network must be in order for it to model this particular function that I'm trying to model now the trivial solution you can do this by hand so let's say I want a function that actually computes produces a 1 when the input is inside the diamond and 0 outside we spent a couple of classes seeing how we could do this here for example the diamond has four slopes of four lines of slope 1 1 or minus 1 and so literally by inspection I could tell you that this line could be bought could be captured by a percept single perceptron the input of of clear here of course is two-dimensional so this line could be by a single perceptron whose incoming weights are 1 and minus 1 and the bias term is 1 the bias not the threshold this one because the slope is a negative 1 right the weights are going to be minus 1 and minus 1 and the bias is 1 this guy would be again minus 1 and 1 and the bias is 1 minus 1 and one device is what so you can actually literally I'm not going to derive this but you know how to do this you could actually compute this guy and you could design the network by hand and it's easy enough to do so for something of this kind in previous classes I've also sort of gone through the exercise of trying to get you guys to design a network for me that can perform simple addition it turns out it's very complex but you can actually do it the problem is if you are using this construct by hand option when you want to handcraft a network to satisfy the computer specific function this is only going to be feasible for the simplest functions anytime the function goes beyond basic simple operations this is out the window so while it's technically a feasibility practically it doesn't make sense so what we really need is an automatic technique for estimation of these parameters for estimating an MLP so more generally what this means is that you are given a function G of X and we want the network to actually compute this G of X so we are going to have to derive the parameters of the network such that this G of X is computed now how can you actually learn this network the way we will do it is we will sort of optimize now again remember that we are assuming that the architecture of the network is given to you so we know the parameters right and now how did I put my chalk yeah so what you really have is a this is a one-dimensional illustration X here's a function that you are trying to model now if the if I set the parameters of the network to any value it is actually going to compute some function and that function may not be the function that I want so that function may be something like this and so now at each point I can define at each X I can define some measure of the error between what I want the function to compute and the for the function actually computes so the total error of korver here over here of course is this entire area and if I modify the parameters of my function this area is going to change so if I had a different set of parameter weights the function might be like so for instance in which case the error is even smaller and so the question now is how do I choose my weights such that the area of this error region is minimized now I'm going to quantify this error at any point through our divergence metric divergence function which may not be a metric which basically compares the function value of the function that I really want which is G of X against the value that the main network currently gives me which is which is f of X W and it's this quantity that I'm integrating over the entire space and I want to minimize this metric of the error between what the network is actually computing and what I wanted to compute all clear right problem this is assuming that the function G of X is not and reality G of X is not going to be not so if I'm asking you to play chess right nobody has defined the correct move for every single state of the chessboard it's just not feasible you don't know the function you cannot compute this integral because that in the value of the function is not known for nearly all X so what we will do is to sample the function instead of having the value of the function I wish I had a chalk of a different color so so if I have a function instead of knowing the value of the function at every point I'm actually just going to sample it I'm going to pick a bunch of X's and read the value of G of X add those X's basically we are getting input-output pairs for a number of samples of X and the manner in which we choose these X's that for at which we are computing the output if you do it properly the you are going to choose more XS where X is naturally occur more frequently so the sampling of X must follow the actual distribution of X now gathering such input-output pairs is really trivial this is just kind of collecting training data for example recovering a set of images and they're labeled so their captions gathering a set of you know voice samples and their transcriptions so this is not hard at all and now you don't really have the entire function anymore you just have these guys and from these guys these training samples you have to learn the function so what we want to do is to estimate the parameters now at this point observe that the rest of this is this really not known right although this function exists this stuff is not known so you only have these few dots and what is the best you can do when you only have these dots you can learn the parameters of the network to fit these values had these XS and if you learn the parameters and networks Tucumcari compute the what the G of X values at the given axis you can hope that afterwards over the entire space of X the function that it learns is actually the correct function but there is no guarantee make sense right so this is really what we're going to do so the story so far learning a neural network is the same as determining determining the parameters of the function the weights and the biases required for it to model a desired function and the network itself must have sufficient capacity to model the function that we are interested in ideally we'd like to optimize the network to represent the function everywhere like in this example right but in practice this requires knowledge of the entire function the function over the entire space in practice we don't really have it so in practice what we will do is to draw some input/output pair samples and we're going to estimate our functions at we are going to optimize our function our network to compute the target output values add the samples that we've drawn and we're going to hope that it fits the function properly okay so let's start with the simplest possible function we know that these things can model in a classifier as multi class classifiers what is to a real valued functions anything let's begin with a simple classifier it's better than regression and in fact even historically classification was one of the earliest problems the earliest problem actually addressed using multi-layer perceptrons and specifically we will consider binary classification so taking a step back system step back in history the original multi-layer perceptron as proposed by Marvin Minsky as was just a network of perceptrons with threshold activations the basic perceptron and we want to compute this using only training instances of input-output pairs now place yourself in 1965 this we know even with the threshold functions we know this function can model pretty much anything right so now you want to learn it how are you going to learn it and what is the simplest multi-layer perceptron you can think of follow me something even simpler right just a single perceptron one the simplest network you can think of has only one unit so we want this is a step function across a hyperlink so let's look at the problem of learning this function but our setup is that the function we are not going to be given the function itself what we are going to be given is a collection of training samples you're going to be said oh you're going to be given the information that for all of these red dots the output must be one for all of the blue dots the output must be zero along the function right it's a single perceptron so basically based only on those dots that you have you have to learn the the parameters the weights of these inputs and this bias which are basically learning the parameters of the hyperplane itself right okay and so again what this means is that you want to learn the weights and biases such that they represent a hyperplane such that all of the red dots are on one side and all of the blue dots are on the other side the answer may not be unique because the dots do not span the entire space so which again tells you that for example there's no guarantee that you're only going to learn this function exactly you can learn other others which have an error with rest with respect to the target function but which are still correct add the dots itself so let's restate the perceptron I'm going to restate it slightly differently instead of having a bias I'm going to extend the input with this additional component whose value is one such that now I can rewrite the function for the perceptron as this over here the output is 1 if the weighted combination of inputs is non-negative and 0 otherwise observe that I have gone from being half fine to being linear and you all know the difference between a fine and linear at this point right so this is just for convenience nothing else okay now first thing what remember a perceptron is completely specified by a hyperplane right now let's consume the hyperplane is linear it's going through origin keeping in mind that this really is no different from an affine hyperplane right so if that's the case then what is the equation for the hyperplane we can write it like so we can say summation WI X I equals 0 this is the equation for the hyperplane right I can put all of the W's in a vector W 1 through W n WB I can put all of the X's in a vector X 1 through X T V being the dimension and I think I've represented as entler n plus 1 on the slide but it's the same thing and now my equation for the hyperplane is w transpose x equals 0 right the inner product between two vectors is 0 now when I tell you that the inner product between two vectors is 0 what can you tell me about the angle between the vectors any one so what's the angle between the vectors you the brown shirt guy yeah 90 degrees answer man you help me over here and wait for you all right so now what does it mean for me to say w transpose x equals 0 is the equation for a hyperplane can you tell me anybody else yeah basically it's the locus of all vectors that are perpendicular to the weight vector if I give you the weight later thank you what's your name I can't hear you and I don't think anyway so if I give you a rate vector W right there are some vectors now and say let's say this three-dimensional space there are many vectors which are orthogonal to this W right and so W transpose X equals zero is the equation that represents every X that is orthogonal to W it's the equation for a hyperplane and specifically the weight vector is normal to the hyperplane right so this is something you want to remember if I have w transpose x equals 0 that is the equation for this guy which is the set of all vectors that are orthogonal to the two w and the weight vector is normal to the hyperplane now consider any vector on this side of the hyperplane the same side as the weight vector what is w transpose X for this one going to be so let's say this is X right I can write X as X 1 plus X 2 correct is X 1 what is the relationship of X 1 to W its orthogonal correct so what is w transpose x 1 0 what is the relationship of X to 2 W it's parallel so what is the sign of W transpose X 2 positive correct so which means that any vector any point which is on the same side as W as what for those points W are the inner product between the weight vector and X is going to be positive right what about something on this side this one can be written as this Plus this guy right it's opposite to W what would be the inner product of between the weight vector and this one negative right because once again it can be written as the sum of two components one that's on the plane which is orthogonal to W and the other which is parallel to W but it's pointing the other way and the inner product between two vectors which are 180 degrees opposed is negative right remember again w transpose x equals mod x length of X length of W cos theta if theta is 180 degrees that's going to be more negative right so this we remember now when I say I want to find a hyperplane when I want to learn the perceptron I'm saying I want to find a hyperplane that perfectly separates all of the points for which I have labels in other words I want to find the weight vector such that all of the positive points which have labeled one have a component that's parallel to the weight vector and in the same direction as the weight vector and all of the instances that have a label minus one have have a component that's actually opposed to the weight vector right so we want to find a way to with the weight vector such that W transpose X is positive for all the red points and negative for all the blue ones how can we do this turns out there's a very simple algorithm we're going to it's iterative you're going to start off with some W and you're going to only update W on misclassified instances if the instance is misclassified and of the misclassified instance is a positive instance you're going to add it to W if it's a negative instance you're going to subtract it from W and we'll see this pictorially but I can rewrite it like so mathematically it's convenient for me to write the labels as plus one and minus one other than one and zero so all the positive instances the red dots will have a label of plus one all the blue dots will have a label of minus one and then I can go through all of my training instances and the classification rule is going to assign a label that's simply the sign of W transpose X we've already seamless right and if this is if this is incorrect then to the weight vector you're simply going to add Y times X now this doesn't give you a lot of intuition so let's do this with figures right I have a collection of these red and blue dots and the red dots have a sign - plus one the blue dots have a sign minus one I start off with an initial weights vector the initial weights vector represents some arbitrary hyperplane now if I'm treating my weights vector as representing a classifier so the hyperplane that the weights vector represents should have the property that all the red dots are in the same direction as W from it and the blue dots are on the other side right so in other words you want all the red dots to be on the plus side of the hyperplane all the blue dots are from there on then on the minus side now when I arbitrarily choose a random hyperplane this is not going to happen right so here for instance so this is my initialization here for instance this blue dot which should have I have flipped my signs beautiful so my blue dots are plus one and the radar minus one in the example apologies but let's stay with this okay so this blue dot must have a sign of plus one but it actually has a sign of minus one right now consider an individual dot right just think of why does that happen you need handles consider and does anybody see my chalk okay consider an individual Dart suppose I have a dot and this has a label plus if this is the only dot that I want to represent what is the ideal weight vector anyone know this has Olivia the vector to the dot right this guy is the ideal weight vector that's going to give me this hyperplane right any other weight vector is actually going to make the dot closer to the boundary correct suppose I have a dot which is has a label - then what is the ideal weight vector yeah part of me no I don't have this dot just I have only this one that then what would the ideal vector B so this if I had the weight vector this way is going to label it as plus one right I just flip it the negative of the end of the of the dot that is going to be my ideal weight vector correct so we recognize this this is easy now let's go back here and see what happened to this one this guy should be a plus but it's been labeled as a minus so the obvious thing for me to do is to move my weight vector towards the ideal weight vector for that dot correct which would be for me to simply so the ideal weight vector is this guy for that for that dot alone I'm just going to add it to my current weight and that's going to swing the boundary around like so and now I've corrected it right but then after having done this now that point is maybe on the correct side of the plane but we're not done this dot is misclassified this dot is a negative instance which should really be which is being classified as a positive incident instance what is the ideal weight vector for this dart pointing in the exact opposite direction right so we are going to add the ideal weight vector for this dot to our current weight so that that's the vector here we flip it and you add it to the current weight vector and that's going to be the resultant weight vector and that's your boundary right your new boundary and now everything is correctly classified we are done you can't do any better so the algorithm itself is very simple you can make you can make it sound complicated when you write it down in formulae but the intuition is trivial right so this was the one that Rosenblatt came up with way back when and so so long as the red and blue dots are perfectly separable by a hyperplane you can always find a hyperplane using the perceptron algorithm this is guaranteed not only is it guaranteed that you can find the hyperplane is again dependent on the dots being separable by hyperplane you actually have an additional guarantee which says that you're going to find the ideal hyperplane and a finite number of steps it's linear in the number of training points it's so let me skip the text so let's look at the figure now here are all of these dots now then the dot that is farthest from thee from the origin is this guy so the length of the largest training vector there's some gamma right or some R and then there's also the best case decision boundary between the two classes where the two classes are as far apart as possible this is the margin of your support vector machine if you are familiar with SVM's right that is the best case distance of the worst case points from the boundary so it turns out that the algorithm the number of misclassifications the number of updates that the recruiter algorithm will take in order to come to converge is upper bounded by the ratio of R and gamma because both of these are finite literally there's a finite number of steps in which the algorithm can will find it's will find a solution so this is very nice for the simplest possible problem we found an algorithm that can actually find the boundary provided the classes are separable and the way we did it was to start off with an initial guess and keep making small adjustments to it anytime we got a point that was misclassified right now can I take the same algorithm one step further and try to do this for a more complex problem like this one now I have here is the boundary that I'm trying to estimate I want to find this double Pentagon I want the network to fire if the input is inside the yellow region and if you give me a 0 or a -1 and then put us outside except I'm not being given the double Pentagon I'm only being given the red and the blue dots now observe that a solution exists a perfect solution exists for this curve for the set up right so using the same intuition as the perceptron as we did in the perceptron algorithm we should be you able to use something like the perceptron algorithm to find the solution for this problem right now assume that you're using the perfect architecture so we have a network which we know can solve this problem we are given the red and the blue dots we're going to try to learn something like the perceptron algorithm to learn the solution now if we do this what is it we really need to do what we are really trying to do is learn all these boundaries not we know for example that if I have this double Pentagon this all right if I have this double Pentagon then this first neuron to the extreme left is probably going to have to model this boundary correct the second neuron has got a model based boundary and so on each of the neurons in the first layer is going to model one of these lines and now so the lower-level neurons are all linear classifiers it's a combination that's giving you this nonlinear output right but here's a problem the training data that you actually got have have these labels the Reds are once and the Blues are minus ones right now you want this perceptron say you want to learn this I'm using a different example so let's say I want to use I want to learn this boundary if I want to learn this boundary what are the requirements on the labels of the dots on either side of the boundary for the perceptron algorithm we said it can only learn the linear boundary if the dots are separable right which means I want all of the dots on this side to have the label plus and all of the dots on this side to have the label - if I don't have that it's not going to be able to learn this boundary correct what are the dots labels that have actually got many of the dots on that side have the label - right similarly if I'm trying to learn this guy this is even crazier right I want everything on one side to be plus and everything on the other side to be - you have pluses and minus on both are minuses on both sides so you're not going to be able to from this basic label data that you have been given you're not going to be able to learn this mission or learn this function because what does never what this particular perceptron requires is labeling of this kind so what is the solution here's the crazy bit you're going to have to figure out what is the actual label you must assign to every single training instance because you don't really know in order to learn every single perceptron now you have say you have n training instances how many ways can you assign lis binary label strand training instances to raise to n so you're going to have to evaluate every one of these 2 raise to n possible ways of assigning labels to these instances just to be able to learn one boundary and how many boundaries to be happier 10 in this case correct so what happens is for a single line you have to learn try out every possible way our free labeling the dots not just the blue dots both the blue and the red dots right and in fact we have to do this for every one of the lines and we have to figure out how to real able all of the training data for air to learn every one of the lines such that when they are combined the resulting output is exact for what what you have and observe that this is true even for a problem where the data we know are perfectly me I can be perfectly separated by this network we have data for which a solution exists we have a network of the correct architecture but using something like the perceptron rule where we start off with some initial guess and then we readjust things every time you find a misclassified point is simply not going to work because in addition to learning the weights you're actually going to have to learn how to real able the points for every single line this is an [Music] NP write the problem has experts this solution has it's feasible it splits theoretically possible it has exponential time complexity so this is where we were in 1968 we knew that solutions these things could model functions we knew solutions could exist but really we didn't have the mechanism to actually find these solutions because the problem had exponential the solution at exponential complexity so obviously people don't give up so there were greedy algorithms proposed by Bernie Vidro Adaline and Madelene we draw as a professor over in Stanford and the learning algorithms he came up with have formed the basis of a lot of machine learning and signal processing so what we figured figured was that the perceptron learning algorithm cannot be directly used to learn in MLP because of the exponential complexity so can we use greedy algorithms instead where we sort of flip one you know the the problem here is you had to figure out the labeling for the entire training set for each perceptron instead of doing that can you do this by flipping one training instance at a time and figuring out what really happens so these are algorithms were called Adaline and madelyn we won't go through this in this class but they are on the slides they do feature somewhat extensively in your quiz so please go over the slides right but now so to summarize everything we've seen so far learning a network is learning the weights and biases to compute a target function in practice ilan networks by fitting them to match the input-output relationship off was a finite number of training instances a linear decision boundary can be learned by a single perceptron with a threshold function activation in linear time and the classes are separable nonlinear but decision boundaries require networks of neurons training an MLP with a threshold function activation perceptron with threshold function activation perceptrons will require knowledge of the input-output relation for every training instance for every perceptron in the network and since you don't have this these must be determined as part of training and so in fact turns out it's not just it's np-hard the problem is np-hard and I believe it'd be complete so here's where it was the realization the training an entire MLP was a combinatorial optimization problems stalled development of neural networks of well over a decade Paul verbose finally found a solution for this in 1974 and his PhD thesis in MIT but the world didn't really discover the solution for another decade now why do we have this problem so let's go back and consider the single perceptron right if I'm considering the single person Tron now let's say that figure is maybe a bit confusing I need a duster with a handle so now let's say I have these pluses and I have these minuses in my training data and my initial decision boundary this is the weight is like so okay now this is wrong it's got it's got half my things wrong right now suppose I know it's wrong so I decide to adjust it and if I adjust it to this one just move it is my error going to change no right suppose adjust it instead to this one is my error going to change again now right so simply by inspecting how the classification classification changes when I when I modify my parameters a little bit I will not be able to decide which direction the parameters must be modified it this is the the and this is why I cannot use simple adjustment based algorithms with the network right so in fact you can vary the weights a lot you can swing a network you can swing this one from this guy all the way out to this guy and the output is really not going to change you can vary you can vary the weights a whole lot and observe no change in the output same thing yeah so here for the individual neurons you have the same problem I can I do you have many neurons right and for each one of these I can vary the parameters a whole lot the decision boundary a whole lot and still not see any change in the output which doesn't tell me which way to vary they are very the parameters now in order for me to be able to learn the network properly what I need is some setup where when I changed my parameters a little bit I want to see whether the error is increasing or decreasing if I can do that then I will keep inspecting the various changes and keep adjusting my parameters in the direction where the error is decreasing make sense right so this is really what we want and this is not going to be feasible with this kind of activation function well there's a second problem the second problem is that we assumed all the data are nice and separable in practice that's not going to happen you're going to have blue dots on the red side red dots on the blue side so the Rosenblatt's perceptron would not even work in the first place for this kind of data right so what is the solution what we will do is to replace the threshold activation with a continuous valued activation now observe what happens in this example if I haven't just over here so let's say I have these pluses and I have these minuses and I have this as my boundary right now if I'm happy if I'm calling this my positive side so my sigmoid goes like so right coming out and E the shape is actually coming out of the board now when I swing this thing towards this guy you can see that the positive is going down the slope so my so so the label that I am assigning to it is moving away from the correct value of one right on the other hand for this guy I can see that it's actually good in one direction I'm getting myself tangled up but that I that example is a bad one but what you would see is when you have a continuous valued function of this kind as you move it then the gap between the label that you want which is plus 1 and the actual value that the function outputs which is going to be anything between plus 1 and minus 1 that's going to change so this gives you some mechanism for if the activation function is continuous and and differentiable you can actually see that I have this sigmoid and here I have a training I have a training instance whose labels should be minus 1 but it's actually being assigned as plus 1 if I move the sigmoid a little bit this way what happens to the label and you can see that the label assigned by the network is actually the value is actually going down which is closer to the negative one that I want it to be and so now I know that it's the correct thing to do is to move it in this direction whereas if you had a perfect threshold function sliding it left and right is not going to give you that information right so you want continuously continuous functions which with continuous variation which actually give you this information and one of the most popular search functions we've seen a bunch but one of the one of the more popular ones is the sigmoid and the sigmoid in particular has a very interesting interpretation so to see what the interpretation might be let's go back and look at our data let's look at non linearly separable data here is a two-dimensional example you want to classify the red dots as one the blue dots is minus one or zero but you have some blue dots on the red side and the red dots on the blue side right so which means I cannot actually draw a boundary which cleanly separates the red and the blue dots now to understand this better let's look at a one dimensional example here is how what in a one dimensional example is going to be look like yeah blue dots you have red dots but there's no distinct boundary where the book dots go from being blue to red as you go from left to right the number of red dots that you encounter keeps increase keeps increasing eventually you're going to see mostly red dots and almost no blue dots so instead of thinking of this as just trying to find a perfect bar as representing the data itself now let's look at a let's begin counting things let's look at a small window around each training point and compute the average value of all the training points within that window what does this average tell you it tells you what is the probability of +1 within that small window right so within that small window if you have 6 ones and 4 zeros the average is 0.6 which is basically saying that in this value of x in this range of X the probability of the output being +1 is 0.6 right and as you keep going left to right and you look what he will find us initially it's dominated by the blue dots so the probability is going to be close to 1 for the blue dots 0 for the 0 for the deck red dots and then as you keep going right the average is going to slide from blue to red and eventually you're going to have all rents right and so this curve actually represents the probability of class value 1 at each X and this curve has the shape of this curve typically tends to look like a sigmoid and as it turns out the sigmoid function which is 1 over 1 plus ei e raised to minus X models is shaped very nicely you'll observe that when X is plus infinity e raised to minus X is 0 so the value is 1 over 1 when X is minus infinity e raised to minus X is infinity so 1 over infinity is 0 right so it actually kind of smoothly slides between the tip and that's not just in one dimension higher in higher dimensions too if X is a two-dimensional variable you're actually going to get the sheet which has a sigmoidal shape and if you look at the sigmoid computed over the affine combination of the inputs that's going to have exactly the shape that you want so we'll sort of revisit this topic of what the sigmoid activation means and the fact that perceptrons would save model activations actually model class probabilities much later in the course but for now let's go back to our basic problem we've replaced the threshold activation with a sigmoid the sigmoid is differentiable now the sigmoid is differentiable what it means for a function to be differentiable is that it tells you how much a small change in the input is going to compute or is going to affect the output that is the notion of difference here differentiability right so which means that over here if I have a sigmoid Sigma over here I will be able to figure out how much a small variation of Z is going to vary much now this is this this component is just a simple addition z is an affine combination of the inputs so because it's an affine combination I can already easily compute how much a small change in W is going to change that or how much small change in X is going to change that so now I can chain these up and I can tell you how much a small change in X is going to change there and that change is going to be small but I also can compute how much a small change in Z is going to is going to afflict why and so as a result I can tell you how much a small change in X is going to change value right so by making the activation differentiable I can actually tell you we can compute the effect of small variations of either the inputs or the weights on the output so when you plug this thing into a complex network now each of the activations is differentiable so for example if I wanted to find out how much a small change in this weight is going to compute it's going to affect the final output now a small change in this weight is going to change the output of this neuron and because the activation is differentiable I can compute that and now I can also compute how much a small change in this guy is going to change the outputs of these two guys which are the ones that it's complete which are the ones that it's connected to and I can also compute how small changes in the outputs of these guys is going to change Y so as a result I can actually compute how much a small change in a single weight is going to change the output of the network and I can do this for every weight in the network simply because the activations are differentiable in other words the entire network is a differentiable function and it's differentiable with respect to every parameter in the network and differentiability basically tells you how much small changes are going to affect the output and once I know that now I have the handle the ability to make corrections to each of these parameters to make the output as correct as possible and so this overall function that we've got is differentiable with respect to every parameter small changes in the parameters result in measurable changes in the outputs and we're going to see exactly how what these changes are going to be we're going to derive it later but then here is the overall setting for the learning algorithm for the MLP you're now going to be given a collection of input output pairs X's and D's X is the input B is the desired output for that input and we now have to find the network parameters says that the network better produces the desired output for every single input we are assuming the architecture is given to us now remember again our original objective was to try to model this function correctly everywhere if we were given the function then what we would do was to compute the area of the error between the actual output of the network and the target function and in the ideal case what is this area going to be zero right but more practically yep because we really don't know the complexity of the architecture required and our learning algorithms are themselves going to be kind of less than perfect you the error will not go down to zero so in that case we are going to impose an extra requirement we want to say that the regions of the input space which have higher probability must have lower error than regions of the input space which have lower probability so for example whatever for example if you have if you are trying to model this function right and then you know that in reality these values of X are never observed then what would the point be in trying to expend effort and trying to make sure that the function is perfect over here here as well there's really no point because you're never going to use those regions of the function right so to account for this we will actually weight every value every value of x with the probability of that of X so in other words we are going to emphasize more frequent values of X more than the less frequent values of X when you compute the error and this is the error that this is the error that we're going to try to minimize we're going to try to find the parameters W that minimizes this weighted error now what is this one this term here there's simply the expected value of the divergence right any time you come you compute the integral of some f of X times P of X you're really computing the expected value of f of X right this is the expected value of the divergence that's what we're going to try to minimize but as we saw in reality you're not going to get the complete function G of X we are only going to get samples input-output samples and so we could have to estimate the function from these samples so what do we do we want to have again drawing this function in reality if this is our function and this is our current estimate for the function we want to minimize this area right but we don't have the function everywhere we only have the function at some locations so maybe we just have these as our training and training samples so all we have are these guys right so we cannot compute the area so instead of computing the entire area that we are trying to minimize what we are going to do is to try to find the average of these lengths and the average of those lengths is going to be our proxy for the area that we are trying to minimize so what we will compute is an empirical estimate we want to ideally minimize the expected error what is also called the risk when we learn the parameters of the network but we cannot actually compute their compute the expected error instead we will compute an empirical average of the error across your training points and this empirical average is going to be our proxy for the error keep in mind again that the actual objective is to minimize the expected error it's just that we cannot compute it so we are going to minimize the empirical average or the empirical risk and so the business of training and network is really from a case of empirical risk minimization we have a bunch of training instances we are going to define the error on the Hyatts training instance which is this should have had color which is this length over here or some measure of this length which is the difference between what the network currently computes here that X and what he wanted to compute and you average this over all of the training instances that your empirical average error or your empirical risk and we are going to estimate the parameters to minimize this empirical risk right so observe here that what you're really trying to minimize is a measure of the error between the actual output of the function and the function that you want and the value you want terminology in the literature is to call this a loss for various reasons that I won't get into so during the rest of this course and in your homeworks you will hear the term loss but the loss is simply the empirical average error across all of the training points right so repeating the problem statement again you're given a training set of input-output pairs of this kind you want to learn the parameters W to minimize this loss term which is simply the simply the empirical average error across all of your training points this is a problem of function minimization it's an optimization problem right so the story so far we learn networks by fitting them to training instance are to two training instances drawn from a target function learning networks with threshold activations requires solving this hard combinatorial optimization problem so instead we use continuous activation functions with nonzero derivatives to enable us to estimate Network parameters and we define a differentiable divergence function again the divergence function which computes the error between the actual output of the network and the output that you want that must be differentiable because if it's not differentiable again everything is lost so maybe that's worth explaining so what you really have is this you have a network and what we've done when we made the activations differentiable is that we figured out how much we made it such that we could compute how much a small change in this weight is going to change the output but that by itself is not sufficient because what we are really doing is comparing this output to a desired output to compute an error if this comparison function and you want to figure out how to adjust this weight W such that this error is minimized right so in other words the error must be differentiable with respect to the weight that means this comparison function this error function itself must be differentiable so it's not sufficient for the activations to be differentiable even the divergence function that compute that that come quantifies the error between what the network actually computes and what you wanted to compute must also be differentiable right so but having said all of that now the problem can be said as a case of empirical risk minimization which is especially as a function minimization right questions one Piazza ok all right so yeah yeah that that is correct right we made that statement early on if it doesn't then all bets are off so you really want to sample according to the distribution of the data itself because that's how you make sure that you make less error in more populous areas and more at the expense of less populous areas yeah part of me could you repeat that so what do you mean by using other activation functions so in all of these cases the question is what is the situation and we use activation functions besides the sigmoid it doesn't really matter the criterion we want is the activation function must be such that we we are able to quantify the change in the output of the neuron as of in response to small changes in the input or the parameters of the input and when you're using something like Ray Lu where the input falls on the negative side on the zero side at that point the derivative becomes zero and so and and any time the input falls squarely in the area where the output is flattened out you really cannot figure out how much a small change in the input is going to change the output that's one of the deficiencies of the rail that we mentioned in the last class but if you have a large enough network some other neuron is going to capture that information anyhow hopefully yeah no a sigmoid we said was really not not a very good function right and that as you go far away from the boundary it flattens out but something for example like a leaky red or many of these other activations which continue to have derivatives nonzero derivatives far away from the boundary are obviously going to give you more information right no we're not saying that at all right we are saying that a sigmoid so a sigmoid has this property it's flattens out far away from the boundary so over here you really don't wear a rail you only flattens out in this region right all right so let's yeah that's it doesn't have to be strictly differentiable even a few you're over here right what you really want to know if you are very close to this guy the odds that you'll fall exactly at this point are very small but here you really want to compute whether a small change in input is going to increase or decrease the output so that you don't let you can use any derivative which falls below the function something called a sub gradient and you can use those values all of them will inform you that if I increase the input a little bit the output is going to increase anything that informs you of that can be used right yeah so the question was how do you deal with the fact that the Ray Lou is not differentiable at 0 you don't really again the fact the query it's not about differentiability per se it is about the ability to know how much a small change in the input in which direction a small change in the input is going to change the output right so we can use sub gradients anyway moving on I'm not going to finish today's topics but what do you understand by the term derivative anybody won't answer me so when I say it's derivative what do you think how do you understand it the slope at a point yeah what is the slope again pardon me a tangent to the locus the rate of change of the output now we are thinking of a scalar function now you're not but if I have multi-dimensional inputs then what does it mean so anytime you're speaking of a when you're speaking of derivatives your standard assumption about derivatives is that you're talking about a scalar function with a scalar output but a better way and possibly the only useful way to think about a derivative is this let's say I have a function y equals f of X then if the function is smooth and continuous at any point when you get close enough to the function it's going to look like a plane and so for such a now for such a function you can always ask how much does a small change in the input Delta X change the output Delta bar and if it's planar that's going to be a multiplicative term right so the derivative of a function is this value alpha that multiplies a small change in the input to give you a small to tell you what the change in the output is going to be and this is really the only way to think about derivatives in any useful way right so over here when you're thinking of a scalar function you can begin talking about the slope right so how much does a small change Delta X change Delta change Y so what is the relationship between Delta Y and Delta X assuming the relationship is linear then you can sort of write it as the ratio of Delta X and Delta Y Delta Y and Delta X but you're really speaking of infinitesimal changes so we sort of like to represent for many reasons because once you go into multi-dimensional data you cannot write things in this manner more appropriate our representation is F prime of X which we use all the time right but the thing I want you to remember is this guy the derivative is a multiplicative term which tells you what you must multiply a small an incremental change in input by to compute the corresponding input incremental change in the output right so why am i insisting on this let's go to a multi-dimensional function say a multivariate scalar function where it's a function of the Y is a scalar and X is a vector so like a bone as a function of two variables right now in that case I can still write Delta y equals Alpha Delta X X is a vector now if I assume X is a is a column vector what is the shape of Delta X that is also going to be a column vector right is a small increment what is with what is Delta Y scale a row vector it's a scalar right so what must this alpha term be vector what kind of vector pardon me same dimension shape it's got to be a row vector correct if Delta X again there's a business of stir of a notation if you think of your inputs as row vectors then Delta Y is going to be a column vector you're going to write multiply it if you think of your input as column vectors Delta Y is going to be a row vector you're going to left multiplied so just thinking about it in these terms immediately begins informing you about how you even represent these things the dimensionality etc right the key piece I'm going to assume that my inputs are column vectors the key piece is this relationship okay now once I write it like so you can see that this alpha is going to be a row vector with many components right and specifically Delta Y is going to be because it's quite a component wise multiplication and summation Delta Y is going to be alpha 1 Delta X 1 plus alpha 2 Delta X 2 all the way to alpha Y Delta X T right now now I can ask myself Delta X is a multi-dimensional vector what would happen if I changed only one component and from this formula you can immediately see that if all of these Delta X's are zero the only thing that you're going to be left with is this one term right the corresponding alpha one remember the derivative itself doesn't depend on the on Delta X the derivative alpha depends on the current position and nothing else right so which means that this alpha one is simply going to be the partial derivative of Y with respect to X one assuming all the other components are fixed makes sense to everybody right and so we often represented as you know the dou dou Y or draw X I and then immediately this formula falls out which is familiar enough for all of you that the increment in Y is simply going to be the sum over all components of the product of the partial derivative with respect to the component times the increment of that component right now this is improve this is important we're going to use it all the time now terminology this term the derivative itself is a row vector [Applause] but you're probably more familiar with the term gradient the gradient by definition if you're representing the inputs as column vectors the gradient by definition is the transpose of the derivative so the gradient is going to have the same dimensionality as X itself okay so having having introduced all these ideas I'm going to in the next few slides remember we're trying to optimize minimize error with respect to parameters W but I'm going to use more generic notation I'm going to speak of a wave of some function of some variable X don't get confused X is not to be confused with the input it's just that I'm abstract aproblem out as some function y of X ok so now let's look at the business of optimization you want to find the problem of optimization minimize a function minimization is to is that of finding the value of x where f of X is minimum right so something of this kind you might have a function of this kind you have multiple locations where the function kind of bottoms out but there's one location which is lower than any other location that's a global minimum and that's the that is the location that you're trying to find now the function is a nice Bowl there's only one point at which the function bottoms out so this is what you're going to find that the function is a hideous function there are many places where the function has a local bottom you want to find the best one the lowest one so now if a function is bottoming out and it's continuous its decreasing before it hits the bottom its increasing after it hits the bottom at the exact bottom because it's continuously expect the derivative to be zero right so any time in high school or college if you were asked to find the minute the location of the the of the minimum of a function the way you solved it you computed the derivative you equated it as zero right turns out this is not enough why because there are two different locations where kinds of locations where the derivative is 0 both maxima and minima have zero derivative so here for example when the function is approaching a maximum initially the slope of the function is going to be positive then it hits a maximum and then the function becomes decreasing the slope of the function is negative when it hits a minimum the slope of the function is negative because it's falling then it hits the minimum then the function begins increase increasing the slope of the function is positive so both maxima and minima are locations where the function changes its direction these are what are called turning points and the derivatives are turning points are always zero so when you look at the derivative of the curve you'll find that the derivative of the curve is going to be zero both at maxima and minimum but then here is another property so just saying that the derivative is zero doesn't give you enough information it doesn't tell you whether it's maximum or minimum but then look at the derivative itself this when you're at a maximum the function initially is increasing it hits the maximum and then the function begins decreasing the slope of the function to the left of the maximum is positive the slope of the function to the right of the maximum is going to be negative so if you plotted the slope the slope is going to be like so it's going to go from positive to negative at a maximum it's going to go from a negative to the part to a positive at a minimum so what is the slope of the slope it's negative for a maximum and positive for a minimum so if you computed the second derivative you're going to get something of this kind the second derivative is negative at maxima and positive at minima and so this is a standard test that you ran in school you first computed the derivative then you check that they check they equated it to zero to find X then you check the day of second derivative right easy enough so if you want to find the minimum of a function you want to find the way you would solve it is it find the value X where where F prime of X was zero this is a turning point then you check check this second derivative right well then things can get a little more complex than that look at a function like the one on top there are three different type soft locations where the derivative can be zero in the first case at maximum the function is increasing then it hits a maximum then it decreases then you have the minima the function is decreasing it hits a minimum and then it increases and then you have inflection points its decreasing then it takes a little rest and then continues to decrease or it's increasing now you know pauses for breath then continues to increase so at these places also the derivative is going to be zero right what is the second derivative at those locations which are not really either maximum or minimum now if you look at that if you if you look at the second derivative you look at the derivative if you have a function of this kind the derivative is initially the slope is negative the function is decreasing then it stops decreasing then it continues to decrease right so the derivative has gone from being negative to less negative to zero but then it continues to be goes back to being negative right what is the second derivative at that inflection point zero right so you have if you begin looking at the second derivative and it's it is positive at minima negative at Maxima at inflection points it's going to be zero right now what about functions of multiple variables this is that was just for one variable again you have the same kind of situation you have function of many variables you are going to try to find the location where the derivative is zero what I mean by the derivative is zero is that you know regardless of which now remember you can approach any point from many different directions in multi-dimensional space regardless of which direction you arrive at that point from you're good you would hit a minimum and then if you if you continued and it doesn't matter which direction you continue the next thing that happens is you go but you begin going back up right or so the optimum is still a turning point but then it has this characteristic is that that shifting in any direction will increase the value it's going to go up and so when you are trying to find the minimum of a function you're really trying to find the location of the input where such that regardless of which direction you move in from that from that point the function increases I'll stop over here we'll continue with this stuff in the next class we're since
Info
Channel: Carnegie Mellon University Deep Learning
Views: 10,591
Rating: 4.9402986 out of 5
Keywords:
Id: 9kAQ8Em7SdM
Channel Id: undefined
Length: 78min 42sec (4722 seconds)
Published: Wed Sep 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.