Lecture 10 - Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced the third linear model, which is logistic regression. It has the same structure as the linear models, where you have the inputs combined linearly using weights, summed up into a signal, and then the signal passes through something. In this case, it passes through what we refer to as a soft threshold. We labeled it theta. And the model is meant to implement a probability that has a genuine probability interpretation. And because of that, the error measure we derived was based on likelihood measure, which has a probabilistic connotation, in which case we maximized the probability that we would get the data set that we got-- the outputs given the inputs-- based on the hypothesis that is represented by the logistic regression being assumed to be the target function identically. And this makes us able to express the probability in terms of the parameters that define the hypothesis, which are the weights, w. And therefore, we have this quantity that we want to maximize. And then we derived an error measure that very much parallels the error measures that we had before, in terms of the in-example error for logistic regression that we will minimize. So this is a useful model, and it complements the other models. One of them was for classification, one of them for real-valued function regression. And this one is for bounded real-valued function that is interpreted as a probability. One of the key issues about logistic regression is that, because the the error measure is a little bit more complicated than we had for example in linear regression, we were unable to optimize it directly. And therefore, we introduced the method that is meant to minimize an arbitrary nonlinear function that is smooth enough, twice differentiable. And in the case of logistic regression, although we don't have a closed-form solution, the error measure actually has a very nice behavior. It's a convex function and, therefore, when you apply a method like gradient descent or other methods, it is fairly easy to optimize because you just fall into that minimum and stay there, rather than have problems with local minima that we talked about briefly. So the algorithm for gradient descent, regardless of the error measure that you are trying to minimize. First you initialize and, in the case of logistic regression, initializing to all zeros was fine. We will find out that today in neural networks, that will not be fine, and we'll make the point why. And then you keep iterating until termination. And what you do is, you update your weight gradually by going along the negative of the gradient. That would be the steepest descent in the error-- the biggest gain you would get for a fixed-size step. And in this case, we adjusted the fixed-size step so that it's a fixed learning rate that is proportional to the gradient at that point. We keep doing this, and then when we arrive at termination we report that as our final hypothesis. And we talked a little bit in the Q&A session about criteria for termination, and also about local minima that will become an issue for today. So today when I modify the gradient descent into the more practical version, which is called stochastic gradient descent, we will talk a little bit about initialization and we'll talk about other aspects that have to do with local minima and whatnot. OK. Today's topic is neural networks. And historically, neural networks are responsible for the revival of interest in machine learning. They have a biological link that got people very excited and people pursued them. And they were very easy to implement because of the algorithm that I'm going to describe today. And they met a lot of success in practical applications, and got people going. Now, it is not necessarily the model of choice nowadays, probably people will opt for support vector machines or other models. Yet every now and then, neural networks would do the job as well as the other models. And many industries use it as a standard. For example, in banking and credit approval, neural networks are often used. So the outline for today is very simple. First, I'm going to extend gradient descent into the special case of stochastic gradient descent that is used in neural networks. And then, I'm going to talk about neural network as a model. What is the hypothesis that it is implementing? And I motivate it from a biological point of view, and relate it to perceptrons. And then we'll talk about the backpropagation algorithm, the efficient algorithm that goes with neural networks that actually made that model particularly practical. So let's start with stochastic gradient descent. What do we have? We have gradient descent, and gradient descent minimizes an error function. That is function of w-- minimizes it with respect to w. And that happens to be an in-sample error in our mind. And it is the in-sample error. And the only thing I would notice here that is particular to the derivation of stochastic gradient descent is that, in order for you to compute the error or the gradient of the error, which you need in order to implement gradient descent, you need to evaluate the hypothesis at every point in your sample. So for n equals 1 to N, you need to evaluate those, or you evaluate their gradient. And that will tell you what is the error, or what is that direction you would go to, which is normal because this is the error we are minimizing. You'd better compute it. So you take the case of logistic regression and we had a very particular form for that. And now you can see that it's an analytic form. In this case, friendly and smooth. And indeed, you can get the gradient with respect to that vector and go down the error surface, along the direction suggested by gradient descent. Now, the steps were iterative, and so we take one step at a time. And one step is a full epoch, in the sense-- we call something epoch when you have considered all the examples at once, which is the only choice we have so far. And we had this formula that we have seen. The difference we are going to do now is that, instead of having the movement in the w space based on all of the examples, we are going to try to do it based on one example at a time. That's what will make stochastic gradient descent. So because now we are going to have another method, we're going to label the standard gradient descent as being batch gradient descent. It takes a batch of all the examples and does a move at once, as opposed to the other mode. So the stochastic aspect is as follows. You pick one example at a time. Think of it that you pick it at random. You have N examples, each of them is equi-probable to be picked. You pick one of them at random. Now you apply gradient descent, not to the in-sample error for all the examples, but the in-sample error on that point. That looks like a very meager thing to do, because the other examples are not involved at all. But I think you have seen something like that before. When we take one example at a time and worry about it, and not to worry about what other guys are doing even if we are interfering with them. Remember the perceptron learning algorithm? That's exactly what it did, and it worked. And in this case, it will also work. Now to argue that it will work, think of the average direction that you're going to descend along. What does that mean? If you take the gradient of the error measure that you are going to minimize, which in this case just for one example, and you take the expected value under the experiment that you picked the example from the entire training set at random. In that case, if you want to get the expected value with respect to the red n, which is now a random variable, this is what you get. And if you evaluate it, it's pretty easy. You simply take this value. For every example it has a probability of 1 over N. And the expected value would be 1 over N summation of that. So this would be the average direction. So you think that every step, I'm going along this direction plus noise. So this is the expected value, but because it's one example or another, there is some stochastic aspect. And if you look at the quantity on the right hand side, this happens to be identically minus the gradient of the total in-sample error. So it's as if, at least in expected value, we are actually going along the direction we want, except that we now involve one example in the computation, which is a big advantage, and we have a stochastic aspect to the game. So this is the idea, and then you keep repeating. And as you repeat, you always get the expected value in that direction, and you get different noises depending on which example. So the hope now is that by the time you did it a lot of times, the noise will average out and you actually will be going along the ideal direction. So it's a randomized version of gradient descent, and it's called stochastic gradient descent, SGD for short. Now let's look at the benefits of having that stochastic aspect. The main benefit by far-- that is the motivation for having this-- is that it's a cheaper computation. Think of one step that you're going to do using stochastic gradient descent. What do you need? You take one example, you put the input and you get the output, and then you compute whatever the gradient is for one example. If you're doing the batch gradient descent, you will do this for all the examples before you can declare a single move. Nevertheless, the expected value of your move in the cheaper version is the same as the other one. So it looks there's a little bit of cheating here. On the other hand, it looks attractive. If this actually works on average, this is an attractive proposition. So this is number 1 advantage. The second advantage is randomization. There is an aspect of optimization that makes randomization advantageous. So you don't want to be extremely deterministic. You want to have an element of chance. Why would I want an element of chance if I know my goal exactly? Well, because optimization is not exact. It's not like you're going for the minimum for sure after that. There are all kinds of traps that you can go through, like local minima and whatnot. So let's look at cases where randomization would help. This is an error surface, and it is the typical error surface you will encounter. The one you encountered in logistic regression, which was simply like this, that was a lucky one, the convex one. In general, and in neural networks for sure, you are going to get lots of hills and valleys in your error surface. So depending on where you start, you may end up in one local minimum or another. You may not get the best one, you may get one or the other. Now, this is inevitable and there is really no full-proof cure for it, as we discussed in the Q&A session. On the other hand, it will be quite a shame if you get stuck in this fellow. You see this small fellow? Because it's really just a shallow local minimum. But according to gradient descent, you go here, the gradient is zero, everybody is happy, and you stop there. So you would love to have an added element that will make you escape at least shallow valleys like that. And the idea now is that, because you are not going in a direction that is deterministic-- in this case, there is a random-- there is some fluctuation here. So there is a chance as you go here that you will escape from the local minima. Now this is a practical observation that in reality, stochastic gradient descent does help with this. It doesn't definitely cure it, far from curing it. On the other hand, it does take care of some aspect of escaping silly local minima. So this is an advantage that basically is a side benefit. We did it for the cheap computation, and we're getting this for free. The other one we also talked about a little bit in the Q&A session, which was the flat regions. So you could be having this being very, very, flat and then finally going down. So if your termination criterion tells you that here you are OK, then you-- it looks like flat, and nothing is happening, and you will stop. Every now and then when you do the random things, the fluctuation takes you up and down and the algorithm is still alive. Still, termination is a tricky criterion, because for termination you need to consider all the examples in order to know exactly where you stand. But for some of the flat regions, just the stochastic aspect also helps a little bit with it. So there are basically annoying artifacts of the optimization of a surface that gradient descent will help a little bit with, if you use the stochastic version. Now, the third advantage-- shows randomization helps-- the third advantage you have is that it's very simple. It is the simplest possible optimization you can think of. You take one example, you do something, and you're ready to go. And I will see an example in a moment that applies it. And because it's simple, there are lots of rules of thumb for it. So people have used it a lot, and people use it in different applications. So you can find rules of thumb that are actually very useful. So I'll give you one rule of thumb that will be helpful in practice. Remember the learning rate? The learning rate was telling us how far we go. And we talked about, you know, that if it's too big then you lose the linear approximation. If it's too small, you are moving too slowly. So sometimes you ask, what should I use for eta, the learning rate? Obviously, the exact answer depends on the situation, and even it is dependent on scaling the error up and down. Mathematically, you can't really pin it down. From a practical point of view, if you go for a very wide range of applications, you take a normal application, a normal error function, mean squared or something, and then you take eta equals 0.1. That actually works. So you can always start with this, and then adjust it there. That's for stochastic gradient descent. So this is a theorem, eta equals 0.1! And the proof is that. These are advantages, so we are now motivated to look into stochastic gradient descent. And let's see it in action. I'll take an example far from the linear models and neural networks. I'll take an example that we looked at before in an informal way, and it will be very easy to formalize and implement this way. Remember movie ratings? What was that? Oh, that was the example where you want a user to look at a movie and do a rating. And you want to look at previous ratings and predict. All of that. Now it looked like this, at least the proposed solution, that we will describe the user by a number of factors, which are basically their taste. They like comedy, they like action, they hate this, et cetera. So there are some values here describing their taste, a profile of the user if you will. And then a movie-- you describe the content with the same factors. Does it have comedy? Does it have-- et cetera. And the idea now is that we are going to reverse-engineer the ratings, the existing ratings in the training set, into factors that explain why this rating is. And hopefully by the time we do that, we will be able to predict future guys. So I do this for the movies that this user saw, and then I will take the factors of the user, the factors of a movie that they haven't seen, and do the same combination that I did here, and hopefully get a prediction for the rating. So all I want to do here is show you this method using stochastic gradient descent, which was actually the method that was used in this solution in the million-dollar prize. So although it is very, very simple, it is actually used. And if you are working for something with the stakes that high, you probably will try your best to get something right. So the fact that actually stochastic gradient descent survived until that late stage tells you that it's not a trivial algorithm. So in order to put some formality on this, we need to give labels for the users and movies. So it would be user i, movie j, and the rating we will call r_ij. Very simple. Now there are factors for the users and factors for the movie. So let's call them something. The factors for the user will be u_1, u_2, u_3, u_K, so it's a vector of numbers that describe the taste of that user. And the corresponding factors for a movie would be v_1, v_2, v_3, up to v_K, which describe the content of that movie. When we said we're going to match the taste of the user to the content of the movie, what we were going to do, we were going to simply take a coordinate, small k, from k equals 1 to K, and multiply these two. So we're taking an inner product between these two guys. And then sum up. And that will tell us the level of matching between the two. At least that is the quantity we are trying to make replicate the rating. So we would like the difference between the rating and this quantity to be small. That's the goal. Now in order to be accurate in the notation, the factors u_1 up to u_k and v_1 up to v_k depend on which user and which movie. Different users have different factors, et cetera. So I'm going to add the label of the user, and the label of the movie. And now, if you look at the picture, i and j will appear. So it's a bit more elaborate notation, but it's not a big deal. And we also introduce it here in the sum. So this will be exactly the case. And for all of the users and all the movies, you have a shuffle of different users rating different movies. So the factors are reused for different ratings that appear in your training set. And now your idea is, how do I make these guys close to the ratings in the training set, hopefully that they will generalize. And the way you do it is, you define an error on that particular rating, which is the difference between the actual rating and what the current factors suggest. The factors now are your parameters, and you're trying to find the value for the parameters that minimizes this. Because you're taking one example at a time, if you do descend on this one, it will be stochastic gradient descent. If you wanted to do batch gradient descent, you would have to take all the ratings, add up these terms for all the ratings you have, and then descend on those. But the stochastic gradient descent is the one which is used. Could there be anything simpler? You're going to get the partial this by partial every parameter that appears here. And remember in the first one, we said that all we are doing is-- we take these factors and try to nudge them a little bit towards creating the rating. And now we have a principled way of the nudging. The nudging will be proportional to the partial by partial each factor. So I have a bunch of factors. Which factors do I modify in order to get there? Now we have the formula, and the formula will be, as a vector, I'm going to move in the space that now has 2K parameters in this case. And I'm going to move in that very high-dimensional space in a direction that makes me, with a certain size of step, achieve the biggest drop in the error in estimating the rating. So you can implement this. And indeed, if you implement it, you will get a pretty good score. Not a winning score, but a pretty good score for the Netflix competition. And in this case, people started adding terms, and obviously regularizing, which will be an important issue that we'll come up with. But basically, the simplest stochastic gradient descent with very plain squared error on something as simple as that will get you somewhere. So now we know that stochastic gradient descent is good. And stochastic gradient descent is the one we're going to apply to neural networks model, so let's talk about neural networks model. I am going to start with the biological inspiration of neural networks, because it's an important factor. That's where they got their name, and that's how they got the initial excitement that got them to have a critical mass of work. So biological inspiration is a method really we use in engineering applications a number of times. And there is a little bit of a leap of faith there which is: we are interested in replicating the biological function. You know, humans learn. We want machines to learn. So in order to replicate the function, our first order is to replicate the structure. That's what we do. We try to make it look like the biological system, hoping that it will perform the same. It is a legitimate approach because something is working-- there is an existence proof, and it has this structure. Maybe the structure has something to it. So in the case of neural networks, this is the biological system. We have neurons connected by synapses, there are a large number of them. Each of them does a simple job. The job, the action of a particular neuron, depends on the stimuli coming from different synapses. Synapses have weights. Very much similar, if you look at a single neuron, to what we thought of the perceptron. Except, obviously, they are different quantities and not as exact and whatnot, but this is the principle. So the idea, now, maybe if we put a bunch of perceptrons together in a big network, we will be able to achieve the intelligence or the learning that a biological system does. And we get to replicate it, and get something like that in engineering, a network of this sort. And indeed, this was the initial thing that we did. Now I'm going to make a single comment about the use of biological inspiration in this way. So I'm going to give you another example, where we had biological inspiration. And we'll get a lesson from it. So the other example is the following. We want to fly. We look around. Birds fly. Let's try to get inspired by birds. And after a long chain of events, we ended up with this. Now, there is no question that the structure, which is what we are going to use, made it. There are wings, there is the tail, et cetera. But once you got the basic structure going, if you are in an engineering discipline-- if you're in biology, your goal is to understand why the structure does the function, and know it. So you want to know how biology does it, regardless. In engineering, you want to do the job. You don't care how you do it. You're just using biology as an inspiration. Completely legitimate approaches to the problem from different perspectives. But once you did the initial thing, you are no longer going for the bird and seeing what organs the bird has. No, no, no. What you went here, and all of a sudden it's all partial differential equations and conformal mappings. And when you get the solution, you get a plane that flies but doesn't flap its wings. So, imitating biology has a limit. You have to get an inspiration for what is relevant, and then on your own derive what you need. So going back to our model here. We will get this. Now if I derive a way to learn, et cetera, I don't need from an engineering point of view to go back and see if it's biologically plausible. If I'm a biologist, I had better because my job is to explain how the biological system is working. So if I tell you that it's doing something that is not biologically plausible, I already violated the premise. Here, as long as I get the job done, I'm OK. So it is fine to take the inspiration, but let's not get carried away. We are actually trying to build something that does a job from an engineering point of view, and whatever works, we will take it. And that is where the neural network is going. So knowing that the building block is the perceptron, and that we are putting perceptrons together in a neural network, let us explore what we can do with combinations of perceptrons rather than a single one. And I'm going to do this pictorially. I will save the math when we define the neural network itself. So we'll just look at pictures of what perceptrons do and how to combine them, and we will get the idea that actually combining this very simple unit does achieve something. So let's look at the famous problem where perceptrons failed. Remember the four points? With the diagonal +1 and -1. If you want something that is plus here and plus here, and minus here and minus here, you're out of luck as far as using a perceptron is concerned. Now we are exploring, can we do this with more than one perceptron arranged in the right way? That's the goal. So we look at this and say, I can get the first-- this thing-- with a perceptron I'm going to call h_1. That's easy. I'm going to get the second one as this. And maybe now I can take the outputs of these perceptrons, and combine them in a way that achieves this particular dependence. And you look at it and say, that's actually very plausible. And your building blocks for doing that are your old-fashioned OR's and AND's. The logical OR and AND. So you think, let's say I have two Boolean variables, 0 or 1. Or in this case, +1 or -1. Can I implement an AND, which returns +1 if, and only if, both are +1? Or can I implement an OR, which returns +1 if at least one of them is +1? That would be the AND and OR. Can I implement these using perceptrons? Why? Because I am in the game of trying to use perceptrons to build stuff, and I'm seeing where this can take me. Well, the OR is very simple. I can do this because I realize I already have, because of the constant term that has a weight 1.5, I'm already ahead of the 0. So in order for this to actually go negative, both of these guys have to be -1, right? And therefore, this actually does implement the OR because if either of them is +1, I will get the signal +1. For this one, I'm resisting a negative bias already. So I'd better have both of them to be +1, if I'm going to exceed 0 and report +1. So this actually implements the AND. So indeed I can implement the OR and AND, using a simple perceptron. Now, you create layers of perceptrons based on what you had. So in our case, we had h_1 and h_2 that implemented the surfaces we wanted in the Euclidean space, and we just want to combine them. The combination now, if you look at it, is that you want the AND of h_1 and h_2 bar, the negative of this, and h_1 bar and h_2. Basically, you are implementing an XOR. An XOR wants one of them to be +1, and the other one to be -1. So this is what you want to implement, but that is easy because if this is a variable, if I have that ready-- I don't know whether I have that ready. I know that I have h_1, and I know that I have h_2. I don't know whether I have this funny quantity with the bar, but likely I do. Then all I need to do is combine them this way with the OR function, and then I will get the function I want. So let's expand the first layer, and make it really layers. So now, you do have h_1 and h_2. We already established that these are perceptrons. So what you do, when you have a weight of -1, it's as if you are negating. And a weight of +1, you are leaving it alone. So you have -1 and +1. And then you get the first layer to do the AND. But not the AND of the thing itself, but the AND sometimes of the thing, or sometimes of its negation, in order to implement this guy that I want. So you end up with these. And these guys will be implementing the functions you want here. And now you pass them on to the OR, and you get the function you want. So let's plot the full multilayer perceptron that implemented the function we want. It looks like this. This is your original input space. This is x_1 a real number, x_2 a real number in the Euclidean space, and this is the x_0, the constant 1. This is the perceptron h_1 and h_2 that you implemented in order to get the first picture. So these are the components, and I can implement them using a perceptron. After I implement them using a perceptron, I do the conjunction of one and the negation of the other, in order to get here. And then I do the OR, and get here. So this multilayer perceptron implements the function that a single perceptron failed in. And we have layers. So each layer would be this fellow, the inputs going into it and the neurons themselves, the perceptrons. And this is the second layer, and this is the third layer. So in this case we have three layers. We have strict rules in the construction, which is feedforward. So it's feedforward, that is, you don't get the output and put it to a previous layer, and you also don't jump layers. It is very hierarchical. You go from this layer to the next layer, and then from the next layer to the next layer. It didn't restrict us very much because you realize that, if you have done logic before, you realize that if you can do the AND's and the OR's and the negations, you can do anything. So I can have a very sophisticated surface and just by having enough of those guys and combining them, I can get a very sophisticated surface under the restriction of this hierarchical thing. So that's pretty good. We now realize that we have a powerful model. And to illustrate the powerful model in a case, let's look at this case. Let's be ambitious, not only just the XOR, I want to implement the circle, which we remember we had to go to a nonlinear transformation, just using perceptrons. So you say, definitely that doesn't look anything like a line. And I'm using lines, there's no transformation here. So what am I going to do? Let me try 8 perceptrons. Just sort of cornering this. If I do this, each of them will be +1 somewhere, -1 somewhere. So I have a pattern of +1's and -1's. And all I need to do is the logical function that will give me where I am inside and where I'm outside. So I end up with a polygon, an octagon in this case, that approximates the circle, using 8. I can go for 16. And then I'm getting closer and closer to the circle. And I can get as close as I want, by having as many perceptrons as I want. And now I have a bigger task of combining the logical results, in order to get the final thing I have. And indeed, you can prove that multilayer perceptrons with enough neurons can approximate any function, which is very good. And for us, being powerful is good, but it raises two red flags. Once I give you, this is a great model. Everybody will be excited, except people in machine learning. Wait a minute, I have been there before. So what are the two red flags? One of them is generalization. I have a powerful model. I have so many perceptrons, so they have so many weights, and the degrees of freedom, VC dimension. I'm in trouble. Well, you are in trouble, but at least you know the trouble you are in now. That is, you can completely evaluate this. I have this model. It has that VC dimension. I need that many examples. Done deal. So this is not going to scare us. It is going to make us careful about matching how sophisticated we can go, to the resources of data we have. So this is not really a deal breaker. The real deal breaker for using multilayer perceptron is the optimization. Even for a single perceptron, we were lucky enough to have this perceptron learning algorithm that applies only in cases of separable. And we say that in the case of non-separable, it's a very hairy optimization problem. It's a combinatorial optimization, and it is very difficult to solve. Can you imagine, now, the problem when I take layers upon layers upon layers, and combine them? And now I'm trying to find what is the combination of weights that matches a function. You don't know what the function is. Here, you looked at it. But I'm just giving you examples. I'm asking you to match. How are you going to adjust the weights, in order to match that? That's an incredibly difficult optimization problem. And that's what neural networks do. That's the only thing they do. They have a way of getting that solution. And the way they are going to do it is that, instead of having perceptrons which are hard-threshold, they are going to soften the threshold. Not that they like soft thresholds, but soft thresholds have the advantage of being smooth, twice differentiable. Rings a bell? Oh, maybe we can apply the all-general gradient descent in order to find the solution. And once you find the solution, you can say I know the weights. Soft threshold is almost the same as the hard threshold. Let me hard-threshold the answer, and give you that answer. So that would be the approach. So let's look at neural networks. The neural network will look like this. It has the inputs-- same as inputs before-- and it has layers. And each layer has a nonlinearity. I'm referring to the nonlinearity generically as theta. Remember, theta was used in logistic regression as very specifically the logistic function. I'm using it here generically for any nonlinearity you want. It turns out the nonlinearity we are going to use is very much like the logistic function except it goes from -1 to +1, in order to replicate the hard threshold which goes from -1 to +1. In the case of logistic regression, we weren't replicating that. We were simulating a probability that goes from 0 to 1. So it's very similar to this. And in principle, when you use a neural network, each of these guys could be different. You can have your different nonlinearities and you will see, when we talk about the algorithm, that there is a very minor modification you do in order to accommodate these nonlinearities. So I could have a label for each of these depending on where it happens. And the most famous, different nonlinearity that you get to use is actually to make all of them this soft threshold. And then when you go to the output, make that linear. So this part would be as if it was linear regression. This would be with a view to implementing a real-valued function. So the intermediate things are doing this thing, and then finally you combine them in order to get a real-valued function. But for the purpose of this lecture and the derivation, I'm going to consider all these thetas to be the same. And all of them will be this function that I'm going to describe mathematically in a moment. So this is the neural network. It has the same rules. It's feedforward. There is no going back, there is no jumping forward. And the first column is the input x. So you are going to apply your input x from an actual example to this, follow the rules of derivation from one layer to another until you arrive at the end, and then you are going to declare at the end that this is the value of my hypothesis, the neural-network hypothesis, on that x. The intermediate values we are going to call hidden layers, because the user doesn't see them. You put the input, there's a black box, and then comes output. If you open the box, you'll find that there are layers, and something interesting is happening in the layers that I'm going to comment about later on. But these are the ones. And for a notation, we're going to consider that we have L layers. So in this case, it will be three. This is the first layer with its input. This is the second layer with its input. This is the third layer with its input. This is not really hidden, it's an output layer. So this is the final layer, L. And this is that. The notation here will persist with us for that. Now I'm going to take this, and I'm going to put the mathematical equations that go with it, in order to be able to implement it. If you want to code this, the next slide will be the one for you to implement. First thing, I'm going to define the nonlinearity that I described. It's a soft threshold and we are going to use the tanh, the hyperbolic tan-- hyperbolic tangent. And the hyperbolic tangent-- Well, the formula looks more or less like the one we had before for the logistic one. It's again based on e^s. And this one happens to go from -1 to +1. At 0, it's exactly 0. It has a slope 1. It has very interesting properties. And you can see now why we are using it. If you take it this way, it looks like a hard threshold. And in the beginning it looks linear. So it has the combination of both worlds. So if your signal, which is what you have here-- this is the signal and this is the output. If your signal, which is the weighted sum of your inputs, is very small, it's as if you are linear. If your signal is extremely large, it's as if you are hard-threshold, and you get the benefit of one function that is analytic and very well-behaved for the optimization. So this is the one we're going to use. Now what I'm going to do, I'm going to introduce to you the notation of the neural network, because it's all notation. Obviously the notation will be more elaborate than a perceptron, because I have different layers. So I have an index for that. I have different neurons per layer. So I have an index for that. And inputs go to the output. And then the output becomes the input to the next layer. So I just need to get my house in order, to be able to implement this. So although this is mostly a notational viewgraph, it's an important viewgraph to follow because if you decide to implement neural networks, you just print this viewgraph and code it, and you have your neural network. The parameters of a neural network are called w, weights. The weights now happen to belong to any layer to any neuron. And there are three indices that change. One of them, the different layers, the different inputs that feed, and the different outputs I get. I have different inputs and different outputs for every layer. So the weight is connecting one input to one output in a certain layer. So let's have a notation. I'm going to introduce a notation, and then apply it to the w. So I'll denote the layer by l. And l, as you see, appears as a superscript for w. That will be our standard notation. The layer is always a superscript between parentheses, for the quantity we have. And then I have the inputs. The inputs we are going to call i, as an index. And obviously, since the weight connects an input to an output, the i should appear as an index. And the output will be called j. So now my parameters for the network are w, superscript l, sub ij. Although it's more elaborate than we had before, we understand where it came from. Now let's talk about the ranges of values for these three indices. For l as we discussed, l will go from 1 to L. So from the first layer to L, which would be the output layer, the final layer. The outputs go from 1 to d, as if it was-- d as in dimension. So you have-- I'm going to the neuron 1, neuron 2, and neuron d. And because I am in layer l, by definition, then the dimension of the layer that I'm talking about will have that superscript. So d superscript l, the number will differ from one layer to another. And depending on which layer you have, you'll have different number of output units, and therefore the j will depend on that. Now for the inputs, they come from the previous layer. You take the outputs of the previous layer to be the inputs in your layer. Therefore, the index for i will go for the size of the previous layer, l minus 1. Now, I left this out because this will not be 1, this will be 0. Anybody knows why? Yeah, yeah, it's that constant x_0 that we always have. Every neuron will have that as an input, and therefore we will have a generic one, which is x_0 to take care of that. So for every value in this array, you will have w_ij l, and these are the parameters you want to determine. Now, let's see the function that is being implemented. What you do is, you get the x's in layer l in terms of the x's in layer l minus 1. Right? And our notation will give this a generic index j, so this is the j-th unit in this layer, and this was the i-th unit in the previous layer. What do you do in order to get that? We do what perceptrons do, or neurons in this case. You combine them with the weights. The weights are connecting the i to the j, and they happen to be the weights of this layer. So when we talk about the weights, the weights will correspond to where the output is. You sum these up. You sum them up from i equals 0, which is the constant variable, up to the maximum, which would be the maximum for that layer, which happens to be d superscript l minus 1. So this is the signal. Now you pass the signal through a threshold, in this case a soft threshold. And you're ready to go. That will be the function you're implementing. And indeed, this would be the value of the output x. And it happens to be theta of-- we are going to call this quantity inside, we are going to call it the signal again. And now the signal corresponds to the output. So the signal is of layer l and the j-th signal in that layer. You pass it through the nonlinearity, and what you get is the output of that. So that wasn't too bad. Now, when you use the network, this a recursive definition. So you do this for the first layer, second, third, et cetera. Every time you use it, you get the new outputs. So the first, you get the outputs of the first layer. These are the inputs to the second. You get the outputs of the second, these are the inputs to the third. And you keep repeating, until you get to the final. Now, how do you start this? You start this by applying your input, the actual input you have, to the input variables of the network. The input variables happen to be in layer 0, if you want. And they happen to be called x_1 up to that, d superscript 0 by definition. Therefore, d superscript 0 is the same as the dimensionality of your input. So this one actually has-- what is the x? x_1 up to x_d. So this guy matches this. Therefore, that is how you construct the network. The number of inputs is the same as the number of inputs you have. Once you leave that, it could be anything. It could be expanding, shrinking, whatever it is, anything it wants. And when it arrives, it should arrive at the value of your output. You have a scalar output, and therefore, after a long iteration, you will end up with one output, which happens to be in layer L, and since I have one output, the j is only 1. So this is my output of the network, and I'm going to declare that my output of the network is the value of my hypothesis. That is the entire operation of a neural network when you feed it. So when you tell me what the weights are, I am going to be able to compute what the hypothesis does. Now our job is to find the weights through learning, so that we match a bunch of input-output examples. When I put those inputs and look at the target outputs, I find that the network is replicating them well. That is the backpropagation algorithm. So let's do what? We are going to apply stochastic gradient descent. So you take one example at a time, apply it to the network, and then adjust all the weights of the network in the direction of the negative of the gradient, according to that single example. That's what makes it stochastic. So let's do it. Now the parameters are all the weights. This array, which is a funny array, is not quite a complete matrix because you have different number of neurons in different layers. So this is just a funny array. But it's indexed by i j l. It's a legitimate array. And this determines h. Therefore, what I'm doing here is getting the error on a single example: x_n , y_n. And I'm going to-- by definition, I have some error measure. Let's call it e of h-- my error measure between the value of the hypothesis, which is the neural network, and the target label. And this happens to be a function of the weights in the network, why? y_n is a constant, x_n is a constant. This is part of the training example. h is determined by the w. That's why this is w, and I'm putting it in purple, because this is the active quantity now when we are learning. So to implement SGD, all you need to do is implement the gradient of this quantity. And what is gradient of this quantity? Well, the gradient of this quantity is a huge vector. Each component is partial the error, by partial one of the parameters. So we put it down. All you need to do is compute this for every i, j, and l. That's all you need to do! And then you take this entire vector of stuff, and you move in the space along the negative of that gradient. That is the game. There is nothing mysterious about this. If you never heard of backpropagation, you will be able to do this, as we'll see in a moment. The idea is just to do it efficiently. And it makes a big difference when you find an efficient algorithm to do something. For example, those of you who have learned linear systems know FFT, the Fast Fourier Transform. Fast Fourier Transform is, you implement the discrete Fourier transform. What's the big deal? The big deal is because it's faster. You get N logarithm N, instead of the alternative. And that simple factor made the field enormously active, just by that algorithm. And very similar here. Backpropagation, if you look at it, I can brute-force-implement this for every i, j, and l. But now I have one thing that will get me basically all of these guys at once, so to speak. And therefore it's efficient and people get to use it, and that's why neural networks became quite popular. So let's try to compute this. Let me take part of the network. So this is in the layer l minus 1, and this is in the layer l. I'm looking at the output of one neuron in this layer, feeding through some weight, into this guy. So it is contributing to the signal going into the next guy, and the signal goes into the nonlinearity to produce the output. Now this quantity is not mysterious. If you look at it, we can evaluate those one by one. That is for every single weight in the network, we can ask ourselves: what is the error? Well, the error is sitting there. At the output, I have-- Here's my pointer. I have the output. I went further than I should! But the output is sitting somewhere there. Therefore, there is an error. And that error will change, if you change w. And that will tell you what is partial e by partial w. So we can do this analytically. There is nothing mysterious. I can get the output as a function of the previous layer, of the previous layer, of the previous layer, until I arrive here. So I have this function that has tons of weights in it, and I'm focusing on one of them. And I can say, what is partial e by partial this fellow? Apply chain rule, get a number. Not a big deal. It's not your favorite activity, but you can do that. Or even you can do it numerically. I can take this fellow, perturb it just a little bit, and see what happens for the error of the output. And therefore, I can get numerical estimate for partial by partial. The problem with those approaches is that I have to do it for each one of them. What I'm going to do now, I'm going to try to find something that will make me get the entire array, which is the full gradient, in almost one shot. So here is the trick. The trick is the following. I'm going to express partial e by partial w_ij, the change in e which is upstairs here, with respect to this particular parameter. I'm going to get it in terms of partial the same quantity by partial an intermediate quantity, this signal, times partial the intermediate quantity by partial what I want. This is just chain rule. But chain rule with partial derivatives, you need to be a little bit careful, because there may be different ways your variable is affecting the output, and you need to sum up all the effects. But here, if you're looking for how does w affect the error? Well, w only affects this sum. w_ij affects only the sum s_j. So if I get partial by partial s_j, this is the only link which w_ij affects the output, and therefore I'm allowed to do this, and there is nothing to sum up with respect to. So I have this chain rule. That's nice. I can probably look at this and say, this is a very simple quantity to get. How does the signal change with the weight? We probably can get an easy one there. But this one is almost as bad as the original one. How does the error change with this signal? It doesn't look like a great progress. But the great progress is that this quantity will be able to be computed recursively. That's the key. So what do we have in this equation? Well, we have the first one. Because if I take this guy, what is partial s by partial w? s is simply the sum of w's times x's. So partial s by partial w is the coefficient, which happens to be the x, and this is the coefficient there. And that is readily available. I already computed all the x's, so that I have. The other guy, this is a troublesome one, so we're just going to call it a name, and see if we can get something going for it. And the name we're going to call it is delta. So now delta goes with a signal. There will be a delta sitting here, if we can compute it. And the interesting is that the derivative of the error with respect to this weight, which will determine how much you change that weight, because when you get that gradient, you move along the direction of the gradient. It means, in each component, you go in proportion to the value of the partial derivative. So since this is the partial derivative, the change in the w will be the product of these two guys-- proportional to that. One of them is x here, and one of them is delta here. So we'll be changing the weight according to two quantities that the weight is sandwiched between. And that's a pretty attractive one. If I get all of those, then I look at the x's and the delta's, and the weight in between will change accordingly. Now, let's get delta for the final layer. Why do I get delta for the final layer? When we computed the thing, we got x's for the first layer, the input. And then we propagated forward until we got to the output. The reason why we're going to get it is because the math will tell us that if you know delta later, you're going to be able to detect delta earlier. So this will be propagating backwards, and hence the name backpropagation. So we're going to start with the delta at the output. And it's not a surprise, because I'm trying to get the partial error by partial something. The closer I am to the action, to the output, the easier it is to compute it. And indeed, for the output, it will be very easy to compute. This is the definition of delta for any value of j and l. And when you look at the final layer, the final layer is not mysterious. It's l equals L, and j equals 1. I have a scalar function, so that is the output layer. Therefore, the quantity I'm interested in is exactly-- just substituting with this quantity. I want delta superscript L, subscript 1. That's what I want to compute. Now, can I compute this? Let's look at it. This is e of w, the thing I'm differentiating. What is e of w? e of w is the error measure, whatever you have, between the value of your hypothesis, that is the value of the neural network in its current state, with the weights frozen. You apply x_n. You go forward until you get the output, that is h of x_n. You compare that to the target output, which is the label of the example, y_n. And that error will be your e of w. Why is it e of w? Because h depends on w. So that is not mysterious because h of x_n is what? It's the value of the output. Right? And that happens to be the variable in layer L, variable number 1, that is your output. And, for example, let's say that you are using mean squared error, just for the moment. This can apply for any analytic error measure you put here. But if you're using mean squared error, this would be it. That's a friendly quantity, because now I want partial by partial. I have this, and this fellow is related to the thing I'm differentiating with respect to. This is a constant. I can deal with the squared. So I'm getting closer to being able to evaluate this explicitly. So let's look at x, the output. Well, the output is nothing but-- you pass the signal through the nonlinearity, right? The nonlinearity is the tanh. Not mysterious. The signal is what I'm differentiating with respect to. I'm almost done. So now, all I need to do is realize that when I do this, I will have to know the derivative of theta, because there is a chain rule and I'm differentiating with respect to this. And this is an intermediate quantity, so I need to get theta dash. So what is theta dash? So what is the derivative of the tanh? Happens to be 1 minus the tanh squared. This is for this particular one. If you have another nonlinearity, you just compute what that is. This is good. So we have delta for the final layer. If I put the input, get the output, I go through this and I have an explicit value: delta at the output is the following. So now, the next item is to back-propagate delta down to get the other delta's. This is the essence of the algorithm. So now, I'm taking the network, but now I'm taking the network from-- I'm taking one unit here, and looking at all the units in the next layer, because these guys happen to be affected by x, and therefore, happen to be affected by s. Remember delta is partial something by partial s. And I want to get partial this by partial s_i, in terms of partial by partial the s's here. I'm going backwards. So I already computed up to here, and now I want to go here. So now, I need to take into consideration all the ways that this affects the output, so I'm drawing the relevant part of the network. This is the quantity that I want. I want to evaluate partial e by partial s_i So I want to compute partial by partial this fellow. So now I'm going to apply the chain rule again. I will get partial e by partial these fellows, which supposedly in my mind, I already know. That's the first part of the chain. Then I'm going to get partial this guy by partial x. Fine. As long as I'm making progress towards the destination, I'm OK. You can do it any way you want. And finally, I'm doing this. Partial x by partial s. So you go through this. This is partial e by partial the final guy, and these guys happens to be intermediate. However, the way this fellow affects the output, it affects all of those guys. So when I do the chain rule, I need to sum up over all the routes that this happens through. And therefore, I need to sum up over all the points here for this quantity. The way e is affected by this guy is through the way e is affected by this fellow through here, or by this fellow through here, et cetera. And therefore, the rule in this case would be the sum. It looks like a very hairy one, but it's no big deal. Now let's collapse it to something very friendly. It's a sum of something. Let's take it one term at a time. We will take this. What is the partial derivative of x_i by s_i. x_i simply happens to be the nonlinearity applied to this one. So all I need to do is just differentiate that nonlinearity, and apply it to the value at hand. So what do you get? You get theta dash applied to the signal. I can have that. How about the next guy? That's an easy one. What is the derivative of this fellow by x_i? Yeah, this is just the sum. I get the coefficient, the coefficient happens to be this thing, so that is what I get. Do I have all of this? Yes. The next guy is the interesting one. How do I get this? Well, you don't get it. You already have it by recursion. This happens to be the old delta. So now I have the lower delta in terms of the upper delta. And I have the top delta in hand. We are done. We just have to keep doing this, and we'll get all the delta's. And the form for the delta is interesting. So this fellow does not depend on the summation index j, right? And this happens to be the derivative of the tanh, so it's 1 minus that squared. So when you get 1 minus that squared, you get this and you can factor it out. The rest of it depends on j and you are summing this up and you're getting this. Now isn't it lovely to have an equation like this? This looks exactly like the forward pass. We're taking something, combining it with the weights, summing up, and getting this. Instead of applying a nonlinearity, which we did in the forward, we're multiplying by this funny guy. So it looks like a very much similar situation. But when you are done, you are going to get a bunch of delta's at every position where an s is. And from our previous experience, then we're ready to go with the delta and the x, and adjust the weight that is sandwiched between them accordingly. So you see the reverse, now we are going down. It's delta's going down, the arrows are going down. It used to go up. So let's do this. And then instead of having theta here, we are multiplying by something, and what we're multiplying by is this quantity. That's what you do in the backward propagation. So here's the algorithm. First, the picture of the algorithm. That's all you do. You take the input, you compute the x's forward, you get the error, you compute the delta's backward. This is supposed to be delta-- delta has disappeared for some reason. The delta and the x determine the weight in between. So if you put the algorithm this way, you initialize the weights and then you pick n at random-- that's what makes stochastic gradient descent. You do the forward run I described. You do the backward run, and then you update the weights according to the input and the delta that are surrounding the weight. You keep this until it's time to stop, and then you return the final weights, and that is your algorithm. Now there are obviously all the questions: the termination criterion, the local minima, all of that. That's the thing we discussed in the Q&A session. There's something specific here that I want to emphasize, which is the initialization. Because it's very tempting to initialize weights to 0, which worked actually very well with logistic regression. If you initialize weights to 0 here, bad things will happen. So let me describe why. First, the prescription is to initialize them at random. Why is initializing 0 bad? If you follow the math, you realize that if I have all the weights being 0, which is what that means, and you have multiple layers, then either the x's or the delta's will be 0. In every possible weight, one of the two guys that are sandwiching it will be 0. And therefore, the adjustment of the weight according to your criterion would be 0. And therefore, nothing will happen. This is just because of the terrible coincidence that you are perfectly at the top of a hill, unable to break the symmetry. So you're not moving. If I just nudge you a little bit, you will be slipping like there's no tomorrow. But as long as you're there, you're not moving. Pretty much like you think of a donkey that is hungry, so they put two sacks of food on top of it. All it needs to do is eat or eat. Unfortunately, it's perfectly symmetric, and the donkey cannot break the symmetry, and it starves to death. So we just want to break the symmetry. So we introduce randomness: shake the food a little bit, which is here to just start with a random thing. Choose weights that are small and random, and you will be OK. One final remark and we'll call it a day, which is about the hidden layers. So let's look at the network again. This is the network. We have an understanding of this fellow, and we have an understanding of the output. And the hidden layers were just a means for us to get more sophisticated dependency. So, if you think what the hidden layers do, they are actually doing a nonlinear transform, aren't they? I have these raw inputs, and I'm passing them through this thing, so I can look at these guys and consider them features. And because they are higher-order features, I'm able to implement a better one. And this one will be features of features, and so on. Now the only difference, and it's a big difference, between the nonlinear transform here and the nonlinear transform we applied explicitly in the case of linear models, is that these are learned features. Remember when I told you not to look at the data before you choose the transform. The network is looking at the data all it wants. It is actually adjusting the weights to get the proper transform that fits the data. And this is not bothering me, because I have already charged the network for the proper VC. The weights here that constitute that guy contribute to the VC dimension. The VC dimension is more or less the number of weights. That's the rule of thumb here. So it is completely fine to look at the data, because it's not looking at the data that is bad, it is looking at the data without accounting for it that is bad. And here it's built in that it's accounted for. So this is nice, because now you can see it's not a generic nonlinear transformation, it's a nonlinear transformation with a view to matching very specifically the dependency that I'm after. So that's the source of efficiency there. Now comes the question, can I interpret what the hidden layers are doing? So I'll tell you a story. Early in my career, I was doing a consulting job for a bank, and they wanted to apply neural networks to credit approval. Very easy. Give me the data, we'll do it. We'll take a fairly simple network. So one of the people in the bank that I was dealing with came and asked me: can you please tell me what the hidden layers are doing? So in my mind I think, is he doubting my competence or something? He wants reassurance or something like that? I mean, the performance is perfect, and he can try it out of sample and whatnot. But then I realized that the reason he is asking for the interpretation has absolutely nothing to do with performance. It's legal. If you deny credit for someone, you have to tell them why. And you cannot send a letter to someone saying: sorry, we denied credit because lambda is less than 0.5. [LAUGHTER] So that's the reason. But the fact that you are not able to interpret what happens in machine learning is very, very common. Go back to the movie example. We get the factors. We predict the ratings. And let's say you apply the system, and you keep recommending movies to someone. And the person is so impressed. You are recommending movies that are on the spot every time. So they come and ask you: how do you do it? You tell him, because factor number 7 is very important in your case. So he says: OK, great. So what is factor number 7? And then you say-- lots of hand waving. You have no idea what factor number 7 is, but factor number 7 is important in your case. Very common in machine learning because you remember, when the learning algorithm tried to learn, it tried to produce the right hypothesis. It didn't try to explain to you what the right hypothesis is. That was the goal. Let me stop here, and then take questions after a short break. Let's start the Q&A. MODERATOR: The first question is could you explain what people mean by using a momentum term in neural networks? PROFESSOR: Momentum is used as an enhancement for the batch gradient descent, in order to get some effect of the 2nd order. So the idea is that if you use gradient descent, gradient descent is using strictly 1st order, just the slope. And if the surface is changing so quickly, which means that the second order is important, you want to get a glimpse of that without having to go through the trouble of computing the Hessian, the 2nd-order quantities. So if you take what's called a momentum term, which means that you take a little bit of the step that you had previously, and a bit less of the previous step, and so on, you end up accounting for some aspect of this. Because if the surface is curved, this goes one way, and if the surface is flat, it goes the other way. So I didn't-- There are lots of heuristics. The momentum is one of them. For stochastic gradient descent, the way I described it, actually works very nicely. And in all honesty, if I have to go to 2nd order I will just go for conjugate gradient, because it's so principled and really gets the bottom line. I'm not big on using momentum in my own applications, but other people have found it to be useful. MODERATOR: Some people are asking about the popularity of neural networks, that it has had its ups and downs. So what's the state of the art in neural networks research if there's any? PROFESSOR: Initially, neural networks were going to solve the problems of the universe. So the usual hype. And hype in some sense is not bad for research, because it gets people excited and gets enough people to work to get the real results. And then when it settles down, there's a critical mass of work. So I don't think this was a bad thing in hindsight. But what happened is that because of the simplicity of the network and the simplicity of the algorithm, people used them in many applications. And it became a standard tool. And there are lots of tools you will find in all kinds of software, where you just apply a neural network. And until this very day, there are companies that use them very, very regularly. So they are post-research, so to speak. There's very little done in terms of research. The basic questions have been answered. But in terms of being used in commerce and industry, they are used. They have very serious competitors, like, for example, support vector machines and lots of other models, but they're still in use. Not the top choice nowadays, but every now and then, someone will publish something and he did this, and he will have used a neural network and got good results. MODERATOR: OK. How to choose the number of layers? PROFESSOR: This is model selection. So the neural networks is really a class of models-- a class of hypothesis sets-- and there are obviously bunch of things to choose. How many layers? And how many units per layer? So if you look from an approximation point of view, because of the sum of products in logic, you can implement anything using a fairly shallow network, provided that you have enough neurons in that layer. But that's not an approximation question, we are talking about a learning question. So the real question is, how many weights can I afford? And then the question of organizing them is less severe. So how many weights can I afford? Because they reflect directly on the VC dimension and the number of examples I need. And there are actually methods that, given a particular architecture, it tries to kill some weights in order to reduce the number of parameters, as a method for regularization. And we'll allude to that, when we get to regularization. But basically, this is a model selection question, where model selection tools apply. The most profound of them will be validation that we will have a lecture dedicated to. MODERATOR: Can you-- why was tanh, the hyperbolic tan used? PROFESSOR: Why is it used? MODERATOR: Yes. PROFESSOR: OK. I want a soft threshold, and I want it to go from -1 to +1. And I want it to have a nice analytic property that I can differentiate it. These are basically the three reasons. In the other case, it was exactly the same except that I didn't want something to go from -1 to +1. I wanted something to go from 0 to 1, because in logistic regression, I wanted a probability. Here, I wasn't really interested in the continuity for its own sake. There I was, because it's a probability. Here I was interested in the continuity just because I wanted the analytic property of differentiation, in order to apply gradient descent. But what I care about is going from -1 to +1, which are the hard decision version. MODERATOR: Will the final weights depend on the order, the way that the samples are being picked? PROFESSOR: Correct. They will depend on the initial condition. They will depend on the order of presentation. They will depend on that, but that is inherent in the game. We are never assured of getting to the perfect minimum, the global minimum. We will get to a local minimum, and anything will affect us. But the whole idea is that you are going to arrive at a minimum. And if you do what we suggested in the last lecture in the Q&A session, that you just start from different starting points, and have different randomization for the presentation. This randomization could be, you pick a point at random. You could pick a random permutation, and then go through the examples according to that permutation, and then change permutation from epoch to epoch. Or you could simply be lazy and just do it sequentially. And all of these, more or less, get you there-- will get you with different results. So if you try a variety of those, let's say 100 tries, and then pick the best minimum you have, you will get a pretty decent minimum, and will be fairly more robust in terms of independence of the particular choices that you made in any of the 100 cases. MODERATOR: Could you go back to slide 12? There. So if you could review the two red flags for generalization and optimization? PROFESSOR: OK. So the top part of the figure showed that we are dealing with a sophisticated model, because in spite of the fact that the unit of it is linear-- the perceptron-- you can implement a circle, just for illustration. You can implement a pretty difficult surface by combining those fellows. When you have a powerful model, it means you can express a lot of things, and therefore the question of generalization comes in because, if you can express a lot of things, you have a big hypothesis set. And then the question of zooming in and generalization-- the stuff we handled in theory. But the comment here is that we are going to have the VC dimension of whatever model we have, and the VC dimension summarizes all generalization consideration. We may decide that this is too sophisticated a model, because we look at the VC dimension of it and the resources of data we have, and we decide we just cannot generalize. But at least it's under control, because we have the number that describes it. In terms of optimization, now it's not like I have the target written here and I'm just designing perceptrons. I'm given a data set-- inputs, outputs-- and I have a multilayer perceptron, each of which is computing a perceptron function, of a perceptron function, of a perceptron function. And now I want to choose the weights for the different layers, in order to get there. So obviously that's a very difficult combinatorial optimization, because it was difficult even for one perceptron. That's why the optimization here is a red flag. That's why we needed to go for an approximation using a continuous function, where we have something like gradient descent that can work for us. MODERATOR: You mentioned the VC dimension is roughly the number of parameters. So they want to comment on it. PROFESSOR: We are not going to be as lucky as the case of perceptrons, of getting the VC dimension exactly. In this case, there are some analyses. And because the weights are not completely independent in their impact-- you can play around with weights in different layers and compensate for one another, and there are some permutations and whatnot. Therefore, they don't contribute full degrees of freedom, each of them. So you can upper-bound it by the number of weights, and lower-bound it by something fairly close to the number of weights, but smaller. So as a rule of thumb, you take it as the number of weights being the VC dimension. And that has stood the test of time, in terms of practice. MODERATOR: In terms of the interpretation, by just looking at the first layer, it's not enough to interpret the-- PROFESSOR: Oh, if your interpretation is to say: yeah, I understand perfectly what the first layer does. It gives 0.3 weight to the first input, and 0.7 weight to the second input, and minus 0.4 weight to the third input, and sums them up and then compares to the threshold, which is 0.23. If you take that as an interpretation, then they are interpretable. But an interpretation here, what people mean, is interpretation that makes sense. That, for example, your interpretation in the case of movies, you say this factor is a comedy content. People can relate to that. But what we are saying is that the factor is relevant to the rating, but cannot be articulated in simple terms that people would consider interpretation. And similarly for the hidden layer here. MODERATOR: Can you say what happened in the end with the bank? What explanation was taken in the end? PROFESSOR: No, I can't. [LAUGHING] It's a private consultation. I cannot comment in detail. But basically, the question was raised and it made the point. MODERATOR: Can you explain again why in the past lecture, you mentioned that data snooping is not a good practice? PROFESSOR: Data snooping is a bad practice if you don't account for it. When we get to data snooping-- we will discuss it in one of the lectures, we will say that you either avoid it or account for it. The problem is that if you data-snoop, and you don't account for it in terms of its impact on generalization, you end up with something that is extremely optimistic. You go to the bank, if you do a private consulting job for a bank, and tell them: I have something that predicts the stock market great. And then you give them. And when they go to stock market, it falls on its head, and that's the problem. Because you thought it would generalize, and it didn't. So data snooping, in the way I presented it, was the fact that we didn't account for that-- we learned in our mind, but we didn't account for the VC dimension of the space we worked on. That was the problem, rather than looking at the data in and of itself. But since the damage is almost unavoidable, it's a very good practice not to look at the data, because the accounting is difficult in this case. In the case of neural networks, there was looking at the data in a very prescribed way. A learning algorithm was actually trying to find the weights that constitute the hidden layer. So therefore, it is looking at the data in abundance. On the other hand, the accounting has already been taken into consideration, because, as I mentioned, the weights have been counted as contributing to the VC dimension. So we know the impact on the generalization behavior. MODERATOR: Does the range of the weights alter the choice of eta? PROFESSOR: Which weight? Repeat the question, please. MODERATOR: Does the range of the weights affect the value of eta? PROFESSOR: Let's say that you're making decisions. So let's say that you're making decisions. Eventually, you will take the output layer and hard-threshold it, as if you are scaling the weights. But the intermediate weights, the actual value matters because the actual value of their output will contribute to the next layer. So you cannot just say that I'm scale-invariant or anything like that. But supposedly, the learning rate was only a way to arrive at a minimum of the error function. And the minimum of the error function will happen at a particular combination of the weights. So it shouldn't affect it, in the sense of a predictable way. Obviously if I change the rate, I may end up in a different spot, but it's not like I will end up in a better spot or a worse spot if I use a reasonable learning rate. Yes it does affect it, but it affects it in an unpredictable way. Pretty much like you can say: how does the initial condition affect the result? Well it affects it, but it affects it in a random way, and you're better off just averaging over a number of cases, or picking from a number of cases, in order to immunize against that type of variation. MODERATOR: Is there a relation between neural networks and genetic algorithms? PROFESSOR: I guess both of them appeal to someone who's interested in a biology reflection. Genetic algorithms are optimization techniques, based on getting a generation and having mating and keeping the good genetic properties, so it doesn't apply. Everything in machine learning has been applied to everything. So there were actually people trying to train neural networks using genetic algorithms. You'll find in the literature all combinations. But at a basic level, neural network is a model. Genetic algorithm is an optimization technique. And therefore, there's really no relationship between them. MODERATOR: Small confusion. Does in-sample training constitute looking at the data? PROFESSOR: The strict answer is yes. You look at the data all too well. You're are actually looking at the data, and you're trying to minimize the performance of the data, and all of that, which again is fine as long as you have already put into account that the way you're navigating the weight space-- the weight space has limited VC dimension. And therefore, when you do that and you get to something, you still have a guarantee of generalization from what you're arriving at to the out-of-sample. So the learning algorithm looks at the data. That's all it does. It looks at the data. But it's already-- before we even turn the learning algorithm loose on the data, we have already chosen the hypothesis set. And we put the generalization checks in place. MODERATOR: What do you recommend, implementing your own neural network or using a package? PROFESSOR: It's-- Honestly, it's a borderline case. For example, if you're doing the perceptron, you just write it down. It's so simple. In neural networks, it's a little bit complicated, and there are some bugs that are typical. I used to have this as an exercise, and then I decided that the logistics of doing it is not worth the benefit of it. So to answer your question, I recommend using a package, for neural networks. MODERATOR: Does analyzing-- performing some sort of sensitivity analysis on the weights give us some information about how the neural network-- PROFESSOR: Yeah, there's actually work on that. There are questions of regularization based on that, on how effective the weight is, and the disturbance, and whatnot. There are all kinds of analyses. Neural networks have been studied to a great level of detail. And indeed, the choice of the weights, the range of the weight, perturbation of the weight-- all of these have been looked at. MODERATOR: Are there other models that lend themselves more to interpretation? PROFESSOR: If you have a bunch of parameters and the algorithm is going to choose them, then already interpreting those parameters is not clear. You can artificially put constraints in order to make sure, or you can start from an initial condition that already has an interpretation, and you're just fine-tuning it. But that's if you are very keen on the interpretation aspect. MODERATOR: Going back to the first examples, where there were logic implementations with the perceptrons. So there was a confusion. Are we still trying to learn weights here, or we just have them fixed? PROFESSOR: This was an illustration for the fact that when you combine perceptrons, you're able to implement more interesting functions. This didn't touch on learning yet. After we do that, we found that the structure that is multilayered is an interesting model to study. And from then on, it became a learning question. We had a neural network. We are no longer going to look at target functions, and try to choose the neurons. We are just going to put it as a model, and let the learning algorithm choose the weights, which is backpropagation in this case. MODERATOR: Could you briefly explain early stopping? PROFESSOR: OK. I think it is best described when I talk about regularization and validation. It is basically a way to prevent overfitting, which is the next topic. So I think it will be much better understood in the context once we understand what overfitting is, and what are the tools for the dealing with overfitting-- regularization, and validation in this case. And then early stopping will be very easily explained. MODERATOR: A question on stochastic gradient descent. When you go through an epoch, you choose points randomly. Only points you have not selected, right? PROFESSOR: There are lots of choices. An epoch is one run. And it's a good idea to get all the examples contribute. So one way to get it to be random and still guarantee that you get all of them is, instead of choosing the point at random, you choose a random permutation from 1 to N, and then go through that in order. And then for the next epoch, you do another permutation. If you do it this way, eventually every example we will contribute the same, but an epoch will be a little bit more difficult to define. You're can define it simply as N iterations, regardless of whether you covered all of them or not. That is valid. And some people simply do a sequential version, no randomness at all. You just go through the examples, or you have a fixed permutation, and you go through the examples in that order, and keep repeating it. And there are some observations about differences, but the differences are not that profound. MODERATOR: Does having layers and no loops limit the power of the neural network? PROFESSOR: Loops as in feedback, I'm assuming. Once you have feedback, even the definition of what function I'm implementing becomes tricky, because I'm feeding on myself. So it's a completely different type. There are recurrent neural networks, which actually is the model that started work in neural networks. And it has completely different mathematics and application domains. Here we're implementing a function, and it is clean enough to do it in a layered way, in order to get a nice algorithm like that. And since we showed that you can basically implement anything if you have a big enough model, you are not missing out on something by doing that. One can become, say-- Maybe I can get a smaller network if I can jump layers, which is possible. It's an interesting intellectual curiosity, but in terms of practical impact, it has very little. MODERATOR: In terms of the VC dimension, since it roughly depends on the number of parameters, if you had a fixed number of nodes, but you arrange them in layers, what do you gain or what do you lose in that? PROFESSOR: OK. If you believe in the rule of thumb, and it's just a rule of thumb that is based on upper and lower bounds, then if I rearrange the number of nodes and number of that, the number of weights will change because the number of weights-- I say how many neurons here, and how many neurons here, and I multiply the number, and that will give me the number of weights. So as long as you take your guiding number, the bottom line number is how many weights did I put in the network. You'll be more or less OK. Obviously, you can take extreme cases, where I have one neuron feeding into one neuron, feeding into one neuron-- the example I gave last time. So you have tons of weights that are really not contributing much. But within reason, if you have taken general architectures that are reasonable, then the number of weights is the operative quantity. MODERATOR: OK, we should quit. PROFESSOR: So, we'll see you next week.
Info
Channel: caltech
Views: 408,434
Rating: 4.9450917 out of 5
Keywords: Machine Learning (Field Of Study), Neural Network (Algorithm Family), Artificial Neural Network, SGD, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, Stochastic Gradient Descent, Back Propagation, Backprop, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Abu-Mostafa, Yaser
Id: Ih5Mr93E-2c
Channel Id: undefined
Length: 85min 16sec (5116 seconds)
Published: Sun May 06 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.