Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)
  • Author: stanfordonline
  • Description: Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. Learn more at: https://stanford.io/3bhmLce Kian ...
  • Youtube URL: https://www.youtube.com/watch?v=MfIjxPh6Pys
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Apr 23 2020 🗫︎ replies
Captions
Hello everyone. Uh, welcome to CS229. Um, today we're going to talk about, um, deep learning and neural networks. Um, we're going to have two lectures on that, one today and a little bit more of it on, ah, Monday. Um, don't hesitate to ask questions during the lecture. Ah, so stop me if you don't understand something and we'll try to build intuition around neural networks together. We will actually start with an algorithm that you guys have seen, uh, previously called logistic regression. Everybody remembers logistic regression? Yes. Okay. Remember it's a classification algorithm. Um, we're going to do that. Explain how logistic regression can be interpreted as a neural network- specific case of a neural network and then, we will go to neural networks. Sounds good? So the quick intro on deep learning. So deep learning is a- is a set of techniques that is let's say a subset of machine learning and it's one of the growing techniques that have been used in the industry specifically for problems in computer vision, natural language processing and speech recognition. So you guys have a lot of different tools and, and uh, plug-ins on your smartphones that uses this type of algorithm. Ah, the reason it came, uh, to work very well is primarily the, the new computational methods. So one thing we're going to see to- today, um, is that deep learning is really really computationally expensive and we- people had to find techniques in order to parallelize the code and use GPUs specifically in order to graphical processing units, in order to be able to compute, uh, the, the, the, the computations in deep learning. Ah, the second part is the data available has been growing after, after the Internet bubble, the digitalization of the world. So now people have access to large amounts of data and this type of algorithm has the specificity of being able to learn a lot when there is a lot of data. So these models are very flexible and the more you give them data, the more they will be able to understand the salient feature of the data. And finally algorithms. So people have come up with, with new techniques, uh, in order to use the data, use the computation power and build models. So we are going to touch a little bit on all of that, but let's go with logistic regression first. Can you guys see in the back? Yeah? Okay, perfect. So you remember, uh, what logistic regression is? What- we are going to fix a goal for us, uh, that, uh, is a classification goal. So let's try to, to find cats in images. So find cats in images. Meaning binary classification. If there is a cat in the image, we want to output a number that is close to 1, presence of the cat, and if there is no cat in the image, we wanna output 0. Let- let's say for now, ah, we're constrained to the fact that there is maximum one cat per image, there's no more. If you are to draw the logistic regression model, that's what you would do. You would take a cat. So this is an image of a cat. I'm very bad at that. Um, sorry. In computer science, you know that images can be represented as 3D matrices. So if I tell you that this is a color image of size 64 by 64, how many numbers do I have to represent those pixels? [BACKGROUND] Yeah, I heard it, 64 by 64 by 3. Three for the RGB channel, red, green, blue. Every pixel in an image can be represented by three numbers. One representing the red filter, the green filter, and the, and the blue filter. So actually this image is of size 64 times 64 times 3. That makes sense? So the first thing we will do in order to use logistic regression to find if there is a cat in this image, we're going to flatten th- this into a vector. So I'm going to take all the numbers in this matrix and flatten them in a vector. Just an image to vector operation, nothing more. And now I can use my logistic regression because I have a vector input. So I'm going to, to take all of these and push them in an operation that we call th- the logistic operation which has one part that is wx plus b, where x is going to be the image. So wx plus b, and the second part is going to be the sigmoid. Everybody is familiar with the sigmoid function? Function that takes a number between minus infinity and plus infinity and maps it between 0 and 1. It is very convenient for classification problems. And this we are going to call it y hat, which is sigmoid of what you've seen in class previously, I think it's Theta transpose x. But here we will just separate the notation into w and b. So can someone tell me what's the shape of w? The matrix W, vector matrix. Um, what? Yes, 64 by 64 by 3 as a- yeah. So you know that this guy here is a vector of 64 by 64 by 3, a column vector. So the shape of x is going to be 64 by 64 by 3 times 1. This is the shape and this, I think it's- that if I don't know, 12,288 and this indeed because we want y-hat to be one-by-one, this w has to be 1 by 12,288. That makes sense? So we have a row vector as our parameter. We're just changing the notations of the logistic regression that you guys have seen. And so once we have this model, we need to train it as you know. And the process of training is that first, we will initialize our parameters. These are what we call parameters. We will use the specific vocabulary of weights and bias. I believe you guys have heard this vocabulary before, weights and biases. So we're going to find the right w and the right b in order to be able, ah, to use this model properly. Once we initialized them, what we will do is that we will optimize them, find the optimal w and b, and after we found the optimal w and b, we will use them to predict. Does this process make sense? This training process? And I think the important part is to understand what this is. Find the optimal w and b means defining your loss function which is the objective. And in machine learning, we often have this, this, this specific problem where you have a function that you know you want to find, the network function, but you don't know the values of its parameters. In order to find them, you're going to use a proxy that is going to be your loss function. If you manage to minimize the loss function, you will find the right parameters. So you define a loss function, that is the logistic loss. Y log of y hat plus 1 minus y log of 1 minus y hat up. You guys have seen this one. You remember where it comes from? Comes from a maximum likelihood estimation, starting from a probabilistic model. And so the idea is how can I minimize this function. Minimize, because I've put the minus sign here. I want to find w and b that minimize this function and I'm going to use a gradient descent algorithm. Which means I'm going to iteratively compute the derivative of the loss with respect to my parameters. And at every step, I will update them to make this loss function go a little down at every iterative step. So in terms of implementation, this is a for loop. You will loop over a certain number of iteration and at every point, you will compute the derivative of the loss with respect to your parameters. Everybody remembers how to compute this number? Take the derivative here, you use the fact that the sigmoid function has a derivative that is sigmoid times 1 minus sigmoid, and you will compute the results. We- we're going to do some derivative later today. But just to set up the problem here. So, the few things that I wanna- that I wanna touch on here is, first, how many parameters does this model have? This logistic regression? If you have to, count them. So this is the numb- 089 yeah, correct. So 12,288 weights and 1 bias. That makes sense? So, actually, it's funny because you can quickly count it by just counting the number of edges on the- on the- on the drawing plus 1. Every circle has a bias. Every edge has a weight because ultimately this operation you can rewrite it like that, right? It means every weight has- every weight corresponds to an edge. So that's another way to count it, we are going to use it a little further. So we're starting with not too many parameters actually. And one thing that we notice is that the number of parameters of our model depends on the size of the input. We probably don't want that at some point, so we are going to change it later. So two equations that I want you to remember is, the first one is neuron equals linear plus activation. So this is the vocabulary we will use in neural networks. We define a neuron as an operation that has two parts, one linear part, and one activation part and it's exactly that. This is actually a neuron. We have a linear part, wx plus b and then we take the output of this linear part and we put it in an activation, that in this case, is the sigmoid function. It can be other functions, okay? So this is the first equation, not too hard. The second equation that I wanna set now is the model equals architecture plus parameters. What does that mean? It means here we're, we're trying to train a logistic regression in order to, to be able to use it. We need an architecture which is the following, a one neuron neural network and the parameters w and b. So basically, when people say we've shipped a model, like in the industry, what they're saying is that they found the right parameters, with the right architecture. They have two files and these two files are predicting a bunch of things, okay? One parameter file and one architecture file. The architecture will be modified a lot today. We will add neurons all over and the parameters will always be called w and b, but they will become bigger and bigger. Because we have more data, we want to be able to understand it. You can get that it's going to be hard to understand what a cat is with only that, that, that many parameters. We want to have more parameters. Any questions so far? So this was just to set up the problem with logistic regression. Let's try to set a new goal, after the first goal we have set prior to that. So the second goal would be, find cat, a lion, iguana in images. So a little different than before, only thing we changed is that we want to now to detect three types of animals. Either if there's a cat in the image, I wanna know there is a cat. If there's an iguana in the image, I wanna know there is an iguana. If there is a lion in the image, I wanna know it as well. So how would you modify the network that we previously had in order to take this into account? Yeah? Yeah, good idea. So put two more circles, so neurons, and do the same thing. So we have our picture here with the cats. So the cat is going to the right. 64 by 64 by 3 we flatten it, from x1 to xn. Let's say n represents 64, 64 by 3 and what I will do, is that I will use three neurons that are all computing the same thing. They're all connected to all these inputs, okay? I connect all my inputs x1 to xn to each of these neurons, and I will use a specific set of notation here. Okay. Y_2 hat equals a_2_1 sigmoid of w_ 2_1 plus x plus b_2_1. And similarly, y_3 hat equals a_3_1, which is sigmoid of w_ 3_1 x plus b_3_1. So I'm introducing a few notations here and we'll, we'll get used to it, don't worry. So just, just write this down and we're going to go over it. So [NOISE] the square brackets here represent what we will call later on a layer. If you look at this network, it looks like there is one layer here. There's one layer in which neurons don't communicate with each other. We could add up to it, and we will do it later on, more neurons in other layers. We will denote with square brackets the index of the layer. The index, that is, the subscript to this a is the number identifying the neuron inside the layer. So here we have one layer. We have a_1, a_2, and a_3 with square brackets one to identify the layer. Does that make sense? And then we have our y-hat that instead of being a single number as it was before, is now a vector of size three. So how many parameters does this network have? [NOISE] How much? [BACKGROUND] Okay. How did you come up with that? [BACKGROUND]. Okay. Yeah, correct. So we just have three times the, the thing we had before because we added two more neurons and they all have their own set of parameters. Look like this edge is a separate edge as this one. So we, we have to replicate parameters for each of these. So w_1_1 would be the equivalent of what we had for the cat, but we have to add two more, ah, parameter vectors and biases. [NOISE] So other question, when you had to train this logistic regression, what dataset did you need? [NOISE] Can someone try to describe the dataset. Yeah. [BACKGROUND] Yeah, correct. So we need images and labels with it labeled as cat, 1 or no cat, 0. So it is a binary classification with images and labels. Now, what do you think should be the dataset to train this network? Yes. [BACKGROUND] That's a good idea. So just to repeat. Uh, a label for an image that has a cat would probably be a vector with a 1 and two 0s where the 1 should represent the prese- the presence of a cat. This one should represent the presence of a lion and this one should represent the presence of an iguana. So let's assume I use this scheme to label my dataset. I train this network using the same techniques here. Initialize all my weights and biases with a value, a starting value, optimize a loss function by using gradient descent, and then use y-hat equals, uh, log to predict. What do you think this neuron is going to be responsible for? If you had to describe the responsibility of this neuron. [BACKGROUND] Yes. Well, this one. Lion. Yeah, lion and this one iguana. So basically the, the, the way you- yeah, go for it. [BACKGROUND] That's a good question. We're going to talk about that now. Multiple image contain different animals or not. So going back on what you said, because we decided to label our dataset like that, after training, this neuron is naturally going to be there to detect cats. If we had changed the labeling scheme and I said that the second entry would correspond to the cat, the presence of a cat, then after training, you will detect that this neuron is responsible for detecting a cat. So the network is going to evolve depending on the way you label your dataset. Now, do you think that this network can still be robust to different animals in the same picture? So this cat now has, uh, a friend [NOISE] that is a lion. Okay, I have no idea how to draw a lion, but let's say there is a lion here and because there is a lion, I will add a 1 here. Do you think this network is robust to this type of labeling? [BACKGROUND] It should be. The neurons aren't talking to each other. That's a good answer actually. Another answer. [BACKGROUND] That's a good, uh, intuition because the network, what it sees is just 1, 1, 0, and an image. It doesn't see that this one corresponds to- the cat corresponds to the first one and the second- and the lion corresponds to the second one. So [NOISE] this is a property of neural networks, it's the fact that you don't need to tell them everything. If you have enough data, they're going to figure it out. So because you will have also cats with iguanas, cats alone, lions with iguanas, lions alone, ultimately, this neuron will understand what it's looking for, and it will understand that this one corresponds to this lion. Just needs a lot of data. So yes, it's going to be robust. And that's the reason you mentioned. It's going to be robust to that because the three neurons aren't communicating together. So we can totally train them independent- independently from each other. And in fact, the sigmoid here, doesn't depend on the sigmoid here and doesn't depend on the sigmoid here. It means we can have one, one, and one as an output. [NOISE] Yes, question. [BACKGROUND] You could, you could, you could think about it as three logistic regressions. So we wouldn't call that a neural network yet. It's not ready yet, but it's a three neural network or three logistic regression with each other. [NOISE]. Now, following up on that, uh, yeah, go for it. A question. [BACKGROUND] W and b are related to what? [BACKGROUND] Oh, yeah. Yeah. So, so usually you would have Theta transpose x, which is sum of Theta_i_x_i, correct? And what I will split it is, I will spit it in sum of Theta_i_x_i plus Theta_0 times 1. I'll split it like that. Theta_0 would correspond to b and these Theta_is would correspond to w_is, make sense? Okay. One more question and then we move on. [BACKGROUND] Good question. That's the next thing we're going to see. So the question is a follow up on this, is there cases where we have a constraint where there is only one possible outcome? It means there is no cat and lion, there is either a cat or a lion, there is no iguana and lion, there's either an iguana or a lion. Think about health care. There are many, there are many models that are made to detect, uh, if a disease, skin disease is present on- based on cell microscopic images. Usually, there is no overlap between these, it means, you want to classify a specific disease among a large number of diseases. So this model would still work but would not be optimal because it's longer to train. Maybe one disease is super, super rare and one of the neurons is never going to be trained. Let's say you're working in a zoo where there is only one iguana and there are thousands of lions and thousands of cats. This guy will never train almost, you know, it would be super hard to train this one. So you want to start with another model that- where you put the constraint, that's, okay, there is only one disease that we want to predict and let the model learn, with all the neurons learn together by creating interaction between them. Have you guys heard of softmax? Yes? Somebody, ah, I see that in the back. [LAUGHTER] Okay. So let's look at softmax a little bit together. So we set a new goal now, which is we add a constraint which is an unique animal on an image. So at most one animal on an image. So I'm going to modify the network a little bit. We're go- we have our cat and there is no lion on the image, we flatten it, and now I'm going to use the same scheme with the three neurons, a_1, a_2, a_3. But as an output, what I am going to use is an exponent, a softmax function. So let me be more precise, let me, let me actually introduce another notation to make it easier. As you know, the neuron is a linear part plus an activation. So we are going to introduce a notation for the linear part, I'm going to introduce Z_11 to represent the linear part of the first neuron. Z_112 to introduce the linear part of the second neuron. So now our neuron has two parts, one which computes Z and one which computes a, equals sigmoid of Z. Now, I'm going to remove all the activations and make these Zs and I'm going to use the specific formula. So this, if you recall, is exactly the softmax formula. [NOISE] Yeah. Okay. So now the network we have, can you guys see or it's too small? Too small? Okay, I'm going to just write this formula bigger and then you can figure out the others, I guess, because, e of Z_3, 1 divided by sum from k equals 1 to 3 of e, exponential of ZK_1. Okay, can you see this one? So here is for the third one. If you are doing it for the first one, you will add- you'll just change this into a 2, into a 1 and for the second one into a 2. So why is this formula interesting and why is it not robust to this labeling scheme anymore? It's because the sum of the outputs of this network have to sum up to 1. You can try it. If you sum the three outputs, you get the same thing in the numerator and on the denominator and you get 1. That makes sense? So instead of getting a probabilistic output for each, each of y, if, each of y hat 1, y hat 2, y hat 3, we will get a probability distribution over all the classes. So that means we cannot get 0.7, 0.6, 0.1, telling us roughly that there is probably a cat and a lion but no iguana. We have to sum these to 1. So it means, if there is no cat and no lion, it means there's very likely an iguana. The three probabilities are dependent on each other and for this one we have to label the following way, 1, 1, 0 for a cat, 0, 1, 0 for a lion or 0, 0, 1 for an iguana. So this is called a softmax multi-class network. [inaudible]. You assume there is at least one of the three classes, otherwise you have to add a fourth input that will represent an absence of an animal. But this way, you assume there is always one of these three animals on every picture. And how many parameters does the network have? The same as the second one. We still have three neurons and although I didn't write it, this Z_1 is equal to w_11, x plus b_1, Z_2 same, Z_3 same. So there's 3n plus 3 parameters. So one question that we didn't talk about is, how do we train these parameters? These, these parameters, the 3n plus 3 parameters, how do we train them? You think this scheme will work or no? What's wrong, what's wrong with this scheme? What's wrong with the loss function specifically? There's only two outcomes. So in this loss function, y is a number between 0 and 1, y hat same is the probability, y is either a 0 or 1, y hat is between 0 and 1, so it cannot match this labeling. So we need to modify the loss function. So let's call it loss three neuron. What I'm going to do is I'm going to just sum it up for the three neurons. Does this make sense? So I am just doing three times this loss for each of the neurons. So we have exactly three times this. We sum them together. And if you train this loss function, you should be able to train the three neurons that you have. And again, talking about scarcity of one of the classes. If there is not many iguana, then the third term of this sum is not going to help this neuron train towards detecting an iguana. It's going to push it to detect no iguana. Any question on the loss function? Does this one make sense? Yeah? [inaudible] Yeah. Usually, that's what will happen is that the output of this network once it's trained, is going to be a probability distribution. You will pick the maximum of those, and you will set it to 1 and the others to 0 as your prediction. One more question, yeah. [inaudible] If you use the 2-1. If you use this labeling scheme like 1-1-0 for this network, what do you think it will happen? It will probably not work. And the reason is this sum is equal to 2, the sum of these entries, while the sum of these entries is equal to 1. So you will never be able to match the output to the input to the label. That makes sense? So what the network is probably going to do is it's probably going to send this one to one-half, this one to one-half, and this one to 0 probably, which is not what you want. Okay. Let's talk about the loss function for this softmax regression. [NOISE] Because you know what's interesting about this loss is if I take this derivative, derivative of the loss 3N with respect to W2_1. You think is going to be harder than this derivative, than this one or no? It's going to be exactly the same. Because only one of these three terms depends on W12. It means the derivative of the two others are 0. So we are exactly at the same complexity during the derivation. But this one, do you think if you try to compute, let's say we define a loss function that corresponds roughly to that. If you try to compute the derivative of the loss with respect to W2, it will become much more complex. Because this number, the output here that is going to impact the loss function directly, not only depends on the parameters of W2, it also depends on the parameters of W1 and W3. And same for this output. This output also depends on the parameters W2. Does this makes sense? Because of this denominator. So the softmax regression needs a different loss function and a different derivative. So the loss function we'll define is a very common one in deep learning, it's called the softmax cross entropy. Cross entropy loss. I'm not going to- to- into the details of where it comes from but you can get the intuition of yklog. So it, it surprisingly looks like the binary croissant, the binary, uh, the logistic loss function. The only difference is that we will sum it up on all the- on all the classes. Now, we will take a derivative of something that looks like that later. But I'd say if you can try it at home on this one, uh, it would be a good exercise as well. So this binary cross entropy loss is very likely to be used in classification problems that are multi-class. Okay. So this was the first part on logistic regression types of networks. And I think we're ready now with the notation that we introduced to jump on to neural networks. Any question on this first part before we move on? So one- one question I would have for you. Let's say instead of trying to predict if there is a cat or no cat, we were trying to predict the age of the cat based on the image. What would you change? This network. Instead of predicting 1-0, you wanna predict the age of the cat. What are the things you would change? Yes. [inaudible]. Okay. So I repeat. I, I basically make several output nodes where each of them corresponds to one age of cats. So would you use this network or the third one? Would you use the three neurons neural network or the softmax regression? Third one. The third one. Why? You have a unique age. You have a unique age, you cannot have two ages, right. So we would use a softmax one because we want the probability distribution along the edge, the ages. Okay. That makes sense. That's a good approach. There is also another approach which is using directly a regression to predict an age. An age can be between zero and plus infi- not plus infinity- [LAUGHTER]. -zero and a certain number. [LAUGHTER] And, uh, so let's say you wanna do a regression, how would you modify your network? Change the Sigmoid. The Sigmoid puts the Z between 0 and 1. We don't want this to happen. So I'd say we will change the Sigmoid. Into what function would you change the Sigmoid? [inaudible] Yeah. So the second one you said was? [inaudible] Oh, to get a Poisson type of distribution. Okay. So let's, let's go with linear. You mentioned linear. We could just use a linear function, right, for the Sigmoid. But this becomes a linear regression. The whole network becomes a linear regression. Another one that is very common in, in deep learning is called the Rayleigh function. It's a function that is almost linear, but for every input that is negative, it's equal to 0. Because we cannot have negative h, it makes sense to use this one. Okay. So this is called rectified linear units, ReLU. It's a very common one in deep learning. Now, what else would you change? We talked about linear regression. Do you remember the loss function you are using in linear regression? What was it? [inaudible] It was probably one of these two; y hat minus y. This comparison between the output label and y-hat, the prediction, or it was the L2 loss; y-hat minus y in L2 norm. So that's what we would use. We would modify our loss function to fit the regression type of problem. And the reason we would use this loss instead of the one we have for a regression task is because in optimization, the shape of this loss is much easier to optimize for a regression task than it is for a classification task and vice versa. I'm not going to go into the details of that but that's the intuition. [NOISE] Okay. Let's go have fun with neural networks. [NOISE] So we, we stick to our first goal. Given an image, tell us if there is cat or no cat. This is 1, this is 0. But now we're going to make a network a little more complex. We're going to add some parameters. So I get my picture of the cat. [NOISE] The cat is moving. Okay. And what I'm going to do is that I'm going to put more neurons than before. Maybe something like that. [NOISE] So using the same notation you see that my square bracket- So using the same notation, you see that my square bracket here is two indicating that there is a layer here which is the second layer [NOISE] while this one is the first layer and this one is the third layer. Everybody's, er, up to speed with the notations? Cool. So now notice that when you make a choice of architecture, you have to be careful of one thing, is that the output layer has to have the same number of neurons as you want, the number of classes to be for reclassification and one for a regression. So, er, how many parameters does this - this network have? Can someone quickly give me the thought process? So how much here? Yeah, like 3n plus 3 let's say. [inaudible]. Yeah, correct. So here you would have 3n weights plus 3 biases. Here you would have 2 times 3 weights plus 2 biases because you have three neurons connected to two neurons and here you will have 2 times 1 plus 1 bias. Makes sense. So this is the total number of parameters. So you see that we didn't add too much parameters. Most of the parameters are still in the input layer. Um, let's define some vocabulary. The first word is Layer. Layer denotes neurons that are not connected to each other. These two neurons are not connected to each other. These two neurons are not connected to each other. We call this cluster of neurons a layer. And then this has three layers. So we would use input layer to define the first layer, output layer to define the third layer because it directly sees the outputs and we would call the second layer a hidden layer. And the reason we call it hidden is because the inputs and the outputs are hidden from this layer. It means the only thing that this layer sees as input is what's the previous layer gave it. So it's an abstraction of the inputs but it's not the inputs. Does that make sense? And same, it doesn't see the output, it just gives what it understood to the last neuron that will compare the output to the ground truth. So now, why are neural networks interesting? And why do we call this hidden layer? Um, it's because if you train this network on cat classification with a lot of images of cats, you would notice that the first layers are going to understand the fundamental concepts of the image, which is the edges. This neuron is going to be able to detect this type of edges. This neuron is probably going to detect some other type of edge. This neuron, maybe this type of edge. Then what's gonna happen, is that these neurons are going to communicate what they found on the image to the next layer's neuron. And this neuron is going to use the edges that these guys found to figure out that, oh, there is a - their ears. While this one is going to figure out, oh, there is a mouth and so on if you have several neurons and they're going to communicate what they understood to the output neuron that is going to construct the face of the cat based on what it received and be able to tell if there is a cat or not. So the reason it's called hidden layer is because we - we don't really know what it's going to figure out but with enough data, it should understand very complex information about the data. The deeper you go, the more complex information the neurons are able to understand. Let me give you another example which is a house prediction example. House price prediction. [NOISE] So let's assume that our inputs are number of bedrooms, size of the house, zip code, and wealth of the neighborhood, let's say. What we will build is a network that has three neurons in the first layer and one neuron in the output layer. So what's interesting is that as a human if you were to build, uh, this network and like hand engineer it, you would say that, uh, okay zip codes and wealth or - or sorry. Let's do that. Zip code and wealth are able to tell us about the school quality in the neighborhood. The quality of the school that is next to the house probably. As a human you would say these are probably good features to predict that. The zip code is going to tell us if the neighborhood is walkable or not, probably. The size and the number of bedrooms is going to tell us what's the size of the family that can fit in this house. And these three are probably better information than these in order to finally predict the price. So that's a way to hand engineer that by hand, as a human in order to give human knowledge to the network to figure out the price. In practice what we do here is that we use a fully-connected layer - fully-connected. What does that mean? It means that we connect every input of a layer, every - every input to the first layer, every output of the first layer to the input of the third layer and so on. So all the neurons among lay - from one layer to another are connected with each other. What we're saying is that we will let the network figure these out. We will let the neurons of the first layer figure out what's interesting for the second layer to make the price prediction. So we will not tell these to the network, instead we will fully connect the network [NOISE] and so on. Okay. We'll fully connect the network and let it figure out what are the interesting features. And oftentimes, the network is going to be able better than the humans to find these - what are the features that are representative. Sometimes you may hear neural networks referred as, uh, black box models. The reason is we will not understand what this edge will correspond to. It's - it's hard to figure out that this neuron is detecting a weighted average of the input features. Does that make sense? Another word you might hear is end-to-end learning. The reason we talked about end-to-end learning is because we have an input, a ground truth, and we don't constrain the network in the middle. We let it learn whatever it has to learn and we call it end-to-end learning because we are just training based on the input and the output. [NOISE] Let's delve more into the math of this network. The neural network that we have here which has an input layer, a hidden layer and an output layer. Let's try to write down the equations that run the inputs and pro - propagate it to the output. We first have Z_1, that is the linear part of the first layer, that is computed using W_1 times x plus b_1. Then this Z_1 is given to an activation, let's say sigmoid, which is sigmoid of Z_1. Z_2 is then the linear part of the second neuron which is going to take the output of the previous layer, multiply it by its weights and add the bias. The second activation is going to take the sigmoid of Z_2. And finally, we have the third layer which is going to multiply its weights, with the output of the layer presenting it and add its bias. And finally, we have the third activation which is simply the sigmoid of the three. So what is interesting to notice between these equations and the equations that we wrote here, is that we put everything in matrices. So it means this a_3 that I have here, sorry, this here for three neurons I wrote three equations, here for three neurons in the second layer I just wrote a single equation to summarize it. But the shape of these things are going to be vectors. So let's go over the shapes, let's try to define them. Z_1 is going to be x which is n by 1 times w which has to be 3 by n because it connects three neurons to the input. So this z has to be 3 by 1. It makes sense because we have three neurons. Now let's go, let's go deeper. A_1 is just the sigmoid of z_1 so it doesn't change the shape. It keeps the 3 by 1. Z_2 we know it, it has to be 2 by 1 because there are two neurons in the second layer and it helps us figure out what w_2 would be. We know a_1 is 3 by 1. It means that w_2 has to be 2 by 3. And if you count the edges between the first and the second layer here you will find six edges, 2 times 3. A_2, same shape as z_2. Z_3, 1 by 1, a_3, 1 by 1, w_3, it has to be 1 by 2, because a_2 is 2 by 1 and same for b. B is going to be the number of neurons so 3 by 1, 2 by 1, and finally 1 by 1. So I think it's usually very helpful, even when coding these type of equations, to know all the shapes that are involved. Are you guys, like, totally okay with the shapes, super-easy to figure out? Okay, cool. So now what is interesting is that we will try to vectorize the code even more. Does someone remember the difference between stochastic gradient descent and gradient descent. What's the difference? [inaudible] Exactly. Stochastic gradient descent is updates, the weights and the bias after you see every example. So the direction of the gradient is quite noisy. It doesn't represent very well the entire batch, while gradient descent or batch gradient descent is updates after you've seen the whole batch of examples. And the gradient is much more precise. It points to the direction you want to go to. So what we're trying to do now is to write down these equations if instead of giving one single cat image we had given a bunch of images that either have a cat or not a cat. So now our input x. So what happens for an input batch of m examples? So now our x is not anymore a single column vector, it's a matrix with the first image corresponding to x_1, the second image corresponding to x_2 and so on until the nth image corresponding to x_n. And I'm introducing a new notation which is the parentheses superscript corresponding to the ID of the example. So square brackets for the layer, round brackets for the idea of the example we are talking about. So just to give more context on what we're trying to do. We know that this is a bunch of operations. We just have a, a network with inputs, hidden, and output layer. We could have a network with 1,000 layer. The more layers we have the more computation and it quickly goes up. So what we wanna do is to be able to parallelize our code or, or our computation as much as possible by giving batches of inputs and parallelizing these equations. So let's see how these equations are modified when we give it a batch of m inputs. I will use capital letters to denote the equivalent of the lowercase letters but for a batch of input. So Z_1 as an example would be W_1, let's use the same actually, W_1 times X plus B_1. So let's analyze what Z_1 would look like. Z_1 we know that for every, for every input example of the batch we will get one Z_1 which should look like this. Then we have to figure out what have to be the shapes of this equation in order to end up with this. We know that Z_1 was 3 by 1. It mean, it means capital Z_1 has to be 3 by m because each of these column vectors are 3 by 1 and we have m of them. Because for each input we forward propagate through the network, we get these equations. So for the first cat image we get these equations, for the second cat image we get again equations like that and so on. So what is the shape of x? We have it above. We know that it's n by n. What is the shape of w_1? It didn't change. W_1 doesn't change. It's not because I will give 1,000 inputs to my network that the parameters are going to be more. So the parameter number stays the same even if I give more inputs. And so this has to be 3 by n in order to match Z_1. Now the interesting thing is that there is an algebraic problem here. What is the algebraic problem? We said that the number of parameters doesn't change. It means that w has the same shape as it has before, as it had before. B should have the same shape as it had before, right? It should be 3 by 1. What's the problem of this equation? Exactly. We're summing a 3 by m matrix to a 3 by 1 vector. This is not possible in math. It doesn't work. It doesn't match. When you do some summations or subtraction, you need the two terms to be the same shape because you will do an element-wise addition or an ele- element-wise subtraction. So what's the trick that is used here? It's a, it's a technique called broadcasting. Broadcasting is that- is the fact that we don't want to change the number of parameters, it should stay the same. But we still want this operation to be able to be written in parallel version. So we still want to write this equation because we want to parallelize our code, but we don't want to add more parameters, it doesn't make sense. So what we're going to do is that we are going to create a vector b tilde 1 which is going to be b_1 repeated three times. Sorry, repeated m times. So we just keep the same number of parameters but just repeat them in order to be able to write my code in parallel. This is called broadcasting. And what is convenient is that for those of you who, uh, the homeworks are in Matlab or Python? Matlab. Okay. So in Matlab, no Python? [LAUGHTER]. [NOISE] Thank you. Um, Python. So in Python, there is a package that is often used to to code these equations. It's numPy. Some people call it numPy, I'm not sure why. So numPy, basically numerical Python, we directly do the broadcasting. It means if you sum this 3 by m matrix with a 3 by 1 parameter vector, it's going to automatically reproduce the parameter vector m times so that the equation works. It's called broadcasting. Does that make sense? So because we're using this technique, we're able to rewrite all these equations with capital letters. Do you wanna do it together or do you want to do it on your own? Who wants to do it on their own? Okay. So let's do it on their own [LAUGHTER] on your own. So rewrite these with capital letters and figure out the shapes. I think you can do it at home, wherever, we're not going to do here, but make sure you understand all the shapes. Yeah. [inaudible] How how is this [inaudible]? So the question is how is this different from principal component analysis? This is a supervised learning algorithm that will be used to predict the price of a house. Principal component analysis doesn't predict anything. It gets an input matrix X normalizes it, ah, computes the covariance matrix and then figures out what are the pri- principal components by doing the the eigenvalue decomposition. But the outcome of PCA is, you know that the most important features of your dataset X are going to be these features. Here we're not looking at the features. We're only looking at the output. That is what is important to us. Yes. In the first lecture when did you say that the first layers is the edges in an [inaudible]. So the question is, can you explain why the first layer would see the edges? Is there an intuition behind it? It's not always going to see the edges, but it's oftentimes going to see edges because um, in order to detect a human face, let's say, you will train an algorithm to find out whose face it is. So it has to understand the faces very well. Um, you need the network to be complex enough to understand very detailed features of the face. And usually, this neuron, what it sees as input are pixels. So it means every edge here is the multiplication of the weight by a pixel. So it sees pixels. It cannot understand the face as a whole because it sees only pixels. It's very granular information for it. So it's going to check if pixels nearby have the same color and understand that there is an edge there, okay? But it's too complicated to understand the whole face in the first layer. However, if it understands a little more than a pixel information, it can give it to the next neuron. This neuron will receive more than pixel information. It would receive a little more complex-like edges, and then it will use this information to build on top of it and build the features of the face. So what I'm trying to sum up is that these neurons only see the pixels, so they're not able to build more than the edges. That's the minimum thing that they can- the maximum thing they can do. And it's it's a complex topic, like interpretation of neural network is a highly researched topic, it's a big research topic. So nobody figured out exactly how all the neurons evolve. Yeah. One more question and then we move on. Ah, how [inaudible]. So the question is how [OVERLAPPING]. [inaudible]. -how do you decide how many neurons per layer? How many layers? What's the architecture of your neural network? There are two things to take into consideration I would say. First and nobody knows the right answer, so you have to test it. So you you guys talked about training set, validation set, and test set. So what we would do is, we would try ten different architectures, train it, train the network on these, looking at the validation set accuracy of all these, and decide which one seems to be the best. That's how we figure out what's the right network size. On top of that, using experience is often valuable. So if you give me a problem, I try always to gauge how complex is the problem. Like cat classification, do you think it's easier or harder than day and night classification? So day and night classification is I give you an image, I asked you to predict if it was taken during the day or during the night, and on the other hand you want there's a cat on the image or not. Which one is easier, which one is harder? Who thinks cat classification is harder? Okay. I think people agree. Cat classification seems harder, why? Because there are many breeds of cats. Can look like different things. There's not many breeds of nights. um, I guess. [LAUGHTER] Um, one thing that might be challenging in the day and night classification, is if you want also to figure it out in house like i- inside, you know maybe there is a tiny window there and I'm able to tell that is the day but for a network to understand it you will need a lot more data than if only you wanted to work outside, different. So these problems all have their own complexity. Based on their complexity, I think the network should be deeper. The comp- the more complex usually is the problem, the more data you need in order to figure out the output, the more deeper should be the network. That's an intuition, let's say. Okay. Let's move on guys because I think we have about what, 12 more minutes? Okay. Let's try to write the loss function for this problem. [NOISE]. So now that we have our network, we have written this propagation equation and I we call it forward propagation, because it's going forward, it's going from the input to the output. Later on when we will, we will derive these equations, we will call them backward propagation, because we are starting from the loss and going backwards. So let's let's talk about the optimization problem. Optimizing w_1, w_2, w_3, b_1, b_2, b_3. We have a lot of stuff to optimize, right? We have to find the right values for these and remember model equals architecture plus parameter. We have our architecture, if we have our parameters we're done. So in order to do that, we have to define an objective function. Sometimes called loss, sometimes called cost function. So usually we would call it loss if there is only one example in the batch, and cost, if there is multiple examples in a batch. So the loss function that, let- let's define the cost function. The cost function J depends on y hat n_y. Okay. So y hat, y hat is a_3. Okay. It depends on y hat n_y, and we will set it to be the sum of the loss functions L_i, and I will normalize it. It's not mandatory, but normalize it with 1/n. So what does this mean? It's that we are going for batch gradient descent. We wanna compute the loss function for the whole batch, parallelize our code, and then calculate the cost function that will be then derived to give us the direction of the gradients. That is, the average direction of all the de-de- derivation with respect to the whole input batch. And L_i will be the loss function corresponding to one parameter. So what's the error on this specific one input, sorry not parameter, and it will be the logistic loss. You've already seen these equations, I believe. So now, is it more complex to take a derivative with respect to J like of J with respect to the parameters or of L? What's the most complex between this one, let's say we're taking the derivative with respect to w_2, compared to this one? Which one is the hardest? Who thinks J is the hardest? Who think it doesn't matter? Yeah, it doesn't matter because derivation is is a linear operation, right? So you can just take the derivative inside and you will see that if you know this, you just have to take the sum over this. So instead of computing all derivatives on J, we will com- compute them on L, but it's totally equivalent. There's just one more step at the end. Okay. So now we defined our loss function, super. We defined our loss function and the next step is optimize. So we have to compute a lot of derivatives. [NOISE] And that's called backward propagation. [NOISE] So the question is why is it called backward propagation? It's because what we want to do ultimately is this. For any l equals 1-3, we want to do that, wl equals wl minus Alpha derivative of j with respect to wl, and bl equals bl minus Alpha derivative of j with respect to bl. So we want to do that for every parameter in layer 1, 2, and 3. So it means, we have to compute all these derivatives, we have to compute derivative of the cost with respect to w1, w2, w3, b1, b2, b3. You've done it with logistic regression, we're going to do it with a neural network, and you're going to understand why it's called backward propagation. Which one do you want to start with? Which derivative? You wanna start with the derivative with respect to w1, w2, or w3, let's say. Assuming we'll do the bias later. W what? W1? You think w1 is a good idea. I do- don't wanna do w1. I think we should do w3, and the reason is because if you look at this loss function, do you think the relation between w3 and this loss function is easier to understand or the relation between w1 and this loss function? It's the relation between w3 and this loss function. Because w3 happens much later in the- in the network. So if you want to understand how much should we move w1 in order to make the loss move? It's much more complicated than answering the question how much should w3 move to move the loss. Because there's much more connections if you wanna compete with w1. So that's why we call it backward propagation is because we will start with the top layer, the one that's the closest to the loss function, derive the derivative of j with respect to w1. Once we've computed this derivative which we are going to do next week, once we computed this number, we can then tackle this one. Oh, sorry. Yeah. Thanks. Yeah. Once we computed this number, we will be able to compute this one very easily. Why very easily? Because we can use the chain rule of calculus. So let's see how it works. What we're- I'm just going to give you, uh, the one-minute pitch on- on backprop, but, uh, we'll do it next week together. So if we had to compute this derivative, what I will do is that I will separate it into several derivatives that are easier. I will separate it into the derivative of j with respect to something, with the something, with respect to w3. And the question is, what should this something be? I will look at my equations. I know that j depends on Y-hat, and I know that Y-hat depends on z3. Y-hat is the same thing as a3, I know it depends on z3. So why don't- why don't I include z3 in my equation? I also know that z3 depends on w3, and the derivative of z3 with respect to w2 is super easy, it's just a2 transpose. So I will just make a quick hack and say that this derivative is the same as taking it with respect to a3, taking derivative of a3 with respect to z3, and taking the derivative of z3 with respect to w3. So you see? Same, same derivative, calculated in different ways. And I know this, I know these are pretty easy to compute. So that's why we call it backpropagation, it's because I will use the chain rule to compute the derivative of w3, and then when I want to do it for w2, I'm going to insert, I'm going to insert the derivative with z3 times the derivative of z3 with respect to a2 times the derivative of a2 with respect to z2, and derivative of z2 with respect to w2. Does this make sense that this thing here is the same thing as this? It means, if I wanna compute the derivative of w2, I don't need to compute this anymore, I already did this for w3. I just need to compute those which are easy ones, and so on. If I wanna compute the derivative of j with respect to w1, I'm going to- I'm not going to decompose all the thing again, I'm just going to take the derivative of j with respect to z2 which is equal to this whole thing. And then I'm gonna multiply it by the derivative of z2 with respect to a1 times derivative of a1 with respect to z1 times the derivative of z1 with respect to w1. And again, this thing I know it already, I computed it previously just for this one. So what's, what's interesting about it is that I'm not gonna redo the work I did, I'm just gonna store the right values while back-propagating and continue to derivate. One thing that you need to notice though is that, look, you need this forward propagation equation in order to remember what should be the path to take in your chain rule because you know that this derivative of j with respect to w3, I cannot use it as it is because w3 is not connected to the previous layer. If you look at this equation, a2 doesn't depend on w3, it depends on z3. Sorry, like, uh, my bad, it depends- no, sorry, what I wanted to say is that z2 is connected to w2, but a1 is not connected to w2. So you wanna choose the path that you're going through in the proper way so that there's no cancellation in these derivatives. You- you cannot compute derivative of w2 with respect to- to a1, right? You cannot compute that, you don't know it. Okay. So I think we're done for today. So one thing that I'd like you to do if you have time is just think about the things that can be tweaked in a neural network. When you build a neural network, you are not done, you have to tweak it, you have to tweak the activations, you have to tweak the loss function. There's many things you can tweak, and that's what we're going to see next week. Okay. Thanks.
Info
Channel: stanfordonline
Views: 48,355
Rating: 4.969419 out of 5
Keywords:
Id: MfIjxPh6Pys
Channel Id: undefined
Length: 80min 14sec (4814 seconds)
Published: Fri Apr 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.