Ali Ghodsi, Lec 7: Backpropagation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in the previous lecture we learned about perceptron and also we learned about a structure of feed-forward neural network feed board neural network is layers of perceptron we learn how to Train perceptron and we wanted to learn how to train a feed-forward neural network in this lecture so the algorithm that we are going to study is called back propagation its dominant algorithm for training feed-forward neural network as a reminder feed-forward neural network had this structure you know you have a d-dimensional and you have D dimensional data point then you have you know a layer of perceptron and each perceptron will be fed by these inputs and there are some weights here so this inputs will be weighted and then we have weighted some of these inputs and then linear function will be applied to the weighted sum of this input and we're going to have an output but this output is going to another layer of perceptron and we could have many layers and eventually we have the output layer and output layer could be only one node could be multiple node so say for example for simplicity assume that there's a one note and you have a void here as the input of this network so we would like to train it by training I mean that you know originally if I had n data point I would like to minimize Y minus y hat so if I write it in a vector form so each of them is a vector of length n so for n data point I have this and I need to minimize this so basically I need to learn these weights in a way that this objective function is minimum okay so if we follow the procedure that we had for basically any any other algorithm that we study in this course I need to take derivative of this objective function with respect to the parameters that I would like to optimize which is the weight so basically I have to take derivative of this with this weight with this weight with this way with this way with all of these weights are parameters of this model if I change them the output will change and I would like to know which set of parameters are optimum set so I have to be able to take derivative of this objective function with respect to this weights and then either set it to zero if there is closed form or do gradient descent or take the second derivative do you know Newton method or something like that so a numerical method so anyway I need to take this through it okay and back propagation the algorithm that we are going to study today is basically way to take this derivative so we are going to learn about back propagation okay so let me you know I'm just going to consider a snapshot of this network you know it's you know a part of layers and nodes and then we'll see that how we can compute this through it so I'm going to consider so this is layer L this is a layer I and this is layer J there are many nodes in this layer and there are many nodes in this layer and there are many nodes in this layer but I just pick one node for each lay and there are weights I mean there are edges between these nodes and there's weight associated to each of them so UIL is one weight and you J I is another weight and don't forget the Deezer perceptron so basically each of them take the input and sum them you know add them up so basically you know I divide each perceptron to two parts the first part would be a weighted sum of the input and the second part is when we apply this nonlinear function in the perceptron we had a sine function applied to weighted sum right so there's a nonlinear function which has been applied to this input so I divide each of these two two parts so this is AI this is said I this is AJ and this is Z J and this is al and this is Z L and basically you know a I for example is just summation of all of these outputs times the waves so it is you I L times that L over all of these nodes and Zi is when a nonlinear function applied to a I this function in perceptron was a sine function but here usually we have a smooth version of sine Phi sine function for example this is our function one I mean Delta R is 1 over 1 plus e to the power of minus a so it looks like sine but it's a smooth so we can take there with us it's differentiable so is it clear so far any question ok so this is the structure now I need to take derivative of say Y minus y hat is my objective function and I need to take derivative of e with respect to one of these weights say for example I would like to take derivative with respect to u IL ok so back propagation is basically a chain rule you know you just apply chain rule in a wise way and you eventually can find this derivative so to take derivative of e with respect to u IL I can take derivative of a with respect to AI and then take derivative of a i with respect to you I am okay so it's just chained role I apply chain rule here so objective function which work to this way this objective function with respect to this variable and then this variable with respect to wait okay okay what's the relative of a i with respect to u IL sorry it is said oh yes because see RI is just this summation if i take derivative of a i would respect to this u IL it's going to be zero so in other words this derivative means that i would like to prepare this way how much the whole basically quantity of this AI how how does this change if I protect you il a little okay so this is the Dell what about this one much derivative of V with respect to error do we know this quantity it's not clear what that quantity is so I just called it Delta but this Delta I which is derivative of e with respect to AI I apply chain rule to this part again so derivative of e with respect to a I I write it as derivative of e with respect to AJ times derivative of AJ with respect to area okay so I apply chain rule ad today there's something wrong here in this formula have to make it small to modify this a little bit to make it cry what's wrong here you know just don't forget that the meaning of this derivative I'm taking the derivative of this quantity with respect to AI means I'm preserving AI a little bit okay I change AI a little bit I would like to know how he has been changed basically this chain rule tell me that okay you want to know the change of e when you prepare a I when you puree I actually you change AJ right and when you change AJ and and also you change AI I mean afterward but but when when I change a I you know I know that when I change a I I change AJ but there are many of them here right when I change a I I change this AJ I change this idea changes that I change all AJ of all of these nodes basically to be correct you need a summation here a summation over all nodes of G because this change I mean if you protect our I the value of all of this are AJ will be changed you know you have to take some mission of all of these changes if you want to take care of this church okay is it clear so I have the relative of e with respect to AJ so with be called derivative of e with respect to a I Delta I now I have derivative of e with respect to AJ I just call it Delta J and then I have the derivative of AJ with respect to AI AJ with respect to a I do you know this value sorry derivative of AJ with respect to air I is sorry is it you ji if it was with respect to Z I you were right it was you ji but it's respect to AI okay so actually in fact we need to apply chain rule again you know so you said that it's you ji it is you ji if it's derivative with respect to Z I but it's not so but I can write it this way I can write it it's derivative of AJ with respect to z oi x derivative of Z with respect to error and derivative of a J with respect to Z I is u ji what about Z I with respect to AI its derivative of Sigma because that is just Sigma AI right so this is derivative of you know it's Sigma prime of AI right okay so let's put all of these things together I wanted to find the ratio of my objective function with respect to one of the weight and it turned out that it is Delta I times that L so if you want to take derivative with you I L it is Z L times delta I but Delta I itself is summation over J of Delta J times this quantity which is u ji x derivative of Sigma AI okay in fact I can even take this out of this summation because you know it has nothing to do with this index so I can write Delta I to be Sigma prime AI summation over J of Delta J u ji so we want to take this there with here with respect to wait I mean one of the baits UIL so you need this quantity which you don't know what this quantity is we need this quantity Delta I times that LZ L we know right okay but Delta I itself is Delta I itself is Sigma prime of AI this is a noun Valley right Sigma is a known function I know the value of a so I can compute the derivative of Sigma at Point a I times summation over J of this unknown quantity again so if I want to take derivative of output with respect to this weight it depends on a quantity which I call it delta i delta i itself depends on the quantity of the next layer which I call it Delta J so with the same argument I can show that this Delta J depends on Delta K so each of these Delta depends on the Delta of the next layer right so what would happen if I just go up to the lastly in the last layer for simplicity here we assume that I have only one node say this is layer K you know layer K just comes after some of these layers ultimately layer K I have output which is y hat and they have some input and it should come to you know a note similar to all nodes that we had so far for simplicity I assumed that this know doesn't know that output and inputs are the same you can always do that right say for example this is your real output node and boy Hart just put another node after that to take the input and get the output identical to them okay so if I want to compute Delta K Delta K is what is the derivative of e with respect to a K right so it is derivative of e with respect to a K but I assume you know there is only one input here and one output here so this summation would be exactly the same as this one I mean a K basically would be just Y hat because of this assumption the timing can i compute the derivative of this objective function with respect to y hat the objective function was just y minus y hat so I have derivative of Y minus y hat with respect to Y hat which is two in fact negative 2y minus y hat so we can compute this right for the last layer I can compute this value the value of Delta so if I can compute the value of Delta for last layer and I know this Delta Delta of this layer depends on the Delta of the of this layer depends on Delta obviously you know each layer the Delta depends on the Delta of the next layer I can go up to the last layer when I can compute del / 4 then I get back you know having the Delta of the last year I can compute the Delta of the previous layer the previous letter previously so basically I back propagate and compute the Delta of all of these layers when I know the Delta of all these layers I can compute this quantity you know these are basically main formula that I need to know this formula this one in this so that's how back propagation works you know you have n data point you start with the first data point you initialize your network with some weight you know I just randomly select some native weights for my network and I feed the network with the first data point when I fit the network with the first data point I can compute all of these values so these values randomly have been selected I can compute all of AIDS I can compute all of that I can compute Y hat based on the initial value right now knowing all of these initial values I can compute Delta K for the lastly I know what the true why for the first point is I know the void hat that my network generates so I can compute basically I can compute Delta K having Delta K I can put it in this formula you know this is now I can put it in this formula find the Delta for the previously find the data for the previous to find Delta of all of the layers after the firstly having all of this Delta multiplied this Delta to any set you can compute corresponding weight okay I mean derivative of that V so basically can compute when you initialize you initialize your network and fitted with one data point you can compute this derivative for all of the weights having this derivative for all of the weights you can basically do gradient descent you know I can say okay I have new weight new update it would be the old weight - some learning rate times disturbing okay this is you know I summarized the algorithm of back propagation now so you're gonna apply an input vector X to network to value two to find the value of all the hidden and output units you know you can find all of these values you can find the value of y hat the output Y in this values okay so we can evaluate lambda K for the lastly which is this or two times of this it doesn't matter you know it's just a scalar derivative with a factor it doesn't matter it doesn't change the direction then you have two essential Formula one of them one of the formula is this that allows you to compute the Delta for the previous layer given the Delta of this layer and you have this formula which lets you to compute the derivative of error with respect to a given weight when you know the corresponding Delta and corresponding Z so having this derivative for all weights you can apply gradient descent okay so that was for the first data point I do this for the second data point I do it for the third data point up to the last one and I get back I repeat this process many times until it converges okay so that's basically back propagation to any question so as you can see it's great in descent as usual gradient descent there is no guarantee that we're gonna find the global minimum we are going to find a local a mean of the function so one of the reason that neural network was out of favor for a while was this problem that back propagation finds a local basically local minima of the objective function and that global minimum it's not a problem anymore you know there is theoretical work recently which shows when you have very high dimensional data local mean is not really that bad you know there are many many local means that are as good as each other you know it doesn't matter if you get a stuck in one of the local meats in fact in some cases you may benefit from local meat we're gonna we're going to talk about model selection just right after we finish with backpropagation and you are going to see that in model selection you have to avoid overfitting in this local mean in some cases help you to avoid overfitting okay so that was one particular structure of neural networks it's called feed-forward neural network it's the most most dominant form of neural network there are other types of neural network convolutional neural network and recurrent neural network and recursive neural network the most eminent one is this feed-forward neural network in this course we are going to mention convolutional neural network as well but not orang an and not recursive normally so if there is no question I'm going to start talking about model selection and overfitting but then we get back to neural network again okay any type of learning you have a problem which is called overfitting and underfitting you know let me show you this forget about classification for the moment you know I give you setup point and ask you to feed the function to this ones that could be a function this also could be a fine this is also fun which one is better among this tree sorry second one why the second one is with how do you know that the second one is bit right so it's it's good point if you want to learn all of these data exactly the last one is the best one but but if you want to find the overall pattern of the data the second one maybe is the best one okay why do we want to learn the the overall pattern rather than learning all points for new points right if because if I want to be able to predict new points I have to lend overall pattern not all of the data point why buy one because it's not you know when I give you the data point for data for end points I have given you the value for this particular end point there is no process of learning here you already know that if there is any learning involved here means learning something new what's going to happen in a new data point not the data points that you have already seen okay if you go for this third one which is basically learned learning all of these data points you have over fitted your model you know this is called overfitting overfitting means that you learned all of these examples exactly but you can generalize if I show you a new point you can't do that okay and it's not just for you know function estimation even in our daily life we have this issue you know I have a students you give them a paper to read and they have like five types of markers red and blue and green and they come back to you and the papers is color coded you know this sentence is red this sentence is green this sentences and each of them has a meaning and this type of a student get 100 in all of the exams and a plus GPA but usually they are not in a weight you know usually you don't you can't expect to get much out of you know in terms of innovation your idea because this is overfitting you know you did read something and you want to learn all of the single you know details of this subject so there is no room to learn anything new it's not just about you know an example you know I told you what's the I don't know what's the difference between blackboard and whiteboard and you can tell make it blackboard is black white board is white so I can I can make distinction between blackboard and voice work but if you give if you have a set of examples you know ten black board and ten white board and I ask you what's the difference between blackboard and white board and you tell me okay black boards are those that or you know the length is like two meter and the height is one meter and the length I mean is one point half and the thickness is this and you have n of them you give me n different numbers and different thickness and different weights and hearts and so on and you do the same thing for voice boards you know give me so much detail of everything I can do perfect for any white board and black board in this data set but if you if you show me a new blackboard which is not in this status that I cannot do that you know because I it I haven't seen you know this exact length I haven't seen this exact thickness in the data set so I get confused very small amount of data I can do that it's black that one is white fine you know that's that's the the comment pattern in these two to make distinction between these two so basically when we are learning a model we have to make sure that we are learning the the right model you know we don't overfeed the data there is another thing which is called under fitting and under fitting in this case for example means you learn this function either linear function okay which is pretty far from the true reality so in this case I'm trying to fit a polynomial to this data and this is a polynomial of degree 1 this is a polynomial of degree 2 and it's a polynomial of degree say for example n minus 1 you have endpoint if you fit a polynomial of degree n minus 1 you can go through all of these points right so say this is the degree of polynomial x and y is my error my empirical error you know what I predict compared to the true value for any model you know you can have a care of this for if I choose a polynomial degree 1 I'm pretty 4 I have a huge a then I choose a polynomial of degree 2 I have some here then I choose a polynomial of degree 3 4 5 at some point I can go through all of these so at some point you know it basically would be even Z so I can continue the process until I have no error which we said it's not a good thing you know it's not a good thing not having any error it's overfitting that's what's going to happen on your training set a new data points say aunt s that this is the pattern that you're going to see with polynomial degree 1 you have error with polynomial degree 2 you have less error with polynomial degree 3 and training said you have less error and tested you have more so at some point they start to increase means this is optimum you know at some point you can't predict well if you give more details to the model it's enough you shouldn't train it more you shouldn't give more details to the model it's enough to learn the general pattern and this is the optimal model before this is under fitting after this is overfitting and we have to avoid overfitting we have to avoid under freedom basically we have to find the right model this is called model selection okay and there are many different ways for model selection for for any type of model there are some general rules general methods to find the right model including neural network including perceptron including you know any other method that we have learned so far and we are going to learn after this this is general you know it can over fit it can under fit and you have to find the right model so in this example that was the degree of polynomial but in general this is complexity so each model you can define complexity for each model the more complex model fit the data better you know they have ability to fit the data better this is in in the family of polynomials this is line is basically a model with low complexity polynomial has higher complexity cubic and so on you know each time you add to the complexity means this mother has more flexibility you know for neural network you know you have a neural network with one set of layers compared with the network with two set of layers tool to set up basically to a set of layers is more complex than one having more nodes means that the model is more complex okay so in general we are going to learn about a model selection and then we will see how it can be applied to some of the techniques that we have learned so far and the techniques that we are going to learn after that so that was basically the example that they give you and the board when you have some data points this is under fitting this is the right model and this is overfitting and we need to avoid under fitting and you need to avoid overfitting so basically this is our question more formally how can we choose a good classifier in general a good model you know and it's not just about classifier about regression anything any any type of learning that you want to do has this problem so in terms of classification you would like to have a classifier H that has load I mean has minimum true error remember what was the true error true error was the probability that H is not equal to what probability we can't compute this probability we estimated right we estimated we can't compute the probability exactly we estimated through the training set but our estimation our empirical error our estimation on truth and on training set is always an underestimate of this true error so it's always biased downward it's not a good estimation in many cases so this is what we compute this is empirical error that's not the true error you know if you can minimize the true error fine but we can't we minimize usually this impure color and empirical error can be problematic it's not always a good estimation of the true heir okay so to study this problem more carefully let me remind you of a couple of concepts you know the concept of bias you know the concept of variance and you know the concept of mean square error so f is suppose that f is underlying function is the true function and F hat is basically the function that we estimate what really we are interested in is minimizing this you know we want the distance between true F and estimated F to be minimum okay we have this concept by us the the distance between F and expectation of the model that we have computed and we have the variance which is you know how stable or how Wiggly is the estimation that we have in this example for example this one this model has high bias the low variance you know if I if I estimate a line a linear function here ten times take the average of them it doesn't change that much so this is not going to be you know the difference between estimation and the average of all of this estimation is not big it's low variance okay but it's high biased my estimate the average of estimation which is this line is far from the reality so it has high bias with low variance but this function this Wiggly function it has very high variance because each time that I'm estimating this F I get something different you know there are many different ways to go through all of these data points you know I can do it in many different ways each time I get a new function new estimation so the average of all of this estimation compared to the estimation that I'm making now could be pretty far so it's high variance but it's low bias you know it's always a trade-off between variance and bias you know we can say that variance plus bias the square is meaning basically mean square error so fix me in a square error you change the variance you minimize you decrease the the bias variance get increase and vice versa there's always the straight of between this quantity and there's always these phenomena that we explain on the board that empirical error or training error always decreases when you change the complexity of the model more complex model less training error less impre color on training set but that's not the case for test sample for it for for new data points so this is the area of low bias and high variance you know these type of models are high variance but they have low bias and this is the area of high bias but low variance so and basically in this two corners we want to avoid all of these two extreme we want to find a point a model with the right complexity that the art of model selection or complexity control okay okay so there are different ways for model selection and for complexity control one way for complexity control is estimate the true error rate so I told you that the probability that H X is not equal to Y that's true error which we estimated through some empirical estimation if you can minimize the true error you are fine you know the problem starts when we approximated by empirical error so the first approach to model selection is somehow is the the true error not just empirical errors you know we fix empirical error in a way that's that this empirical error is a representative of the true error I may not add a factor to this subtractive factors that is do something about it you know so the first approach is to estimate the true to estimate the true error there are two major approaches under estimating the true one of them is cross validation that we are going to learn about in details in this course is very simple very general technique for model selection is called cross validation the other one is computing error bound we're gonna see just a you know a little a little bit of flavor of this type of computing you know there's a usually you know this approach is that there are many inequalities in probability and be used as inequalities to put a bound on the true error you know we claim that this the true error cannot be less than this cannot be more than this this is theoretically very interesting really elegant techniques in practice usually this type of bounds or loose bounds they are far from true error you know it's a bound of error and you know your your truer is here this bound is here tell you that it cannot be more than this this is true and you can pretty erratically you can prove it but it's loose in a sense that it's far from here it's worst case scenario you know worst case it's it's as bad as this cannot be more than cannot be worse than this but in practice it's very better than this you know semi-supervised learning everyone use semi supervisors and a good result theoretically you can prove that for the bound of semi-supervised learning theoretically can proof that using unlabeled data doesn't improve the bound you know it means that semi-supervised learning is useless I mean if you just use your label data the bound is better than using unlabeled data and label data together vertically you can show that but in practice semi-supervised learning is useful because this mount is very loose it's here and reality is here so there are many elegant techniques interesting to article to to study may not be that useful in practice but we're gonna see a little bit of this you are going to see this in detail and there's also regularization regularization means adding you know a penalty function to your cost function when you optimize it and we just see a little bit of this you know for neural network for example we see they decay how to add you know l 2-norm to this and why we add at this level you don't go through the detail of that okay let's start with cross-validation cross-validation maybe is the simplest way to do model selection it's it's quite easy you know you're gonna leave some of the data out in the time of training so I have this amount of data I divided in two parts I just used this part for training and I don't use this part in the training I call this part validation set okay so so far in training we used to minimize the objective function and the objective function was for example the difference between true value and estimated value over all points on the training set so if I had end point in the training set that was my empirical error what I'm going to do now is to divide my training set say this is n data point I'm going to divide my training set put M of them here and L of them here and I'm going to use only this part in training but when I want to measure the error I use this for to measure their in this example for in this example that we have I'm going to use end points to learn the parameters of a linear function when I computed the linear function I want to measure that the error of the linear function I use this point that hasn't been used for training the model so when I compute this I compute this over I up to L of these points then I go for quadratic then I go for cubic and so on so do you remember that that was the care for training set for the points that we have observed but that was the care for the point that we haven't observed so I'm measuring the points that I haven't observed I'm measuring the error for the points that I haven't observed right so it's not going to decrease all the time it's going to start increasing at some point and that's my right mod this is the most I mean this is vanilla version of cross-validation it's very simple version of cross-validation you put a part of the data aside you don't use it in training but you use it when you want to measure the air okay so we basically measure see over all the tape at the point since at v-- validation set so we calculate the error according to validation said no trainings so one of the bad thing about this is that it gonna basically you know you're going to lose a part of data that you have for training so data usually is quite valuable you know I have n data point for training I have only 100 data points and you tell me that put aside 30 of them you just use 70 of them so it's going to affect my estimation you know my estimation is not going to be as good as estimation using 100 data points because I'm losing some of the data that I had for train ok one way around this is k-fold cross-validation okay in k-fold cross-validation in k-fold cross-validation this is all of the data to tell ah okay and I'm going to divide this data to K part then I'm going to put this part aside and train everything based on this four part and then estimate use this for testament Aman you know exactly what we did before but I'm going to do the same process again but this time I'm going to put this part of the data aside and I'm going to use these four parts to train them on I'm going to do it again this time I don't use this part I use the other parts and they use you know I have to say for example here I have K parts in the data I repeat this process k times and in each time I put one part aside I learn K minus one part of the data to train the data somehow I get around this problem that a part of data hasn't been used in training you know basically I use all of my data points in training but in each part of the training you know I put one portion of the data aside so this is called k-fold cross-validation and in k-fold cross-validation you divide your data to K chunk and each in iteration you remove chunk K from the data compute the classifier from the rest of the data so you remove this part you calculate it using the rest or you remove this particulate using the rest and so when you compute the error you compute the error with the chunk that you haven't used in your training but then you take average of all of these errors you know I had error when I applied this part and I chained the model using this and applied this word to the model I trained the model using this and apply this so I have case scenario the same so I take average of all of these case that would be my air okay so basically I learn a linear function but I go through this ten or K folds of learning and then second and third and I mean quadratic and so on and then I decide which one is the best one so somehow we resolve the problem that a part of data is not going to be used but there's another problem here can you see that sorry what is this small I mean you this is mom I know you can take a large part doesn't matter you're gonna choose it yes no you don't have K different models you are going to have one more I mean at the end you are going to have one model not a model but to find that one model you need to repeat the process of learning K times right so when I had n data point I was able to fit a line to this M data point say for example Lister school right now I have to do a list of square here one here one here ten times take the error and report that the error for linear model is this then go for quadratic n so in each for each complexity I need to do the training K times okay so that's the the drawback here that if training is computationally intensive in many cases it here's the case so you can do that you know say for example for many of the neural network models you have high-performance computer and you have GPU and you have CPU I still have wait one week for training you know you can afford k-fold cross-validation you know in you know to come up with the right say for example a structure it doesn't make sense so you have to do something different but if the training is not that expensive it's a very good technique you know you can do it and we do logistic regression it's fine you know you can do that you learn you know something it's not that expensive you can do that it's quite often you see in papers when they report the experiments that you know we used 10-fold cross-validation 10 is quite often quite common they say we have used 10-fold cross-validation for you know this experiment but some people say that we UK could be anything but you see papers they say that you know we have used like 6 fold cross validation today do you know why they do that instead of ten by they use 612 sorry yeah so why instead of 10 they would say that we have used 11 fold cross validation well any of these could be the reason but usually that's because they are cheating you know when you see this type of numbers get suspicious that's why don't you use 10-fold cross-validation because usually they torture the data until it provides good results you know and they provide the best result for the paper so they go with 10 own and they're not that good let's do 7 8 ok 6 this one so let's report that you know 6 fold cross validation we bit everyone else in the world yes you can do PCA first and then do cross-validation after work actually it makes more sense to do it this way know you'd learned your feature and then you do cross-validation after four you can do the other way around you know it depends but in this particular example you know it's fine to learn the service face for the whole data point that you have and then do a cross-validation we are talking about two different models we have we have you know a classifier that you want to learn and we make some assumption about the data that this data is close to subspace so you can separate these two parts for the first assumption the data is close to the subspace you apply PCA to hold data and then go and learn your second I mean learn your classifier based on this assumption that you have already made you don't need to mix these two you can if you want sorry you can but you don't need to I mean most likely you know the subspace that you are going to find using all of these n data points is a better subspace then you know a part of that okay any question here for k-fold cross-validation okay so there is one variation of k-fold cross-validation just called leave one out cross validation and leave one out cross validation you know you have n data points assume that you have n folds let's leave one out so basically you have to train your model n times so I put the last data point aside I learned the model with n minus 1 data points but I compute the error and the last one that I haven't used I put point n minus 1 aside I learned the model using all of that n minus 1 remaining point compute the error on this one and so on so it's n fold cross validation basically for n data points it's called leave one out leave one out you know could be quite extensive when algorithm is hard and I mean as Gotham is computational intensive and so on and it has high variance as an estimator but in some cases actually can get around that you're going to see this in one of your assignment that if the model is a linear you can avoid computing in different models you can learn model ones but compute leave one out error based on one computation you can do okay there's a way to get get around it any question here okay so we are not completely done with neural network so now that we learned about cross-validation I'm going to you know talk about another model and other type of neural networks is called radial basis neural network we are going to learn about the radial basis functions neural network RBF then we'll see regularization for neural network like weight decay and then we will see some more theoretical basically concept that how we can control the complexity how it can compute a bound for RBF for example okay so have in mind sort of wrote Pat what you're going to learn in next I mean this lecture in next lecture and next couple of lectures okay so radial basis Norrell network have this structure it's a sort of neural network with one hidden layer do one layer you have input you have D dimensional input and you have a layer of M nodes these nodes are not perceptron these are just linear nonlinear functions okay so basically this Phi is a nonlinear function applied to point X point X is d dimensional you know it it absorbed point X and this D dimensional and a flow and nonlinear function to it and has an output and you have M of these nonlinear functions and then in the last layer you always you know RBF always has this one layer of hidden doesn't have more than that and you have some weights here here actually the output would be similar to perceptron which is weighted sum of this inputs this outputs okay wait that song will appear here in the output okay there is no wait here in the first layer we don't have from input layer to this hidden layer we don't have any way from hidden layer to output we do have a set of weights and as usual we will need to learn this weights you know because this is the parameter of them all we need to learn this weights okay so this is the structure and now we have to see how it can learn this way okay so I have this X 1 to X D and I have like Phi 1 to Phi M and I have output which is y 1 to y K and I have weight here I don't have any weight here okay so it's going to happen is that you know I have D dimensional data and this D dimensional data will be applied here so basically like five one of X will be computed Phi 2 of X will be computed and these are outputs and the final output is a weighted sum you know for example this one is like w1 1 times 5 1 X 1 so this is X 1 or in general X plus W 2 1 Phi 2 plus W M 1 by M this is going to be my Y hat at point x okay so if it's yx1 for example that's going to be viable okay we need to basically find this set of dates here so any idea and this suggestion and its suggestion to do this okay let me write it this way you know I have a function I have a matrix X suppose that I have n data points so I have a matrix X and such that you know each column of this matrix is one of the data points and I have n of them so basically this is d by n matrix this is X 1 1 this is X D 1 and this is X 1 n and this is XD M so this column is vector of X 1 and this column is vector of X n I define a matrix file of all of these values that this function or all of these m functions can compute for by all of n data points you know my first data point will be applied to Phi 1 and has a value will be applied to Phi M and has a value so in total I have M times n values you know when Phi 1 is applied to X 1 and then Phi M is applied to X 1 when Phi 1 applied to X N and when Phi M applied to X n there's M times n values I put it in a matrix of this form okay and I have Y as a matrix and this matrix is you know I have basically K by n values because this is as you see it's K dimensional so for given point for any point you know I have Y 1 2 YK and I have and of them right now I can put all of my weight in a matrix so how many weights do I have I have M nodes here and I have K nodes here so it's in matrix M by K right so basically it's a matrix M by K so this is w 1 1 up to W M 1 and w 1 k w MK okay so I can put all of my weights here so now I can basically show this output using matrix notation you know I can say that Y my output is what is this matrix W times 5 right so W transpose times 5 my write in terms of dimension this is M by n W is M by K so it's going to be K by M so Y is K by n that's right so Y is just W transpose Phi so I can write my output in terms of this matrix multiplication and this w is unknown boy is now and Phi is now write Phi is a function and nonlinear function that you choose it's called your basis basis function that's called it's called radial basis function you choose your basis so I can have these values you know I know my X I choose my function Phi I apply Phi 2 X I fill up this matrix and how I have my output I have my input the only thing that I don't have is W now I want to compute this w how can I do this list the square you know I can compute I can minimize this okay and find W so should should we do a gradient descent again take derivative of this and do gradient descent update W yes No is it good if I do gradient descent here do I need to do a gradient descent here yeah it's just Lister square it has closed form solution don't need to do a gradient descent you know we can find you can enclose for we can tell me what the solution is you know sorry inverse yeah yes basically you know this cost function you know this is this norm is just trace of y minus W transpose Phi transpose y minus W transpose Phi if you take derivative of this with respect to W you know this is basically you know you have a quadratic form for the first one you have a quadratic form for the second term and you have 5w you have five transpose W Y and you have Y transpose W transpose Phi and you have trace of this now you have to take derivative you have to for trace here you have to take derivative of this trace with respect to W this is zero derivative of trace of Phi transpose W W transpose Phi it's quadratic with respect to W and it's just 2 v v transpose W and this is I mean to chase of this with respect to W you take derivative it's just Y Phi transpose this is also Y Phi transpose and then you set it to zero so 5 v transpose W is Y Phi transpose so W is Phi Phi transpose inverse Y Phi transpose I think I made a mistake here should be I think why Phi transpose Y should be Phi transpose Y we see Y is K by n Phi transpose is now we made a mistake in this transpose here fight it's Wi-Fi transpose okay so it's y transpose Phi okay so it's Phi Y transpose so this is going to be n by K yeah this n by K then Phi is M by n five Phi transpose is M by M and the inverse is also M by M so M by M M by n n by K its M by K and W is M by K yes that's correct so that's the solution basically so solution is Phi Phi transpose inverse Phi Y transpose so you can solve it in closed form any question okay see you on thirsty
Info
Channel: Data Science Courses
Views: 11,784
Rating: 4.9689922 out of 5
Keywords: Machine Learning (Software Genre), Backpropagation, Neural Network (Field Of Study), deep learning
Id: J6hcu87NZWE
Channel Id: undefined
Length: 81min 8sec (4868 seconds)
Published: Tue Oct 13 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.