Ali Ghodsi, Lec [2,1]: Deep Learning, Regularization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a list of paper I selected many good papers I'm going to complete this ad 1020 more papers to this and I try to categorize them in different categories in terms of application or theory you're going to see important papers in this field some of the papers are like color coded by blue means those papers we definitely because we cannot read all of these papers in class but those papers that are blue are papers that we definitely wants to read so make sure that those blue papers are selected so the way that we are going to organize the second part of the lecture which would be mainly paper presentation is true this wiki course note if you had a course with me in the past most likely you have an accountant wiki course note but if you don't you know if you just type wiki course know.com you will get to this page and you can see the course note of my courses in the past but to be able to participate and contribute and the content of the course note you need to have an account so just press log in and then you don't have user name and password if you are new if you have it from the past it's a still valid but if you don't have it just request one and then there's a couple of information that you need to give me user name and email address and so on and then request account so then I get an email from you and I approve it and you're gonna have an account when you choose your username make sure to choose your UW username and make sure that you give me your UW email address and give me Gmail or to mail or whatever so make sure that this your UW email address and this your UW username otherwise in terms of marking it's going to be difficult I have hard time to match who is who okay so suppose that you have account on wiki course note after you made your request you are going to have an account on you niihka course note then you are going to see you know there are many pages here they are going to see a new page on deep learning we are going to use that page for paper presentation so we're going to see a table there and it works like Wikipedia in Europe and when you have account you can edit it so you're going to see a table there and you have to register at the time of your presentation and the paper that you are going to present so we're going to select one paper in the list of papers that will be posted on my web page and you're going to register in a particular time that you want to present this paper okay that's about the time of presentation I'm going to teach about half of the course like about six weeks I'm going to teach and after that roughly we are going to start presentation so the rule for presentation is that ahead of time I mean before presentation and I'm going to specify the time for that you have to as a presenter you have to write a summary of the paper that you are going to present and this wiki personal so we want to present paper a say for example one week before the presentation you need to put a summary of the paper that you want to present on vicky's course now everyone else who register for the course officially needs to contribute on this summary before the presentation okay before we come to class for presentation so one big before the presentation the summary is there you have one big to contribute and this summer so it means that you have read the paper before you come to the presentation and I am going to give you the detail later on but the prayer I mean the contribution could be a different type of contribution could be a technical contribution or it could be editorial contribution so we are most interested in technical contribution of course but you may not have even technical contribution all papers but you need to have technical contribution of a good portion of the papers so we can you know add an example elaborate on some concept you know give reference to another paper which make it more clear provide a piece of code which make it clear some something you know a technical contribution or editorial contribution then we come to the lecture for presentation so we know that everyone has read the paper reported so that's basically how our paper presentation is going to work and your contribution has a portion of the mark and also the summary that you're writing has a portion of the mark and your presentation has a portion of the mark which altogether makes about 50% of your mark and I'm going to specify exactly a portion of each part and post it any question yes so to get information of course we should use the Piazza say yes yep yes I'm going to use Piazza for announcement I'm going to put this list of paper and my own webpage so if you go to my web page and they're teaching there's a deep learning and on the Flair link page I'm going to put this page there it doesn't matter if they can do I mean without account you can see everything you know but you can't make contributions if you want to make if you want to edit if you want to change something you need an account if you're just sitting in the course and you want to have an account it it's just password protected because you are using this for marking at the end and you know this has a history so I'm going to look at the history I'm going to see who has done what what was your contribution in the summary that we're seeing now yes presenter has a summary but everyone else are going to collaborate and and contribute to it but looking at the history it's going to be clear who did what you know what was your contribution in this eventually it's going to be a collaborative summary but for marking I'm going to check the history to see you know contribution of each person okay any other question yes well you have to actually have to contribute you know we're gonna roughly we are going to have 24 papers for presentation and again roughly are going to contribute on about 20 papers you're going to write somebody of one paper and and contribute to about 20 papers to have the full mark of that I know sorry for each week we have about four papers to prison yeah and if you want to contribute and 20 papers you have to contribute and you can miss about four of them during the six weeks you know three and a half paper homes each week you have to country okay but the contribution doesn't need to be you know something you know long you know it could be to line of good example or you know a clear explanation of a concept which is not clear or referring to you know a paper which is missing here and that's why we don't understand what's going on okay so if there is no any other question I'm going to start the lecture so today's lecture is about regularization and you know in the previous lecture we learned about back propagation which is most popular algorithm for training in neural network and deep network but overfitting is a big issue in neural network and we have to take care of overfitting when we use back propagation on any other technique for learning and today we are going to learn different techniques for regularization this include as simple as weight decay which is quite all techniques early stopping using manifold hypothesis to generate more data or generating more fake data adding noise to the input adding noise to the weights and drop out there are many an in-sample method there are many different techniques to take care of this problem so but before I introduce any method let me there is a controller somewhere I need to turn it off no this this this is mine and I'm trying to turn it on here but it's not working we had the same problem the last lecture you know this control is not working but somehow we learn how to do it using this controller last week but I forgot you try to turn it off and they start writing on the board now I mean I meant okay that's okay what I meant to muted not turn it off the kids Bend Bend you turn it on you know it takes a while yeah but but that's okay thank you very much so we can take care of this later so let's have you know a sort of general idea of regularization have more in-depth insight about what's happening in general not just for neural network in general what's happening when we are using data for training and what sort of problems we may face and then we see how it can be handled in neural network or deep neural network so general tasks in machine learning it almost almost all of the machine learning tasks as we are doing is estimating a function so we would like to estimate a function assuming that there is a true underlying function f of X we are trying to estimate f hat ok this is classifications the regulation its density estimation almost all of the tasks the main difference between different machine learning tasks is that we are when we are trying to find an estimation we are choosing different function class and we are choosing different ways of ranking elements of that function class and we are choosing different ways of selecting one of the alums or maybe searching in that function class that's basically main difference between different machine learning algorithms and techniques so I have underlined function I make some sort of assumption about the data I choose a function class and this function class could be class of linear for example classifier all class all functions that can be expressed by a neural network then have to rank those elements and my rank would be based on my objective function who is better who is worse you know I have an objective function the one which minimize this is the best one and so on and they have a way to search in this function class and choose one you know my optimization technique you know I formulate this as a convex optimization and non convex optimization gradient descent and I go and pick one of them okay so this is sort of general scenario basically everywhere and we would like this F hat to be as much close to F okay roughly is speaking you know you can always think that there is a relation between F and F hat and expectation of F hat of this form so they make a triangle we would like F and F had to be close but between F and expectation of F hat usually there is a distance which is our bias and between expectation of F hat and F hat there is a distance which is our variance or if you want to be more precise it's the square root of variance and this variance pose squared of bias is the distance between F and F hat which is mean square error yes yes so you know I have set of points for example and I would like to fit a function here you know I may choose to feed a linear function if I choose to feed a linear function I have high bias you know I'm far from the observation but the variance is a small means an average if I feed many linear functions take the average of them it's not that far from the one that I have fitted right now to have selected right now it's not that four but the bias is large gram far from reality I mean from observation not reality but if I choose you know a function which is pretty flexible I may have a very small variance but they have a huge bias huge bias in it in a sense that it's time that I fit a function it's different from the previous one and on average you know this is going to be pretty far from what I'm fitting at the moment so I have huge variance ok yes here large Raelians this is this is high variance method low bias yes the linear case was high bias and lower variance okay so there's always a trade-off and to Innis that in order of minimizing this we need to you know understand you know this trade-off and choose the right model so if you look at the care of learning in all type of learning you are going to observe you know something like that when this x-axis is the complexity of your model you know this this model is more complex than linear a polynomial of degree 2 is more complex as compared to one a neural network with more you know layers or more nodes is more complex to the one which has less so that to add to the complexity you know a training error will decrease and you can even make it zero but the true error decreases up to some point and after that is start to increase right which is basically you know that's the right model and after that we start over fitting the model ok and in I mean intuitively we learned what overfitting means in the previous lecture okay so now let's make these concepts more precise and see mathematically what's going on here so I'm assuming that the true function is f the estimated function is f hat I'm assuming that I have a set of observations I have n observation and this is my training set that I'm going to learn F hat based on this training set and I make this assumption that my observations are basically a function of F when F is my true function plus some noise okay and for simplicity I assume that this noise is Gaussian noise centered at zero Sigma squared so I have a true function from this true function on different data point x I we have an output but this output is not perfect when we observe it for learning it has been distort by some noise okay what I'm interested in is the difference you know for four point X null or I know what I'm interested is expectation of F and F hat I want to make this small you know f of X I I will show it by effort I nice so I want to make this distance as small okay but let's see what this distance is you know F null is basically F plus some Epsilon right so it is or let me a start from here it's better let me you know when we are doing when we are trying to train a model where we have the observers you don't have F right we are trying to minimize the distance between our observation and the prediction of the model so I assume that the prediction of the model is the output of this function so usually we are trying to minimize our observation and prediction of the model right but our observation is what our observation is f plus some Epsilon okay minus y hat okay is that is that clear what we're doing here okay so let's let me expand this you know this is going to be f null plus epsilon naught squared plus y naught squared minus two you know I'm Lucy I will rearrange it this I take these two together and this one separately so basically imagine that it's f minus y no plus epsilon so it's f naught minus y naught squared plus epsilon square - to ethanol - wine oh yeah there's a this is y hat right is that right now sorry I think you know it's right now you know this Y was F minus this Y was f plus epsilon then I had y hat here you know it's it's clear it's more clear you know we want to minimize this this Y is f epsilon and this is this Y hat is I can either write it Y hat or it's F hat so then I have F hat minus F squared F hat minus F square I have epsilon squared and they have epsilon times F hat minus F okay so basically I can write all of this white hats as f hat now makes no difference is it clear now yes yeah yes okay is that okay now okay so we have this expectation so it basically it's it has three terms its expectation of F minus F hat its expectation of epsilon squared minus two expectation of epsilon times F minus F hat okay so this is basically the term that I'm interested in you know because I want the distance between the true function and estimated function to be minimum yes sorry plus what there this one is plus so this is - how come see you know let me write it here the geek so this is you know I write it it's y hat minus y squared okay so Y hat minus y squared so it's what sorry it makes no difference you know it's a square so it is f hat - your Y is f minus epsilon is that right sorry okay we're doing this again so this is this is right and confirm this is because this is f plus Epsilon so it's minus F minus Epsilon so now I make a group then I have F hat minus F squared I have plus epsilon squared I have minus 2 epsilon F hat minus F sorry is that right now but yeah is it okay no okay this is the term that you're interested in you know we want this to be minimum the distance between function and destination so what's this stare expectation of epsilon squared don't forget that this is standard normal what is it no the expected expectation of epsilon would be zero but expectation of epsilon is squared so it's variance basically right and main minus this term but I don't know what to do now you know I consider this case I'm going to assume that this point is not in the trainings you know this is correct for any point suppose that I choose a point that's not in the training set if the point that I'm choosing is not in the training set see what's going to happen here for this cross tear F hat is estimated based only on the point that is in the training set right I mean f hat is completely independent from this set if it is this point I haven't used this in fact you know if I look at this crust here I can write it as you know this epsilon was epsilon was y minus f it is like Y minus F F at minus F and don't forget that F hat and this y are independent so what this expectation is going to be given that these are independent sorry you know this term is conceptually this term is like covariance right look at f as a mean that has been distorted and you observe wise so wise minus its mean and f hat minus the mean so covariance between y and F hat when they are independent it is zero right so in the case that the the point that we are checking is not in the training set this third term is zero okay so if I summing up over all n points that are not in the training set on the left basically I have summation of all of these points is equal to this summation plus I have supposed that you know I have M of them and I have M Sigma squared and this last term is e here actually is the error that I care about here is the error that I'm measuring in perkily using my data so basically I can conclude here that the error the time caring about is this empirical error - M Sigma square which is a constant M Sigma square is a constant it means that the error the empirical error is a very good estimate of true error if the point is not in the training set you know this is justification behind a cross-validation technique in cross-validation technique what we do we are training our model using training set but we have a validation set validation set is is a set that our model hasn't seen validation set hasn't been observed right we are not minimizing the error on the training set we are minimizing the error on the validation set and we claim that the validation error is a good estimate of true error that's the reason in pre-call error is a good estimate of empirical error is a good estimate of truer up to a constant that's not the case when it is a part of trainings though I'm going to see that case but is there any question here sorry why because because it's in this side I take it to the others so what's what's happening in the most interesting or the more interesting case that this point is in the training set that's the most common scenario you know we have a data set and we're training our model using that and we're trying to minimize it so in the case that this is a part of training set this tier this cross term is not zero anymore we can't claim that F hat is independent from the observation so Epsilon you know this is not zero anymore so we need to estimate this term okay actually we can estimate this term under some condition you know according to a Stein's lemma if you have if X comes from n theta Sigma squared so if X has Gaussian distribution and if G is a differentiable function actually it's you know it's just weakly differential then expectation of G X times X minus theta is equal to Sigma squared times derivative of GX with respect to X ok that's this expectation okay let's apply this to our problem you know our problem is expectation of epsilon times f hat minus f no so you know X here comes from n theta and Sigma squared our epsilon comes from 0 Sigma squared so you know this is like X minus theta write this theta is 0 and this is like GX it's a function of epsilon because this F had F is not a function of epsilon but F hat is you know if I change epsilon F hat will change so I can apply steins lemma here and this is going to be equal to Sigma squared derivative of G which is f hat minus F with respect to Epsilon so this is Sigma squared is constant but this actually with what does it mean it's Sigma squared you know and they have derivative of f hat with respect to epsilon and minus derivative of f with respect to epsilon to 0 because f is independent from epsilon if I perturb epsilon F will not change F is the true function f is my estimate F has nothing to do with my observation so if I change the noise of the data the true function doesn't change you know true function comes before observation but if I change the the noise my estimation will change so this this theorem is zero I have only this thing okay and in fact even you can write it as Sigma squared f hat divided by Y if I if I prepare Y how basically the function is going to change you know because you know if you're not sure that you can write it this way you know F with respect to epsilon you can write it as f with respect to Y times y with respect to Epsilon right and Y is f plus Epsilon so this is one so basically it's just F with respect to Epsilon so so let me call it like D so what is this you know what is this derivative this derivative told me that I prepared one data point how much my function is going to change so it's a measure of flexibility of your model if the model is not flexible if it's lined up you tap one point how much this line is going to change various one or not right if it's quite flexible function I protect one point how much this function is going to change a lot right so it's a measure of complexity of the function that you are fitting and if I sum over all points you know I'm going to have summation of Y or I - I erase that it was I think why I had - Y is squared on end data point that I have on the training set is summation over all of these functions plus n Sigma Y squared - 2 Sigma squared D I I equal 1 to N ok so this is the error that I care about this is my empirical error computed based on my training set so empirical error I mean true error is this error - a constant plus this term and this term has to do with the complexity of the model so in this case our empirical error is not a good estimate of true it's always boyest done word you know it's less than the true error always so to have the true error you have to add a component to this and this component has to do with the flexibility of them all so if the model is not flexible it's as small if the model is very flexible it's large and that's why we observe that's why we observe this care you know this is training error the more complex the model is this is smaller but this is the true error you know when complexity get larger you know this last tear the tear tear for less complex model or a smaller so it's a small here you know this this small this tear will be added to this it's as small as small as small at some point it's large and make a huge contribution here so it make it it make this one to race right okay is that clear yes yes it's basically have expectation yeah thanks so basically that's what's going to happen okay so and that's the reason for overfitting in overfitting what we are minimizing is this empirical error overtraining side but we really need to minimize as this the truer so one way to avoid overfitting is to add a penalty function to this and this penalty would be an approximation of this complexity which comes in many different forms so we add a function such that is this function is as small this penalty is a small for a small values of for model of low complexity and get the larger with models of high complexity sorry not not necessarily nonlinear you know even when you want to fit a linear model you know we can add a penalty here this is nonlinear yes mostly nonlinear models yes linear miles are less complex okay so this is the base of regularization that's why in regularization we add a penalty function you know so in regularization we are minimizing we have a function which is our objective function in this objective function is a function of parameters and function of observations we are minimizing this over training set that's the only thing that we have but that's not enough because if you do so you're going to over fit your model we're going to add a penalty function here and this penalty function should be as small for small values for low complexity models and large for high complexity models okay so we are going to do regularization for neural network it comes in many different forms now I show you that weight decay or regularization by L 1 norm is clearly in this form and I'm going to show you what's the effect of that on the solution and then we will see that even things like adding noise to the input is a form of this regularization drop-out is a form of this regularization early as stuffing even is a form of this regularization we're going to see that how it's a form of regularization all of these techniques and how does those method going to affect the final solution ok so before I switch this slide but I don't want to turn it off and on again before I switch this slide let me just give you a view of a regularization from Bayesian point of view you know it's mainly from frequent this point of view but from Bayesian point of view you know you have basically what we are doing is to find the posterior of other parameters given data so and posterior of our parameter given data is basically the data given parameters and our prior over parameter and marginal so from Bayesian point of view and you want to find parameters of your model you're maximizing this posterior okay so if I take the log of this you know when I'm because instead of maximizing this I can maximize the log of posterior so if I maximize log of F this is going to be log of x1 to xn given theta plus log of my prior over theta minus log of marginal right but I'm in with respect to theta parameter this marginal is constant minus a constant so this constant has no effect in terms of optimization when I want to optimize for theta what is this stair data given parameter it's likely with right actually it's a log likelihood and this is log of my prey compare it with this one you know you may just work with likely will try to maximize likely but you're going to overfit to avoid overfitting you have a penalty this penalty is like a log of your prior okay not all penalty can be interpreted exactly this way some of them are exact you know if you penalize your objective function with l2 it's like having a Gaussian prior over your parameter if you penalize your objective function with l1 it's like having a laplacian prior over your parameters but some of the priors that we have is not exactly basically a sum of the penalties that we have is not exactly our prior okay so but basically we can view this like regularization from this point of view as from Beijing point of view as any question okay okay so I explained this concept that you know in most machine learning tasks we need to estimate F of hat and F of we can basically add a penalty to our objective function and then optimize for that one so the most popular penalty or the most popular technique in terms of optimization for the network or neural network is what's called weight decay okay so in weight decay basically we are trying to find weights of network using back propagation but instead of optimizing the cost function which say for example is y minus y hat squared and stop optimizing that we are adding this penalty the l2 norm of W to the objective function and then we optimize for that okay the thing is was a bad decision to open this because I'm going to get back to black or no yes okay so we want to penalize Knoller network with l2 and my parameter is w weights of the network and the penalty the time adding is l2 norm of W so I would like to make l2 norm small afterward okay so I call this J tilde of W X Y so instead of minimizing J I will minimize J tilde right in back propagation we used to take derivative of J using backwardation technique now I use I need to take there whatever J tilde and for simplicity I just so which basically bit would be derivative of J plus derivative of this l2 norm and you know usually I have coefficient here would be alpha W okay so I'm going to minimize this it what's going to happen you know in my optimization you know I had w- learning rate times the direction of error which was basically this stair i just need to add a coefficient of the w there you know that's the only thing that's going to happen in the loop of when i when i'm trained okay so why does this work why it's our going to prevent overfitting so there's a intuitive and short answer and there's a more mathematical and lowing answer to this the short answer and quite intuitive is that this is my sigmoid function right if you don't have sigmoid function at the end of each unit the whole model is going to be linear because if you think one layer to the other layer it's like a linear transformation you take input you multiply it by matrix which are our weights you go to the second layer if you don't have these function then there's another linear transformation that has been multiplied to this and another linear transformation and it doesn't matter if you multiply ten linear transformation end of the day it's only one linear transformation so if you don't have this sigmoid function at the end of its unit it's just a linear model but we do have this sigmoid function if I penalize large w's means that I want my W is to be close to zero I prefer small W means I prefer the behavior of my sigmoid function which then when it's almost aleni when it's large it is start to become nonlinear so I don't want let all sigmoid functions to behave completely nonlinear I want to using sorry a linear kernel for yes no I mean if it's completely linear the whole model would be linear so it's not neural network it doesn't do anything it's just a linear transformation so you need non-linearity you need definitely non-linearity but you have to control this non-linearity you don't want this to be so nonlinear you know you want to control yes yes usually you know in in learning neural network usually a good practice is that you always normalize your input right at the beginning you know almost all all the time it's gonna help okay but so more precisely let's see adding this tear how qahal adding this term is going to change our solution you know not intuitively mathematics so to understand this I'm going to write the Taylor expansion of this you know I'm going to assume that I have Taylor expansion of this and then see how this solution is going to be changed when I add a term to this so for simplicity I just write my objective function as a function of W I don't write x and y because my penalty is just a function of W in this case in some cases it's not okay so here actually I just write it as a function of W so let's write a Taylor expansion of this so what was the Taylor expansion if I want to estimate function f in the neighborhood of a according to Taylor expansion it's going to be F a plus 1 factorial plus said f prime of a over 1 factorial plus F double prime of a 2 factorial and so it is like X minus a here and there is X minus a s squared here and so on right so let's expand this function at W star and W star I assume is the optimum solution of G so let's assume that w star is the optimum solution of J so this implies that the derivative of G at W star is equal to 0 assume w star is the solution means the derivative of G at Point W star is e so if I write this expansion so at I have like a term like j w star then I have derivative of G at Point W star which is 0 so this the second term would be 0 because it's derivative of the function at point W star and I assume that W star is my solution then I have this theorem which is one over two and basically it's X minus a squared means it's W minus W star squared and don't forget that this W is a vector now not a scalar and second derivative of my function means second derivative means Hessian right so that's going to be this thing and I can expand it more you know I can assume that there is a tear tear and so on but let's assume that it's only two terms so this is my approximation Thanks so this is my approximation of so J hat is approximation of J okay now here actually you know this J tilta is what I need to optimize in weight decay so J tilter would be this J Plus this term and in back propagation I need to take the derivative of this and derivative of this basically derivative of J in the reverb penalty so the rivet or in a stop derivative of J let me use the ritual of J hat so what is derivative of J hat so this is J ha so what would be the derivative so I have derivative of G at point W star which is zero and they have the derivative of this stare it's quadratic so it's going to be two times this one-half which is just one and it's going to be H w minus WS top okay so that's sorry why it's zero because we assume that W star is the solution of I mean is the optimum solution of our optimization so if if it's optimum solution of j if it makes j minimum that's the point that makes the derivative of jz right here here actually i said that that's what we have access to the impre-- color we can minimize this but I just wanted to show that it's not a good estimate of the true error what we're minimizing is not a good estimate of the true because of this step because when you make it as small this becomes large you can make it as small by increasing the capacity of the the function the flexibility of tuwa but the more flexible the function is the larger this term is going to be okay so that's my derivative now I have to basically substitute this here so I'm going to use this derivative here I I need to set this derivative equal to zero you know in beta K so it means that I need to set H times W minus W star plus alpha W to zero so if I solve this that's going to be the solution of my model after they decay given that this approximation is good approximation any question okay let's solve for this I have h w- h w is stir plus alpha W equal to 0 I want to solve for W because W is my new solution so it's going to be H plus alpha I times w equal to H W so W will be H plus alpha R inverse h WS that's my new solution let me call it w tilde so that's how my solution is going to change after weight decay but if alpha goes to 0 alpha goes to 0 means that I don't have any weight ticking right if alpha goes to 0 W tilde goes to W star right because this is going to be 0 and it's H inverse H times W star so W tilde is going to be W so what if W is not Z so it's going to happen if W is not you know age is a matrix I can use singular value decomposition to decompose this matrix and this is Hessian matrix it's a symmetric matrix so if I apply singular value decomposition it's going to be Q lambda Q transpose because it's symmetric you know so these two are going to be the same and you know the skews are orthonormal matrices or eigenvectors of h h transpose and so on so I'm going to replace I mean I'm going to substitute this H here so my W tilde is going to be Q lambda Q transpose plus alpha R inverse Q lambda Q transpose W star okay I can write it as q lambda plus alpha I Q transpose inverse I just wrote this racket this way you know I can do that because Q is orthonormal so Q transpose Q is identity as long as as long as Q is not truncated QQ transpose is also identity okay so I have here alpha I suppose that this alpha is you know suppose it's Q lambda Q transpose plus alpha Q Q transpose AI or qi q transpose makes no difference versus identity then I can take this Q out I can take this Q transpose out and I can write it this way okay this is going to be just lambda plus alpha I at the middle one q has been multiplied from left q transpose is multiplied from right so this is this term and then Q lambda Q transpose w star so this is going to be what so the inverse of this matrix it's going to be Q transpose inverse when Q transpose inverse would be Q right because Q is orthonormal the transpose of orthonormal matrix is the inverse of the matrix so it's Q lambda plus alpha I inverse and then Q inverse which is going to be Q transpose times Q lambda Q transpose W star and then I can take this sprocket out so I have Q transpose Q here which is identity so W tilde is Q lambda plus alpha I inverse lambda q transpose Delta star okay you know for the moment suppose that I don't have this part if I don't have this part it's going to happen I mean or imagine that alpha goes to 0 if alpha goes to 0 then this is lambda inverse times lambda it's going to be identity so W star is going to be QQ transpose W star which is w star so if lambda goes to 0 nothing we're going to have one way of thinking about this is that this Q transpose since it's orthonormal matrix it's like a rotation matrix this Q transpose is going to rotate my solution but then this Q is going to rotate it back to 2 what it avoids okay now I have this theorem at the middle but don't forget that this lambda is a diagonal matrix you know this lambda because we did singular value decomposition so this lambda is like lambda 1 lambda 2 up to lambda whatever okay so lambda plus alpha I as this form right you take the inverse of this it's like this right and then you multiply this by lambda so what would be side final multiplication of this term it is like lambda 1 divided by lambda 1 plus alpha and the 2 plus lambda 2 plus alpha and so on okay so what's happening here is that this Q transpose take my solution and rotate a little then for each direction I'm going to multiply this by lambda 1 divided by lambda 1 plus alpha and so if lambda I is greater than alpha then lambda I divided by lambda I plus alpha is going to be one so if lambda I is greater than alpha then in that direction I don't do anything I rotate my solution in that particular direction that lambda is greater than alpha this coefficient is going to be 1 so I don't do anything I just rotated back with this but what if lambda I is greater less than alpha then what would be this coefficient it's going to be almost same right we approach to zero it's going to be a small value if alpha is large enough so in the case that lambda is small I'm going to rotate my solution in that particular direction that lambda is a small I'm going to shrink it to zero and then rotate it back so what is lambda lambda is are these are eigen vectors of hessian matrix okay so what does this this lambda is tell me if lambda is a large means that I'm basically considering a direction in which I mean changing in that direction is going to change objective functions significantly it's a direction with a large eigenvector a direction of hessian matrix with large eigenvector so it's going to significantly contribute in minimizing my objective function if lambda is a small it means that in that direction moving in that direction is not changing the function significant so basically they decay keep the directions that change the function significantly shrink directions that do not contribute significantly okay keep only some of the directions so there is a concept which is called effective parameter so effective parameters in neural network and weight the case lambda I divided by lambda I plus alpha over all I so this is called effective prompt so you can see that some of these are going to be 0 right originally you know if I goes from 1 to n I had any of these values are together now for large lambda eyes I have one for a small one I have zero so the number of parameters is less so that's why weight decay help to avoid overfitting because it reduced the number of parameters that I have it reduced the complexity of the model the larger the Alpha is the smaller this effective parameters is going to be you know it's shrink it more yes nation of your solution right here you're cutting out some of the singular vectors so is it's just using a lower rank or a lower dimensional parameter vector equivalent then to using a high dimensional parameter vector with regularization could there be any but the parameter is vector I mean the dimensional parameter is vector has to do with the number of units that you have so lower dimensional vector means less number of units which definitely helps to avoid overfitting right I mean if you have less number of units the dimension of W is less and yet definitely less number of units correspond to lower I mean less complex model so it helps overfitting but our problem with neural network is that quite often we don't have any systematic way to choose the structure of the model you know I'm going to decide that in the first layer I have five units in the next layer I have ten units and then I have 20 units and I don't know it's more or less or it's too much or I have two too many or have to choose less than that and I'm not sure right so it's more I mean there are more systematic way to fix the structure is still somehow avoid overfitting by controlling say for example the size of the weight rather than playing with the structure and you know see if it's it over feeds you know change the number of units but if you change the number of units it works definitely but it's going to be pretty hard to you know play with the number of units to make it less complex but it has the same effect as you mention if you're trying to pick the best rank of your solution which means the number of connections in your network that's kind of like cross-validation to choose what your lambda parameter or so your alpha parameter should be it kind of achieves the same thing but it may be harder to do that first you remind me of something interesting maybe I talked about it in the next part of lecture after the break or we can talk about it offline this derivation that we had at the beginning of the course I have an idea for a final project for controlling the complexity of the network based on this derivative this derivative was derivative of the final output of the function when you prepare one point so maybe we can compute this derivative using back propagation sort of technique you know the way that we use back propagation to find the derivative of objective function with respect to weight and we similar approach can be applied to compute that derivative and use that for basically estimating the true error rather than just training error let's talk about this later yes No No if if you penalize it but with l1 then they those weights becomes you really but here no it's just a small I I don't see the direction but okay yeah in this sense it's true that in PCA you're rotating your coordinate system toward the direction that are more important but I didn't see the direction with supervised PCA maybe connect me you can explain more to me in next after any other question okay let's come back at 3:45 you
Info
Channel: Data Science Courses
Views: 9,831
Rating: 4.9452057 out of 5
Keywords: Machine Learning (Software Genre), Neural Network (Field Of Study), deep learning, Regularization, weight decay, Ali Ghodsi
Id: 21jL0I6wbns
Channel Id: undefined
Length: 89min 0sec (5340 seconds)
Published: Wed Sep 30 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.