Ali Ghodsi, Lec [1,2]: Deep Learning, Perceptron, Backpropagation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm going to start perceptron so as I mentioned I may board some of you but it's necessary to be on the same page okay so perceptron as I mentioned in the first part of the lecture is just a linear method so it's a simple linear classifier so this simple linear classifier if you want to show it by I mean graphically usually they show it this way you know there is a unit in this unit if you have X 1 2 X D when X I mean vector of X is a D dimensional vector of X 1 2 X D you know like this I mean it's a vector and these are scalars so this is in ordi ok so perceptron you can think of perception as having two parts the first part of perceptron compute Sigma of beta I X I so basically there are some weights here for each of these nodes and in the first word we take summation of all of these nodes overall I plus beta norm so that's what's happening in the first half part of this perceptron in the second part we're going to apply a sine function to this so basically we are going to have sine of what has been generated in the first part so the output of this perceptron is going to be either plus 1 or negative 1 so if you have two class problem and labels of one classes plus one labels of the other classes negative one using perceptron you can assign labels to these two a class okay so as I told you it was not clear how to train perceptron at the beginning you know it took quite I mean when I explained it to you now it sounds quite trivial but it took a couple of years for best researchers in the world to figure out how to train this percept on how to find this weights in a in a correct way so it is I mean the decision boundary is aligned in two-dimensional is faced in d dimensional space it's like half a space its hyperplane so I can show it by beta transpose X plus beta not equal to 0 so this is the equation of this line or this hyperplane okay so let me look at the geometry of this hyperplane a little bit you know what we want to do is that we want to use this as decision boundary in a sense that I want all positive points to be on one side of this decision boundary all negative points be on the other side and learning means that I choose beta means I choose this weight and bet on all in a way that this line is in its correct position you know all positive points are one side negative points are on the other side so how should I choose this weights you know that's what learning perceptron means means learning this weights correctly okay so you know how we do learning we usually define an objective function and we are trying to you know we are we're assuming a family of you know a family of hypotheses here our family is the family of linear discriminant functions and in this family we have to search for the best one and search means that you have to define a cost function and objective function and try to minimize your error so I need a notion of cost function here and I need notion of error I have to define what do I mean by error what do I mean by cost and then I have to minimize it so it's clear what do we mean by cost you know I want this I want all positive points to one side negative points with on the other side so if I miss classify a point you know I've made a mistake and this is my error you know I have to minimize this I have to minimize this mistake I have to minimize miss classification okay so let's look at the geometry of this line or this discrimination function you know if I take like two points here there are two vectors so for any X 1 and X 2 I can say that beta transpose X 1 plus beta null is equal to 0 right and they can say beta transpose X 2 plus beta naught is equal to 0 so that's true for any two points on this slide ok so it means that beta transpose x1 minus x2 is equal to and plus there's no plus X beta transpose x1 minus x2 is equal to 0 okay so beta is a vector and X 1 minus X 2 is also a vector so X 1 minus X 2 is a vector and beta is a vector and if beta transpose I mean if dot product of two vectors is zero what does it mean they're orthogonal right so it means that beta is orthogonal to the line to X 1 minus X 2 so that's one conclusion that I can make so another conclusion that I can make from just simple geometry in calculus is that for any X null for any point on this point beta transpose X no plus beta null is equal to zero and it means that minus beta null or eternal actually is equal to negative beta transpose X no ok that's second conclusion I want to make now when I have this line suppose that you know I'm giving a point X and I would like to compute the distance of this point and the line decision 1 I'm looking for this distance so how can i compute this distance I told you that beta is orthogonal to this line right so basically I can say that beta is always orthogonal so I can say for example I call this point X now I can compute the distance between X and external and external could be any point on this line and then I can project it to Beit then it's going to be my distance right so my distance basically of point X is beta transpose times X minus X naught okay that's my distance for any point on one side it's going to be positive four points on the other side it's going to be negative so technically it's not distance its distance with the sign you know it could be positive or it could be negative so the real sign is absolute value of this okay so this is basically beta transpose X plus actually minus beta transpose X null but beta transpose X no I know is equal to beta naught okay so this is basically beta transpose X plus beta naught so if you have X and you want to compute the distance between X and your decision boundaries simply you need to put that point in the equation okay and this gives you the distance between that point and the decision boundary but it could be positive or negative so technically it's not distance if you want to compute the real distance it should be absolute value of this or in this scenario that we have positive points on one side and negative points on the other side so why assume why is my labels so here you know in the training I assume that you know I have X I Y I I equal 1 to N and X is the dimension and y always comes from negative 1 and plus 1 set okay so labels are either positive or negative so I can either take absolute value of this or I can multiply voy to this quantity so if I'm on the positive side beta transpose X plus beta naught is positive plus 1 x plus 1 it's positive if I'm on the negative side beta transpose X plus beta naught is negative times negative 1 it's going to be positive so either way it's going to be positive so this is the distance of point I from the hyperplane okay so why did I compute the distance of a point from hyperplane get back to our original goal you know I'm going to train this line and there are some points around that and I want positive point beyond the positive side negative points we end the negative side so suppose that some points are not on the you know these are positive points these are negative points that could be a perfect line but suppose that I assign the weights in a way that it classified points this way some points have been misclassified so I would like to correct this and to correct this you know I need as I said a notion of cost function and then I have to minimize it you know in many cases especially in a neural network use gradient descent to minimize our cost function like in gradient descent you have an objective function and in this objective function you just take the derivative of your error and in each iteration you take one a step toward the you know minimum value so I don't know how to get there I just know the direction of error I go one step at a time until I get I can't improve more so suppose that I want to do a gradient descent here do you have any suggestion of the cost function you know these are points that have been misclassified so I have to define the goodness or badness of this line you know I have to somehow quantify how good or how bad is what I have found here and then correct it how do you define yes sorry now first I want to quantify to see how good or how bad it's you know if I give you this line and I also give you this line which one is better the black ones why it's better sorry the black one is better word boy because the number of misclassified phones are less right so one way of quantifying the goodness of this line is to count the number of points that have been misclassified but is it a good cost function I want to minimize it sorry it's not the best but even if I want to use gradient descent can I use this number of points can you start with that but don't forget that I have to take derivative for gradient design right yeah okay so that was the reason that I computed this distance because I'm going to define my cost function based on this distance so be yes so that would be something similar to support vector machine that was invented way after this way after perceptron that we put the line somehow it has a good distance from both classes but it invented years after okay so I computed the distance because I want to define my cost function based on that so there are some points that have been missed I can divide my cost function based on the numbers and the distance of points that have been misclassified this tends to the decision boundary okay so basically I can say my error function which is a function of beta and beta norm function of my parameters is summation of distances of points that have been misclassified so M is set of all misclassified points so I'm going to sum the distance of all points that have been Miss Club the distance of all points would have been misclassified to the decision boundary if no points have been this case classify this is zero if a point has been misclassified based on how far actually in distance is a good thing you know just instead of just counting them I know how far that point is from the boundary if it's close it's better than the case that's pretty far you know and when it's 54 the the decision boundary should rotate more to capture their point when it's closed should rotate less so it gives me a good indication actually I need a negative sign here as well do you know why sorry yeah because you know I'm talking about misclassified points you know points are on the wrong side you know why I times this is the distance because both of them are on the right side you know this is positive this is positive this is negative this is negative but when a point is misclassified it's in the wrong side it's positive but the label should be negative and vice versa so I need a negative sign here to make sure that these are positive okay so that's my error function so having error function it's easy to apply gradient descent so simplest thing that you can do is gradient descent just need to take derivative and when you take the derivative you need to take one step toward the direction so I need to take derivative of this error function with respect to beta and the derivative of the error function with respect to beta is what what is it so why I X I okay and what about derivative of this function with respect to beta naught okay okay so I have derivative of my cost function with respect to my parameters so how gradient descent work in general so gradient descent is quite simple in general you know so if my parameters is W for example in gradient descent I need to just update YW this way I need to know the derivative of error with respect to W means I need to know when I protect about you how my error will change and initialize the model with some initial value for W and see what the error is then take the derivative go toward the negative direction of this derivative one a step and this row shows the length of my step the bigger it is I take the larger steps they're smaller it is I take the smaller a step twerk the error so there's no guarantee that it goes to global mean if the function is non convex if it is convex we go to global mean otherwise you're going to get a stuck in a local meet right so if count function is convex that's fine if the function is not convex then I'm here you know I just realized that which direction should I take to minimize this you know I don't go up I go down a little bit and then I go down and go down I get here so this is my derivative now okay so because they retail at this one so I don't move anymore I stay here but the mean is here but that's how gradient descent works we can't do better than this using gradient this okay so applying this gradient descent to our method I have beta that I need to optimize and derivative is Sigma Y I X I so it should be like why I X I overall M there's a negative sign here in practice usually people don't use this signal and you know why that's the same for beta and all so that's Sigma where Y I so in fact I should have this that in practice people just do this I mean in practice what people usually do to Train perceptron is to given some data points assigned some random weights feed those one of those points point one to this perceptron then so it's going to be a line that you can according to this decision boundary can come error okay an error is not zero so we cannot classify points correctly it can take derivative and update your weights according to a gradient this scent is guaranteed that your weight is going to be better than the previous weights okay and then you do this for the second point and the third point up to the point and you go over and over and over up to the point that there is no error and you can prove that if two sets are separable linearly separable this algorithm find a solution definitely you know find a solution which separate these two if they are linearly separable it's linearly separable means there is always a gamma such that Y times beta transpose X I plus beta on all is greater than equal gamma for all X I and so if this is the case if two sets are linearly separable this algorithm definitely finds a decision boundary which perfectly separate these two but it could have infinite number of solutions it will find one of them you know if you have they said there's infinite number of solutions that's one solution that's another one this is another one this is another one and we're going to find one of them based on the initial value okay so let's get back to this question so why I don't use the Sigma here why I only use the error for one single point rather than the error overall misclassified points yes okay so in a stop gradient descent you can use a stochastic gradient descent and this is if you use the Sigma here that would be gradient descent if you don't use this Sigma here that's going to be stochastic gradient descent we're going to see the details of a stochastic gradient descent and gradient descent later on in this course and there's also other variations like mini-batch gradient descent it's faster it's an approximation if you want to be completely correct you have to put the Sigma here but computationally it could be expensive when you have many points if you have 1 million points here in each iteration you have to go through summation of 1 million points which is quite expensive so you're going to use this a stochastic and so what's happening is that if you go through the Sigma then it gives you the correct direction but you can show that it's stochastically all most of the time it gives you a good direction you know yeah it may happen that using only one point you go wrong for in one iteration but overall you go to the right direction generally and it's way faster ok so that's basically percept yes what do you mean same points do you have a training set you have to go to your training set over and over and over now we go through all of the points that we have in training set we go through all points and trainings but when you do a stochastic gradient descent usually they shuffle the points in each set you know we don't keep the order you know from x1 to xn you go in the second round usually we don't go from x1 to X and we shuffle the points and then we could yeah yes exactly yeah yes I have you know I have this set of points this training sir and I initial some random value for beta and it turned out that I get this okay then it's not a good one I can compute my cosmic us is that some points have been misclassified this point is misclassify so the distance of this point to line and the distance of this point to line I add them up this is my cost function I want to make it as smaller so I take derivative it tell me which direction of beta and beta and also they take to make it smaller so I take one step and this one a step make for example this line for me and if this is a smaller and I keep going until you know it gives me this so in each iteration yes I compute the error again and then I take derivative and so on if as I said if it's linearly separable by this definition using this algorithm you're going to find a solution if it's not linearly separable I mean so for example if this is the case there is no way to separate them by a linear decision boundary then you can get a descent solution but in practice you're going to see that it starts to you know oscillate between two solutions you know it gets larger and smaller larger and smaller you have this stop tip and in finite number of iteration you can prove that you can find solution and this finite number depends on this gamma the larger the gamma is the number of iteration is less than smaller the gamma is the number of iterations more yes coefficient vectors are sorry yes yes otherwise otherwise that I wasn't that was not the distance you do normalize it yeah otherwise you know if it's not normalized that's not the distance if it's not normalize you have to divide it by the norm of beta right yeah the assumption is that it's a unit vector okay any other question yes sorry that was two different things yeah stochastic gradient descent which we go through the detail of a stochastic gradient descent later on in this lecture means that instead of computing you know you have an objective function which can be written as summation of some parts okay so your Phi say for example Phi is your objective function and you can write Phi as a summation of Phi is such that each Phi R is error of one point point XR so in a stochastic gradient descent instead of subtracting learning rate times Sigma Phi I you subtract learning rate times fire you get rid of this Sigma that's for computational reason the it's going to be expensive to do to go through all points when there are many data points that is the castor grading design so that was definition of linearly separable data so the data is linearly separable and it means that you can the Royal line that's such that positive points is one side negative Poisson under so sometimes you can't okay this gamma I told you what's the gap what is the margin between these two classes you know how far these two classes are from if these two class are well separated gamma is large with a small number of iteration you can find the solution if these two are pretty close gamma is a small you need very huge number of iteration to get the suit it's pretty intuitive besides the mathematical proof it's quite intuitive okay yes yes yep okay so perceptron as I explained in the first part of the lecture is building block of neural network so the first model that people try to build out of this perceptron especially after the fact after this realization that perceptron is not solution of artificial intelligence cannot you know there is a limitation nature of this receptor on that cannot for example classify X or problem or solve X or problems so people start to put perceptrons together different layers of perceptron so basically you know we can assume that you know you have a couple of these perceptrons okay and then you know your have again a d-dimensional data point and you feed all of these perceptrons with your points okay and there are weights associated to each of these edges so I have different perceptrons here and then I apply the sine function I have output here then this output could be input of another layer of first cetera so I put another layer of perceptron and I feed them using the output of this and I can have many layers of this form and eventually I can have one or more nodes as output you know all of these the output of this nodes can come to a perceptron and make my outfit okay people start to try this structure and they call it multi-layer perceptrons or neural network this is no run this is an perceptron is you know you know a model of no run and this is a network of a neuron so it's brain that was the claim actually or a model of brain so this is called input this is called output this is input layer and these are hidden layers okay so people start to put these different layers of perceptrons together the problem was that they didn't know how to train you know we know how to train perceptron I'll go by that time we know how to train perceptron because we're able to take derivative of the error with respect to weight but here I need to take derivative of this error you know supposed to have Y as correct in output and this is what the network predicts so I can define some sort of error sum cost function and this cost function which is a function of all of these weights I have to have a way to take derivative of this error with respect to each of these weights because I want to correct those weights it was not clear how to do this until many different researchers in different places independently discard propagation so back propagation is the main algorithm in feed war forward neural network this is called feed-forward knowledge in feed-forward neural network is the main algorithm to train the model means to compute the correct weights okay so now we want to go through the details of back propagation okay I think it's clear how this model works you know you have an input here you have some weights and so on so what you're going to see here again similar to simple perceptron is a linear I mean weighted sum of all of these features so it's like W transpose X here actually in perceptron we used to have a sine function applied to this so in in neural network we put a smooth function instead of the sine function here if we'd like sigmoid function or tangent hyperbolic you know a smooth function because you would like to take derivative of this function and sine function you know it doesn't have to be you can't take derivative of sine yes exactly yeah yes so that was one interpretation of neural network after all you know that you know there was claim that it's a model of brain and so on but then some people basically realized that if it's a sigmoid function you just it's a you know some logistic regression on the top of each other yes if it's a sigma it's let's take a crash okay so you're going to have a smooth function here because you need to take three with you and this a smooth function the most popular one is second wait some people use tons of hyperbolic recently there are other flashes are more popular we're gonna see those functions which is pretty recent and they're faster in training then Sigma is function and so on okay but okay so if this is going to be a weighted sum then this nonlinear function is going to be applied and we're going to have an output and this outfit goes on and on so we do need this nonlinear function otherwise this model doesn't make any sense because if you think of this you know each layer is basically it can write each layer as a matrix multiplication so it's matrix W times X it's another matrix times this and so on so it's a linear operator you know and summation I mean if you have many of these linear operators it's just a linear operator if you don't have this nonlinear part here the summation of some linear transformation is just a linear eventually so everything collapses to a linear model if you don't have that nonlinear sigmoid or tangent hyperplane or whatever function okay so I'm going to go through the detail of back propagation but the idea is that we start with some initial weights this initial weights you can apply to the network and see what the error to Frank's final error is now back propagation tell me how to compute the derivative of this error with respect to one of these weights and if I do have using gradient descent or stochastic gradient this and I can update that particular weight and make it better for next iteration okay that's right so we are talking about like feed-forward neural network and its back for peckish okay this is just an ax stops now shot of you know a huge neural network this is one node of one layer so let's call this layer L so in layer L there is many nodes I have just taken one of those and then I have another layer of nodes and this is one of those nodes and I have another layer of nodes and this is one of those no so these are suppose that these are we have the three layers of this network and I call one layer L on layer I and one layer J so these are perceptrons so perceptrons as i said in the first half of the perceptron we take summation of input weighted summation of input and the other side we apply nonlinear function to it okay so this part I would call it a and this part I would call it Z so this is a eldest is that all of this is y this is Z I and this is AJ and this is a J and basically you know that L for example is going to be a nonlinear function applied to a l or this is for I for J for anything let me write it for our okay and this nonlinear function could be many different functions for example it could be 1 over 1 plus e to the power of minus a/4 Sigma of V which is a function of this form between zero and one could be hyperplane a hyperbolic tangent which is between negative 1 and 1 with similar shape okay and there are weights here and a for example AI is summation of all of you I L times Z L over N you know each of these is just linear sum of the previous step okay now I have to define an error function and try to take derivative of that error function with respect to one of these weights so error function could be different things you know for the moment and simplicity let's assume that my output is only one unit so I have one unit at the output so it could be many but let's assume it's one so the output would be Y hat so let me to be precise let me tell you that my data point is my training point is X I Y I and I have n of them and network generate y hat so I can have different cost functions let's for simplicity assume that my cost function is y minus y hat square okay okay so I need to take derivative of this cost function with respect to one of the weights say for example this weight so I need to take derivative of our class function with respect to u IL for example so back propagation is in fact application of chain rule you know we're just going to use chain rule many times so the relative of error with respect to u IL I can write it as derivative of error with respect to I want to take derivative of error with respect to u IL means I want to know if I protect you il a little bit how much that the final error will change so I can write it as derivative of final error with respect to AI and the relative of AI with respect to the air so how much this the input of this unit is going to change times derivative of a i with respect to UI okay what is derivative of air I will speak to you il a I will speak to you il sorry good B's at all right because see air R is just Sigma of l UI L said L so if I just prepare you know each of these a is summation of all of these nodes x corresponding weight so derivative of this with respect to this particular one is just said L right so this is the what about the reversible error with respect to a I do I know this value no we don't know what the derivative of output with respect to a is so for the moment I will just call it Delta I okay so derivative of error with respect to a I which is unknown for me I apply chain rule again I will write it as derivative of error you know I want to compute the derivative of final error with respect to air I I write it as derivative of final error with respect to a layer which is closer like AJ times derivative of AJ with respect to AI actually I need the summation here as well summation over J you know I'm taking the derivative of error with respect to a I so I wrote it as derivative of error with twist to AJ and AJ with respect to a I but there are many of them so if I want to see when I Pratap air I how the final error will change basically I'm saying that it is how much AJ change because it I will change AJ right but if I change a I AI change all of these ages so I have to take the summation of all of this right so I have to go I have to take summation over all units in layer J okay so the derivative of error with respect to AJ you know I can't derivative of error with respect to AI Sigma Delta I and just call this Delta J okay so what about derivative of a J with respect to a I derivative of a J with respect to a I I can apply chain rule okay they have any suggestion how can I write this I can write it as derivative of AJ with respect to Z I and then derivative of Z with respect to a I so what is derivative of AJ with respect to Z I it's you ji if I perturb that J is that I how does a J will change you know usually so derivative of a J with respect to that I is just u ji what about the derivative of Z i with respect to a I derivative of this function you know I know what's the relation between a and Z Z is just a that nonlinear function has been uploaded that's the reason that we didn't use sign because we wanted to take derivative so that's basically derivative of my nonlinear function Sigma that has been applied to a okay okay I can put all of these together now you know I have derivative of you know this Delta I which is derivative of error with respect to air I is in fact summation over J of Delta J times this quantity in this quantity is what is UJ I times Sigma Prime yeah okay and you know if you look at this one this has nothing to do with the index of the summation I can take it up so I can say that Delta I is Sigma prime i sigma over j delta j UJ i this is I this is Jing so something interesting actually happened in this formula there is a quantity delta which is not known to me but I can compute Delta I based on Delta J you know if you tell me what the Delta J is I can tell you what Delta is if you'd give me the Delta of this layer I will tell you the data of this layer dealt artists if you tell me the Delta of this layer I will tell you the Delta s Li okay so I can define Delta is unknown but I can define Delta of one layer based and the Delta of another li okay why it's useful because suppose that you know I have fit word nor network and there's layers and layers and layers at some point I get to the lastly okay suppose that this is layer K and the output here is voi hat okay so what is what is Delta K by definition Delta K is defined as derivative of error with respect to the weight right so I just make it very simple the last layer I mean you can do differently but just to make it I mean clear I assume that this perceptron is a special perceptron doesn't have a sigmoid function here you know whatever comes in goes out you know just it's just summation so in this case basically I have to take derivative of error with respect to Y hat you know and I can take this derivative because my error function was why - why hat squared I can take this derivative with y ha in fact if it's a scaler I don't need this norm you know so it's basically two times negative two times y minus y hat you know if it's just pretend it's the scalar minus 2y minus hat so I can compute Delta for level K so if I can compute Delta for level K I can compute the Delta for one level before right and I can compute for layer before K layer before K this layer this layer this layer so this Delta will back for per gate and knowing the last one I can go up to the first one and find them but if I know the Delta then I can find this error because derivative of error with respect to any weight is just Delta I times Z L right so this is important formula and this is also important so I can do something quite simple you know I can feed my network with some initial weights and then compute the error then I would like to know how should I change my dates do you know how should I change my ways I can use gradient descent to do a gradient descent I need to know the gradient of error with respect to each weight to compute gradient of error with respect to each weight I need this quantity and that ZL is now because I feed the network forward you know I applied one data point get some initial values take the sum apply the nonlinear function to it so I know what the Z L is I know all of a is all up says all of this you know weights based on the initial value that I had so ZL is a noun I don't know the Delta but it doesn't matter because I can compute the Delta for the last layer and then I came back propagate it and find it for other layers and find it for other layers means that I can find this derivative for every single weight you have this for every single weight means that you can correct that weight in the right direction okay I will write down the back propagation algorithm exactly but yes you know you're going to start with some initial weights randomly and with initial weights you're going to start randomly but in you're gonna is you're gonna feed your network with one single data points and compute the error compute the error it can compute all of this derivative and correct the weight go for another data point and so on you know but you have to do this many times you have to go over all of your training point many times the same as perceptron until it converges you know until you cannot make it better but you can't make it better it doesn't mean that you have in global minima it's pretty non convex function you're in a local minima it was the common belief up to a couple of years ago until very recently it was a couple it was come and believe that that's a very bad property of bad of backpropagation that leads to you know local minima it turned out that some researchers now believe that in high dimensional space this local minima is not really bad so there are many local minimizer or as good as each other you know no matter which initial value you stir to get to a good solution in you are in very high dimensional space we don't really know stellar we don't really know many things about deep network why it works why it works so nicely many open questions in terms of not technically in terms of the reason behind the success of this models and this is also a sort of open question in terms of optimization that why it's a good solution but it is apparently of course it depends on the shape of this cost function definitely it depends on that you know if you look at the literature of neural network early literature's you know there are papers about how to choose cost function such that you know the number of local minimizes meaning is minimum or we don't have that it's convex or there is many literature arounds how many local minima we are going to have the number R is order of exponential and so on and so forth it was as I showed you in even 2006 one of the claim of one of the inventors of backpropagation about the limitation of back propagation was that it trap in poor local minima with in general it's true but apparently in high dimensional space this local minima is not poor usually it's is not as a decent solution well there are many algorithms but it's still one of the dominant algorithms for training is back propagation no not not necessarily you have d dimensional data points yes high demand I mean the data is hiding most of the data that we are working with our high dimensional data right you're working with images in images is taken by a camera with 6 megapixel you know how many measurements do you have it's a vector of this land so it's pretty high dimension same as text and so on so other data is very high dimensional date and apparently in this is it doesn't matter okay any any question okay so so if I want to summarize back by Gatien algorithm for you it works this way so in step one so choose some random baits then apply X to the neural network and compute y hat then compute Delta K for output which is simply in this case -2 YK hat minus YK then compute each Delta I which is equal to Sigma AI Sigma over J Delta J u JK and compute this derivative derivative of error with respect to the weight as Delta lies at l4 all leads so when you have this and can compute this is basically playing backpropagation algorithm as one implementation note about back propagation is that when you want to initialize wait for the network it's better to use weights close to zero and small weights ways that are close to zero when weights are large the model can over feet I believe that you are familiar with the concept of overfitting and underfitting from previous courses but overfitting is a real a real threat in any type of learning right you can when you have a flexible model you can torture the model at a level that it learns all details of your training side it's quite bad cannot generalize you know if you teach someone all details about you know you want to tell someone that what's the difference between a person and a tree simply you can say you know you can distinguish this by one feature you can say okay usually people are not more than 2 meters usually and usually trees are more than 2 meters that would be sufficient to distinguish between 90% when you get 90% accuracy in your test set but if you go to too many because this is my training set and I start to say that ok this is their height is either 175 or 170 or 172 or this and that and the color of their shirt is either red or over-interpret and that type of their shoes is either this or that on top of their hair in all details of trees and it you can't generalize you know if I show you a new person which was not in this class and doesn't have this type of shares in this color and this height you can't realize that it's a person you know the same as 3 so you overfeed your model so you shouldn't give too much details to the mark so model each model has a flexibility you know capacity the complexity if the model is too complex means took too flexible like neural networks are so a thread is that you can over fit the model and overfitting can happen with large weights for the reason that you know this is the sigmoid function or tangent hyperlink and you know close to zero the behavior of this function is almost linear when you are far from zero the behavior becomes nonlinear so large weights could lead to overfitting because the model starts to behave quite non linearly so it's very flexible small weights can collapse the whole model to a linear model so it restricts its capacity or flexibility so it started with a small weights for for initial values one way to avoid overfitting that I will tell you more about it is weight decay you know we can add one term to this to to make sure that which doesn't get too large and the model doesn't over feed one of the problems with neural network in the past was that we didn't have enough data points and overfitting was coming in in neural net well now we have too many too much too many data points to feed and it doesn't happen quite off yeah I wanted to mention this that you know that the plain back propagation there was was a common belief that cannot be applied to deep networks I mean we can apply to deep networks now I mean now we learn that it can be applied to deep networks and there are many definite works methods now that just use plain back propagation for training any question yes I mean you know our BM was our BM is stands for a restricted Boltzmann machine our BM was which we go through our BM later on you know though the idea of our BM in 2006 paper was that you know you have suppose that you have a deep network so there are many layers so you can do back propagation how can I do this how can I train this the are the idea is that ok let's train this layer by layer I'm going to train these two layers and that's these two layers and then these two layers and then these two but what does it mean that I want to train these two layers if I have a supervised problem this is my input and my output say is y so what should I use here as my you know true value you know what does it mean that I want to train these two layers together so the training here could be unsupervised and that's where restricted Boltzmann machine comes to place so in the first papers or the first attempts of deep network these layers are restricted Boltzmann machine which we will teach in this course that what restrictive Boltzmann machine or and you know restricted Boltzmann machine I actually when you have this I mean roughly I can say that okay I have a representation of the day in D dimensional space and this is D prime what is the best representation of the data in D prime dimensional space for this D dimensional representation so I have a notion energy that he'll want to minimize which representation minimize this energy you know suppose that you have other you have seen other type of unsupervised learning PCA it's unsupervised I give you X in D dimensional space and you give me some representation of the data in another space exactly the same thing it's another representation of the data is another space then give this go for another representation and so on up to the last stage last stage I put my output here and I train my last stage with the real output then I use this initial ways and then I do some sort of fine-tuning with an algorithm similar to back propagation to character okay because if you want I mean that was the belief that if you want to go with backpropagation from the first place when you have many of these this derivative of error will be negligible after a couple of layers after two layers and basically you can't compute it you know the derivative of the gradient would we can't trust this derivative of gradient because it's almost negligible so we tell you that you have to go this it's pretty dangerous problem you know you don't you know there are two problems vanishing gradient and exploding greedy exploding gradient is better because you realize that the program broken it's not working you know you have to there is a bug here but vanishing gradient is pretty dangerous you even don't realize that's what's happening you know just the direction is wrong and you just go on and on and on and it doesn't converge you know and you don't know what's happening that was the problem but then and there are solutions for that now and there are methods that use pain back propagation to to train this so I mean it's comparable to restricted Boltzmann machine I can say that most new methods don't use restricted Boltzmann machine most of them don't use restrictive Boltzmann machine anymore it's just playing back propagation or convolutional network or our and and these are the three most common ones up to a couple of years ago the most common one was respectable so much it's not anymore okay any any question the other question yes back propagation is the name of the algorithm for training feed-forward is the name of this structure this a structure of network is called feed-forward Network but there are other structures like convolutional net word has different in structure or RNN has different a structure it's not fit forward there's loops from one layer to the previous layer for example it doesn't go just forward that's the name of the structure but so let me quickly tell you about the stochastic gradient descent and gradient descent that we talked about you know as I mentioned you may have a cost function in this form this Phi can be written as a summation of Phi eyes such that usually each Phi I associated to one data point a data point I there's some cost function that you can write it this way if you have this type of file cost function then you can use gradient you can use a stochastic gradient descent gradient descent I mean plane gradient descent or batch gradient descent has this form of update you update W by W minus the learning rate times the gradient of your cost function with respect to your parameter okay and because of the nature of this function which it's summation of some other function it's w- learning rate or a step size times this summation when you have many data points computing this summation could be quite expensive in each iteration you have millions of data points that you have to go through all of them for each iteration by difficult well in general you can use cross-validation for example to go for a good learning rate in practice there are many tricks like their learning rate which depends on the number of iterations so you start with a large learning rate I'm going to mention that here you start with a large learning rate but in each iteration you're gonna make it more smaller smaller because you know if you have a small learning rate it means that in each iteration you're gonna take a very a small step right it takes forever to get to the mean but if you have large learning what's wrong with large what's gonna happen with what can go wrong yes yeah you may overshoot actually you know this is this is my function that I want to minimize and I'm here the learning rate you know I take a big step take me here and the next step takes me here and the next step takes me here next step takes me so it doesn't converge you know so it should be as small to make sure it does converge but if it's to as small it takes forever to convert so one a strategy is to start with a large one but in each iteration you make it as smaller and a smaller and in general you can apply cross-validation to find a good one the unfortunately there are in in the area of neural network there are many tricks involved in training and optimization you know so sometimes we feel that there is I mean the person who wants to train it needs to have magic hand too you know to do this it's not quite good I mean when you have a convex function like you have support vector machine is quite clear what's happening is quite clear what's the minimum is how to optimize it but neural network you know you need to have lots of experience how to design it how to put a structure how to train it how to choose the initial value you have to choose you know the step learning rate and so on yes as a as I mentioned you know optimization is quite a large open problem in the area of neural network and people look at all type of optimization way like hasty and free line search you know many different things but as a dominant method I am Not sure actually but yeah optimization is quite a large area okay so here actually you know you can this is basically plain a gradient descent or batch gradient descent but stochastic gradient descent in a stochastic gradient and you consider only a subset you know it's called SGD you consider only a subset of this summation you know you have this summation here and it's stuff everything you consider a subset of that it's called a stochastic gradient descent could be quite efficient so usually people get rid of this summation and go only for one data point as we did here for in run forget waited for perceptron or you can use you know that that would be basically the simple algorithm for a stochastic gradient descent that you choose your initial value and you repeat until converges randomly shuffle data points in the training set and then you will compute this you know you don't have the Sigma here you don't have the correct gradient you just have the gradient according to one data point this is one example here that I will post it you can look at this example later on and you have batch gradient descent in in batch gradient descent instead of looking the summation of all of these n data points you're going to look at submission of B data points B is a subset of n and then this B could be a parameter of this method it's usually called mini-batch gradient descent and here is the algorithm for mini batch that you choose your initial value and then you choose B and then you repeat until converges again randomly shuffle data points in the training set and then you know you go if it's 10 you go over from 1 to 11 to 21 and so on each time you have a subset of 10 data points sometimes it works better but using only one is the most common one ok as I mentioned you know this learning rate you can choose it adaptively it could depends on iteration you know T is number of your iteration so that's a common practice so you start with a large value but each time it's going to be you know this constant times plus T plus you know another constant and T get increase you know in your iteration in each iteration you make it better definitely right so the longer that you're on it you're closer to the mean so when you're closer to the mean you make it as smaller to make sure that you don't overshoot and you don't miss the data mean so it's basically they just use this okay any question here this argument is correct not only for convex function you know if you have a non convex function again this is correct you know I want to get here I shouldn't overshoot here you know and close to this I have to fine tune and start to go small there are many there are many techniques in general you know an optimism for optimization of non convex function in practice as I said in high dimensional space you know you don't need to do that because all of these local minimums are also almost as good as each other so they are they I was told that seventeen more people registered in the course in during the break so I don't know how many I'm going to look at this and I don't know how many people are going to register in the course eventually but soon you're gonna see a list of papers for presentation based on the number of people we have we need in a schedule how to present papers for the second part of the course on how to choose projects I'm going to let you know so any quiz no question I will just finish have a good weekend you
Info
Channel: Data Science Courses
Views: 19,042
Rating: 4.9298244 out of 5
Keywords: Machine Learning (Software Genre), deep learning, Neural Network (Field Of Study), Backpropagation
Id: AxC40B6KtSQ
Channel Id: undefined
Length: 89min 15sec (5355 seconds)
Published: Thu Sep 24 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.