Machine Learning Lecture 36 "Neural Networks / Deep Learning Continued" -Cornell CS4780 SP17

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I can't last lecture of machine learning and please everybody remove your laptops unless you in the very last row few logistic things so the project eight is out and I said don't worry it's relatively easy it's a lot easier than project seven which is probably among the harder ones then the oath of the latest homework assignment is out that is optional so you don't have to hand this in the there's two reasons why you should do it anyway and that's a because it's a good preparation for the exam and B because what you're deriving in project in the homework six is exactly what you have to implement the project eight so it helps you a lot if you've done it and if you you know then project eight is really trivial okay also some people asked about practice exams I posted them on Piazza under resources so if you the very top is practice final exams simple so please look at those in preparation finally I believe the early sign up for the conflict exam is now over I believe so there will be I think the only very few people who can take the exam because of a university recognized conflict and we will have an alternate date so just be ready we will post that today evidence okay any questions about just logistics okay good we're talking about deep learning and neuro networks and so as we you know last time basically deep learning you have a loss function of our predictor HR hypothesis and this can just be you know some generic function that goes over all the different samples and compute some loss for example just you know a typical example maybe the Skrillex H of X I minus y I squared this is the regression setting and in new networks are just a simple extension of linear classifiers H of X is w transpose Phi of X and now comes the trick Phi of X itself as a mapping so Phi of x equals some transition functions times U of X and you know let's just remove the bias and that's a one layer new network if you have multiple layers you just stick another function U and then you have v prime of x equals Sigma of u prime or Phi double prime of X it's on you can make as many layers as you feel as necessary u double prime of X right so in this case we have one one two three three layer new network okay so the input comes in here you multiply with a matrix apply our transition function the transition function can just be the max was zero so you said everything that's negative to zero and you take the outcome that's a vector you multiply by the matrix set everything negative to zero take the outcome multiplied by matrix set everything negative to zero and then at the end you have your request so what I want to talk today is how to learn such a new network so if you have a loss function and we've seen before when you have a loss function and you have a classifier like this maybe just do gradient descent and so new networks actually you do exactly the same thing the only thing is you don't just have that one vector W you also have these matrices in between all right so these are your parameters see if that W that's the the last the last layer learning they're all the intermediate layers each have a transition matrix so how do you do how do you degrade in the sand well the first thing is you have to take gradient the gradient with respect to every single one of these matrices and actually it's exactly the same thing as before you just take the gradient and then you just take a small step here just a small update so you just say you know u becomes u minus alpha times the gradient so it's just like good old gradient descent and there's nothing special about it the one thing that's a little tricky is that you have to take these gradients with respect to these these may these weight matrices that are hidden deep inside the network right so you basically have to think about it you kind of have to go through this function for this function and so on until you actually end up here all right and turns out that's a very good trick how to do this and it's called the chain rule who's heard of the chain rule awesome who's heard of chain-smoking all right it's about the same actually okay so here's what you do right you busy say well the first gradient is simple right that's just the gradient that's just let me just write this is DL DW okay so in this case this is just some of our I equals 1 to N and then here W transpose Phi of X I minus y I times Phi of X I all right so that's just the gradient of the square laws where H of X is this thing you okay any questions yeah oh yeah no X is just a sample a training example I don't need some of all the trainings now that's the gradient all right good question yes any other questions so so far there is nothing special at all right this is just that this is just a square loss you implemented this right this was the ERM homework you did exactly this Phi of X was your feature vector right you just minimize the square loss you just compute the gradient and then when we do a gradient update we just say W becomes W minus alpha times this fellow you okay raise your hand of you with me all right awesome now comes this guy here right and now he gets a little tricky how do you get this guy right and well we just use the chain rule so you just want to have the LD u and if you use a chain with one thing you gotta realize this well you have to do all these chains right all the way you know each one gets more and more expensive that's what you would think but turns out actually it's very cheap because it can reuse computation and this is called back propagation so actually a long time for Beck congregation to be invented like long after new networks were invented people actually at the beginning they did not train it with a chain rule and that's there's good reasons for this but so okay you want to have the green in perspective this first matrix it's basically inside the spire of X the first weight matrix that we have and how do we do this well that's DL and it comes a trick we can actually say D Phi and then D Phi to you but one thing we got to do is go to the simple trick we call this year a of X this here a prime of X this yeah a double prime of X okay that's just the notation so a of X base D takes the output of the previous layer and just multiplies with the matrix and then Phi of X is just this transition function or Phi of X ok any questions does that make sense so it's kind of what I do is like you know if you have these Russian women like inside a woman is another woman now what I did is I kind of a half woman that's kind of you know in vitam between two stage I don't know if that makes any sense but you know you see what I'm saying okay raise your hand if that makes sense oh no no no sorry the lake's layer it's just a one and that's just the next layer now I can also call this one and then two does that help this here's my 1 1 and this here is 0 I don't know I'm just trying to go into the layers make sense actually let me go back to the primes I prefer the price yes so basically all I'm saying is inside this layer what hat what is happening I take the previous slot at the output of the previous layer and modified by the matrix I call that a and then I'm pushing through a transition function I call that Phi okay let's just give the first step it is a two-step approach because each step now a name stem of the chain rule I can just say well this is the lastest gradient of the overall last function respect to this U is by the chain rule the gradient of the last function with respect to a and then a with respect to u okay raising ended that makes sense awesome okay good good if this is boring it should be boring right and I'm trying to make it really like mechanical right and because it's going to turn to an algorithm this is going to be a very simple algorithm that computes the gradients and just pops them out one by one every iteration you get a new greedy all right so now let's try the second gradient what do we do all right so sorry this is U u prime it's 1 prime here so now I would like to know the gradient of the last function with respect to this guy all right so I first go through the first representation the first representation is a transformation of the second representation and so on down here so how do I get this gradient right so by chain rule that's the same as before I first go to this a and then from this a I go to u prime a sorry one second da prime D a prime u Prime okay so the trick is that what you do is you always take the gradient with respect to the the function which encapsulates your gradient either your your weight and then the gradient of a with respect to you now can anyone tell me what is the queen of a respect to you so who knows it's a of x equals u times pi of x right what's the a do you that's just 5x right that's just 5x okay so that's really easy to compute you have this right you had to compute it anyway when you when you put when you pushed in an input through this function if you just save all these five X's right and those are actually the gradients that you need here alright so you've already computed this part right so this part is it's done right let's just you know these parts are done it's just the five exits and what about this part well this part you've already computed here so all you need to do is compute this part da da Prime and turns out that's very easy all right it's just actually the derivative of the transition function so I'm not gonna go through it in more detail because it's boring and it's on the homework assignment so now I just admitted that the homework is boring but never mind it's just something if the seediest doesn't help you much if I do it in front of you the best thing is if you just work through it so please you know take the seriously go through the homework assignment and just you know please leave like the whole homework assignment all it is it just says show that the gradient respect to this thing is just the green respect to this thing times some term and so on okay and you just derive all these updates the beautiful thing about it is that ultimately what you get out of this is a very simple algorithm it just says alright here's I want to take the gradient that's back to all of these all of these different matrices what do I do I take we need with the first one and then what I do I just multiply this by a by a vector and then I get the weighted with the second one and then I multiply this again they get a third one and so on it's a very simple photo and in fact I put the pseudocode up in these notes if you look at these notes where it says backward sparse that's actually the pseudocode so in project eight that's what you have to implement that code and when I said it was easy it's like yes it's three lines three lines of pseudocode my handwriting yeah so use this oh okay your question is whether we use really gray in descent uh stochastic go into sin give me two minutes I get it yeah that's right so the key point is that when you go when you compute the gradient with something that's deep down than your network and let's say you have thousand layers right all you need to do is you take the gradient of the layer right just before hand and you take almost all of it you just multiply over one term and now you get the next queen okay so that the computing all these green is actually very very efficient all right and that's the key taking this gradient is less and less important these days so it used to be you know even five years ago that you had to like what PhD students that most of their time is computing gradients and implementing gradients all right there was a huge pain you need to make it fast and all the stuff nowadays actually this is really just a very recent development is that all these languages have automated automatic differentiation so you actually just computer fungus you just put in the function and you say take way to go ahead compute the gradient and it computes the gradient symbolically so this is something that only got very efficient in the last couple years and so nowadays if you do you know tens a flow or use PI torch etcetera all these packages they all take the gradient for you so you don't have to do it anymore but it's important to know what's going on all right any questions all right good so now we know how to take the gradient and so especially once you've done the homework it's going to be crystal clear to you okay good so please let me have our network and what we get is we get all these different gradients and then all we need to do is just gradient descent right so we just say you know W becomes W minus alpha times DL DW u becomes u minus alpha times DL U and so on then you do this for all of these and then you start over alright so there's two more subtleties two more changes to what we've done before so this is this just normal gradient descent and we've done this now and you've implemented it better I have a a subtle a very important difference and this was one of the reasons why people mocked neo networks for years and that is that the function is not convex so all the last function that we've had in the past when we divide gradient ascent and we just had linear classifiers we had a convex function convex function just look like this so we said you can't just start anywhere you want so we just started with a zero vector and then we just take gradient steps and you know and every single time you take a little step you go further down eventually we will get to the minimum right that was creating descent on a convex function we still use convex loss functions for newer networks but it's not convex why's it not convex what's happening so by curly L on the very top left is a convex function square loss as a parabola yet I'm telling you it's not convex why is it not connects yeah because a fire is centrally right because of the transition function that's right so you have a nonlinear function in there so it's convex with respect to W but it's no longer convex expect to u u Prime and u double Prime alright so the function isn't you know because you have these nonlinearities if it's if you take the second derivative U it's very easy to show that it's not a convex function so what you're trying to do is you're minimizing a function it looks like this alright something like this right and so you're trying to find the minimum here the first thing is you may as well forget about finding the global minimum you will never find the global minimum right there's exponentially many local minima like little little valleys like these and you will always get trapped in one of those right so that's out of the window finding the global what you want to do is you find a find a good local minimum the first so the most important thing is the most important change to previous is that suddenly initialization matters where you start is no longer irrelevant before it was always irrelevant right you just you start somewhere just take the old zero vector because it's convenient that's no longer the case right if you start here need a gradient sent what's gonna happen you're gonna take steps down steps down steps down to here Richie and you gonna converge to this minimum right because grading descent converge to the local minimum if you start here you gonna go here if you start you start here right so initialization is a big deal yeah of course if you can if you can afford it right train ten different networks and take the minimum what's even better is average the outputs and ensemble them but yeah you're right okay so initialization is now really important and in fact if you initialize with the old zero matrix you will actually get horrible results because then actually every single dimension is the same remember what we said last time what did the low-level features learn right these low-level features learn very simple functions and then basically the next level learns more complicated functions build up little functions but if they're all initialized the same actually what happens is there every single function is the same so what you want to do is initialize the randomly so people just you know initialize these W meet these new matrices and the W vectors with random noise low order Gaussian noise so that's that's that's a really important difference you know between you and network optimization and and for example SPF and that was one of the reasons people loved SVM so much right because they were global and so they optimize things globally you get the glow a minimum you yet guarantee that everybody chains the same SVM and the same data gets exactly the same answer with newer networks you and I train you and a probe on the same data if you initialize randomly if you get different answers right so that confuses people either people didn't like this okay good any questions about this alright so here comes the second thing and that was something I mentioned last time is stochastic gradient descent was really important so here's the problem that people did so people the first time around new networks were around they use these very aggressive solvers like you know gradient descent they've a kind of approximated Hessian steps and so on and these are great optimization algorithms because they get you to the minimum as quickly as possible right so if you know you can't refer usually can't afford a Hessian a Newton method but people don't approximate mutants methods which are very good very very effective right and so people have known these for a long time or the optimization community has developed these and they converge to the minimum in just a couple of steps and they are much much faster than gradient descent right so people said don't use gradient descent use these much fast optimization algorithms right and so you start here and it turns out yeah you're right right they just with a few steps right you're actually at the minimum that's awesome the promise is until that you know and that worked great but you never could not do very well and so the problem is that you're in a highly non convex space so where you start there's always the terrible local minima right next to you next to you all right not far away from you and now if you these very aggressive optimization algorithms that basically you know find and you know immediately converge to the local minimum right you will go right there right and but often these are not the great minute all right often the minima over here I'd very far away from you so what you would like to get is somewhere like to deepen you on a minimum which may be you know further away and that's where stochastic gradient descent comes so let me tell you what stochastic gradient descent thank you so the gradient of every single every single weight it's basically yeah you right it's basically the sum of all the data points I equals 1 to n you know D and then that little loss D right so basically there we sum over every single data point and we sum up you know every single data point contributes a little bit to the gradient and the computers overall and then that's our final gradient now here's here's the idea behind stochastic we instead so imagine you have a function like this this is my function I'm trying to minimize and I'd say it's non convex give you some other holes around yeah so stochastic you in descent or something very very simple he's saying well instead of computing the gradient over all data points which is the correct thing to do I just approximate it and approximated with a single point so I say no DL do you is roughly just this H of X I it's roughly just the last the gradient of the loss of one single point that's totally wrong right so the gradient is an average like this he has Bayesian average of many little gradients right and what I'm saying is okay I'm just looking at a one training sample at a time I just computed one training sample and that gives me an you know just compute the gradient that that sample gives me I pretend that wasn't our only training example so let's say I'm here for example write the gradient points in this direction because of target all to this function the last function but if I just take a single sample I may take you mean point in this direction right that's terrible but that's something we've seen before where have you seen this before a lot of terrible gradients boosting right that's exactly the same idea so if you have a lot of noisy gradients all you do is you just take a tiny little step size right you take a small step in this direction right and you know on average these gradients are exactly the correct gradient all right so sometimes a little bit to the right a little bit to the left cetera right give me our back to the drunk drunk a gradient descent algorithm right so you take a bunch of steps right on you know an average you kind of go in the right way and the cool thing is this takes you computed the exact gradient takes you and operations because you have to go through your entire trained ear set and compute the gradient that you know the contribution of each one of your training points and sum them up okay so that's that's a greedy right takes your order and time now what you do with stochastic we need a sense in this time a time you can actually take n steps so you take a little step another little step little other step and so on and as you do this right you move along and your gradient changes because if you're here the gradient actually points more in this direction right so after you've kind of passed half your data already you've you've actually made some progress right and now the future they're half the data points on average don't point in this direction anymore of the original gradient they're now point in the direction of the gradient at the very point of a mistake right which actually goes more in this direction so if in the original gradient descent algorithm you basically take some large steps right like this then you take a step like this and like this and so on and stochastic we need to said you take many many teeny little steps right but because you know because you can have adapted direction already as you go into the data set you actually get to the to the minimum faster any questions yeah okay good good he's saying like well wait a second well you know you switch the story like initially you started off with the local minima and now you're telling me something about drunk people pay enough that was not exactly what he said oh yeah so here's let me get to this clear this you know that this one reason ivz people and didn't take stochastic way in descent series the stochastic Gideon said was known but actually people people thought it was terrible I word and I can tell you why because if I try to prove how many steps does it take me to find the exact minimum up to ten to the minus five accuracy right it turns out the stochastic we descent it takes you forever why does it take you forever because initially you get very close to the minimum and then once you're there you take a lot of random steps you just never really hit it right so that's again once again be at the you know the drunk guy walks home right he gets this house then you can't find the damn keyhole right so that's the problem right you're right here and because it's too noisy you just never really converge Hey but that's not a big problem because in machine learning right if we don't actually care about the exact minimum right the whole last function was made up anyway right but here's the even better thing right and that gets us back to the optimization turns out now we're walking home right and turns out there's many little holes here right this function is non convex right and there's many attractors here right there's little holes here there's basically the function the function looks like this okay when you're walking downhill right now if you an exact Radian method what are you going to do is you compute the gradient here and you always kind of find exactly this minimum right but stochastic gradient descent is so noisy what it's going to do it's going to go past here it's just not going to find the stupid minimum it's gonna walk right past it Hey so because it's such a terrible algorithm right it's actually great right because it doesn't doesn't fall into every single hole along the way and you only end up at some of the large holes right where it can't escape from and so people haven't thought about this right but it turns out that's exactly what you want you want these large holes why is this people thought you want to have the really deep hole so you want to minimize loss and get really really deep right that's what gradient descent is really good at at least a second-order method but not you don't actually want us imagine a function looks like this right all right so the minimum is actually here that's very very steep loss right but I claim that's a terrible loss to get into what you really want is you want to end up here can anyone tell me why any ideas huh don't you tell me right this is your function but there's a function right my last function what is my last function my last function is a function of my training set all right Alif age of x i yr okay it's a function of two things my data and my function right by my hypothesis if I fix the data which I do here then I just get a function of my hypothesis and I try to find the minimum but what are we actually going to do later on right later are we gonna throw away the training data and we're gonna switch to test data the moment we put in a different data set here the functions gonna change and we have these really really steep local minima what happens right now we they optimize this thing and now doing test time actually have a different function right maybe it looks like this right well it weeks right I was here now I'm here right suddenly the minimum I found is really really terrible alright does that make sense so if you have these very narrow minima right they tend to be very specific to the data set that you trained on if you take very wide minima right it's unlikely that if you change the data it would actually you know would change a lot so if you what you really want us to get as low as possible in your last function on a really wide minimum and that's the only thing that STD is capable of finding all right and so that was actually that was crucial yeah if you have a very large data set and you can afford to overfit so then you're not overfitting anymore I guess so then these if effects good they did it becomes less dangerous all right that's basically what it means like that if you have a very large data set and the change of function becomes very small that's basically what it is yeah it may have been okay it may have been okay but but it's you know what's a very large data set right so you know there's always a lot more use a very high dimensional spaces but you're always under sampled right so what people found and initially they just found empirically and nobody could really answer it is if you train in your network with SGD it suddenly worked a lot better and it took us a long time to understand where they came from yeah is the number of gradient steps bounded by the number of training points no no you go over the data set over and over again so in green descent one time basically what you do is you just go over all the data points compute the gradient take your gradient step right so now what you do is you take like with any potted random compute that little approximation of the gradient make it create an update and just keep iterating this yeah so you can't really guarantee it it's just that they leave you a very narrow neo Network right then your base is very very narrow minimum right and there's a high dimensional space right the one dimensional space it seems obvious that this thing falls in here right but even a two dimensional space right imagine this here's my function and here's a tiny little hole right if I get too close I actually fall in here I think reading suddenly point in this direction doesn't point in this direction anymore right but because the SGD is so noisy or keys you're gonna point out of it again take a step out of it non non clear right so it seems unlikely that you actually will end up there it's just you're busy shaking it right and so actually the trick what people really do actually to optimize these things and do two tricks in practice the first one is they don't just take one point it's at random they take around 64 128 points and take they call it mini batches and the reason it's just very simple as just because if you propagate a single point through network or 64 is about the same cost that's just because newer computers have these little parallel vectors and so on right so it's actually you have now one cache etc so it's actually it's just efficient to do this and one one second yeah and the second thing is that what you do is initially and this goes back to your question how can we guarantee that you don't fall in these local minima initially you take a pretty large learning rate okay so then you get a really noisy you know oh my god you know um but the the good thing is this prevents you from falling into small local minima and you get close to the actual area where the local minimum is so maybe you do a a box 100 epics err pocus you want to go over your entire data set right hundred air box with a large learning rate and then after that you kind of somewhere here right now you're just jumping around right so this looks a little like this like your function you know now we're in here right and what you're doing is you're jumping back and forth between us right because these steps are too large and then what you do is you lower your learning rate by a factor 10 and then you basically take small steps and that means now I'm converging to the actual local minima let me let me draw it one more time so here's kind of the idea right imagine your function typically looks about something like this but here's the minimum so what you want to do that so you start out here initially you take a lot of steps like random steps right they're very noisy to get you somewhere here now you're just bouncing back and forth so you do this for a couple steps until you basically don't make progress anymore and then once you're here you take a small learning rate and now you basically move down here until you get them okay and when you look at the loss it literally looks like this like if you look at the loss of modern new networks it kind of it looks like this right and then you drop the learning rate and it looks like this and sometimes we'll drop them running okay yeah you can you can absolutely right so his question is you know once you've done STD for a while and now you're pretty confident you're near the local minimum you know why don't you just now use the second-order method and you can and practice this just typically not versa to implement it even because now you you know you did a lot of you know you just take a couple of iterations of the small learning rate and you're good to go but yeah you could absolutely yeah yeah oh yeah yeah yeah this is just it would take me in a month to explain it but you know if you're interested in there they're basically there is that a whole body of literature now that analyzed I'm sorry the question was is there more rigorous answer why STD gives us good local minima and the answer is yes there's multiple you can formalize it in multiple ways if you want if you come after lecture I can give you a few intuitions why that is the case yeah oh yeah yeah so the coast questions like what if whatever function looks like this right that's a bunch of you know why and actually that's what it's gonna look like like you have millions of parameters you're in a million dimensional space right there's not just a few of those right there's billions of minima right and you will end up in one of them one thing that's nice about this though is that if you want to do bearing with newer networks you typically have enough randomness that you don't have to stop sample your data so typically when you do bagging your sub sample n data points with replacement with newer networks because you know if you just initialize your network randomly you've started to a different part of the space you end up in a lo different local minimum without even subsampling your data set differently you actually get quite differently behaving new networks surprisingly they tend to have roughly the same error rates but they make very different mistakes and this makes new networks amazingly well suited for ensemble so if you really want to get the lowest error for example in the Calgary competition right so what you do is you just train five newer networks and you average their results right and that actually makes a huge huge improvement okay any more questions alright so I have I want to show you a few things ok good and two last things so actually before you finish I want to get to one thing what's the neuron and neural networks and so people always talk about new likewise are called neuro networks and so this is all just functions and I try to explain that way because I think that's actually kind of it's really not that different from what we've done so far in the class but let me quickly explain to you why people call in your networks and what's the neural view and I'm not a big fan of it because it leads to weird negative articles in Wired magazine or something but basically what people say is well you can write this as a graph in some sense right so you have your X comes in and like I write like this is kind of these these images that people draw but every single dimension of X is a little ball here a little circle okay so this it isn't a vector X and this is the first the value of the first dimension second dimension you know there's like a five dimensional vector okay make sense it's just a vector notation raise your hand that makes sense okay awesome so now what do I do I stick my accent here I remove the right and what do I get I get yeah my eggs are sticking to the stubble you know into this this first function so I get a new representation it's called as Phi double prime of X right and this has a bunch of dimensions and the transition here is X goes to u times X right u times X space that's what it is right so Phi xx equals or you know it's Sigma u double chyme times X okay make sense raise the end of that make sense okay good and now we can write again as he has my Phi prime of X this is my Phi of X and this here's my H of X and now if I think about this what is this here well this here is a function of every single input so what people do is they draw and draw an arrow I just say this value here the first dimension of this guy isn't its base the first row of this U matrix times X this is so they just call these these connections then they have so everything is basically connected here and then you do this for every single is illegal a and now you have a network right I don't like it I could tell you why I don't like it it's because people that have done these analogies with the brain and people have done analyzed these neurons and said like what what do these neurons do etc and and the problem the reason I don't like it very much is because it's really just a end you know a d-dimensional space well you could just rotate the space and you get totally different representation but it's exactly the same output right so in some sense it seems odd to interpret the different dimensions of the space I don't see too much value in this and but but it also has led to a lot of misinterpretation of misunderstandings between people think hey this is what the brain does which is not true right what they say this is kind of you know these neurons are doing a lot more than they are actually doing it's really just a function Phi of X and you know of course that's a you know some D dimensional vector so each one of these vectors has a value but just you know it's important for you to know when people say oh I take a look at a neuron the value of a neuron but they really mean it's just you know one of the dimensions of this upper vector any questions by the way initially it was inspired by the brain right so the original neuro networks came from that direction yeah okay good in practice do people use anything else than rectified linear unit and the answer is nowadays rectified linear units are very very popular and the reason it's actually quite simple is that these but people use before it so people thought you have to use a continuous function and differentiable function so rectified linear unit if you paid close attention you realize it's not differentiable so you do gradient descent on a non differentiable function right that seemed crazy right and that leads to all sorts of problems turns out if you do STD it's so noisy anyway it doesn't really matter all right and so when people started doing STD you could actually use rectified linear units right and then actually the advantages over these tan age and sigmoid functions became apparent because these tan Jason Sigma forms is a very very flat alright so remember what they look like they kind of they have everything from zero to one essentially and you get you basically what these networks with sigmoid internees functions get is they saturate very quickly so rectified linear units tend to be better at not game trapped in local minima yeah oh is he your ATK was a big deal when you use these sigmoid functions because you want to keep them in the middle weight DK is just another word for l2 regularization there's nothing else right it's just something was just invented parallel and what's called Beatty K so people also use regularization on the weights but yeah so it's less important for rectified linear units but it's still do to avoid overfitting okay so quick okay good those of them I want to show you last time and it didn't really work so so now he here is so this years after this dysfunction that I the black dots are basically my training points and actually on the left hand side is a newer network with rectified linear units and you see here the structure is one input to hidden notes and one output so two hidden notes means my first function maps my one dimension input into two dimensional output and then the next layer takes the two dimensional output maps to a one dimensional album so what you basically map here is the number of dimensions and what you see is the rectified linear unit has two notes so what it does is it basically you know as first you know these two lay as busy to to change points and you see town age is much much much smoother right I can now introduce four notes and one thing you can see here there's the reason people thought tonight is much better right because tan H here almost hits every single training point right the error is much lower where's rectified linear unit is kind of you know they are kind of struggling here but the reason is these problems are rectified linear units have is because they are hard to optimize go away when you have millions of them right then basically there's always some that point in the right direction and so rectified linear units turn to be turn out to be better at complex problem but they're worse at these little demos which was misleading so now you know I can make eight a dimensional hidden representation and and so on so you know if I make it more and more complex you see you know one thing you'd see here the challenge function optimized this very very nicely and here you still have these these piecewise linear functions what you can see here is that in some senses are newson aiding some some thing here right so the function goes up here despite there's no data to support this that's fine all right and one thing I still want to show you real quick is so here's a one thing tensorflow so I don't know if tensorflow is a prepackaged Google I see most people prefer high torch actually that comes from Facebook but I think hands are flow is a little easier to use what you can see here is we can take a little dataset easier my data points positive and negative and I can now construct a newer network you can play with this if you just type in tensorflow playground it's a nice little demo that they provide and so here's my input this is the first I mention of my input that's the second dimension of my input so is that either based is my x coordinates my Y coordinates and then I have a bunch of a 4-dimensional learn a four dimensional representation this is my Phi of X and this is my output let's just make a one-dimensional output Y naught and I can now train this thing and what you see is here this is the output right so it has now trained this and you know gets it all correct right and so one thing I can now look at I can actually look at these you know one layer new network is actually looking at these neurons makes that so here actually what you see is it learned a bunch of functions okay now type in X or make it a little harder optimize this and you can see here right oh it's struggling come on all right so here's what you do you know the struggles you just add more and more neurons right so here we go all right here we go better right so and what you see is these is here's my inputs and I can now look at when i'm hover over this you can see this image here on the right changes so what it shows me is what the activations are of this particular function and so you casey see it learns these individual little functions and the final is now composite of these functions which is exactly in our nails the the X or data set right now I can do something really hard but I can't do this one first this one should be easier let's see if now it should be able to nail this no time oh yeah Oh beautiful right and now the spiral spiral is actually very hard right and so we can now see if this works it probably won't get it actually what is it doing oh yeah so one thing you do now is we can add a lot more neurons see what it does now alrighty I had another layer it's like a lasagna more layers is usually better alright is it doing something it's starting the whole thing is 1 since JavaScript of my laptop's there's a little unfair right I mean there's like alright it's doing something well we could add a lot more I don't know it's just you know okay let's just max it out it's through rectified linear units okay let's see what happens no not yet it will get there eventually it can't do it you can't do it it just takes a while all right let me interrupt this one last thing so just last time I mentioned that this has become extremely powerful with images and so there's actually just to show you here's an example as clarify as a start-up they actually they sell their deep learning community all they've done is trained deep nets to you take input as images and then classifies on the right the classes that it gets put into so there's an image the baby says the sunset water dawn dusk and so on and so one thing you can now do is we can try our own image so I can actually go to Google Image Search and I know what should i search for any elephant Killians right I see have my okay all right so here's an image that's a little small I could try it he's a this how about this this is a picture of when I was younger when I still lived in Germany okay never mind all right so here's the image all right I'm not a robot okay what does it say it says success oh my god all right you gotta leave on a high note so this is the last one see you on next on Wednesday
Info
Channel: Kilian Weinberger
Views: 7,207
Rating: 4.9718308 out of 5
Keywords: artificial intelligence, machine learning, cornell, cs4780, course, deep learning, neural networks, kilian weinberger
Id: zmu9wR2c7Z4
Channel Id: undefined
Length: 51min 57sec (3117 seconds)
Published: Wed Jul 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.