Machine Learning Lecture 35 "Neural Networks / Deep Learning" -Cornell CS4780 SP17

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

only three lectures left to go you think so a new homework is out and you need this is very simple this is all very simple stuff one second [Music] and I can PvP hear me all right better okay um yeah new homework is out a lot of people are doing well on the Calgary competition I had a look today on the private leader board actually on the private set which will be the one determining the grades it's very similar it's not identical to the public leader board but generally people are doing very well I sincerely apologize for the mistake we had project seven that's Marie sorry we have a lot of testings but sometimes something slips too so for those who didn't catch I didn't see it a Piazza that was a small mistake in the last function that we wrote so basically in the Python homework and the only in the Python version actually there was a mistake in the formula of the splitting function that did not match the lecture notes and did not match the homework so it's now should be corrected in case you still have an old version of the Bukharian thing because once if you checked it out it does we didn't overwrite it just ignore that equation just take the equation from the homework or from the lecture notes basically it was a normalization constant that was wrong okay any questions about this there will also be one more project it's being rolled out tonight that's a very simple project don't be scared it's the easiest project you've ever done in your life so you know just most of the work is just reading through it and understanding what you have to do it's it's just it's basically a deep learning anybody see what you have to do is you have to implement a new network and train it so that's all there is to it that may have sounded complicated if intimidating a few months ago but now you know this is very simple okay so last time we talked about new networks and just a little recap the idea basically is just as before it started crazy this came from you know same person Frank Rosenblatt who invented the perceptron and his problem was you know he invented the linear classifier and that was too restrictive so his idea was how can we make it more complicated or more powerful that's really what he wanted and the idea was we map our X to Phi of X such that our classifier becomes H of X equals Phi of X transpose W plus NP and so previously in the when we talked about kernels we chose this Phi of X implicitly by defining it in a product function that was a good idea here in this case we are learning Phi of X so if a see saying we are learning two things here we are learning the W and the B that's the hyperplane and we're learning the representation of the data okay so if we are mapping the data into some space we're learning that mapping and then in that space or space we also have a classifier that's kind of how you can view it so that's a lot more powerful than you know then actually just learning a linear classifier by itself new networks have been around for a long time and in fact machine learning started out in some sense of the modern machine that have started out with newer networks and they recently had this Renaissance since the Renaissance that's you know Jim as a peon everybody has gone bonkers over it like yeah for multiple reasons and but it's not entirely fair to say that it's exactly the same algorithms like there's some subtle changes and let me just write them down here and and we will go through them today and some of them we already mentioned last time so the first change in that B that's one thing we mentioned last time is that people used rectified linear units instead of sigmoid transition functions so just to remind you PI of X actually has the following form Phi of X is some function Sigma of a of X plus C and this function here people used to think has to be a sigmoidal function that basically switches on dimensions yes or no so what this year does this gives me a vector given ax this gives me a new vector which is a transform vector and then the sigmoidal function would just set things to one or two zero so if it's above a certain threshold you said to one if it's below a certain threshold is the way you have to use in your networks and there may be in part because newer networks is the same word as newer networks in your brain and people this was really brain inspired and the brain does something like this so neurons basically have you know you know inputs and if these are above a certain threshold then there's a high probability of them actually firing themselves so this is kind of how you get Cascades in your brain ultimately this is a really bad way of doing it when your networks for artificial neural networks the reason is because we use gradient descent our brain doesn't use gradient descent if we train them with gradient descent the promise that this function here is completely flat here and completely flat here it has no gradient so my people nowadays uses the rectified linear unit where Sigma of Z equals max Z with zero and by the way the rectified linear unit was invented actually 20 years ago it's just people didn't catch on people didn't realize and there's a good reason why I fact then it wasn't worth it but nowadays everybody you know rethinking the transition functions made a big difference the second thing that made a huge difference and that actually is really what it's all about is GPUs so one thing I will mention today is that you know maybe you next lecture is yeah the new network is actually quite expensive to train and they're very slow so that was one reason people abandoned them and you're a kernel machines came around yes the ends were much much faster and they were better because he couldn't train your networks all that much but new network training has a very specific form and what it does it involves almost only matrix multiplications so it's matrix matrix modification that's all that is so you spend all your time doing matrix matrix multiplications and that used to be really slow all right so that that was basically you know well you have to do all these matrix meetings mummification is a really slow algorithm just use an SVM with GPUs and we really have to thank you guys because you guys are the generation when you were kids right you did not go outside and play instead you played computer games thank you hey all right we have any you know can't can't thank you enough with it because that that drove the development of really high-performance GPUs which then could be used for scientific computing and they're really invented for 3d operations but that's all just matrix of multiplications so what GPUs really do really really well is matrix operations and some AG's operations is really kind of it's actually most it's actually very easy to write code that runs on the GPU it's very hard to write code that runs the GPU is faster than code that runs on CPU and but matrix modifications in some sense this golden thing will actually works really really well on GPU because it's so paralyzed abour and it you know you have enough operations per memory access etcetera so because GPUs came around suddenly could train really big newer networks which beforehand but completely out of the question and turns out then actually if you train them enough they actually became really good the last thing and this is a really important one if SGD which stands for stochastic gradient descent and this is what we will do today it's a variant of grain descent that's it's that was considered a really stupid algorithm and so machine learning has this history of laughing at things and then later on going like oh by the way that's that's awesome so maybe one day genetic programming is kind of rules I have rule everything so SGD really actually made a huge difference that was something that wasn't taken seriously because STD is actually from an optimization point of view a terrible algorithm you can prove it's a terrible algorithm but actually has some properties that people did not think of and that those three things together and oh the last one you know rebranding beep Randy yes so you know so that was that was the final thing right so in some sense people figure these things out but they had to rename it to kind of to make people realize this now actually is different ever now works yeah good question the question is what's the benefit of a deep learning deep learning over kernels and yes one very nice thing about deep learning is that it scales linearly with more data so if you get more and more data let's say you get million yeah give one hundred thousand points two million data points and it just you know it's just ten times slower right whereas kernel it's actually because you have the kernel matrix is actually quadratic in the number of data points and so the actually the interesting thing is the improvement that you get if you draw draw a figure that's a number of theta training data right and the accuracy that you get this is that diminishing at the beginning like every single either put makes a huge difference and then it kind of you know there's a craft that slowly kind of asymptotes right so um in order to get kernel machines to work on more data you have to do approximations yeah the key context you compute the kernel matrix exactly anymore typically you have to do approximations that costs you accuracy the problem is at some point you difference is so big that this improvement here is very little and basically that's the improvement that you get by more data is roughly the same about it this is roughly the same word of magnitude as as much as you have to pay for scaling the algorithm up does that make sense so in some sense you go from here to here right you get an improvement theoretically but because you approximate you actually go down again so kernel machines do not improve anymore right that's people people realize if you added more data you know you didn't get the benefit anymore but it's new or new networks you could actually throw more data at them and do better and better and so people then like companies like Google and Facebook they generate these datasets with you know fifteen million data points that they label they pay people to label them and the better you know the benefits that you got basically only became apparent with deep learning other algorithms could not capitalize on that there's other approaches there's other things too and I will get into this more envision so there's the last thing deep learning actually I you've may have heard a lot about it lately it works really well in two domains right there's really two domains where it's the Killa it's not the answers or everything this is like people miss a misconception it's not the modern way of doing machine learning it's silly it's it's really works really well on object recognition and images and that was something that beforehand was very hard because before we had a long pipeline we had took images he had to extract features he had put them into an SVM and so on yet have made a special kernel deep the nice thing on your networks is they just taken input as a images input and the prediction is output and they are much much much better and there was a actually I'm still going to get to this remember there was a famous competition every year on who has the best results in object recognition that's called the image net competition run by faithfully and this was a competition that's been going on for many years and it's exactly like you know the kaga leaderboard of this in class competition basically and what happens with the every year kind of they got a little bit better right so there are new improvements and so on and so the way it looked Paisley is in 2011 right like the best performance was this slice since this here's my error right and then you know the other submissions were here and then the next year but this was 2009 say 2010 keeping a little better right in the the other ones you know here the other submissions and then 2011 you know we got here right and into the things got more and more sophisticated and then 2012 came around and you know the best thing was here you know and so on that looked just like the same and then suddenly you know Jeff Hinz teams admitted theirs and it was down here right so with one submission they actually have more than half the error right and before I basically looked like you know they would not reach this in like several decades if you would you know continue this graph right and suddenly with a busier than within one year they made as much progress as was to be expected but in several decades and that was busy by switching to deep learning that were trained on GPUs and use these tricks and so then the next year actually the result was here and then you know here and now actually they're better than human performance right so at least on this this task computers outperform humans and so that that was really you know in fact actually cvpr has a computer vision conference at that conference Jeff Fenton explained his algorithm and more people came to that talk then came to the conference actually like you know they and they had to get extra chairs and so on like you know everybody just want to see him explain how the how the hell did that happen right and that was what really everyone was interested and the next tier busy you know now nobody does anything but deep learning these conferences and so these the second thing one is images and I didn't say but the second thing is does anyone know what the second application is were deep learning is really really amazing speech that's right right so on that enabled series so if you have an iPhone or Android phone something you can talk to these things now and that suddenly that that was the your the similar drop right so basically people at these we had been working on speech recognition for decades and suddenly the Bernie came along and then you know within a few years every year they have the error rate until it was good enough that you could actually use it and that started already in 2006 so that was right when the iPhone came out okay so let me I actually want to show you a demo still today I promised you a demo because of the screw-up on project seven alright so here's the idea we we basically want to learn a classifier H of X equals W transpose Phi of X where Phi of X is is Sigma of ax plus C and this here is P and C here's the vector so actually I think I called you here all right let's call this UX the first thing is people asking you how do you learn this right then you learn it just the normal way but you have a loss function l equals the sum of all your data points want to N and then you have some loss of you know H of X I and why I write this can be the square loss or something right if some lasts and because some convex function that tells you how well you're doing all right and then how do you know this thing about you take gradient is send respect to W and respect to human spec to see right so there's really nothing different like you know all the components we've done this now for for several months and let me give you a slide you know kind of an inside view of what these NiO networks are doing right so so one thing I'm not a big fan of is people say like oh these black boxes right they just do whatever nobody understands but I think they're thinking of the human brain they're not thinking of these functions right now it's true that at the end often we don't know what this function looks like that we have that be a training but if I train a function to you know recognize pelicans from dogs or something well you also wouldn't know what that function looks like right I mean what functions separates pelicans from dogs so the fact that you can approximated of course you wouldn't know what what kind of exact function it is so let me give you a slightly different perspective on what exactly is going on inside a new network so let's say I want to do a regression problem so my last function in this case I just you know I just fixed this is one to the N H of X I minus y I squared okay and my H of X is based the following is it's basically this function here right so H of X is I take my let me just write it as once one thing is w transpose Sigma of UX plus C plus P and so these are my parameters okay and my Sigma in this case is a max was zero okay so let me write this down max was 0 this is the rectified linear unit so what does the max with zero do and max with zero this operates element-wise so if it's positive it just keeps it doesn't do anything and if it's negative it set it to zero okay good any questions at this point yeah because I guess ultimately they're all Universal approximator z' so you can approximate any function it's mostly the transition function is mostly a matter of convenience but what's best to optimize actually so let's say might the function of trying to learn let's say I just have a 1 in 1 dimensional input one-dimensional output so I'm trying to learn this function here okay all right this is the function this regression problem and this here's my X this years know why okay raise your hand if you're still with me all right good and now here's what what the new network really is doing right so think of this matrix U as the following think of our matrix U as Paisley this is my matrix and I have many little vectors right so these column vectors u1 u I don't know u H right so have each of these alright and my vector C is just C 1 CH ok all right so then what is this right this is basically if you write this s there's just a sum of you know of the different dimensions yeah WD x max of you D transpose damn it it's chalk exponent on UD UD transpose X plus C D comma 0 plus feet okay does that does it make sense that's the same thing all I'm doing is I'm just dividing out the sum here I guess some over everything that I mentioned and here's I'm aware every single dimension any questions raise your hand if you're with me okay this is just a know what I did is I just decompress the mate you expect a modification right does beezy get multiply u times X right here what is U times X u times X is dammit this chalk is horrible yeah is this Q 1 transpose X u H transpose X this is a vector transpose transpose X ok let's just please raise your hand is that does that make sense okay thank you okay all right so then so but what is this you chose was ecstasy that's a hyperplane right so hyper clean in the two dimension spaces the line okay so these are lines and if they are positive there are none if they're negative theta zero so these things here look like this right so you have is he have some function that looks like like this right and then suddenly hits zero and just disappears okay this could be you one right you want transpose X right and this year is is C okay so when it's a 0 plus C 1 okay this is C 1 does that make sense this is a line linear function and when X is zero this is exactly C 1 okay with me any questions okay another one could be this right copy could look like this right this is YouTube transpose X plus C add enough people here can see this oh and this baby here you know my c2 is actually down here right so at 0 I'm at c2 and then I have this line that goes up okay all right so what am i doing I have many of these functions and I add them up and I give each other the weight okay and at the end of shift the whole thing up or down okay B is just saying I'm taking this whole thing so what if I add up these two functions what do I get right well here in this region this front the second function here is zero so I just get the first reason function right but then here the second function becomes nonzero so what's happening it moves up right so the sum of these two functions look like this okay this year is the first function plus the second function okay if I now had a third function right let's say I have a third function here that starts here and goes this way there's my u3 all right then I would be easy at this point here what are after jaw and so you can't see anything you have no idea what I'm doing right okay so this is my son right this is u1 plus u2 now and their third function comes in here what do I have to do how does this function now continue so yeah Hassler it's like it goes even higher right you can see by changing this function alright I could also I could also have a function like this I could actually you know have a constant function here that I suddenly switch off I could also make it go down again okay does that make sense so you can you can do anything you want and so essentially what you're doing is you take this function that you're trying to approximate and you're basically okay well this is pretty straight years I just take my first function right after this point where it gets pretty bad some jeans what I'm doing a switch on the second function right and he have to switch on the third function fourth function v function six function seventh function okay and that's all that's going on in your network that's obey see what it does the proxy mates functions by piecewise linear components all right so this is actually quite simple and two internally if you look at these you is those bases the first function here is this one right plus some B that shifts the whole thing up then here comes the second one so here you actually have another one that actually goes up right you add these two up and then you get this nope you oh sorry actually this nation- right so you actually have to what do you have to do you have to oh sorry you actually do this one and you can have a negative weight here how do you have a negative weight then you just subtract them all right there's a consensus then you actually go down you can have another one that you're subtracting and so on okay any questions yeah sorry each line corresponds to a row in my you matrix okay and your row matrix you you matrix basically isn't you know isn't a use element of our H by D but D is the input dimension of your data and H is the number of Venus's you know something that you choose so you have a three parameter how many of these linear components you allow to have alright if you just have one what you're gonna get is this right it's just the line if you have two were you gonna get something like this or something I don't know all right and as you switch on more and more you get a more and more complex fitting of the function oh you don't decide this right the general network decides it but it minimizes a square loss right so if you actually imagine you would switch on something else here well you would actually have quite a last year right because you square the difference between what your prediction is and what the true value is right so the square loss penalizes you for being too far off so what it's going to do it's going to shift the you know that function over the second one right to reduce that gap any car step how do you switch on/off so you have these little functions here right and when the moment they are negative they are zero and you can shift them around any time you want right you can you can change the slope by changing you and you can shift them around this way by changing the C value but if I would lower the C value the whole function would go down and it would be switched on here all right it's switched on the moment it becomes positive and then I can always add it or subtract it okay yeah yeah you could have a constant function right that you were just always and actually that's that's true we don't really need it yeah so the question was why do we need that second B right well in some sense she could just model it's by one more function that's always positive and always constant and you're exactly right so the Neo networks don't really need this and actually the project actually I took out the constant so that that's why the project is so easy all right nice are you laughing the people really that the thing one thing they found half was the constant last time okay any any more questions and and so and turns out actually it's very simple proved actually and that within your network so you can approximate any smooth function arbitrarily close all right and how do you prove this well it's quite simple you just say if I make my H here arbitrarily large so I have many many many many functions right well I can you know if I'm allowed if I allow you to have as as many of these linear functions as you you want right then you can make the error as small as possible right so if I say the error should not be larger than Delta or Delta something very small then you can tell me okay I can do this I just need so onto many million linear components right and then I can show that the sum of all the errors is less than Delta why does that make sense so that's essentially you know the way you argue that these are universal approximated so universal approximate us you know used to be a big deal nowadays many algorithms have this you know can be shown at this property but you can approximate any continuous function arbitrarily closely right and so you know your networks that's shoe but this age will come very very large so you basically train a lot of parameters yeah no you select this there's something you put it yeah that's right that's right and the more you have the more powerful is your near Network oh I haven't said anything about layers yet actually layers what are you talking about I don't know there's no layers and no you're right so there's later on do layers and later on five minutes you will do layers and so okay well let me tell you layers you just you know okay good so here calm layers just out of order yeah but that's okay so people have heard of these layers and so basically we have this w transpose X so H of X equals W transpose Phi of X right where v of X is you know what you know well I guess you have it right 5x equals Sigma of a of X plus C well let's call this a why not it's gonna say all right now what are layers well here's the idea well people want to make this more complicated so I call this u let's call this u u and C okay why not so one way to make this more powerful is to make this matrix you really really really rectangular right so increase H a lot so have many many linear components another way to make this mock-up a powerful it's an idea that comes from Russia I don't know if you know these little ladies if you open them there's another lady in there you open them is another lady in there right so here's what you do we see well actually that's not X that's five prime of X and Phi prime of X is Sigma of U prime of X plus C Prime alright so now you busy you have one more lady inside the ladies I don't know they're caught by Musco that what do they call I know what they got right now so now you actually have two layers IGBT is saying I first have a transformation and then I have another transformation and if you want to you can say well actually that's one move right this Phi double prime of X and you can do it as many of these you know as if you want right like you can spend your whole life making layers right and so that's actually one thing that that's new these days is that people have many layers that's where the word deep comes from deep learning and there's a very good reason why people did not do this in the past so beforehand like when I was a student a PhD student everybody just had one layer everybody knew he could have more layers in fact Frank Rosenblatt himself right in 1963 he or what he came up with the idea is that you could have more layers if you want to write but nobody did it and why didn't one do it yeah has two reasons the first one is it's very expensive right because you always have to multiply it with matrices and matrix multiplications are very calm very expensive and the second one is that's the practical reason there was a practical reason a theoretical reason the practical reason is it was very expensive and it took a long time and the theoretical reason is that you could prove there's no function in the world that you can learn with a deep network that you cannot also learn with the shadow network it's just one layer so it's a very famous theorem that says all you ever need is one layer all these other layers don't don't improve your expresses the expressivity of your function and so that was great right because people said well multiple layers this complicates expensive anyway and I can prove to you that you don't need it there's no reason to do it so no one did it and that was in some sense you know that was in some sense all but forgotten right like no one no one worked on multiple layers it was considered a silly thing to do right we could prove theoretically that it's there's no benefit from it and it's really slow so why would you do it and it was actually that really came down to Geoff Hinton and the University of Toronto who basically who realized that they actually there is a benefit like you know that actually in practice that makes quite a difference and he he bought a whole bunch of GPUs at the time they were very expensive so what he did is he raised his own money he started a company and had bought tons of GPUs and trained these networks and that was actually how you won that imagenet competition and then actually that company was bought by Google so but but it took him in some sense you know to show the whole world that actually you know everybody was wrong in that sense and and the funny thing is the theory is not wrong actually all right a theory is right you can go through the proof it's correct and it just says essentially that the number of the size of this matrix and this is something that people didn't really look into is how big would that matrix have to be and so here come let me show you a good analogy and then I will show you a demo I promise you them okay here's here's the analogy when I but I just showed you is the way your networks work is they Beasley have these these piecewise linear functions so now I'll pay attention you have to see this just like it's my prop yeah at this prop this actually does not just my notes as prop so here's what I'm gonna do we just like this so a piecewise linear don't do this on your nose so the piecewise linear functions essentially what they are there's these little hyper planes right so think of it as if I'm folding this paper right so I have my first piecewise linear function so now I have a fold here and now I want the function to go down so I have a slightly different fold here right that goes in a different direction all right great all right so now if you look at this I know if you can see this function it goes up and then goes down okay these are my two piecewise linear function that I use to approximate a nonlinear function now here comes the cool thing if you have layers you can do the following let me fold this there's my first layer becomes my second layer rights now if two layers are the three layers come on right and now in this there I defined a piecewise linear function so I do this fold here and you know what's happening you seen this and now I do another one here hey and out comes a beautiful thing if I now unfold this thing right well it looks horrible [Music] what happens is you have food all over the paper I just fold at Rice and actually I reuse the same folds over and over the entire space right and that is the power of multiple of multiple layers and actually the first layer what it does it does exactly the same thing it learns these piecewise linear functions but what is the second layer do the second layer takes this entire thing and reuses it right the second layer it doesn't take piecewise linear it like doesn't take simple straight lines anymore the second layer takes his entire construct and basically uses that as a building block right and the third layer takes the construct of constructs of piecewise linear functions and builds those does that make sense so you actually get a you know you get a pre exponential explosion and your capacity by having more layers and that is the powerful thing because the theorem that says that NiO networks is a powerful said that this matrix is one layer has to be exponentially deep yet so has to be experientially wide but you get that exponential effect also by actually having multiple layers just having a few more layers ubz multiply you know the number of possible you know lines or hyperplanes that you're putting together right exponentially with the depth and that is what's so powerful because before we'd you would have this massive matrix right that you couldn't afford to ever use whereas now you actually just have a few small matrices and they have the same expressive power does that make sense raise your hand if that makes sense awesome any questions yeah [Music] so what's the trade-off between having a wide network and a deep network and that is a question that people argue about a lot and the answer is your network should not be too wide and not be too deep you're welcome I was consulting that would have cost you a thousand bucks that is kind of where we are right now we're trying to in some sense it took the whole community by storm because now we actually we used to be the safe we had us like this theoretical understanding now are we going like oh actually if you have to rethink our theory right so we actually have to and we're still working on this theory like the theory is not mature yet at this point yeah but it's the first one only has one function then it's also hard right so then the new basically those oh you're right right and so currently the trend is to deeper and deeper deeper that's currently the trend right so people have trained networks in fact that she I myself am guilty so I just had a paper last year that train networks have had 1,300 layers deep right so a few years ago if I take this we would have thrown stones at me right you know so now you can do it like it used to be by the way just to make this clear like seven years ago right if you had three layers that was considered ridiculous right it's like five layers nobody even but if you know it was like why are you wasting energy right global warming you know now people trained thousand layer networks right so it's really it's really changed completely ah good question so why do these not over fit right let me go to the demo I don't know if it's the biggest question right now is why do these new networks not over fit so drastically we have pretty good understanding but it's well beyond this course and actually it comes down well ah okay let me tell you it comes down to stochastic gradient descent which actually I didn't have time to today I will get to a next lecture I promise stochastic gradient descent is actually a variant of Gwaine descent that's so bad that it doesn't optimize the function properly and because of that you not have a feeling and that that is the chip and turn now that was actually crucial people originally just an SGD because it was faster and then it turned out actually you know they just did this - - you know it was fast the kind of engine they knew it wasn't very good but actually turned out suddenly their networks didn't it worth it anymore I will but I now we started to understand this theory pretty well I will hopefully get into it more in next lecture a small demo so um where are we ok 1 D C is the first thing I can now take a network and today I know let me just define a few there's a regression network so I just define a function kind of like the one I just showed you these are my training points there's my x-axis my y-axis not gonna fit a network through it and what do you see it's basically yeah you know there's actually two layers here so as it goes through it basically you know it makes these pieces in the air functions actually I have 40 components here and then actually another 40 in the second layer so this is very very you know you can see basic fits this is absolutely perfectly I can also do this one more time with fewer components actually I think I can actually I think I prepared this one second is the class demo is this correct yeah so oh sorry this is not rectified linear units and what is this sorry let me just do this yeah okay Oh is he just a demo 1d and I just number of layers that just make this one and that's how many layer how many hidden nodes do you want for okay let's do one for no one's at four let's do it so let me do that is the same kind of function but now I only have four piecewise components if you look at it well that looks horrible but this is actually the rectified linear units Oh actually depends those old haven't gotten instance yet so actually depend how you optimize it because you have a lot of zero gradients actually sometimes to get stuck so this actually got stuck this doesn't happen if you have more and more units let me do eight here we go okay okay so here's the first one I have why does it say what is it to us oh sorry I just I just edited the demo which was a huge mistake oh here we go trans is soft plus sorry should never change my demos okay so why is it just Sigma oh sorry guys trucks here we go yeah there you go okay good now to work okay good this just having two different components right 200 components actually be careful if you count carefully if you have very good eyes you can see this is one two three why is it three only of two hidden layers only I would have two two functions yeah that's exactly right right so one of them becomes 0 and the other start so actually what you get is actually you actually get sweet at two turning points all right so here's the first turning point here's the second turning point so he means the old both of them are switched off first one switched on second one switched on okay so you get actually two turning points now they can take the same function and let's go okay here so now we have four right so here's what it does it kind of is seen kind of approximates this better here's the last loss goes down until it's flat this is because they're rectified linear units up one has zero gradient and you - for too many iterations oh I read 4,000 oh my gosh all right anyway you get the picture [Applause] okay so yes ah you know what let me go to bed of them I know this is boring all right I'm much cooler demo so you can actually learn really really complicated stuff with this and so some of you have demos do this demo before I apologize but it'sit's good enough to show again so here's the idea so I tried to come up something really really complicated so you learn these if you have these functions you know it works really well for images and you know the amazing thing about neural networks is you can actually learn concepts that are quite complicated so I thought about what's a really complicated complex a contact and so one is whether people are good looking or not because that's something I I really have no idea so what I did is I took this data set off there's a webpage called Hottentots you can upload an image of yourself and then people who pray to you and say you're good-looking or you're not and I did not upload an image of myself but I I took the images there and actually I raised them all and are they above average good-looking or below average so I just have two labels so 50% man 50% women and the idea was to resist votive on labor wise you're either positive +1 above average good-looking or -1 below average good-looking and I just cropped out the faces so this is arguably a really hard problem right so if you just get your face you know the questions are they above average good-looking or below average people so here's when you run it the training accuracy of course a hundred percent it memorized the train there said but at the testing accuracy is 74 all right there's a balance data set so that's a lot better than a coin toss so you know this is a really hard problem so let me test it for you so here's a you know let me just get a few test points that we all agree on so the first one is Heidi Klum so I love you know Heidi Klum she's from Germany she's a supermodel and she's arguably very very good-looking by the fact that she makes gazillions you know with her looks so I can now take a dko and put that image to that Network and see how you know how good luggage this -1 is bad looking 1 plus 1 is good-looking so you can test this we get two points that's off the charts right so the newer network has the hots for Heidi Klum's not bad so then I thought about what about a guy so I I don't know what guys are good-looking so I asked my wife and she she didn't skip a beat he was like George Clooney so here's George Clooney and I gotta be honest with you bugs me a little there I know I look nothing might rush to me that's a different topic that says you know anyway so we can put rush Clooney through the new network and see what happens and here he is oh wow you know I told you so that not even saving all right then of course that you know I couldn't resist right so here you know here [Applause] their penis maybe there's a second career for me a swimsuit model like you never know alright let's go and this point I was very curious like how does that that's pretty bad actually that's that's pretty close to my house mom so how do I do compared to the ugliest person on the internet and so I found the ugliest person on the Internet which is very easy I typed in ugliest person on the Internet into Google and that's the first image that comes up is this so that alright let's let's try this is the lower bound let's go all right now All Right see y'all on Monday

Info

Channel: Kilian Weinberger

Views: 10,506

Rating: 5 out of 5

Keywords: artificial intelligence, machine learning, cornell, cs4780, course, kilian weinberger

Id: kPXxbmBsFxs

Channel Id: undefined

Length: 49min 39sec (2979 seconds)

Published: Wed Jul 11 2018