Lecture 2 "Supervised Learning Setup Continued" -Cornell CS4780 SP17

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right quick word about the exam that was just to get you quiet all right and please please those who haven't dropped off the placement exam table at the end of class and now we have a three-minute advertisement to the our project we have the other we have some new competition resources semester we have a new cluster with about 100 CQ force for me to pick the memory so we had a 2.7 gigabyte text file that basically crash from most of our teams last week on it punchy spark as you can see the file that I uploaded it's about 400 but using sport tix and it can dousset these kinds of really fun data science stuff into comfort robbie keane learn more about on wednesday and if you have any questions email cornell data science at gmail.com thank you thank you [Applause] Thanks all right terrific all right and who has who still needs a handout from last week from last time okay it's a little tricky it's random over there enemy how do we do this let's pass them around okay sample here hey wow you guys camping you pass some backwards I don't know who it doesn't have any all right oh sorry that's some extra here [Music] all right everybody's lent Co all right some people still leave the handouts how's that possible this is 370 I gave a 370 but some the new people arrived I'm sorry I don't have any more I have a lot of places he sounds maybe I don't know like who who doesn't sitting next to it ok breeze raise your hand on the time who needs one now look around you who your neighbors also raise your hand oh yeah so you can I'm clustering and okay could you maybe just ask the people in front of you or behind you if they'd be very nice and maybe give you one of those at least every other person should have one okay um so the placement exam is equated tonight so hopefully tonight or tomorrow you will get an email with from velarium that will invite you to be on the on the web page or you get email saying that maybe you should take it again soon by the way this class is likely to be offered again next semester all right by the way I got a new torque torque I got the big chalk okay last time we talked about the machine learning setup so if you remember we had we said we have a data set D that contains n data points o n pairs a feature vector and of label and that was in from the space curly X curly y McCurley x is the space of all features and curry y is the space of labels and these are drawn from a distribution which is elusive to us so these are drawn from done distribution pins I'm going to ask me at the end what does that distribution p right seems are like usually have a Gaussian distribution or something all right and this is really this is a distribution that you can't write down as some distribution of Baisley of Mother Nature or something right so if I would for example go to outside in on you know Cornell campus right and take pictures of students faces right so you know let's say I want to make a face classifier and you know I go outside the first person I meet I just asked them could have take a picture of your face right Soviet but okay and that would be a distribution rights as a random distribution of who I ran into and if I would do the same thing somewhere else in some other country that's a I would do this in Beijing I would get different people right so it's a different distribution P okay always there's a distribution we don't know it's something that's you know we don't control we knew this distribution would be really really easy right because then you could actually just take P of Y given X all right let me just stick in our X and we predict the label Y but we don't know P so therefore we have to do this whole of machine running and by the way there is a that's important things that when you can you collect your data you have to collect from the right distribution actually they're the one that know now that I just said the face example I was like see a great example of Nokia so I don't know if people have remember Nokia used to be a company that makes phones and and and they had one thing they want to do is they want to put in a face recognition system so when people take a picture a camera a picture with the camera it automatically detects the face and so usually when you have a face in the camera that's the person what do you want to zoom into and what you want to make you know sharp etc and so they trained their face recognition system you know with random people ways it took a random people from the streets of Helsinki and the promise of course you know Nokia phones are sold everywhere in the world and so one thing they you know they forgot it's that basically they didn't have any people they were not Caucasian so and then next even you ran it in the United States african-american you know here's a stand in front of camera I think it doesn't see me doesn't see me right and there's YouTube videos they were really funny because basically the problem was it was trained on a different distribution then actually was applied and that's very very important that when you take your training data it should be the same distribution that you and that you actually intend to deploy it on afterwards in this case actually was quite embarrassing for the company and they had to update the software cetera okay so we basically have this data we assume this data is given to us and we you know a big you know getting this thing that's not always easy but in this course we won't get you know I won't get too into much detail how to actually collect this data we just assume somehow you can sample that data so you know last time I had a few examples I you know one from hospitals where you have you know patient data and I said you know for example it could be a vector that this drives a patient and why could for example be the question will this patient return if I would release that patient now would he or she return to the hospital within six weeks that's a typically good question another one is text documents this could be an email and why could be the question is that email spam or is it not span and finally how far did we get is that how far we got yeah okay good so let me do one more I just want to say image so X could be an image this is exactly for example the face problem right so face detection for example so well image actually is a bunch of pixels so if i zoom in it looks like this and these pixels have certain colors so now we have my my face here's my face something like this and so if you have a 6 megapixel camera for example I think the eye latest iPhone has 6 megapixels well really buddy hips so that means you have 6 million pixels each pixel actually has 3 different colors so it's consists of three numbers the red value the green value and blue value usually you have around 255 values for each one of them and that's 24 bits and in total so then you can write your vector X eye as basically a vector that takes all these different pixels and says we keep the red value the green value the blue value of this pixel the red value the green value the blue value this pixel and so on right so this is my for example 0.3 0.2 0.8 that's these numbers describe this pixel here in the top left so you know if you have six million six mega pixels and this leads six mega pixels would lead to a representation that's an 18 million dimensions now that's not often not a very good representation of images but so for a long time actually people had much much better representations in the last couple of years actually people went back to actually just representing pixels as pixel images really as pixels and we will get to this at the very end of this class let me talk about convolutional neural networks okay any questions about the feature space X and representation of data as vectors so the key is basically you know whatever your data is we assume at this point that you can just take your data and take each data instance and represented as a vector X and we assume that every vector has the same dimensionality and and so the text document be a pair of words and and so on all right oh yeah one also one distinction away that I still want to make is there is dense and sparse vectors so let me just actually annotate my notes I just realize this so we have dense there's a sparse and a dense representation means that we use you know for almost every example if you use every single one of these dimensions there's always some value here a sparse representation means that the majority of these dimensions will be 0 and only a few of them are filled in so for example you have this in text document so if you have a text documents and last time I said the way we do this is we just take you know a huge vector of it each dimension corresponds to one word in the dictionary so you know we have twenty million that's all right so a million words that say in the English dictionary then if you have a little email well almost every word in the English vocabulary will not be in your email okay and so if you pick a word randomly you're almost certainly will not be in your email right in fact most of our language we just use the same thousand words over and over again so almost all of these will be zero and there's a few counters basically we say the words there actually is quite common alright so that's actually that's a sparse representation and that's important because the nice thing here is that these if you actually would store this you wouldn't actually still those you would just store only the entries that are nonzero and and so sometimes it allows you to do machine learning on very high dimensional data and because it's part so you can still feed it in memory so the important thing is to understand what stance and what's parts so sparse means you have many many zeros and you're in your future vector and dense means typically don't alright so in the first lecture I had this cute little picture off the computer and you write a program and so on so let me just go back to that picture and use the notation that I introduced to just make that a little bit more formal oh it's still like a tune but so this here is my computer I stick in the data and I stick in the output that's what I had and hope you people remember this raise your hand if you remember what I'm talking about alright good so my data here is basically the X 1 to X n these are for example emails and then for each one of them these are my training denizer I know what the answer is right I know this one was a spam email this one is not an spamming and so on so then I stick in my y1 to yn ok ABC correspond to each other and phasee what I would like to have an algorithm that if this is the input this should be the output and so my machine learning algorithm the ML algorithm is then going to crunch these numbers and output your program right which we call H so H is really mathematically it's just a function and so hopefully if you take any H of X I you hopefully get why I back in almost all cases so that means you did a good job if you're learning and so now that's what's all we call training and then after we're done we take you know we move to testing and testing is now we take a new instance ax or which we don't know the label we take this program H we put it in a computer and the computer computes as H of X which is prediction and hopefully that is actually the y that corresponds to that X all right any questions raise your hand if you understood it all right awesome good I like some people like yeah all right move on okay good it's good I'm moving on I'm moving on I don't want to lose anyone the first week okay one thing yeah one thing is important is this this will only work and there's a you know I'm saying this yeah until you can't hear any more but it's the only work if this X is drawn from the same distribution is this right so these here these X's are drawn from this distribution P this here oh there has to be drawn from the same distribution otherwise it will not work okay good so where are we now you're trying to do machine learning and hey let's go all right good all right so when you want to do machine learning the first thing you have to decide is what is my label space and what is my data space okay so you take this data you generate these vectors we just give you a bunch of examples you also generate your labels why do you have to understand what problem are you in it's a regression problem does anyone remember about a regression problem is what defines a regression table yeah well you you do have labels but they're real numbers and that's right you don't discrete labels yeah and so regression you have real numbers you know binary you have 1 or 0 you just have to double options a multi-class classification you have K options that you want to predict and you also have to write a feature representations you have to somehow come up with a good way of capturing your data and make sure you include everything and there's one more thing you have to actually decide what do you function H and so like so far I just said oh he learns on program you know H but that you know it's not gonna be magic right so there's like different kinds of programs that we could learn and most of this class is actually about you know what kind of what kind of options are there and when is you know algorithm a the right thing when is I wouldn't be the right thing etcetera so we define a set curly H that is the hypothesis class and that's actually this year's called hypothesis that's why it's H so the basally say well leap you have some set of possible programs possible functions that we could learn and we try to find the best one out of that and choosing h is usually something that's not automated some people are trying to automate it but that's usually not that's something that's your job that's you job as a data scientist it's given a problem making the right choice of what your curly Ages right what kind of functions should we try to learn given that particular data that particular problem and then the argument of Xin learning algorithm stop is out of all these possibly infinitely many functions that I could learn take the best one that fits your data that'll be con learning so let me give you a few examples examples of and we will go through all of these in the next couple of lectures is for example decision trees just to give you a little sneak preview of what a decision tree is this is a tree is one that actually it's a tree well let's say for example in the hospital data if you want to predict to someone coming back within the next six weeks so what you do is you repeatedly spit on some features for example say you know is the age greater than you know 65 right and if yes then go to the right no go to the left right you say well this day you know let me have for example is blood pressure are very high right then yes go to the right no and let me say here probability of returning is very high maybe it's 90 percent probability of the person returning the person here becomes maybe say is the age you know greater than 30 if that's no then you know maybe it's you know let's say it's unlike if the person returns and so on so you build a tree that basically you put your data through and ultimately what you're doing is you're petitioning your space in two different instances and say if you fall into these down here and the sleep down here you're likely to return to the hospital if you fall into this leave you're unlikely to return the hospital maybe that's a decision tree the question is how do you learn these decision trees that they actually have provable guarantees etc and we will do this in a couple thank you another one is linear classifiers and some of you may have heard of these perceptron for example that I mentioned last time is exactly a linear classifier then we've artificial neural networks neural networks whatever also called deep learning these days and support vector machines so these are probably the most common and are the most famous ones machines and so one thing I really want to hammer into you is that there's no best algorithm right there's no like that's a bit mistake that I see especially like a few years ago super vector machines were super fashionable everyone want to do everything super vector machines now the same thing with deep learning right if people come to my office and say like you have this data and I want to use deep learning right that's not how it works it really depends on the problem depends on your data which one is the right argument and so hopefully by the end of this class you would have a good understanding it's actually a very principled way of deciding what the right I wasn't so it's not just a you know trying the one that you like the most is a thing all right let's come up with our own I wear them but they're really terrible algorithm there's no you know there's only the third lecture so all right so terrible I'm with number one so lean is always gonna write this down someone's gonna take a screenshot and put it on Twitter or something this is what Professor Weinberger's teaching yet yeah so here we go pick any age of art of age randomly randomly and hope it works well right can always evaluate it right you can basically see on the different data sets like you know does it seem good predictions right so you just put a random decision tree for example and just evaluate well that's clearly not gonna work very well and what's the problem right the promise well you know the Space Age can be very very large it can be many many possible functions you could learn and it's very very unlikely the one you're picking is going to be the one that's you know right for the data so you probably can you guys only expect me what would I see if I deploy this what would happen there's an obvious quite answer it's not very hard it's not a trick question anyone brave enough to say the obvious but it would be terrible but you would get a very high error right because babies like running a random program on your data right it would not spit out the right answer can anyone think of us an hour that's actually where they could still work yeah that's right so if your age is restricted enough right but basically what do you pick it doesn't really matter all that much that's very weird right but in some settings that could actually be the case so if that's the case then it's actually you have a huge advantage which is very very fast right so picking one edge randomly it's reasonably fast okay give me I'll give you another example here's another chair of algorithm that's gonna do much much better he has terrible algorithm number two here's what we do we try out every single age in this in my hypothesis class and pick the one that works best on the data so I go through my my set of functions that I could learn and try every single one and assume for now it's finite because for the program you know number of programs is actually effectively finite because we only have so much decision you know if you minimize how how complex they can be and then we use the one that's best what's the problem with that one why is that a terrible word yeah and what do you ever write like there's maybe more programs in that set and then there's you know elements you know whatever like electrons in the universe alright so you know we'll take you why so so that's not a good job at either so we have to have some way of taking this function out of the set of possible functions okay and that's exactly where the loss functions come so let me just explain it see if not functions and here the last functions do loss functions evaluate for any age class how well does that H work on my dataset so let me give you a very very simple like the most common example is the 0 1 loss and the 0 1 loss is called 0 1 because it's either 0 or as 1 and so here's how it works so we run this over a function H which we have want to test and then some data set D and we do the following we go over every single data points X I Y I in D and you take the Delta function we say H of X I equals to Y I who knows the Delta function ladies your hand not many ok there's a function what it does this is one from the 10 years over 1 for discrete variable it's basically the same thing it says if this here is true then output a 1 otherwise output a 0 right so this function is either 1 if H of X I equals y I 0 otherwise and actually so it doesn't take my notes it should be not equal that's the accuracy because as ice claimed its era okay yeah you want to describe in English words what this function measures yeah yeah accuracy and not anymore right because I just made the cost up but yeah if it used to be accuracy right so if I if I say if it's not the same and I count those as actually the number of times I'm wrong right but you're right like originally it was the accuracy right so Beasley but ever X I H of X I does not equal why I then I have a counter one otherwise that's 0 I divide that by the total let's say I have a hundred examples three of them I get wrong so three of them I have you know I have a one here oh the other 97 I've zero except me develop a hundred so three percent of my training samples are wrong okay does that make sense and so once you can see already lower is better right we want to drive the loss to zero that's why it's called loss that's a negative thing is it's a bad thing and the last functions are always non-negative so if you have a loss if you drive it down to zero that means you're making no more mistakes as that's universal that's across all last far so this is a convention okay so lower is always better and if you a zero then is if you you've made no more mistakes any questions about this alright let me give you another one square lots a squared loss so the square loss it's again a sum of all my data points and to the following I say H of X I minus y I squared than average can anyone tell me an example where that may be when that you know we want to use that one yeah deep linear regression that's right alright so for any regression problem really right so if I try to estimate a house price of something and I don't have a Z say well here's how much I predict that house is worth and the image is really worth right and so if I'm off and I Square this I may say well the more I'm off right the more I have to pay with my knowledge right and if it's zero bit of my exactly hitting it then I have a zero loss so then I'm doing really really well okay any questions all right let me do one more and the absolute loss it's 1 over N and sum of all my data points now I just have the same thing but I take the absolute value so again the same thing you know I'm trying to save is how close is my prediction to the true value and if it's a little higher or a little lower I have to pay for it can anyone think of an example where I may prefer the square loss over the absolute loss where I want to prefer this absolute loss over the square loss why don't we actually buy it I can give you a few minutes so justice I'll give you one minute discuss it with your neighbor and see if you can come up with an example [Music] all right any suggestions when would you yeah prefer one over the other it will and you want to you squared because you say if you get one wrong like really really wrong it would actually mess up the whole neighborhood it could be could be typically outlier proms go the other way so I mean you're saying it's busy whatever you saying the right thing saying that you know if basically getting one really really wrong all right basically square law Stars viewer off by a lot right you might have only one point is off by a lot that's a lot worse and having a lot of little points off by a little okay so you have ten points off by one they Eve give me a penalty of one if you have one point off by ten or ten squared is a hundred right so that's much much much worse right so we'll always make sure that everything is busy you know roughly you know there's no number that they're really really wrong and that's good except for can anyone think of any scenario when it's been that's really really bad any ideas so one month yeah that's right so one thing is for example I give all these houses here rather than Bill Gates decides to move here is Iike right if I was a mention for you know five hundred million dollars right and so all you will try to do with your function is get his prize exactly right right and if the student housing is off by a couple hundred thousand doesn't matter right it's not a big deal right well that may not be you know we all who cares about building a house in that case how do you actually basically what you're doing is many many people will have larger loss right we'll have an inaccurate prediction just for the sake of one guy who basically gets it right all right so in that case you may want to actually go to the absolute Lots all right another setting is if if some things could be wrong all right so that's a some some how this may you know some patient he someone sells their house to their brother-in-law right and they sell a big mansion for one dollar right because it's like some tax benefit or something or they you know people selling it to get you know in case they go bankrupt they don't want to own it anymore but they're really selling it to their spouse or something so now you have this when the tax record you actually have a mansion that's worth one dollar right and so the squared loss would try to make you know charge with everything to get this prediction right now I would say okay I'm really really far off i square this I'm off even more right and would screw up everything whereas the absolute laws would say well you know it wouldn't amplify that too much okay so it's a trade off actually and we will get to this more in a few lectures but yeah this was a great examples okay good so let me come up with so okay so learning is basically the process of you know you give the learning algorithm a function H and I choose choose an age little H in my hypothesis class all right so that's basis the idea and as the answer machine learning algorithms you say like you know I want to learn your networks or something and which one is the best new that's basically the learning process and these last functions are basically there that we can evaluate how well different functions are enabled guide us like you basically metalhead find some clever ways of cutting through this very very large set of possible programs the possible functions to be could learn such that the little age that you find has a small loss in our data set now it's all might come along and say wait a second if what we really want right is find a function that a small loss on our data set IV could come up with a really terrible idea there's terrible algorithm number three here's how we could find our you could define our function we just defined the following wave say H of X do the following you say if this ax actually equals x I oh sorry X I for some X I element of in our data sets so if there exists the data data point in our data set that matches exactly the input then we output Y I and other ones otherwise we just output y1 can anyone tell me what the loss of that algorithm is on the training data set so by now computer if I take this H and take any of these last functions on my data set and my training set now assume all my data points are unique so I have only X only X I only once in my train data set what would the last be of this I yeah zero yeah I guess you did arrive so he wouldn't make any any mistakes right so for every single data point in my data set I would look at what the label is I would busy find it in my data set and would output exactly that label right so the last would be exactly zero right so that seems like a great algorithm right are we done that's the end of this class in your home right we solve machine learning if on an algorithm that always gives us your loss no matter what we do that's the problem yep that's right it's a terrible error than why because if you stick in anything that was not in your data set and doesn't know jack right Oh which is outputs by one basis just memorized it's like the difference between people you know before an exam right they're just memorizing everything and then they write it on the exam it's not everybody but some people right then but then the new problem came along they don't know what the answers right so that's exactly what you're trying to avoid it's memorization yeah and so we what you're saying is our chain data comes from some distribution and our test data should come from the same distribution and how do we know that's actually the case and so that's trust that we have in the data scientist that's right so we just by the ways you you had adhere taxes all right yes or that is a problem there's a very big problem actually and I give you a famous example again of someone who made such a mistake actually was the the US Army actually they trained a classifier that classifies between civilian vehicles and military vehicles and so what they did is they it took pictures they sent some on ours take pictures of military vehicles so the guy just drove around base and took pictures of tanks and jeeps and stuff and they drove it down you know downtown and took a pictures of normal people's cars some civilians cars and trained a classifier and they realized they got zero percent error and so they fade really good it was like awesome and we can use this right we know targeting and all this that's awesome right so then they actually you know fed really good for them about themselves for a couple days and then they deployed it and turn out it didn't work at all that's fifty percent error it's like one coin toss right didn't work at all the question was what was wrong well what they did is the guy first actually drove around base and took pictures of military vehicles and then they see a few hours to do a lot of pictures of civilians of the civilian car so at this point the son had said a little bit it got a little darker so all the only classifier there is like bright images where military vehicles and dog images were Servilia cops right and it worked really really well the defense made an example of the distribution not being the same right that's just you know that was sample from a very different distribution so yes you have to be careful okay good so there's this problem but you know we still have this guy here so this terrible algorithm number three there's memorization would do would it give us zero loss so if we only use these loss functions to basically measure how well we're doing on our training set this will still seem like a really good idea right gives us your loss so we have to make sure that we don't arrive is something like this and actually one thing you will see later on is that most machine learning algorithms are very much capable of memorizing the Train dataset and that's okay as long as they still generalize so in this case this doesn't generalize at all and so let me tell you what generalizations any questions by the way any questions about this all right generalization and here's what we just went wrong or you would like to have is a function H and you know H such that for any X and y drawn from the distribution P so in case you don't know that notation that's this invert a upside-down a means for all so for all X and y Johnson's diffusion P we want that H of X is roughly y it's very often by I'm not going to formalize it so what we did is that this memorization algorithm only worked for data data points in the air in the data set D but did not work in the general case if you have any X and any y Jonathan P it would not work and this is what we call the generalization so what do we really want right what we have here up here as we measure the loss on the training data set so we took the sum over all the data points in the training data set for each one of them he said this our exes are why there's a prediction on X there's a prediction it is our true labor why are they the same right so either in the absolute loss and the squared loss or are they literally the same that was the zero one loss all right and now what we would really want is something else you would want for any x and y drawn from this distribution P which is elusive distribution P which we don't have access to for those we would like to have that a H of X matches fine so what we would like to have is the following that if we draw if you take our function H and we compute the loss on any x and y that is drawn from this distribution P the expected loss should be very small and this is called the generalization loss let me go through this one more time DeBakey saying what's the expected loss of what's expected value of this loss function if it's applied to a to any x and y drawn from this distribution P so if you would go out there and take a new data point x and y I'd stick that in your function asked to compute the loss what would be the air so what would be the value what's the value that you would expect the probability here is over these xn by Jonathan T ok any questions about sorry ok one more time so what we computed here up here is basically the average loss right if you so this is basically let me just this is the average loss of all data points in our chained innocence we summon over all our data points in the train data set and we say what is the loss for this particular train eight point and be just average then okay can all the view that says you know essentially you know if you would pick a point randomly from your train that I said what's the expected loss all right so here's something else you're amazed he's saying so that's good right but that here we can run to the memorization problem that be Bayes the only do well on the train data set but we don't do well on data points that are not in the train set so what we would really want is if you draw any X&Y doesn't have to be from your training data set but it has to be the same type of data that isn't your training it has to be drawn from the same distribution as the data of which your data set is made up I took any one of these X&Y stick them in my last function then this H should have a low loss right Alyssia basely computes the loss that you would get in expectation grazie handed that makes sense okay so this is great why don't we just minimize that and the answer is we cannot write because we don't have this P and so we can't possibly this is all you would like to minimize but we can't do it because we don't have access to P all we have is these measly and data points drawn from P that's all we have so what do we do and here's the good thing if when you one minute you want to hear the good thing all right I'll give you the good thing in one minute and the good thing is we can estimate this as we can't compute it we can approximate it how do we approximated here's how we approximate let me get the data set this is my data set like you can imagine like these nothing x1 y1 x2 y2 etc I take my data set and I split it into two portions training and test so the moment I collect the data I split off some part that I call a test set and the rainy part of the training set and this data set I put on some d0 disc that I put in a safe you know balled somewhere in the basement right no one has access to it and then this part here gives us data eight data scientists I mean the data scientists now learns an algorithm and as far as the data science concerned he doesn't know he or she she is she doesn't know about the test set right so he or she just knows this training set and he or she now trains the algorithm to do well on this data set and maybe you get really no last year that's very encouraging the moment the data scientists come back with some function H we can test how well it really does how do we do this you apply it on this data set okay does that make sense because Lena sine has never seen this data set this is a fair estimate right this is sampled from the distribution P we know this because the data a sample distribution P so then basically we have an estimate of how well we would do with you go in the real world right and I do this literally right survive data set I don't give my grad student everything all right so I keep something and then you can always evaluate later on all right so that's this year is to estimate what we really want and this years to guide our algorithm and if you do this then you would actually realize that the memorization algorithm is terrible algorithm number three wouldn't work right it would give you zero error last year but it could be a really high last year all right so now you know it doesn't work okay so you you file your data scientist you get a new data okay good and we pick up on that on Wednesday

Info

Channel: Kilian Weinberger

Views: 34,028

Rating: 4.9832983 out of 5

Keywords: supervised learning

Id: zj-5nkNKAow

Channel Id: undefined

Length: 48min 44sec (2924 seconds)

Published: Mon Jul 09 2018