Data Analysis: Clustering and Classification (Lec. 1, part 3)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is the third of the first lecture through third part and what I want to talk about in this last piece is sort of what is perhaps the most common and weren't easiest actually maybe not the most common algorithms out there very much like the k-means it's extremely simple to think about and implement and so we're going to use it here and it's called K and n or K nearest neighbors now it's very much like the k-means algorithm what k-means did remember is built two clusters about basically who's closest to who you acts you ask which center of mass point my closest to that's the one I belong to K nearest neighbors is very similar but it's different it's a supervised learning algorithm which means you're providing it with labeled data so I might provide it with a set of labeled data until maybe when we can demonstrate here is this concept on a little toy example and then we'll go over to MATLAB show you how that's how it's done okay so the toy example I want to consider it's going to we're going to we're going to go ahead and hoops pop that down there and I'm going to go back to the drawing here and I'm going to come over here and I'm going to race what I had and then we're going to draw some stuff here we're going to go back to the simple example of thinking about squares and circles and how to classify them but now notice even by me drawing them as squares in circles I've eventually essentially labeled them myself right as adheres circles here squares so I'm using expert in the loop knowledge I've labeled them so it's already a supervised type of algorithm so what I want to do is take this example and let's go ahead and look at something like let's go ahead and put some green squares down again okay I will do that a little bit more I'm also going to do some orange here let's say orange marker like that so I have some orange circles okay so suppose this is my data and what you've already seen I've labeled them I've laid them as green squares or in circles so what the k-nearest neighbors does is says hey look that's my training I've given you this information there's a new point that I want to give to you a new piece of data and your job is to label it one of these things is it an orange circle or is it a green square okay that's the objective and we're going to try to do is see those objectives through so the way the cane nearest neighbor works it says okay let's go ahead and take a point let's pick a point right here I'm going to make it a triangle so very much like k-means what it does is starts looking at distances two points and so the K nearest neighbors the K now represents the number of neighbors so generically if you don't do anything in MATLAB the KN search algorithm which is what it's called will just look at who's next to me whoever's closest to me that's what I'm going to label myself so for instance in this case here you could say this guy is closest to that point and if you say what's that piece of data the K nearest neighbors would just say okay of all the data points available since I'm closest to that I will say that I am in orange circle okay let's take another point and maybe a point here okay kind of more interesting so now you have this distance there you have this distance there this distance there and this distance there okay so now this is like playing bocce ball because you're trying to with your eyeball tell who's who's closest to who I would say that this is maybe the winner but you know so if that's the winner you would say that this point is also again a an orange wall okay so that's how you do this labeling okay that's if you're using one nearest neighbor now let's take a more interesting case which is what if I had instead at this point I had a piece of data right here now in this case is kind of interesting because your nearest point is clearly right there you could say I'm going to label that a green square but you notice that I have a neighbor right here I also have a neighbor here and I'm a neighbor here all that are orange so what Cead nearest neighbor does it says wait a minute how about you do the following this algorithm can work but I can choose the K in other words I can say I'm going to choose the K nearest neighbors so suppose I'm going to choose the three nearest neighbors and what I'm going to do is have those nearest neighbors vote of course you always pick an odd number or else you can get a tie but for instance here I could say I'm going to take the three nearest neighbors and whatever they vote for that's what the point is going to be so for instance here you would say well the nearest neighbor is a green box okay but I'm not going to just listen to that point because otherwise I would be a green box I'm going to say who are the next other two that are closest well I have the green box and two orange balls so I have two votes for a ball one vote for a box I will become a ball because I have the vote is in favor of the ball so that's the idea is that you can have this nearest neighbors making these decisions for you not just one but two but three five again you don't want to pick an even number because maybe you could get a tie then you have to make some decisions there and then even points like over here you could say well a point here I have a point here you know if I use my three nearest neighbors I would use this one I use this one and I'd use this one and then I would say okay that is clearly a square right so I wait my vote with three okay so that's how this thing works pretty simple very simple I give you in the supervision I give you green squares orange circles your job is if I give you a new if someone gives you new data is to say is it a green square orange circle and all you going to do is for a new piece of data simply calculate the distance to my neighboring points the K nearest neighbors and you're done okay that's that's the concept that we want to convey here all right in the distance again really standard metrics right so this is this goes back to junior high and Pythagorean theorem right this is like just simply saying that distance right is the Delta x squared plus Delta Y squared square root okay that's how you calculate distances it's very simple and if it's in higher dimensional space then this distance becomes you know Delta x1 squared plus Delta x2 squared squared okay so that's the idea all right so simple concepts distances that's how you do KN search it's also how you do k-means right you just compute distances between everybody and then from that alone you're able to make these in one case with k-means you're making a clustering decision in this case here you're making a classification decision here's the other thing I want to point out that is maybe not so obvious but it will become important as we go along which is normally when you do these data reductions you know I've sort of looked at clusters and you sort of draw a line over there is one type over here is another type the can and search it's not clear that's what it does for you okay so if I start to think about this I can start seeing that right away so first of all KN and search is not going to draw lines between my data sets between my data points are between some kind of separation plane what it's going to do is you're going to it can actually follow curved lines through your data so let's actually try if we can to draw what the surface of separation would be so here you know if you're a point here then your nearest points are this one this one in this one so you're kind of a of a a green square but if you go over a little bit now this point becomes closer than this so this is probably some line that's going to go sort of like through here through here you're still going to get votes and you see what's going to happen is your separation line is going to be actually sorry now that you have this guy down here right you're going to be able to something like this so KNN search is capable of generating non linear separation curves through your data okay and it's all based upon your training set that you give it and so there's also important that when you think about your training set and you think about that curve we talked about the concept of cross-validation it could be that that curve is completely over fit the thing it fits is the very specific data you gave it but what you want to understand is if I do many trials of the data and random draws how much is that curve change and curves like this tend to change a lot often because these kind of very complicated curves are typically just generated by overfitting of the data okay so really important to do some kind of task afterwards which is a task in which you are going to evaluate this by doing cross-validation okay never forget to do the cross-validation it's extremely important always always always do cross-validation okay so this is the simple example now let's go to the matlab code let's look at some real data well okay manufactured data that we would construct and then talk about what a can in search would do okay so I'm going to take this down here and go back to my desktop pull in the data now what I'm going to do here is so pull in the data we started with which is going to be these black and white dots that we had from the last example we did okay so let me pull this over here run that I think I ran it okay what's oh I see how home let's just get rid of this for a moment yep total okay here's my data and this data what I want to think about doing the fit again is want to separate it and suppose this is my training so remember in this case here where you're actually doing a supervised algorithm then your data is going to come in the form of already pre trained data so in others I've labeled these black I'll label these Reds and in this case it's pretty easy to do a cait nearest neighbors right K nearest neighbors just going to say you give me a new point let's say here and all I got to do is say who are the neighbors of that point right there well I'm surrounded by black so I'm going to be a black point down here I be in red point somewhere between more complicated okay so let's actually bring some more of this data together I'm going to basically bring this back down to what we started with last time and this becomes kind of more interesting and this is where you can start seeing that supervised algorithms have to make choices and they're going to get some of them wrong they're going to label some of them wrong partly because statistically some of your data is indistinguishable or some of the data it's the disco outliers look like they belong in the other set of data so for instance take this red point right here okay so it's here now if you do a K equals three for the nearest neighbors if you have a new point anywhere around here and you're doing K equals three the red is only voting once versus two black points so but if you do one nearest neighbor you could put a new point up here closest to that bed and it would be labeled a red even though clearly it's in a sea of black that's why you would put maybe wait this with a three five seven something like this okay but it gets more complicated in here how do you label points here and this decision boundary becomes very complicated right because if you look at this you a sea of black and red points all mixing so whenever you take a new piece of data and it's sitting near this decision space it's it it could be labeled black or red depending on small changes of where it sits right so this whole region down here is going to be very complicated for it to make a decision about what cluster it belongs to but if you're up here it's going to be labeled a black point if you down here it's going to label the red point okay it's a simple thing like that now how does this work in practice in MATLAB well just like we did before we took these two random variables right and what I did now is I stacked them up on top of each other all the X points all the white points so my data is in two columns so my data has two columns the X values the Y values now the way they're labeled here right now is that right eye that's the number of features of this data so each row is one of the points but I represented by a two-dimensional space but I could imagine each row being a thousand dimensional or 50 dimensional three dimensional right but each row is one piece of data and each column is how you want to express that data in some kind of hierarchical way or feature spaces what we might call that okay and here are since they're just two-dimensional objects it's simple to have to you know the x position the Y position that's what is this so this is all my data and notice what I've done is I know that the first 240 points belong to those black circles I know the next 240 points belong to the red circles so KNN search is very simple and the way it's going to work is the following what you're going to do is you're going to say okay I have all my data stacked up and the way Canon George is the following let's call it index and D is equal to K and n search X so put in your data here's an example of how it might work KN n search is the dot M file that you call what you throw into it is your data matrix here it is these are the labels so the first 240 belong to one group the next 240 belonged to the other group and now you can give it new data some new x and y positions and what is going to do is take this this is like the training data and it's going to try it on this new data so the new data is going to come in and it can be tested against this data you gave it and here you can say I want three clusters K is three put five you could put if you don't put anything there it'll just look for the nearest point possible okay so the default is K is one okay so you give me a point whose my nearest neighbor I'm going to say that that's what I am whoever is my neighbor that's what I am okay here whatever my vote is for the nearest three neighbors that's what I am okay so that's the basic algorithm now what does it return it returns you two things it returns you an index index and a distance so the index it says for instance if this is going to basically take the new data for all the new data it's going to tell you which point it is nearest to okay so that's that's its metrics right is which guys that might closest to it's going to return you out of all your data you have here that you trained with it's going to say I'm nearest one of these points in your training data that's what it's going to return to you here and it's going to return to you also the distance to that point now when you have multiple K's like three it's going to report to return to you three indices who are the three closest data points on to and what are the distances to those three data points right so that's it so it's just basically going to say here they are here's the distances and you since you already know what these are then it's easy to say oh this is then a block point or a red point in a simplest incarnation where you just say nearest neighbor if you just did this it just take the new data and what index it is let's say you ran this thing and it comes out as you know it's closest to 0.1 hundred well point 100 in this data remember I the first 240 points were from the X Y variables in the next 240 from X to Y 2 so it was in the first it was 100 it would belong then to the first grouping of points which are those black circles and if you wanted to know how close it was to one of those black circles the closest one to it D would be a vector of the distance to the nearest point given by this so that's the basic algorithm and again notice the structure and we're going to see this over and over again you're going to give training data here this is the labeled data this is what you made up and with that you give new data to do some classification tasks okay so again Kate Nate K nearest neighbors and k-means sort of very animal you know geometrical ways to think about data and they actually can work out quite nicely and they're kind of two of the top ten algorithms that people have identified the sort of workhorse algorithms that that you can apply to a generic data sets and and I find fairly good success ok we're going to actually keep building on this idea and go with develop some higher-end algorithms that have actually been more successful even and that will be coming up in the next couple of lectures
Info
Channel: Nathan Kutz
Views: 20,704
Rating: 4.9694657 out of 5
Keywords: data science, supervised learning, unsupervised learning, clustering, classification, k-nearest neighbors, Nathan Kutz
Id: kolT2HbWigM
Channel Id: undefined
Length: 19min 12sec (1152 seconds)
Published: Fri Feb 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.