Data Analysis: Clustering and Classification (Lec. 1, part 2)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is part two of the first lecture on clustering classification machine learning algorithms and what I wanted to do now is we want to switch over to MATLAB and we're going to learn to do is program some of the ideas behind this k-means algorithm k-means algorithm is it's like one of the most important algorithms in sort of the unsupervised learning literature so it's simple and that's one of its primary benefits very easy to program it's very fast and you know for the most part it's the workhorse of unsupervised learning unsupervised learning as I mentioned is a little bit tricky because with the unsupervised learning algorithm tries to do is tries to actually find clusters for you without any previous information about if there actually are even clusters to be found or how many there are to be found so typically when you think about a workhorse in this space it's important to start thinking about that kind of idea so what I want to do is I want to think about manufacturing some data and what I'm going to do is generate two types of data I'm going to generate let's call it data set one data set to what there going to be some galaxy and distributed random variables and we're going to look at these things I'm going to see that oh look I have two clusters and the question is how well does the k-means back that out for us in some kind of principled way algorithmically it's a toy problem but the idea is to illustrate the concept maybe write a little code for this so that we can play around with this okay so let's go ahead and build this thing out so I'm going to switch over to MATLAB and when we switch over to MATLAB here it is we've got a programming box here and there is actually a k-means algorithm in matlab and i'll show it to you at the end of this thing but the first thing I want to do is illustrate the process of how K means works okay and we'll do this by first constructing some data so what I'm first going to do is make up some random variables and the random variables are going to be Gaussian distributed so that I can make use of this so I'm going to say X is equal to R and I'm going to make randomly distributed random variables what I'm going to do there okay is make this random and distributed random variable that's what this R and n does is it makes basically a Gaussian distributed random variable with mean zero and unit variance and I have 240 points it's going to be lined up into a vector of 240 rows one column okay so those note it's called my X values that I want to look at and my Y values are going to be similar but I'm going to make it a little bit narrower in the Y direction okay and what I can do with this is just say plot X versus Y and let's make these with black squares line with two because it makes it a nice little thicker for us okay so let me pop this out run this code and there they are okay so what this figure shows us right is here's my distribution of points randomly generated now you can't quite see it here but on the x-axis this goes from negative four four wires on the y axis goes negative 1.5 to 1.5 so if we had a fixed axis which I'll show you in a minute it would look sort of like an elliptical distribution of points but because the axis are squished it looks like it's just some random distribution in there so have 240 points that's my first random set of variables let's make a second set so part of what we're going to do is going to say suppose this x and y come from some kind of data these are cats and then we'll have x2 y2 which let's say are dogs and I got some distribution so let's make these two variables and then our job is to try to separate them okay so here we go we're going to make two more variables and what I can do with this is copy that Oh all right there we go not too bad of a mess-up okay so let's call these x-two and y-two and i don't want to make the same distribution what I'm going to do is Center this one here over around 1 or negative 1 I'll make this like some other ellipse and it will be centered around minus 1 so I can run this and now what we want to do is do a hole on to this plot plot x2 y2 let's make this red and okay so now we have two sets of data and then they are ok so now this is this is kind of the data set we're going to start to play with ok so you have the red dots down here you have the black dots up there and so part of what we want to do is to think about how to separate those out now in this case here it's very close to the example I showed previously which is you know the most of black are up here most of red down here but there is certainly mixing in here if I wanted to get a better separation notice what I can do I can just say okay I'm going to make the Y variable now let's go - - and you'll see when I run this think around ok what's it complaining about all right Oh see have it pop up to the forum sorry your one let's have it pop out every time boom there you go oh and you know what it's plotting over the old data so we have to do a clear all closed all clear screen boom okay there we go that's more like it so if I set the raid the data further notice here the black and the red are very well separated and in fact with your naked eye you already can see what we want to do is draw the line right here right if I want to draw a line a classification line with this algorithm I I know something like this probably a little bit further up easy to fix I just can go down a little bit there alright so something like this might actually work quite nicely a very well separated data so this one's easy but notice there is still a mistake even here see that black dot right there there is some probability right that this thing will sit far away from where this cluster sits down here now I want to walk through you with you through what the k-means algorithm does it's very simple what you get to do with the algorithm is pick the number of clusters you want this algorithm look for that k is typically that number so in other words k-means you're looking for K clustering means okay so in this case we're going to try for two means right to two clusters so what the algorithm does it just says okay fine what I'll do is I will pick two initial spots for where the means are okay so in this data here I'm going to blow this up just so we can look at it more carefully and I can just pick two random points let's say here with a star and over here with a plus these of course are wrong the clusters aren't even near those points but here's what the algorithm does it just says okay everybody who is closest to this point right here closest meaning just what's the distance to point belongs to the star everybody who's closest to that point over there belongs to the plus so in some sense there'd be sort of this line that would run down like this and it would say these points belong to this one these points belong to this one that's the first step of the iteration algorithm is that separation so now I'm going to label all these points to here all those points to there and then what the k-means does it says okay everybody who belongs to this cluster over here what I'm going to do now is I'm going to look to see where is the center of mass of all these points so iteration one is to guess iteration two is to say where's the center of mass of all these points well you know might be if you have this weight plus these up here you know maybe it's maybe this is your new point okay what about this one here now again weighting it according to here knowing looks do you get a big back end over here so maybe I don't let's make something up I'm just making this up I'm just eyeballing it because you have a big mass of black points here less right here but big red down here so I'm doing that so all of a sudden now iteration one started here are some guesses now you're here at the next iteration so these are your new points so in fact going to race my old points they're gone and I can start with these changing points now let's repeat the algorithm whoever's closest to this point belongs to this point whoever's closest to this point belongs to that point so I might draw a line something like this the algorithm now says great all the points belonging to this star I'm going to now find the center of mass of all the points which might be somewhere like let's just draw that so this goes to that point how about this one find the center of mass of all these points well you get this big waiting down here with the red so I'll put this so this goes down to there okay you have some new points now in fact I'm going to race the old ones take them away as well as take away the old decision line so here's my new points I repeat the process yet again and now you can see I have these points in here and here so my splitting might look something like this all the points above here belong to the star all the points down here belong to the cross I now look at what's the center of mass here and this thing might move slightly affecting on a bad position now but maybe maybe it's center of masses here whereas this guy moves to its center of mass maybe that center of mass is not too far away from it was and what ends up happening you notice this is after three iterations what's going to happen is these things are going to converge so what k-means does is you sit sit there with an iteration just give it some guesses for the number of clusters okay I you say two clusters I'm going to two guesses and what it's going to sit there and do is just sit there and work on those two guesses and figure out like whoever's closest to this one belongs to this one whoever close to this one belongs to this one okay and then once you find all the points belonging to this guy you find the center of mass of all those points that's your new point find a center mass of all these points new point then do the same algorithm find who's closest to who just is this raised process and when it starts to converge you actually have your two clusters and in fact in this case here basically the k-means would reproduce what we have here which is your naked eye best separation between these points and these points down here so that's the whole k-means algorithm it's really simple right in some ways because all you have to think about doing is from any point you know if this was my guess out here let's say all I got to do is calculate distances to my point soup I didn't draw that straight line but from here down here and here down here right these are just distances all you got to do is say well that distance is you know if this is it's the difference the Delta x squared plus Delta Y squared square root fagor e in distance which is easy to calculate so you go through for every point associated with the nearest point kind of trivial okay so that's how K means works now let's take these two data sets we could certainly write an algorithm to do this ourselves alternatively we can basically tap into the power of MATLAB and power if MATLAB has k-means built in so we're going to go ahead and look at what the k-means algorithm looks like so let me close this thing down and we'll bring it up here and you can say help oops help k-means okay now when you do this let me make this bigger now OOP sorry bring this over here really make this much wider k-means hoop crap alright all right I will fix this up this is not currently on this computer but we'll fix it up here in a moment does not have the statistics toolbox loaded but let me tell you basically the program structure of what it would do so the k-means command is as simple as basically putting in all your data so for instance stuff i was going to put all this data in i might make my data for the input let's call it X total and the X total right is going to be my X variables stacked on top of my x2 variables and my Y total is going to be my Y variables depth on y2 so I know there's two clusters so this is a little bit fake but I put them all together all the X all the Y so the real data then each point of that data is represented by two pieces the X and the y values so let me put this in as my data is equal to basically my X values and my Y values so what is this well what this thing here is is now each row of this vector data is a point there's one of those points the black or white points has an x value and a Y value so it's a two column point it may be that your data is not two-dimensional could be ten dimensional in which case each row would have ten values or you might have a thousand dimensional space which would mean each row would have a thousand points okay but each row is an individual point okay or a piece of data so like here's this point here's this point here's this point row by row I'm gonna have a total of 480 of them right here okay what I'd really want to do is that actually want to maybe take 200 points from each one and then I could cross validate with the remaining 80 points okay but for now I'm just going to say this is my data the k-means algorithm basically works the following simple is that k-means throw your data in you see that - that's the number of clusters you're going to look for okay so what will happen now is you throw in your data the x total the y total and you give it a number of clusters to look for and what k-means does is just exactly that algorithm I illustrate it on the board it will go it will go systematically through your data okay and we'll pick two random actually there are some smart ways to make your initial guesses it will pick two initial guesses and simply iterate and update those center of masses until they converge and when they converge you're going to get this variable out and out will have for instance binary data 1 or a 2 so associated with every row will come out of the 1 or a 2 if it's a 1 that means is part of cluster 1 if it's a toothed part of cluster 2 so it's going to basically label every single row of your data as either a 1 or 2 because that's the number of clusters you've told it to do so this is very simple and then what you can do is say well let me collect all the out data that is labeled a one that belongs to this certain class so that's kind of my my data now associated with cluster 1 everything that comes out label 2 - all the rows that come out label 2 are now in that cluster and so now you have basically done this clustering algorithm very fast and again I'm going to remind you the most important piece of this and the hardest piece is getting the number of clusters correctly ok in many cases you might have some idea what to pick there but in a lot of cases where you're really looking for data structures and data mining it ticking the K is is often sort of the art of the deal - making this name work okay that's it for K means at this point and we're going to continue with the next piece which is going to be supervised learning using one of the simplest items which is K nearest neighbor search
Info
Channel: Nathan Kutz
Views: 32,053
Rating: 4.9693875 out of 5
Keywords: data science, supervised learning, unsupervised learning, clustering, classification, k-means, Nathan Kutz
Id: G5l6xi_D64Y
Channel Id: undefined
Length: 18min 41sec (1121 seconds)
Published: Fri Feb 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.