Machine Learning Tutorial Python - 13: K Means Clustering Algorithm

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
machine learning algorithms are categorized into three main categories supervised unsupervised and reinforcement learning up till now we have looked into supervised learning where in the given data set you have your class label or a target variable present in unsupervised learning all you have is set of features you don't know about your target variable or a class label using this data set we try to identify the underlying structure in that data or we sometimes try to find the clusters in that data and we can make useful predictions out of it k means is a very popular clustering algorithm and that's what we are going to look into today as usual the tutorial will be in three parts the first part is theory then coding and then exercise let's say you have a data set like this where x and y axis represent the two different features and you want to identify clusters in this data set now when the data set is given to you you don't have any information on target variables so you don't know what you're looking for all you're trying to do is identify some structure into it and one way of looking into this is these two clusters just by visual examination we can say that this data set has these two clusters and k-means uh helps you identify uh these clusters now k in k means is a free parameter wherein before you start the algorithm you have to tell the algorithm what is the value of k that you are looking for here our k is is equal to two so let's say you have the this data set you start with k is equal to two and the first step is to identify uh two random points which you consider as the center of those two clusters we call them centroids as well so you just put two random points here if your k was let's say three then you will put three random points okay and these could be placed anywhere in this 2d place doesn't matter next step is to identify the distance of each of these data points from these centroids so for example this data point is more near to this centroid hence we'll say it belongs to red cluster whereas this data point is more near to green so we'll say this belongs to green cluster the simple mathematical way to identify the distance is to draw this kind of line connecting the line between the those those two centroids and then draw a perpendicular line anything on the left hand side is red cluster on right hand side is green cluster so there you go you already have your two imperfect clunky clusters and now we try to improve these clusters okay so you started you only got your two clusters now we'll make them better and better at every stage and the way you do that is you will try to adjust the centroid centroids for these two clusters for example for this raid cluster which is these four data points you will try to find the center of gravity almost and you'll put the red center there and you do the same thing for green one so you get this when you make the adjustment and now you repeat the same process again again you recompute the distance of each of these points from these centroids and then if the point is more near to red you put it them in a red cluster otherwise you put it in a clean green cluster okay so you repeat the same method and see now these points got changed from green to red so they're more near to red that's why they're in red cluster and you keep on repeating this process you just recalculate your centroids then recalculate the distance of individual data points from these centroids and readjust the clusters until the point that none of the data points change the cluster so here right now see there is only one green which is changing its cluster so now it's in red but after this we are done even if you try to recompute everything uh none of these data points will change their position hence we can say that this is final so these are now my final clusters now the most important point here is you need to supply k uh to your algorithm but what is a good number on k because here we have two dimensional space in reality you will have so many features and it is hard to visualize that data on a scatter plot so which case should you start with well there is a technique called elbow method okay and we'll look into it but just to look at our data set we started with two clusters but someone might say no these are actually four cluster third person might say oh they are actually six clusters so you can see like different uh people might interpret these things in a different way and your job is to find out the best possible k number okay and that technique is called elbow method and the way that method works is you start with some k okay so let's say we start with k is equal to 2 and we try to compute sum of square error what it means is for each of the clusters you try to compute the distance of individual data points from the centroid you square it and then you sum it up so for this cluster we got sum of square error one similarly for the second cluster you will get uh the error number two and you do that for all your cluster and in the end you get the total sum of squared errors now we do square just to handle negate value there is nothing more than that okay so now we computed ssc for k equal to 2 you repeat the same process for k equal to 3 4 and so on okay and once you have that number you draw a plot like this here i have k going from 1 to 11 and then on the y axis i have sum of squared error you'll realize that as you increase number of uh clusters it will decrease the error now it's kind of intuitive if you think about it at some point you can consider all your data points as one cluster individual where your sum of square error becomes almost zero okay so let's assume we have only 11 data points at 11 value of k the error will become zero okay so error will keep on reducing and the general guideline is to find out an elbow so the elbow is on this chart this point is short of like an elbow okay so here is a good cluster number okay so for example for whatever the data set this chart is representing uh a good k number would be four all right so that was an elbow technique let's uh get into python coding now all right so the problem we are going to solve today is cluster uh this particular data set where you have age and income of different people now by clustering these uh data points into various groups what you're trying to find out is some characteristics of these groups maybe the group belongs to a particular region in u.s where the salaries are higher or the salaries are lower or maybe that though that group belongs to a certain profession where the salaries are higher versus less okay so you try to identify some characteristics of these groups so right now we have just name age and income and first thing i'm going to do is import that data set into pandas data frame so you here you can see that i imported essential libraries and then i have my data frame ready with that and uh since the data set is simple enough i will first try to plot it on a scatter plot okay so when you plot it on a scatter plot of course i don't want to include name i just want to plot the age against the income so df dot h df income in dollar i'll just use the same convention you can use dot also but since there's a bracket here i'll use the same convention okay when you plot this on scatter chart you can kind of see three clusters one two and three so for this particular case choosing k is pretty straightforward so i will use k means so k means is something imported here okay and of course you need to specify your k which is n underscore clusters and by the way in jupiter notebook when you type something and when you hit tab it will auto complete okay so it creates this k means object for you and it has all these default parameters you can tweak all these parameters later but i'm just trusting on the default parameters the second step is fit and predict so in previous supervised learning algorithms we used to do fit and then calculate the score here i'm just directly doing fit and predict so fit and predict what okay i'm going to fit and predict the data frame excluding the name column because name column is string and it's not going to be useful in our numeric computation so i want to ignore it all right so you do fit and predict and what you get back is y predicted so now what this statement did is it ran k-means algorithm on agent income which is this scatter plot and it computed the cluster as per our client criteria where we told algorithm to identify three clusters somehow okay and it did it it just assigned them different labels so you can see three clusters 0 1 and 2. now visualizing this array is not very very much fun so what we want to do is we want to plot it again on on a scatter plot so that we can see what kind of clustering result did it produced okay so i am in my data frame i am going to append uh this particular column so that my data frame looks like this so now this is a little better where i can see these two guys belongs to same group these two belongs to same group and so on but it is still not as good as scatter plot okay so let's do this plot dot scatter plot all right now uh what we need to do is we need to separate these three clusters into three different data frames so let me do that df1 is equal to df df dot cluster cluster is equal to zero okay so what this is doing is it's returning all the rows from dataframe where cluster is zero and the second one will be this and the third one will be this so now we have three different data frames each belonging to one cluster and i want to plot these three data frames onto one scatter plot okay now just to save some time let me just copy paste the code here okay i will come at this little later but see three different data frames and we are plotting these uh data frames into different color okay so cluster zero is green then red and black let's see how that looks okay so df oh i'm made a mistake here i had a typo good all right so i see a scatter plot here but there's a little problem so this red cluster looks okay but there is a problem with these two clusters you know they are not grouped correctly so this problem happened because our scaling is not right our y-axis is scaled from let's say 40 000 260 000 and the range of x-axis is pretty narrow see it's like hardly 20 versus here is 120 000. so when you don't scale your features properly properly you might get into this problem that's why we need to do some pre-processing and use min max killer to scale these two features and then only we can run our algorithm all right so we are going to use min max scalar so the way you do it is you will say scalar is min max scalar and this is something if you already noticed we imported here okay all right so scalar is this and scalar dot fit df so now i want to fit first the income all right so my scalar min max scaler will try to make the scale 0 to 1 so after i'm done with my scaling i will have a scale of 0 to 1 on y as well as x axis all right so df let me just uh copy paste this guy here is equal to scalar dot transform okay so now scalar will um scale the income feature all right so df this okay let's see how that did it so you can see that the income is a scale right it's like say 0.38 and so on so it is in a range of one to zero you will not see any value outside zero to one range we want to do the same thing for our age also okay so let's do that scalar dot fit df dot h df dot age is equal to scalar dot transform df dot h and then we print our df and you can see the age is also scaled okay i have this extra column because i made a mistake previously but you can ignore that you can ignore cluster also so we have age and income features properly scaled now okay and even if you plot these on to scatter plot they will look structure wise at least they will look like this okay all right so the next step is to use k-means algorithm once again to train our scale data set so it's gonna be fun now let's see what scaling can give us and as usual y predicted is equal to km dot fit and predict so again i started with three clusters and i am using um i'm just fitting my scale data age income all right and let's see my y predicted so it predicted some values which yet don't know how good they are so i will just do cluster is equal to y predicted i will also just drop the column that we typod and then let's look at df okay in places in place is equal to true okay so now this is my new clustering result uh let's plot this on to our scatter plot i'm just going to remove this for now now you can see that i have a pretty good cluster see black green and red they look very nicely formed uh one of the things we studied in the theory section was centroids so if you look at km which is your train a k-means model that has a variable called cluster centers and these centers are basically your centroids okay so this is x this is y so this is the first centroid of your first cluster second centroid and third centroid and if you can plot this into a scatter plot uh it can give a nice visualization to us right so pld dot scatter so first let's plot x axis okay so x axis for this will be it will be what okay so using this syntax you can say i want to go through all the rows which is three rows here and then the zero means first column which is this okay and your y is your first column and just to differentiate them with regular data points i will use some special marker and color so you can see that these are the centers of my clusters all right let's look into now elbow plot method see this data sort was simple but when you're trying to solve a real life problem you will come across data set which will have like 20 features it will be hard to plot it on scatter plot and it will just get messy and you will be like what do i do now well you use your elbow plot method so in elbow plot um as we saw in theory we uh go through number of case okay so let's say we'll go from k equal to 1 to 10 in our case okay and then we try to calculate sse which is sum of square error and then plot them and try to find this elbow so let's define our k range let's say i want to do 1 2 10. this will be 1 2 9 but whatever okay and then sum of squared error is an array so for k is equal to 1 you'll find sse k equal to 2 you will find sse you will store all of that into this array and then use matplotlib to plot the result okay so 4k in k range so i'm just going through one to nine and then each iteration i create a new model with clusters equal to k and then i call fit okay and what what do i try to fit okay i try to fit my data frame but i use this syntax because my data frame has name column i don't want to use name column all right you'll be like what the heck this guy is doing all the time using this crazy syntax but that's to avoid name if you want you can just create a new data frame just drop name column that is fine too and all right so now what is my sum of square error how do i get that when you call km dot fit after that on your k means there is a parameter called inertia that will give you the sum of square error and that error we want to just append it to our array that we have all right that was pretty fast because our data set is very small okay let's see what is sse so sse you can see that sum of squared error was very high initially then it kept on reducing and now let's plot this guy into nice chart okay when you do that you get our elbow plot remember elbow plot elbow all right where is my elbow where is my elbow okay here is my elbow you can see that k is equal to 3 for my elbow and that's what happened see i have three clusters for exercise we are going to use our iris flower data set from sklearn library and what you have to do is use pattern length and width features just drop sample length and width because it's it makes your clustering little bit difficult so just drop these two features for simplicity use the pattern length and with features and try to form clusters in that data set now that data set has a class label in the target variable but you should just ignore it okay you can use that uh just to confirm your results and in the end you will draw an elbow plot to find out the optimal value of k alright so just do the exercise post your results into the video comments below also i have provided a link of jupyter notebook used in this tutorial in the video description so look at it when you go towards the end you will find the exercise sections also don't forget to give it a thumbs up if you like the content of this tutorial you can also share it with your friends
Info
Channel: codebasics
Views: 184,858
Rating: 4.9449415 out of 5
Keywords: kmeans python, kmeans sklearn tutorial, kmeans example, k means elbow method python, unsupervised learning, k means clustering python, k means clustering, k means clustering algorithm, k means python, clustering in machine learning, clustering python, k means, k means algorithm in machine learning, kmeans, k-means clustering, k means clustering in machine learning, k means algorithm, kmeans algorithm in machine learning, k-means clustering algorithm, k-means in python
Id: EItlUEPCIzM
Channel Id: undefined
Length: 25min 15sec (1515 seconds)
Published: Mon Feb 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.