K-means Cluster Analysis With Excel - A Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to teach you how to perform a k-means cluster analysis with excel cluster analysis is a wildly useful skill for any professional whether you work in hr supply chain product management finance marketing you name it cluster analysis is used all over the business and the great news is you can conduct k-mean clustering using plain old excel some vlookups a few calculations and just a little bit of power query is all that you'll need as you will see in this video now i need to apologize this video is quite long and the reason for that is is that i've received feedback from my viewers that they prefer more in-depth tutorials rather than shallower tutorials so if this is too much detail feel free to skip around and just pick out the content that you need for whatever task you have at hand let's go and get started with the basics of the k-means algorithm how it works and by the way just so you know the k in k-means is the number of clusters in the analysis so if a k is equal to three you're looking for three clusters in your data so i'm going to use a contrived example to teach you the basics of the k-means algorithm the k-means clustering technique typically my case i like to use a contrived example and what you can clearly see from this particular example is that you've got some obvious clusters going on here so you wouldn't even really need k means in this scenario think about it this way what this graph with this chart depicts are two columns of data in an excel worksheet one column corresponds to the x-axis and another column corresponds to the y-axis and if you plot them using a scatter plot in excel what you would see clearly is these clusters but here's the thing interesting data that you cluster typically has many columns of data and these types of clustering that are obvious in this diagram in this chart would not be the case in the real world if you've got 7 8 9 25 37 columns of data you can envision a chart in 37 dimensions right this is a two-dimensional chart two columns of data if you had 37 columns of data you just the human brain just can't do it what we're going to do is use a very simple contrived example to understand the base concepts and then what that allows us to do is then extrapolate to more complicated examples using the knowledge we've gained with the simple contrived example so we've got two obvious clusters as we go through the slides i want you to pay particular attention to this one guy right here this one data point right here because this is where the k-means algorithm gets interesting so let's go ahead and get started with how k-means works so first up k-means doesn't know anything about your data when it first starts it doesn't know what the clusters can be or will be it does it just doesn't know so in order to have a starting place that makes sense it uses randomness and that's a very common theme in analytical techniques because randomness gets rid of any sort of bias while it's not guaranteed to give you the best result it certainly helps in getting a good result or a useful result in any number of scenarios the first step in getting started with k-means is to tell the technique tell the algorithm how many clusters it should find and we tell k-means that k-means doesn't figure it out on its own now just to be clear there are other types of clustering algorithms that will find the number of clusters on their own however k-means is not one of them you need to tell it how many clusters to look for now you might be asking yourself and rightly so great dave well how many clusters should i tell k-means to look for i will cover that later on in this video the answer to that is it's non-trivial but let's just say for the sake of argument right now just for our learning we're going to tell k-means to find two clusters so k-means says okay great you want me to find two clusters i don't know where to start so i'm going to use randomness so what i'm going to do is i'm just going to throw out a number of starting points so in this case we've asked for two clusters so k means then says okay cool i'm going to throw out two starting points and they're random and they're denoted here in this particular chart is the orange dot here and the green dot because what will happen is that k-means is going to say look i don't know what the optimal clusters are at the beginning i need some place to start i'll toss out two random places the orange and the green dots here on the chart and then what i will do is i will go through the data over and over again and refine the clustering i will move the cluster centers around and then assign the clusters optimally eventually i'll reach a stopping point and then you'll have clusters now i know that sounds a bit abstract we're going to go through it step by step in these slides so it'll make total sense to you so we've got our starting point we've got a random cluster one located here and a random cluster located here what we need to do then is once we've got our starting points is we need to assign the existing data to the clusters and that's pretty easy we just say look whatever data points are closest to the current set of clusters we just say look they belong to that cluster and let me show you what i mean by that so these guys right here are all obviously closer to the orange cluster than they are to the green cluster what we'll do is we'll say okay these guys now belong all this data now belongs to cluster number one and then similarly we'll say all these data points including the guide i told you to pay attention to this dude right here all are closest to cluster number two so we're going to assign all of these guys to cluster number two just to start with right notice we're at the very beginning of the clustering process we're not even nowhere close to being done yet but we need to start somewhere and this is how we start with k-means so we can then flip all of the dots over so you can see here everything that was closest to cluster one is now colored orange so these guys are all cluster one and notice that these guys are all now green and one thing i need to i need to mention there's a reason why these dots are bigger than the other dots not only so it's easy for you to see but also to tell you as well that these data points here they're not part of the original data set k-means just keeps these points on the side they're not actual data points it's just part of the algorithm so that's why they're also differently sized so you know that these are fake data not real data these are actual data that would be in your worksheet these are not as you will see later on when we go through the excel worksheet for this video so that's our first step but we're nowhere near done yet because what we need to do now is we need to say look these clusters right here these hypothetical randomly tossed out clusters should be at the center of all the data points that's how k-means works it tries to find as many cluster centers as you tell it to find and places them in the right position within the data and what you can clearly tell is this orange dot is not in the center of this cloud of points and the green dot here is also not in the center of these green points down here this cloud of green points so we need to of course not surprisingly move the clusters now this is where the k means in the name comes from because you need to find some way of saying okay i've got a cluster of points here i got a cluster of points here how do i move these points these cluster centers to the middle of the clouds well it's actually kind of easy what you do is you say i'll take the average of the horizontal so the average of the horizontal of all these points which would be approximately right about here and i'll move it over here and then similarly we'll then say okay well we'll take the vertical we'll take the average of the vertical and move it up or down based on that if we combine these two we kind of get the dot the cluster center in the center of the cloud of points k means is just the average so mean is another term for the average so if you're thinking in your mind's eye the average function in excel you're spot on so we take the average of the vertical we take the average of the horizontal and then we use those as the coordinates to move the cluster center from wherever it is right now to where it needs to be and let me show you what that looks like approximately boom and you can see it's not quite exactly where these things line up but you get the general idea this is definitely more in the center of the cloud of points than it was before so we move it now not surprisingly we need to do this for the green cluster as well so we take the average of the vertical we take the average of the horizontal but notice we got this guy way over here so and he's a little bit higher up as well from the rest of the cluster so this is going to affect the movement of this green cluster because we've got a little bit of a this guy's hanging out here i don't want to use the term outlier because that has certain implications but if you want to think of it as an outlier that's okay so we now move the green center notice how it's up here not where you might expect down here because of this guy out here this guy's pulling the center to the left and up a little bit compared to all the rest of these guys over here okay but we're not done yet because remember this guy what about this guy right here is he's not actually probably should be green anymore right look at this i mean he's farther away from the green cluster center than he is from the orange cluster center so this is a key aspect of understanding the k-means clustering algorithm is that it is iterative that it's iterative it's an iterative process so what happens is all the steps that we're going through in these slides the k-means algorithm does this over and over and over again until it reaches a stopping condition and as we'll see in a second that stopping condition is well if i don't need to assign any data points and i don't need to move any clusters anymore then i'm done but what we can see right here is this this guy right here so we should probably flip him from green to orange so we flipped him over to orange here now here's the problem we need to move this cluster center and this cluster center now because they are not centered anymore right because we've now flipped the screen guy to orange so this guy needs to move more into the center of this cloud and because this orange guy now belongs over here this cluster center needs to move as well so we just keep repeating this process once again right over and over and over again and we just take the average of the the vertical and the average of the horizontal for both the green and the orange calculate what the new coordinates need to be and then we move them so as i mentioned earlier but it bears repeating this process keeps going until there are no data reassigned so if no data points need to be assigned to a different cluster then obviously you don't need to move the cluster centers anymore and the algorithm is done and the reason why i'm emphasizing this is because when we go over to excel and see the implementation of this in the worksheet you're going to see this process repeated over and over again in the tables of data in the worksheet and if you're going to do k means in excel you have to manually implement this iterative process yourself as you will see so don't panic we're going to talk a little bit about math here and the reason for that is we've defined this idea that you want to assign data points to the closest cluster center but that begs the question of well how do you define close well you use math for that not surprisingly and by default the most commonly used method that k means employs to calculate how close a data point is to a cluster center is what's known as euclidean distance now don't panic like i said don't panic it's actually a lot easier than you think so what i want you to do is i want you to set the wayback machine right i want you to get in your brain and go back in time to when you were in school and you learned a little bit of introductory geometry in particular the pythagorean theorem i'm sure this will probably jog your memory so if you've got a right triangle something like this the pythagorean theorem says look you can calculate how long this thing is the hypotenuse c in other words here you can calculate how long this is from a and b this is the famous equation a squared plus b squared equals c squared or alternatively you can take the square root and you get c is equal to the square root of a squared plus b squared if i have a point right here and a point right here this equation the pythagorean theorem tells you how far apart they are by a straight line now this is in two-dimensional space as we'll see later on in the next slide this actually scales to three dimensions four dimensions five dimensions six dimensions 37 dimensions which is why it is the most commonly used method of defining how close a data point is to a cluster center in k-means so we've got our math here i'm just going to need to refine it just a little bit and let me show you on the next slide so we've got our we've got our same triangle here and we've got the pythagorean theorem right here a squared plus b squared equals c squared however what we really need to do is understand how do we calculate how long a is because we need to get the length of a and square it because that's part of the process of finding c squared which is the distance between these two points right here and also similarly we need to find out how long b is what's the magnitude of b because it's also part of the calculation to find c if we think about this in terms of the x axis and the y axis the length of a is the x value here minus this x value here that tells you how long a is similarly in the vertical you can say well how long is b well b is equal to the length of the top part here minus the bottom part here so y2 minus y1 this is y2 and this is y1 and then if you substitute that in you get this right here so a is equal to x sub 2 minus x sub 1. and you get that right here and b is equal to y sub 2 minus y sub 1 and that gets it here this is the format of the math that you're going to see when i get over to the excel workbook but as i said on the previous slide and it bears repeating here this is in two-dimensional space but this formula works for a third dimension so maybe you have z right you often see a three-dimensional chart with x and y and then z so it works for three dimensions it works for four dimensions five dimensions like i said before up to 37 dimensions however many dimensions you want so this is how k means determines the closeness by default of a data point to a cluster center and this is implemented in the excel workbook that you can download from my github repo if you'd like one thing that you need to keep in mind is because k-means relies on euclidean distance to calculate how close a data point is to a cluster center it doesn't work with categorical data out of the box so what you need to do is you need to transform your data using a technique known as dummy encoding and let me show you what i mean by this so let's say that i have some hypothetical data in excel and it's my customer type because things like this are very common in business data these are not numbers so you can't calculate euclidean distance with gold silver and bronze but what you can do is transform the data into indicators so these are binary indicators one means yep i have this thing and zero means no i don't have this thing so you can see here i am a gold customer so i have one but i have zero for silver and zero for bronze i'm a silver customer zero for gold one for silver zero for bronze so on and so forth and now since these are numeric columns i can now use these with euclidean distance calculation i can now incorporate categorical data into my clustering process which is extremely important when you're analyzing business data using k-means there's a caveat given the nature of euclidean distance the way that k-means works with your data k-means generally speaking works best when you have a large number of purely numeric columns like height and weight and sales and things with decimal points in them as opposed to a large number of categoricals because notice this is one and zero only so yes you can have a few categorical columns in there but if your data is mostly categorical or like maybe even half categorical you need to ask yourself if k-means is really the right technique for you because like i said there are many other types of clustering techniques and the reason for this is pretty simple i mean intuitively think of it this way remember that the process for moving a cluster center in k-means is to take the average of all of the columns of data in this deck we're only looking at two columns right the x and y column however if you've got 37 columns and half of them are categorical like this what does it entail to take the mean the average of the gold column or the average of the silver column or the average of the bronze column kind of doesn't make as much sense as it does to take the average of a height column or a weight column or a sales column just keep that in the back of your mind there is no clear-cut rule that tells you when you've got too many categorical too many dummy encoded columns in your data just know that generally speaking you want to have more numeric columns when you use k-means compared to the categorical columns that we see here all right enough of the slides let's go ahead and flip over to excel where you can see the workbook where i've implemented the k-means algorithm that i just covered in excel using nothing more than typical excel calculations vlookups and just a little bit of power query so here i am finally in the excel workbook and there is a lot going on in this workbook and i apologize for that however what we just saw in the slides was that the k-means algorithm the k-means process it's got a lot going on so therefore you need to put a lot in your worksheet in your workbook to make k-means happen if you're just using techniques like vlookups and a little bit of power query so first up what i've got here is a very famous data set this is known as the iris data set and i don't have any categorical variables in this particular worksheet because it would be trivial to add them if you decide to do that based on what you saw in the slides i wanted to keep this nice and simple so that you can focus mostly on the 80 percent of what happens when you do k-means so what you can see here is we have like 150 rows of data so this is not a very big data set you can see them all here and i've added an observation column all this is is just the number of the observation basically the row number in the data and we'll need these for the vlookups as you'll see so this is our base data and the iris data set is famous because it's used a lot in machine learning and cluster analysis as an example data set and what we know is that there are basically three clusters of data in the iris data set and like i said earlier and i'll repeat finding out the number the the optimal number of clusters for k means to find is it not it's not a non-trivial problem and i will cover that in the next section before i end the video we know up front that we're going to find three clusters so as we know from the slides the first thing k-means asks is okay dave yeah you wanted me to find three clusters great and where should i start and i tell k-means i don't know so k-means is all right i'm going to throw out three random places i'm gonna have three random cluster centers to try and find just to start with and what you can see here is how i did that and what i did was i just said look we'll just use a random row from this table of data over here oh and by the way this table is named iris underscore data as you can see here so i said minus we'll just pick three rows at random from the table iris underscore data and just use those as a starting point because as we saw on those slides the cluster centers will get moved around through the process so we'll just pick three at random and what you can see here is the code that i used to do this so i have the number of observations here stored in h4 is 150 and what i'm doing is i'm just using a simple calculation using excel's random function to say look pick me a value between 1 and 150 and that is the row of data that we're going to use you can see here that i've got three random values because i've just repeated this particular called particular function call over and over and over again now here's the problem that you need to be aware of if you're not familiar with the rand function every time you update the worksheet the rand function in excel will generate a new number so what you need to do is you need to grab a certain collection of these values that you like or actually no don't grab ones you like just grab some random ones and then what you want to do is you want to copy and paste the values over here because this thing is going to change every time you do an update to a worksheet let me show you what i mean with that so if i so if i shrink this column but then i double click it notice how i get new random values and that's annoying there's a way to shut that off but it totally jacks up everything so you just can't do it so you pick some random clusters copy and paste the values over here okay that's the first thing that you do i'm just going ahead and scroll over so we can see the starting point okay so here are our initial observations these are the initial three random cluster centers that we're going to use for k-means and then what happens is that once you've identified the row number observation number 44 observation number 92 observation number 24 you can copy over the actual individual data values now notice here that we have four columns of data so what we're doing here is a four dimensional clustering problem but notice how we're going to use all the things that we learned in the slides in the worksheet even though we have four columns and it'll scale to five six 27 whatever it might be so these are our current starting cluster centers and notice that i copied the values over because remember what i said in the slide right the cluster centers are not real data they don't show up in your original table they're just holding places for the algorithm and what we will have is tables of data like you see here going forward that hold the cluster centers and they're not in the original 150 rows of data so we've got our cluster centers right we've tossed out three clustered centers here at random and now we just need to go through all the individual data points in the data all 150 observations and say hey for each row which cluster are you closest to right now given where the cluster centers are located and the cluster centers once again are located right here in these 12 values cluster center one cluster center two cluster center three and you can see the numbers right here how we do that is we use euclidean distance and what we can see here is we've got our observation column 1 through 150 right because we have 150 rows of data and then what we do is we say hey observation one what's your distance to cluster one hey observation one wants your distance to cluster two and cluster three that makes sense right we know that from the slide intuitively it's not that big of a deal so let's take a look at the math oh man now remember this right here all of this goodness right here in this is nothing more than the euclidean distance calculation that square root of a squared plus b squared plus all the other things and then we say well to actually calculate a and b we need to actually calculate how long the line is when you take x2 and subtract off of that x1 and then we need to take y2 and subtract off that y1 so on and so forth as you get more dimensions you just have to keep doing that over and over and over again and that's what these guys represent and just to make this a little easier to see let me pull it up in notepad for you so here we have the actual excel formulas that i was just showing you in notepad and what you can see here is i'm just using vlookups to calculate one of the values and then i use a vlookup to calculate the other value and then i square it just like we saw in the slides and all of this is wrapped up in the square root function so that's that radical right that little little house looking thingy that hailed everything and this is basically the meat of the distance calculation and we're going to use these base formulas here over and over and over again as we iterate and as i go through the worksheet you'll see what i mean but these calculations aren't particularly complicated unfortunately they are kind of a pain to do in excel and if you're an excel user of any sort of experience this kind of ugliness is not unfamiliar to you okay so we've got our distance calculations here the distance from observation one to cluster one cluster two cluster three and then all we do over here is we say well which of the distances is the smallest and that means that's the cluster that this row of data is closest to and how we do that is pretty easy we just use a little excel magic here to go look through all of these and find the one that's the smallest and then we say okay which of the three is the smallest is it one two or three and it happens to be one not by a lot but one is cluster one is closest for this row of data we get back cluster assignment one then cluster assignment three three three one one one so on and so forth if i scroll down you can see that it's a lot of threes and ones and a whole bunch of twos that's iteration one we pick a random three cluster centers to start with we then go through all 150 rows of data calculate all the distances and find out for each one each of the 150 rows which cluster center is it currently closest to and make that the current cluster assignment and that's analogous to what we saw on the slides when we color coded things right whether it was orange or green but now we're using one two or three instead of orange and green as we know from slides once we've got this table of data we now need to move the cluster centers to the actual centers of the clouds of points and i'll show you i'll demonstrate how we do that using power query in a bit but we're just going to run through some iterations conceptually here so we take this table of data and then we say give me all of the rows of data that are currently assigned to cluster number one and then calculate the average of the sepal length the average of the sepal width and the average of the pedal length and the average of the pedal width this is the coordinates where we're going to move cluster center one remember it's the average of all of these values down here that are assigned to cluster number one and then we take the average of each one of the rows of the original data and that's where we move the cluster to that's the coordinate of the new cluster center and we do the same thing for two and we do the same thing for three now notice this right here notice that this these were the original starting values for cluster one cluster center one and notice that they are now different not by a lot well actually in some cases quite a bit so this one went from 0.6 down to 0.26 but you may not see necessarily you see a lot of movement like for example this went from five to five point zero nine five ish a little bit of movement but notice that it's moving in four dimensions simultaneously which boggles the brain and similarly you can see that cluster two moved a little bit and cluster three moved a little bit and that's typically what you see these cluster centers move as the algorithm as the process iterates now that we've got iteration two cluster centers right this is the this is where the centers are now located we need to now go through all of the data once again and say look do we need to flip any of the data points maybe now this this row of data is now closer to a different cluster than it was before and you can see here let me scooch over so my face isn't covered up you can see here iteration two we're just doing the same thing again and if i click into this notice that we're using the same type of calculation right it's the same stuff that i showed you in notepad but we have to adjust the vlookups because notice now if i click in here we're using this table of data for the vlookups as opposed to this table of data which is what we used for the first iteration because now we have new cluster center so we've got a vlookup into this table to calculate the distances not this table so what that means is you got to go update those big long hairy calculations for each iteration now it's not too complicated what i do is i just use notepad and i just do search and replace it's not hard it's it's a bit error prone because it's manual process and it's a bit tedious but it's not conceptually difficult we just go through and we calculate all of the clusters the assigned clusters and let me scroll over a little bit so you can see here for example we've got one that changed so notice that in the very first iteration observation number seven the seventh row of data in the original data table was assigned to cluster one but notice now it's closer to cluster number three so in iteration two we flipped it so to go back to the slides we turned the green that green guy that was like the outlier we turned them orange same idea here and cluster assignments are going to change as the algorithm iterates so we calculate all these and we do the new cluster assignments well what that means is that we need to now calculate the new cluster centers right we need to calculate new cluster centers because if we flip a bunch of the points that means the cluster centers might not be in the center of those clouds anymore so we need to do this again as i mentioned earlier we do this with power query i will demonstrate how it works in power query here soon enough but once again conceptually notice that we these were the coordinates for cluster center number one at iteration two and notice now that they are different in iteration three so we've moved cluster center one a little bit and similarly you can see the cluster two's moved around and cluster three is moved around so these cluster centers are moving around a little bit as we flip all the assignments now that we've got new cluster centers we need to go through the table of data again and calculate which cluster should each observation belong to now that the cluster centers have moved it's not super complicated it's tedious a little bit error prone so you can see here that if i click into the calculation i'm now using this lookup table to calculate the distances instead of this one you're noticing the process here it's not too complicated it is tedious i won't lie and then what we do is we take power query and we take this data and then smash it up and create cluster number 4. now notice this this is really super important notice that cluster 2 didn't move and what that means is on the last iteration where we did this calculation none of the cluster 2 observations those rows of data that were assigned cluster two none of them changed one and three still changed one and three are still moving around one and three are moving around but two is fixed which is progress right that means we're getting closer to getting done with this iterative process and by the way we're not going to finish the complete clustering in this video because i have no idea how many iterations it's going to take i'm just showing you how to do it how you can do it in excel and then if you'd like to complete this as a take-home exercise feel free so what we need now is our iteration 5 clusters and i'm going to show you how to do that using power query so what we do is all we do is we click into this table right here and you'll notice that this table has a name it's called iteration four that's just a naming convention that i'm using and what we're going to do is we're going to go up to the data in the ribbon here and we're going to get and transform data we're going to do some power query awesomeness here so we hit from table range and that's going to load up power query for us my power query is running a bit slow so that's why i did jump cut all we need to do is work with these two columns of data we don't need any of these columns so i can just click on this one hold down my shift key click over here right click and click remove columns because we don't need them because all i need is the cluster assignment and the observation number because remember the original table of data is all the way over on the left in the worksheet we're still using that so what i need to do though is i need the values i need the values from that original table so what i need to do is i need to do a join i need to do essentially a vlookup ish kind of thing but with more power and this is what power query gives me way to go microsoft with the naming so what we're going to do is we're going to merge that's how we can join tables of data so what i'm going to do here is i'm going to do a merge and it's going to ask me okay well what do you want to merge with here's your iteration 4 table right this is the table of data that i'm working with currently how do you want to join it how do you want to merge it well i want to join i want to merge it with iris underscore data because this is the original values because remember to calculate the cluster centers i need to go okay give me all of the rows of data that are assigned to cluster center number one and then i need to calculate the average of all four of the columns of data from the original data table i'll join that up and i'm going to go ahead and do an inner join here and i'm going to join up an observation because notice that i'm going to cluster assignment 1 and then i'm going to get this row of data right here this is cluster assignment 1 i'm going to get row 5 right here so on and so forth and that's going to allow me to calculate the average of the four columns of data by the cluster assignment which is how we move the cluster centers around so it tells me that the selection matches 150 rows which is what you would expect so click ok and now i get my join data here see this is iris underscore data it's joined up to the iteration four table and i can expand it by clicking on this bad boy right here and it says okay what columns do you want dave well i don't need the observation number because i already got it it's right here right i got observation right here so i'll just unclick that i just want these four pieces of data and i don't want to use the original column name as a prefix because that's annoying and i click ok and boom notice here i've got my cluster assignment my observation number and the individual values from the original data table oh sweetness but i'm not done because what i need to do is i need to say okay for each cluster cluster one cluster two clusters three for each one of those centers i need to find what is the average of this column what is the average of this column what is this common average of this column by the cluster number and that's cool we can totally do that very easily using the group by functionality this is essentially a way to pivot data it's very similar to a pivot table in excel although it's a it's more powerful because that's why they call it power query right okay so click on group by and i'm gonna go to advanced and i don't want to group by pedal length i want to group by the cluster assignment right i want to know cluster sign one cluster sum of two cluster segment three sweet and now this is pretty easy all i do is say hey i want the average of the sepal length and i'll just call this sequel length i need another one i need another average and it's just the next numeric column the sequel width i call it sepal.width and then i add another column and i need to average it petal.length and notice this is exactly the process i did for each of the green tables that we saw previously where i was doing the cluster centers each of the green tables at the top of the worksheet this is exactly what i did right not difficult tedious but not difficult and then we can do pedal width and then pedal.width click ok boom look at that so given the current state of the data in the worksheet right from all the tables that we saw previously the new cluster center for one is this and notice that for some reason it's out of order so i'm just go ahead and sort ascending order boom and you can see the data now i need to put this in my worksheet and that's pretty easy so i just go over here to load and close close and load i'm just going to click on the little down arrow and close and load to and i want to stick it in my existing worksheet and where i want to put it is right here right right underneath here and then i just click okay close this down look at that now i've got my updated cluster centers now once again notice this we got a little bit actually yeah a little bit of movement no a little bit of movement cluster two just a skosh just a teeny teeny tiny bit so cluster two stayed still for two iterations and then it moved just a little bit so we're not done that's what this is telling you is we're not done because we need to keep doing this process over and over and over again until what you see is that the clusters don't move at all all three of them don't move at all and as soon as you see one of these tables it looks exactly like the previous green table then you're good to go you know you're done and then you have your assignments now i'm not going to bother filling out this table down here because it is tedious right but it's well actually what the heck i'll just show you is basically what you do so you can just copy over the data and grab this entire table of data and i just paste it right here and maybe i grab this and put it right here and it calls iteration 5. but notice that if i click in these lookups are looking at the wrong table so what you have to do is you have to go and adjust the vlookups to look at this table and that i won't go into because that's a simple just like search and replace using that notepad file maybe i'll put the notepad file up on the github as well so you can get this workbook from the github as i mentioned earlier and i'll put the text file in there too just in case you want to use that for the search and replace kind of functionality and then you would just you just calculate that in there and then just repeat this process over and over again that's basically it now if you're thinking to yourself geez dave this is a lot of work and the answer is yeah it is a lot of work so this this is a very computationally focused type of way to do clustering which excel out of the box isn't very good at that's why you would use typically if you really wanted to do this seriously in excel you'd either buy an add-in or you might write some vba visual basic for applications code but there's an easier way this process right here everything that we've just seen can be done for example in one line of code in the r programming language now i'm a big fan of the r programming language myself i use excel all the time but r is what you should probably go to if you need more power and here's the cool thing if you know how to write code in excel like this right here make no mistake this all this stuff right here that's code and if you're experiencing writing this kind of code in excel learning r programming is very very easy and i don't expect you to believe me i got a video that i'll put down in the description i'll put a link to it on my channel you can take a look at it and i'll also throw it up at the end of the video as well in case you want to take a look at it any excel user can learn our programming and you can do this in one line of our code instead of doing all the things that we just described okay i know this has been a long video but i wanted to make it maximally useful to the most number of people which is why i incorporated so many elements if you're finding the content useful if you like what you're seeing would you mind helping me out with the youtube algorithm and just giving me a like that will tell the youtube algorithm this content might be useful to other folks that are interested in doing k-means clustering with excel now that we've talked about how the algorithm works how you can implement it in excel let's talk a little bit about the cons of using k-means clustering as i mentioned at the very beginning of this video k-means clustering is probably the world's most commonly used clustering technique however it's not the best necessarily in every situation it's a very good general purpose clustering algorithm which is why it's so widely used i mean it's kind of the default however there's it's not perfect in every situation and let's go through some of the cons of why you might not want to use k-means or at least things you need to be aware of if you do use k-means so first up and i've talked about this before as the data analyst you need to tell k means up front how many clusters you wanted to find now the problem is that you don't really know what the optimal number of clusters are in the beginning the more complicated your data set the more rows you have the more columns you have the less likely it is for you to understand how many clusters you should probably have k-means look for so the most common technique that people use to find the optimal number of clusters with k-means is with what is known as a scree plot and the scree plot is actually quite simple and the technique is how do i put this it's basically a brute force mechanism so for example let's say you have a few thousand rows of data and you have like 21 columns of 21 columns of those data so it's a fairly complicated data set so literally what you would do is you would run k-means with two clusters then you would run it with three clusters four clusters five clusters six clusters all the way up to maybe 15 clusters and then what you do is you calculate how good those clusters are for each individual k two is here and three is here and three is a little bit better than two and four is better and five of six and seven and so on and so forth and eventually what happens is on the plot you get diminishing returns of goodness based on higher levels of k higher numbers of clusters and what that tells you is where you should stop usually what you do is if you see if you think of it like this kind of a curve like this usually you stop when what they call the elbow which essentially is that point of diminishing returns and almost like a you know the elbow of somebody's arm you should be thinking to yourself right now wait a second dave just walked through all the work that you need to do to do one k-means clustering with a single k so what you're telling me is dave i need to run that process that we just saw in excel multiple times for different numbers of k's different numbers of clusters and the answer is yes yes which is why you can certainly do k-means clustering in excel as i just demonstrated it's mathematically possible it's not that complicated however it takes an inordinate amount of time to do it well the only exception that to that would be is if you kind of already know up front how many clusters you have but then that begs the question of why are you doing cluster analysis in the first place anyway if you want more information on the scree plot technique i will put a link down in the description below so you can actually go find like a blog article about that once again as i mentioned before if you're really going to do k-means cluster analysis you probably want to use a more sophisticated tool than excel my favorite of course is our programming like i said you can do a k-means clustering in one line of code and it's actually trivial to explore multiple k's with r programming and produce a scree plot to find the optimal number of k okay next up what's another con well this goes to the nature of the algorithm the fact that it starts with random cluster centers at the very beginning means that oftentimes k means if you do it multiple times will give you different results you will literally have different clusters because you're starting at different random points so a common practice if you're using k-means to get around this problem is to run it multiple times and you're probably seeing a trend here so first up you run multiple iterations of k-means to find the optimal k the optimal number of cluster centers using something like a scree plot and then once you've picked that number let's say it's 7 let's say your screen plot says that 7 is the optimal number of clusters and then what you need to do is run multiple k means with a k of seven because you get the different clustering based on the random the random starting points so if you're if you're thinking to yourself oh my goodness if i want to do robust k-means cluster analysis with excel it's a lot of work yeah it's a lot of work but it can be done can't be done now next step and we talked about this already is that k-means works best when you have less categorical columns less of those dummy encoded ones and zero columns talked about this before but it bears repeating because categorical data is very very common in the business world so you really have to ask yourself am i having too many categories if i'm using k means clustering if the vast majority of your columns are all numeric sales and measurements like height weight and width and things like that then you're good to go but if you've got a lot of categoricals you really have to ask yourself if k-means is the right thing and then lastly and this makes a lot of sense when you think about the way you know the euclidean distance calculation works k means the clusters it produces look like clouds they look like globular chunks of data points if you plot them out on a chart and your data doesn't always look this way so for example and this is purely hypothetical imagine that if you charted out your data that you had like a ring of points that looks kind of like a circle and then inside of that you saw another circle of data points k-means will not pick out the outer ring as one cluster even though logically that's probably what you should do and inside it wouldn't also pick out the inner ring as a cluster of points that's not the way the algorithm works because of the way it does distance calculations so if you have complex types of shapes in your data like that k-means doesn't work there are other algorithms there's one called db scan for example that can work with that type of situation so k-means is a good general purpose clustering algorithm but if your data doesn't have these nice clouds of points or if that doesn't work for your scenario then you might want to look into a different algorithm like db scan that's the cons of using k-means all that to say it is arguably the world's most popular clustering algorithm it's very very good at many many things clustering is wildly useful for any professional and if you're interested in upleveling your game in terms of doing cluster analysis you're going to probably need to go beyond excel so check out these videos because in my experience nothing is easier for an excel user than to learn r programming and as i've mentioned before our programming makes cluster analysis robust cluster analysis very very easy until next time please stay healthy and i wish you very happy data sleuthing
Info
Channel: David Langer
Views: 4,232
Rating: 5 out of 5
Keywords: k-means cluster analysis with Excel, k means cluster analysis with Excel, k-means cluster analysis using Excel, k means cluster analysis using excel, k-means clustering, k means clustering, k-means cluster analysis, k means cluster analysis, cluster analysis, k-means, k means, k-means cluster analysis excel, k means cluster analysis excel, cluster analysis excel, cluster analysis with excel, cluster analysis using excel, k means excel, k-means excel, clustering
Id: YuxwkchSAW4
Channel Id: undefined
Length: 48min 35sec (2915 seconds)
Published: Thu Jun 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.