Data Analysis: Clustering and Classification (Lec. 1, part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay welcome to this first set of lectures basically on data analysis using clustering classification regression a lot of things we're going to learn a lot of tools mathematical tools here sort of form the basis of what you might call data mining or machine learning and so part of what we want to address is what are some principal ways to think about data and how do you go about taking data in looking for patterns in your data so that you can do a variety of things not only do you want to do things like classification tasks where you identify a new piece of data and you classify it as something that you know you may also want to do some kind of looking for clusters in your data maybe you don't know much about the data and you want to just try to understand are there patterns of the data which you don't know about and you're trying to discover and there's also ideas of regression which are can I actually take my data think about doing curve fitting to that data so that I can make predictions based upon that data and apply it to new data okay so we're going to talk about some of the key ideas and some of the key ideas revolve around this idea of learning from your data so we might think about this as generically learning or you might want to just call it data mining and the idea is to think about taking that data looking for structure in the data and it comes in two broad techniques and I'm going to go through both of them here now which are in effect in the series of lectures what we're going to do is cover a lot of the key technologies log the key methodologies around these methods so the first thing I'd like to highlight is this idea of unsupervised learning so the idea here is just as it sounds is that you're going to basically generate an algorithm on your computer that's going to look for patterns in the data you're not going to label any of the data you're going to tell the computer what kind of data it is the goal of the algorithm itself is to discover these patterns in the data without you being the architect of that discovery process another type of data analysis is called supervised supervised learning is a little different this may be assumes that you know something of your data and then the idea would be is to capitalize on what you know and perhaps you even can classify some of your data you might think about pictures of dogs pictures of cats if you have cats and dog pictures how would you take these dog dog and cat pictures and let's say maybe have a hundred dog pictures and a hundred cat pictures I'd like to do a training algorithm so that if I give you a new picture of a dog or cat you your computer could actually classify that as a dog or cat and that's the idea of supervised learning is that you're going to label data ahead of time in in this process so that you can make future decisions based upon that learning process now interestingly enough these two algorithms are trying to do different things in unsupervised learning the goal of this is to look for patterns the goal of this is to take your data and discover clustering in the data in some kind of space okay so unsupervised certs learning is really about you know patterns in your data or looking for clusters in your data now we'll show some examples in a little bit with this so this is this data mining operation you don't know anything you're just going to put it in there go look for things and hope that in fact that you can do a good job in finding those things supervised learning is different because in some sense it's already assuming that you might have clusters the CatDog example is one where you say I picture the cat's pictures of dogs and so already have clusters I already know that I've dogs already know that have cats so you are the expert in the loop and then sort of in this idea you are providing labels so you're going to label the data ahead of time and you're going to give this data some okay I have dogs I have cats I labeled them all I know if they are and now the idea is to make use of that and if I give you a new dog or a new cat can you actually let corrected yourself or how can I write an algorithm to do that okay so this is going to vote the idea of providing labels for data and then you're going to try to see what you can do with new data how do i label new data and so there's going to be this training phase and then a test phase so the training algorithm the idea behind the training is to sort of take these ideas use the fact that you know what cats and dogs are so you can train an algorithm to recognize what the key structures of the data are based upon your expert in the loop opinion and then you can go test new data and see how well you did okay so two one one idea then is clustering is happening here in the unsupervised learning you're looking for clusters you're looking for patterns in the data in contrast here in a supervised learning you're not necessarily looking for clusters you've already told it by providing the labels what the clusters are so what you're kind of trying to do in this case here is actually do classification and also regress so the idea behind classification and regression is if I can do classification I can find some kind of regression curve or some kind of basically the best least square fit type curve to the data doesn't have to be linear that actually explains this classification data so that's kind of the big picture thinking behind these two learning algorithms and we're going to hit a lot of the common algorithms used for both now before we get to that though what are the advantages and disadvantages well it unsupervised learning the disadvantage is that you don't give any information or expert information or knowledge to your algorithm so it has to figure it out all by itself it may not even know the right number of clusters suppose for instance you have dogs and cats in an unsupervised algorithm the data is not going to if you're just letting this thing look for patterns in the data it may not know that I'm looking for two distinct clusters dogs and cats okay so that's a that can be a drawback here and it's actually typically the Achilles heel of this methodology and this is why you want supervised learning because here you're kind of telling it what to look for ahead of time or you'd like to but the unsupervised learning doesn't have that capability unless you you enforce it yourself okay the supervised learning on the other hand already assumes the kind of things that you're looking for which is you're going to give it a training set and you're going to provide labels so maybe you already have two two data types dogs cats and so it already knows doesn't have to look for how many clusters am I looking for the data it already has that information and you gave it to it okay and not only that you gave it expert opinion to start building a training set off of this thing here and so there's a there's some kind of advantages with supervised learning but the disadvantage supervised learning of course is that there's a long time training stage sometimes you take a lot of data and if you want good classifiers oftentimes you have to go through a long time training stage in other words to do the classification okay so let me show you an example of what this might look like so and then we're going to program in MATLAB okay I'm going to do this here I'm going to show you what this might look like let me come over here and uh pop some data out for you here okay so for instance this could be some kind of data set where I'm looking at in some abstract representation here's a bunch of green squares and they could be let's say dogs or cats or some kind of projection of dogs and cats into a two dimensional space right so it's kind of nice they kind of are are here and clustered and each dot would represent a different dog for instance so that's kind of a sampling of data that you might have and then you could also have a different type of data and let's represent this and orange sure that sounds great and suppose those are your data sets so now what you see is you have these two data types you have the green you have the orange and of course here it's it's almost trivial to say that they're different right here you can say well there's the green the orange I see clearly the difference and the idea then would be say oh what I will do is form a cluster around the green and a crust around the orange okay so that's one operation that you might start to think about doing is that you can easily do this classification task now let's talk about how it relates over here to the unsupervised and a supervised learning if you're in an unsupervised learning algorithm then you don't even know that there's two things two objects let's say green and orange dots so ahead of time you're going to have to make some decision you're going to say hey I'll train my algorithm and say go look for two two clusters for instance and in that case if you pick two and a clear they are two what it would do is start to figure out can I form two clusters and the idea would be here's a cluster one and here's a cluster two okay now the unsupervised learning that's all it centrally does is provides away a principled way to do that clustering now it's interesting is you can use the unsupervised learning and then feed it into the supervised learning because once you have the clusters it's equivalent to providing labels now the supervised learning what it tries to do is to say well I have this data I've given expert in the loop opinions and for instance I could say hey these are green squares these are orange dots I've labeled them so you know what they are so you already know there's two clusters two types of data and so the idea is what happens if I have new data in the system so for instance let's take a look at something like this suppose I have a new piece of data right there and the question comes so this is something where I've already got my clustering so now I want to go down to my supervised learning algorithm which is say what is this if I give you a new piece of data and it sits right here what is it and of course from the naked eye you would say well I don't know it's it's up here with all the greens it's probably a green so that's how that supervised learning would do it would go right down to this classification and regression and in fact you might even say what classification and regression are really going to try to do for you here in providing a training is to sort of decide is to decide how to separate this data so for instance you might separate the data like this anything above here is green anything below there are orange dots so this example is very simple so even if I do classification of that orange circle or the circle here it's clearly above here with the greens I would classify it as a green square and if I had something down here piece of data down here let's say this X and I say what is that X well I'd say oh look it's it's actually sitting down there with the cluster of orange balls so I would label it as an orange ball okay so that's the idea here is that you have a training set let's say the green and orange and now your test said are these new data points so if I gave for instance if was this was dogs and this was cats if I gave you this new point I'd say oh that's that's cat I think I said ducks let's say that's a cat let's make this up let's say these are cats down here let's say and these are dogs up here ok so then any dots down here would be cats any dogs up here sorry cats down here dogs up here and that's the simple classification ok and again you can get at these through these different type of methodologies in this case if you knew it was dogs and cats you would say oh look I can already know that there's two sort of types of data in there so I'm going to go ahead and and classify it as such ok so this is kind of the general idea and again unsupervised this is a really key concept unsupervised essentially for data then you don't know anything about provide you away getting labels on your data right so it's looking for clusters you have to pick the number of clusters but once you've done that provide you with labels on those so it belongs to cluster 1 versus cluster 2 then you could do things like use those as training sets to do classification and regression ok and in fact ultimately what this allows us to do is the very important thing of drawing this line right here this line here is essentially going to be like my regression line because it's giving me a mathematical formula for distinguishing between one type of data to another ok that's openly a lot of the goal of data mining is to first look for clusters and then provide these lines of separation between the data now this case is fairly easy right if we look at this case everything separates nicely let's take a more sophisticated case and see what that would give us so I'm going to race this data it's probably a faster way than I'm going to do it but fine I'm gonna stray stall this ok and we're going to come up with a new data set a new data set is not going to be as clean as that one where I had everything separated in fact let me erase this line here that we have our classification line so in this case that I drew and of course oftentimes in textbooks you always see these nice cases where everything separates out nicely and there's no problem and it's almost through the naked eye you can make these distinctions about what they are but in a lot of cases it's a little bit more complicated than that so let's go ahead and draw some of these orange circles and let's just say that's my data all right and then I could say well what if I had another type of data then I could ask do you think you have a shot at having a nice regression line that would separate the orange from the green right and that's that's a lot of times when you get a data where the data analysis tricks don't work so well but sometimes it's all about how do I get this down into that representation in the first place and this is where with something like principal components plays a big role is in this sort of projection space of where I put my data is there maybe a more clever way to represent the data where it does separate out versus something like this where clearly there's no way if I gave you a new set of data let's say right here and I ask you what is that you would not be able to say with any kind of accuracy that this was an orange ball or a green square because in fact it sits there in a sea of orange and green so these are the kind of ambiguous types of things where you're clustering isn't going to give you a good - separation even if you label these data it's clear it's not going to give you a good separation so in the data analysis here there are there are serious let's drawbacks that you might encounter with such type of data okay so--that's but this can represent realistic data that you would have in practice and so it's important to understand that a lot of times you're going to see things like this okay so so though so we've given you one case where it's nice separation one case where it's not a nice separation let's do what's more typical which is kind of some hybrid in-between where you you get some separation but you get some mixing and so let's go ahead and I'm going to show you one example of that and I always think of this as sort of being you know this is kind of what you more get in practice than anything else all right I've got all that erased let's go put some more on dots down okay we're going to go back to green squares so I might have some things out here nice cluster of green squares all right and then I might have my orange data as well there's the orange and okay let's let's look at this example here for a moment so now in this case you clearly see that most of the greens are below most the red oranges are above so there is this nice separation so if you were to do some kind of clustering analysis and you didn't know the number of clusters but if you did manage to pick two clusters what you would find is probably something bunched up there in the orange something bunched down on the bottom and the green or even if you were to do a supervised which is if you labeled these things you would see this thing you say here's a bunch of orange balls green squares and what I want to do is these two things here at the end of the day classify regress in other words figure out how do I want to classify new data and how do I draw a regression line which ultimately leads me to a way to separate the data in this case here you might think that the best separation line I don't know let's just draw something might be something like this so again anything above here is an orange ball anything below here is a green square so in this scenario you get a nice separation but notice you're going to make mistakes and by the way that's the that's the general thing what you hope to do in data analysis even looking for data mining and patterns and your data is that what you hope to do is get the right trends maybe even get accuracies in the you know 80's 90's percents right w fantastic even if you can sometimes get above coin flip it's pretty good so if you do better than just guessing 50% maybe you find you're actually is only 55% but that's still better than just a pure guess but something like this clearly you already see right here I have this Green Square and that green square gets misclassified in this algorithm it would say that this Green Square is actually an orange circle and that's wrong so you've made a mistake however you got most the other ones right now on this side over here you get a lot of the green but look at this guy right here that gets misclassified as well so this one here gets misclassified as a Green Square so you've made a mistake on this side you've made a mistake on this side and also things like this this one here which sits right on that boundary line it's it doesn't know quite what to do you know basically the algorithms has double precision accuracy so at some level of 10 to minus 16 it's going to be on one side of the line or the other and it's just going to simply get classified as being one of these or one of these okay but these points are very interesting because they're right at the border of your decision space on one side you're going to get one way on the other side you're going to get another way so that's a kind of a more generic thing you're going to make mistakes you're not going to get perfect classification now the one thing I'm going to highlight here is that I think one of the biggest dangers in data analysis is this danger of overfitting your data this can happen a lot in the supervised learning algorithms you have to be really careful I want to talk about this concept of cross-validation I'm going to illustrate it here if I wanted to do I'm kind of classification on this data and this is my training data and I want to do a perfect classification task I might draw something like this let me see if I can do it here go around here and I could say hey look at that if that is my decision line my classification I'm going to get everything perfectly right the problem with something like this is I can take any training data and if I want to generate some sufficiently complicated curve I'll get 100% accuracy so it's going to start talking about the idea of cross-validation which is how well does this hold up when I put new data to it okay so the idea of cross-validation is extremely important perhaps maybe a most important thing to take out of all of this is you better cross validate how would a cross validation procedure work and by the way always all right always you cross validate never miss doing this step okay never ever miss doing this step there's anything you want to write down here always cross validate so how do we think about it well cross-validation says the following how about I do this since I'm going to provide labels to data and it's going to be my training set what I'm going to do is take all my data that I have suppose I have 100 orange balls and Hornet and 100 green squares suppose I had that well one thing I could certainly do is take 80 out of the 100 randomly chosen I'm going to take 80 randomly chosen here 80 of the hundred randomly chosen here so you know I'm going to take 80% and I'm going to be have that be my training set so I'm not going to train with all the data I have I'm going to train with 80% of it but what I'm going to do now to go test new data is I'm going hold the 20% I withheld is going to be my test data okay so I take 80% I test with 20% and I'm going to draw a decision line for every time I do this so I'm going to test it I'm going to get an accuracy so I'm going to get some kind of accuracy score but then once I get a cure scores let's suppose I do this and I get some decision line and I get an accuracy of 95% what I'm going to do is maybe I just got lucky maybe when I picked I didn't pick this problematic guy here and this problematic guy over here so I'm going to shuffle I'm going to take a new 80% and test on that remaining 20% now I'm going to do this randomly again take a new random 80% test on the other 20% a new random 80% test on the other mission and what I want to do is keep track of my accuracy as a function of these random trials and what ends up happening is you no longer get curves like this this curve was very specific to this exact data pole if I take another random sampling of it this curve will be a bad predictor typically that's how you do cross-validation you would actually start evaluating for every decision curve you make how accurate is it and how robust is if I take a new set of training and test data okay so this cross validation step is absolutely critical never present your results on data unless you've done this ever okay it's really really important you're just being lazy if you don't do it and nobody will believe you if you don't cross validate anyway okay so those are the highlight ideas unsupervised learning look for clusters and data supervised learning is once I have labels or clusters in my data that's what I'm hoping for provide some kind of classification or regression on that in other words draw those decision curves and whatever you do in this process whatever vacation regression curve you get always cross validate always check to see what kind of accuracy do you get if you do a random trial and you shuffle okay so those are the major highlight concepts that we want to test in pieces of code that we have next and so there's that snapshot and I'm going to present in the rest of this lecture two methods one is kind of essentially the bread and butter unsupervised learning algorithm is called k-means and in a supervised learning i'll prevent one of the simplest ones it's called K nearest neighbors and are very intuitive I'm going to write some code to go through it and we'll start with that next
Info
Channel: Nathan Kutz
Views: 109,257
Rating: 4.9486079 out of 5
Keywords: data science, supervised learning, unsupervised learning, clustering, classification, Nathan Kutz
Id: B0TI2q7wgIQ
Channel Id: undefined
Length: 26min 59sec (1619 seconds)
Published: Fri Feb 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.