Data Analysis: Clustering and Classification (Lec. 2, part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay today is filled with awesomest and we talked about some advanced clustering and classification techniques so the only would be super awesome is where the cam chance or Jersey so if you guys don't know cam Chancellor is look up some YouTube video on him you'll see how awesome that this is like the principle component of awesome not only is it three one which is the first two numbers of Pi it's also a prime number and it's also our Seahawk cam chancer all right so I figured that would help me give a better lecture today so I want to talk about some new methods and I want to start thinking about doing some data analysis on some methods and some and some cluster cuts or some data in fact we're gonna look at dog cat pictures you've already seen principal components and so we're gonna take a program a look at some dogs and cats look to see how they naturally cluster and then within that context we're gonna go ahead and start applying some of these algorithms to this and learn some new ones okay so good so let's come over here to our laptop and go ahead and close this down and come over to matlab so we're going to be in matlab for a lot of this lecture and we're going to start doing is looking at a particular set of data and this data is going to be looking at dog and cat picture so i'm going to load some data the dog data map and load cat data mat oops so what this data is is a collection of pictures of dogs and cats we're going to take a look at what they look like in a minute and so part of what we want to think about doing is write an algorithm that would be able to distinguish between a cat and a dog okay that's simplest thing we can start thinking about doing is that kind of classification and clustering and in fact we want to see if this actually produces clustering and so we're going to do this using some principal component analysis of the pictures so let's go ahead and take this in if I actually run this it's gonna produce two data files one it's gonna be called dog once we call it cat okay and we will look at their sizes in a minute so let's go ahead and load this and we can run this and when we run this come back over here oops we can look here at size dog so what it is it's 4096 rows by 80 columns so the way this is arranged in a data matrix is that when we think about collecting your data each column is a new picture of a dog and there are 80 dogs now it's 4096 what I need to do is want to take this data out I need to reshape it to a picture these are the pixels this is 64 by 64 okay so that's the way my data is going to be arranged 64 by 64 with the reshape each column is is a unique picture of a dog and the same thing with the cat I have 80 pictures of cats to go with my 80 pictures of dogs and what I'd really like to do is figure out from all those pictures of dogs and cats there's gonna be my I can use these my training sets and I want to figure out can I actually write an algorithm that will go through and systematically find a way in a principled way to identify the differences between dogs and cats just from the pictures okay so that's the goal for today let me erase this here and we'll start programming and you see there what I've done is I've loaded this cat data and dog data that was what the code was calling me to do and with that we're gonna take that loaded data and start doing some analysis okay so first of all let's look at what these cats and dogs might look like so we're going to do four a goes one two nine and so we're gonna look at the first nine cats or dogs and so what we're going to do here is going to make a figure and we're going to subplot this J we're gonna make so we're gonna go through and make this a grid of three pictures it's a three by three matrix and they're gonna be the first nine pictures that we have here of dogs and the dogs themselves each dog will be for instance the reshape of the dog go grab one of the columns which is one of the dogs reshape it into a 64 by 64 okay now that reshef is great and now we can just do an image show of the dog sorry what I called DG so you're gonna go grab this data I'm gonna go grab a figure here gonna go ahead and put a subplot on you're going to take each pit of the first nine columns and do an image show this going to give us the picture of the first night dogs and cats or first night dogs so we run this thing there it is these are for instance some exemplary dogs now there's a couple of interesting things about the way we're doing this which we're just going to just go ahead and take these images we're not going to do any other processing one thing that's extremely helpful and makes actually quite a bit better classification algorithm is if instead of looking at the raw images of dogs here we would actually look at for instance using edge detection type algorithm just to look at for the edges that are important in the picture and we can do that with like for instance what's called a wavelet transform but for here we're just gonna actually look at these headshots of dogs and we're gonna try to figure out what are the principal components that represent the dog we can do something similar for the cat and the cats I can just go here and let's call this instead instead of DG I'm going to call this CT for the cat and I can look at my cat data which is cat and we're gonna plot CT so these gonna be the first tank cats and there they are so you see the cats you see the dogs one of the things that's because if you don't do the edge detection one of the problems is look at each background you know this cat has a white background this one here has a gray this is a darker gray so it ends up happening now the background plays a role when you look at the sort of an edge detection algorithm it takes it you know removes the background altogether essentially but in either case we're gonna be able to use this ideas and sort of think about this data where we just want to look at this data and some principal component space and right away you're gonna start seeing some ideas that way maybe we can start using to do some clustering and classification okay so pictures of dogs and cats and once we have those I can actually I'm gonna go ahead and now delete that because we don't need that anymore that was just show you what they look like and so now let's go ahead and take these dog and cat pictures and we're going to do it the dog cat the dog data what we call deep it's gonna be in a matrix D and what we have to do is take the dog file which is currently in au int 8 format which is a video basically an image format I'm going to turn it into double precision numbers because we're going to do math on it so we need to do that so all you got to do here say double dog and then C will be for the cats double cat okay so now we have our cat data and our dog data and now we're going to look at sort of what are the correlation structure between this so I'm going to make a data matrix X I'm going to basically take all of our dog data stack it with us and stack it with our cat data so now what happens is this matrix X remember the matrix D had each column was a dog 80 of them the matrix C was 80 columns of cats so what I've done here in this matrix X is I've concocted a them together so now what I have is a hundred and sixty columns with dogs and cats okay so the first 80 are dogs the last day to your cats that's how I've arranged my data and what I want to do is start looking at the correlation structure among those dogs and cats so a way to make the correlate look at the correlation structures to do the SVD and it's going to be essentially like a principal components so SVD of that X matrix and we'll use the economy-size SVD okay and what we want to do here is plot a new figure and plot the diagonals of the s so we can see what kind of if it's kind of a low dimensional structure or not how many principal components going to need don't plot that in black circles with a line width of two okay so this is going to give us a Diagnostics or a view of what this thing might look like for a data and there it is so this is our singular value decay and you can see there's one really big dominant mode and then Moe two three four and it just tails off okay now 160 pictures in there and you can see there's a lot of variance in this data set right H dog and each cat is a little bit different so you have a lot of things but you do have some dominant features up here and what we're going to do is just start to say okay let's look at how do these dogs and cats project onto the principal components okay we can look at the first four principal components in fact what I want to do is I want to look at in the space of the first three principal components what is what are these each individual dog and cat look like okay so I'm gonna go ahead and take that plot out and we're gonna go ahead and start looking at with the dogs and what the cats look like so let's first look at what the principal components look like we're going to look at the first four so the first four principal components those are the columns of U and so what I want to do is say subplot two to J and what I'll do is get one of these eigen faces essentially which is I'm go ahead and take reshape the the column of you then I want here to a 64 by 64 okay and then P color this thing with axis off shading Terp and color our map oops let's put a space there it'll make a color map hot okay so what I've done here in this piece of code so taken it I'm gonna go do a loop four times through and in each one create a subplot and I'm going to take the first four column so here J is gonna go through one two four here's the column space so go grab all the rows in this column so column one is the first prints component column two is the second principal component column three is a third principal component column 4 is fourth prince of component and so forth right now it's in a big vector you're gonna reshape it into a 64 by 64 ass in just in case you are wondering to what the first principal component of awesomeness is those kam chancellor anyway it's a well-known fact in the in the right circles so i just want to let that let you know that and here this is like just looking at dogs and cats alright so we're gonna look at what these things look like shading and interpret it and then you can take a look okay so there are the first four principal components i know i don't know maybe that when you do when you do the edge detection that looks much better but it's okay this is just somehow these are the top ways that things correlate now remember this is the number one correlation right here this is sort of in some sense the average face that's represented by all those dogs and cats combined you can see there's kind of some structure here there's sort of this where the nose and mouth region is you have the eyes right there you can kind of see them have some structure here around years okay so those are gonna be those are so your your your dominant feature sets here again you can see sort of right here are some I type structures some things happening around the ears so those are going to be those are sort of how we're going to project every dog and cat onto this basis set okay and it's going to have a unique representation each dog in each cat has a unique representation in this basis set so now we want to do say okay great we've taken looking at those basis that we're going to use now we want to basically say okay why don't I go ahead and take what I know is in my data set this X the first 80 columns were dogs the second 80 were cats so I'm gonna go ahead and plot the first eighty components which actually given by the matrix V V tells you how each dog and cat is projected onto you so what I can do is I can say all right V tells me so for dog one how does it project onto all the principle components for dog - how does it product on all the principle components and what I'm going to look at is just the principal component two three and four I'm gonna say that the average principal component that's really dominant for all is gonna sort of be the dominant thing but I want to look now at the variance around that average face so let's just look at component two three four we can look at any of them if we want but that's what I'm gonna choose to look at okay so what we're going to do is plot three and when when a plot here is pull out the first eighty remember the first 80 of these are our dogs okay the first 80 rows are dogs the next it from 81 to 160 or cats and I'm gonna plot the second column versus the third column versus fourth column and I'll make these black circle and this line with this too to make it a little thicker okay we're going to hold on to that and now we're going to plot on the same plot from 81 to end the second column so now you're going to go here to the 80 first row which is where you start to Katz's all the way to the last column which is 160 or I could put end it doesn't matter and plot the second column versus the third column versus fourth column I'm gonna plot those in red red circles line with two okay so that's what we're doing let's just go ahead and take a look at what this looks like this is a very important figure for us and here it is and let me make it bigger okay so black or dogs red or cats and let's do some interactive stuff here in particulate sat the rotation of I can start rotating this around you can start seeing at certain angles there for instance there's a lot of red that lives over here a lot of black over here so you have this idea that the data itself is quite mixed but there's also the potential to do some separation in that data right so you see let's call it a cluster of blacks cluster of Reds but the fact is they they overlap okay and so we can just you know continue to look at this here's maybe another viewpoint of that again just rotation and we're kind of lucky here that's why we always picked this is also why I picked three dimensions so it's it's a little easier to visualize okay so we have principal component 1 2 & 3 and this is the kind of you get now of course what we're going to see is that you're not going to be able to just using this very basic clustering to do that well because the red and black are quite mixed in this region okay and so part of what our job is is to figure out how can I use some of my data set of analysis skills and some of my clustering algorithms to sort of give me a predictor of how to to do something like this okay and how to how to do a principled separation so I can try to figure out how Ashley what's my classification algorithm if you give me a new picture it if I want to label a dog or cat how do I go ahead and do that with this okay so to do that I want to set up set us up one last step which is we've got this picture here and the next step then is to say all right oh let me get rid of this know what that's doing there so what I want to do is start setting it up in a sense then I know these are pictures of dogs and cats so what I want to do is set aside a training set and a test set so this goes towards this idea of cross-validation right so I'm gonna out of the 80 dogs and 80 cats what I'm gonna do is I'm gonna say I'm gonna take 50 dogs and 50 cats randomly okay and from those 50 dogs and faculty cats random that I randomly pick I'll hold those out I'll do a training algorithm I'll try to train my classifier to recognize the dogs from the cats from those 50 then the remaining 30 of dogs and cats those remaining 30 I will now run it through the algorithm and ask is this a dog or cat is this a dog or cat so out of those 34 dogs and cats it's gonna tell me it's gonna rank each one is it a dog or cat and I can count up how many I got right how many I got wrong and get a percentage of accuracy it's as simple as that okay so that's the way we do it and then what you would do in cross-validation you would do this random selection many times so you take random 50 train on the third test on the 30 do it again take another random 50 train on the 30 random 50 trent on the 30 until you kind of get some kind of average score you understand sort of how your variance is changing in the accuracy alright so how to do that well make up two variables q1 is gonna be I'm going to use this random permutation command here it is and q2 will be the same thing okay why do I have this q1 it so oops sorry the way you use this command is like this random permutation and you put in a number here 80 and what it will do is take one two three all the way to 80 a basically a set of integers they go from 1 to 80 and it will randomly permute them around so what q1 comes back as is the shuffling of 80 in some random fashion okay and if I run it again oh give me a different random shuffling of the data so what I'm gonna do is by using this if I take the first 50 q1 it's giving me random indices or dogs to pick out and I'm used q2 to pick out my random cats okay that's it so let's go ahead and go ahead and take this here and we can and then we'll have the idea of being able to have our training set and our test set alright so first of all my dog data is the following it's the first 80 and I'm looking at two to four and my cat data is 81 to 162 to four so what I've done here has taken the first 80 which I know our dogs take columns 2 to 4 in other words principal components there projection onto prints components 2 3 & 4 by the way these are defining the feature space I want to look in okay the only reason we took 2 3 4 is because it was easy to visualize you can take many more features you can take 20 features 10 features you can pick which features you want so this is the idea here but we're going to just do it based upon that picture we're gonna see how well can I classify with three features and those three features are the second third and fourth principle component features but a cat same thing go from column 81 to 162 to 4 so now I've picked three features for cats three features from dogs all from this principal component space okay so that's my data that is my data now that I want to do the classification I'll go to the line so let's build a training set let's call this X train that's equal to what I can do is take my dog data and take q1 first 50 all the all the rows there that are left and I'm going to put that with my cat data again the first 50 all columns okay so let's walk over through this so what did I do my training set now what I'm doing is I'm going to stack up okay the first 50 columns this week so so the first 50 rows are going to be dogs and it has three columns because I picked in principle one two three four so there's three columns of features there's 50 dogs and I picked the random 50 dogs given me by this permutation I stacked these 50 on top of another 50 which are the cats here are the cats from 1 to 50 and again sorry 1 to 50 grab all the features 2 3 4 so this is here my training set 50 cats 50 dogs that's what this does ok now my test set then is going to be the rest so I'm going to just go ahead and copy and paste this here oops then you know that copy and paste all right let's do this alright so my test set pretty similar it's going to be X dog but now Q 1 instead of 1 250 I'm going from 51 to the end ok so these are the remaining end is 80 so go grab the last 30 dogs and that's in your test set and act and then your cat you do the same thing it goes but now it q2 which is a different random variable 51 to end okay so the first part of this lecture is just setting this up training set test set these are the components you need if you're going to do some kind of supervised learning algorithm so the goal will be to build some training algorithms they're going to make use of these training sets of dogs and cats and let me just run this make sure there are no issues there's figure 1 and then I can come over here and then you can see here I've got this q1 into q2 by the way let's just take a look at what like for instance q1 might look like there they are see there's just some random string of numbers sorted and what I have now is my size of my training set a hundred by 350 dogs 50 cats three features okay and then my training set and then my test set let's look at that sixty by three I had 30 dogs 30 cats three features that's what I'm gonna train on okay so we are set up now then to make a go at using some of these algorithms beyond k-means and K nearest neighbors
Info
Channel: Nathan Kutz
Views: 12,982
Rating: 4.9076924 out of 5
Keywords: k-means, clustering, classification, data science, machine learning, Nathan Kutz
Id: Tk7KrWOVgYc
Channel Id: undefined
Length: 26min 52sec (1612 seconds)
Published: Fri Feb 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.