Feature Selection in Python | Machine Learning Basics | Boston Housing Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome back to my channel today I'm going to be going through a machine learning tutorial in Python today we're doing feature selection so previously we've been going through a general introduction to machine learning the difference between supervised and unsupervised which we did in an example using K nearest neighbors both clustering and classification examples however today we're going to be trying to improve upon our previous K nearest neighbors regression using feature selection if you are new here my name is Kira and I'm a computer science / machine learning PhD student in first year and my channel here on YouTube is all about academic life style mostly and productivity but I also like to do these videos about my own research and what I've been learning in terms of computer science so that I can help many of you add as well so I do hope that you will stick around and subscribe to see more content like this and I really hope that you enjoy this video obviously today we're going through feature selection and I'm just gonna run you very briefly through what we did last week well a couple weeks ago what I'm gonna show you today can be done with any learning algorithm really but last time we were looking at K nearest neighbors because we've been talking about the difference between supervised and unsupervised learning so we went through k n ends for both of those tasks and for classification and clustering and then also I've been talking a bit about case based reasoning because that's what I use mostly in my research so I'll have all of these videos linked down below for things that we've previously done so basically last time one of the things we did was a K nearest neighbors regression on the Boston housing data which is one of the kind of toy datasets that are is often used for different regression tasks so that's where we got to last time and then today we want to look more so at the different features that we have so the different variables and using some sort of feature selection methods to reduce the number of features that we have so very briefly what the purpose is of feature selection is mainly the reason is that there are lots of learning algorithms that tend to perform very poorly on hide national data what we mean by that is data that has a ton of different features so a ton of different explanatory variables there are some learning algorithms to just perform very badly on this kind of data so this is known as the curse of dimensionality but additionally there are other reasons that we might want to reduce the number of features so that might include reducing the computational cost so obviously trying to do learning algorithms over a ton of different features is going to take much longer than if you have a few features there also could be a cost reduced reduction to do with collecting the data so in cases where it's either very time-consuming or actually expensive to collect data having to collect a ton of data for a ton of different features would be a lot more difficult than just a few features and lastly reducing the number of features can often improve the interpretive improve the interpret bility interpret ability of the model so obviously having just a few features it makes much more sense that you know why is explained by a couple of different features than a very long list of features it just makes a lot more sense to people so that's generally preferred is to have less features so then I'm just gonna briefly go through the Boston heisting dataset which we didn't really go into much detail on the actual variables last week so we have med the MATV is the median value of owner-occupied homes in thousands of dollars so the variables will be the answer might be like 23 but what it means is 23,000 dollars and this data set was collected I think in the 1980s so that's the our dependent variable and then we have our independent variables or explanatory variables we've got the crime rate we've got the proportion of residential lands zoned for lots over 25,000 square feet I've got the proportion of non retail business acres pretend Charles River dummy bet variable so one if the tract bans the river otherwise it's zero but the nitric oxide concentration the average number of rooms per dwelling the age of the owner-occupied units built prior to 1940 so the proportion of ones that were built before 1940 this is the weighted distance to five boston employment centers the radius is the index the radial variable is the index of accessibility to radial highways tax is the full value property tax rate per $10,000 and PT ratio is the pupil-teacher ratio by ten B has something to do with the proportion of black residents by ten which I'm not sure why that's reflective of housing price but perhaps that would have been more so the case in the 1980s in Boston but I am guessing it's definitely not today and then I'll start which is the lower status of the population the kind of proportion there so it's close that and yeah so those are all of the variables and one thing that I didn't realize last time when I was doing everything was that this variable here because it's an index it's not actually the values are one two three four up to age and then 24 but it's an index so actually it's more like a label than an actual and numerical variable so it's important that a variable like this is actually coded as a categorical variable because it is very different to how it would be if it wasn't categorical you know having an actual numeric variable and a categorical variable are very different in how they are portrayed so it's important to make sure that you incorporate that so I'm just gonna show you now I'm loading in the data set and getting everything ready and then you can see here I have this line here Boston equals as type category so that's gonna code it as a categorical variables we've got our dummy variables and that's what's gonna happen here so this is adding in all of the dummy variables basically these three lines here actually I don't even think we need this one because the next two lines do that's you've got our dummy variables being gotten by the Boston data so that's to see if that works yeah and then you can see here we've got all of our dummy variables so instead of having the radial variable here we've got 1 2 3 up to age and then 24 and then we've got one like all of these here so that's what's going on here is that we need to have that this way otherwise it doesn't actually make sense the way that it's being coded so anyways this is kind of what we have done last time so obviously the values here are going to be different to the ones we got last time because I hadn't coded it as a categorical variable but apart from that it should be the same everything else to be the same so you can see we're working with a pretty high root mean squared error so this is essentially saying that average the amount were off by is six and a half thousand dollars which is a lot considering the highest meant for a highest in the data set was fifty thousand dollars and then so the oral squared is 0.5 and we'd like it to be closer to one so today we're going to be doing different things to try and get that up to one so firstly the thing we can do is to filter the features by variation or by variability so looking here at the variation or variability for all of the different variables what we would like to see is that there might be some that have a very low variance and that's a variation why do I say that by variance okay and that they have a very low variance so what's meant by that is that they don't really change much in their values so pretty much then if they have a low variance so close to zero it means that their values are pretty much always the same so it's unlikely that they're good predictors because unless all of the output variables are the same then it doesn't really kind of represent the changes whereas ones that have a higher variance can often be explained like that can often be explaining what's going on in the data so one thing you might want to do is get rid of these two variables see does that make a difference but often it doesn't really like to be honest it's not really the best thing to do but you can see here all of these have quite low variance as well except for 24 so that might be one that's actually relevant but all of these other radial variables don't really have anything and so one thing we can do is drop those and we can see that has ink that has actually improved us a little bit let's just see can we drop all of these other ones as well I suppose I think we'll set the limit at 0.1 so the problem is with a rate with a categorical variable like this you can't just go around deleting loads of them so another thing that you might want to do instead is actually combine these levels so have let's say levels 1 2 4 5 to 8 and then 24 on its own so that there is more of an explanation there but you can't actually just delete some levels and not others because that just doesn't really make sense and I think oftentimes with machine learning especially if you don't have a good understanding of general statistical practice people do this kind of thing and it just doesn't really make sense and it's not really the right thing to do so that's something we're not going to do we're going to leave those in so we've just dropped these two and it did give us a little bit of a of an improvement the next thing that we're gonna do is actually look at the correlation between the features and the expand trade feature so fries by the dependent variables so price so here we're importing Seaborn and matplotlib that's just to give us some plots that can actually explain things a bit better so here we're doing a plot for the correlation and it's gonna show up as a heat map so you can see this is quite a lot to take in so you can see all of these ones are our categorical variable here and that's why it's very blurry because there is a lot of overlap in some sense between these these are all very like they're very low correlation with a lot of the variable and with each other so that's kind of what's going on here basically what we want to look out for is anything that's a black or a light color so that can be an indication that we have high correlation between our explanatory variables which we don't want so one of the assumptions for aggression if you don't know for doing any form of regression is that there is no correlation between not no but very not too much correlation between the explanatory variables so generally we would see a value of 0.8 and above to be high correlation between the explanatory variables so that's something we want to avoid and then similarly we'd like to have the features in the data set that we're going to use to have a high correlation with our medium value for price so first of all let's just look at the correlations again so this is all the correlations with the price and this is just absolute value so some are obviously positive and negative here but here we're just looking at the ones that are and we're just doing the absolute value so they're all positive so one way that you could filter out the features I haven't really mentioned much but a way to get rid of features that doesn't involve your classifier really at all is known as filtering so when you actually select features without using a specific classifier instead you just select features and that it means that the features you selected can be used with any classifier so this is kind of what we're doing here with removing ones that have low variance and also keeping ones that have high very high correlation here so here you can see I've decided to keep everything that has a high correlation so above 0.5 and so that's what I'm doing here so here we just have three ones that are actually above 0.5 and here I just decided to go ahead and try this for all of the different all of the different variables just trying to get rid of things that are either have correlation below 0.1 this is getting rid of below zero point two below 0.3 and so on so this is trying out different versions so here obviously this is not really correct because we wouldn't really just get rid of some of the radial variables and you can see we didn't really improve there here we have slight improvements this is starting to get up there is this is actually a pretty good improvement even just by keeping this 24 here so that might give us the idea that we might want to combine levels one to eight let's say and then have 24 on its own and then keep them both in there but then as we go along the best value that we end up getting is a root mean squared error of 4.0 4.7 3 is the lowest we go with the or a squared of 0.7 4 so that's pretty good of an improvement considering it was or squared 0.5 and now with just three variables we've gone down to zero point seven four so that's pretty good and so that's the best we've done so far so you can see that's done I suppose in a way without using a classifier this method could have been easily implemented just as easily with something else but now we're going to look at an option two that actually decides to keep variables based on their performance in a given classifier so for here we were just ranking them by correlation essentially and knocking out the ones that were not as highly correlated with the price variable but that doesn't always necessarily give you the same results as if you were to use something else because sometimes there are features that interact with each other in some way even if they don't necessarily have high correlation they interact with the classifier so different features can be differently suited to a classifier and can give better results with that classifier than it does with others so the selection technique we're going to use is using a wrapper so what that means is that it actually takes into account the classifier and those the method for selecting the features works directly on the classifier so I'm just taking in the data set again because we have obviously removed some variables and everything so here we have just getting the vector and data set in again so what we're gonna do is known as sequential feature selection so the different options would be to one option would be firstly to test out every possible subset of features and see which one performs the best with the classifier but obviously one issue with that is that if you test out every single classifier there every single subset of features that would take a really long time if you've got a hundred features let's say so the way that sequential feature selection works is that either you can choose to go forwards or backwards or kind of forward and backwards along the way but essentially let's say we're gonna go forwards here and essentially what it does is it's at each step of the way so let's say forwards feature selection we start with a completely empty subset and then we make a classifier with each individual feature like a single feature classifier so let's say you've got 20 very or light let's say for this case we got all of these features and we make a classification model with just each of these variables so we've got whatever twenty-something variables or 15 maybe variables that we then make a model just with those so an individual model for each of those and then we see which one has the best and performance and then that one that feature is selected and then it moves on to the next step and it then makes lots of two-dimensional models so testing each other feature adding into this best model that we have so far it then makes a ton of different to variable models and then sees which one performs the best and then at the next step we add in a third variable but we tested each step which variables the best one to add at the mod add to the model so that's how forward and selection works and then backward selection is the opposite so we start with a model that has all of the features and then at each step we remove one variable and we just decide which variable to remove based on the amount that they affect the model so if the and the performance may go down a little bit because or squared generally does continue to rise or fall with depending on how many variables there are even if that variable doesn't actually contribute much to the model so anyways those are the two main ways that it works so that's something we're going to do now we're going to use the feature selection so this you can get from ml x dot ml extend dot feature selection and it's called the sequential feature selector in Python so that's what we're gonna do now here we're going to set the limit of features to 13 so that's how many variables we have I guess and we're gonna use the negative mean squared error as our measure for what is the performance of the model but there are a few lots of different scoring you could use that's just what we decided to do here we have forward equals true so we're doing four words step selection the other option would be to do backwards so let's do that and it usually takes a little bit of time to run because it's testing out all of the different subsets at each time ok so going through this name we can see that here we start off with just one feature so our best one is L stats it seems which makes sense from our correlation and and then the next one is number of rooms and then crime and then PT parent/teacher ratio so this is the lowest value we get is just about minus 20 and then it starts to increase again so that means after that we have a worst-performing model so that's just change this to false and see did we get the same because sometimes with backwards and feature selection we actually end up with a different one so here we have crime so you can see already that's different I know it's the opposite of what munch so we want to start down here what are we doing see okay so you want to see can we find our best subset now okay so again we have L stock now it's the opposite way around here again we have L star as our top one number of rooms crime and then PT ratios so it is actually the same and we've got the same value here but it just depends because sometimes you just don't really know going up and down and I'm sure those there will be some of these subsets that are actually different than the other way around but anyway so we've essentially chosen our top four values so let's then build a model with those and see what we get so you can see we've improved our root mean squared error and the aurochs squared by a lot so far so it is pretty good and just to double check we want to see what the correlation is like and do we have any high correlations between our explanatory variables the only one that is a bit risky is the L stat and number of rooms so one thing that we can do is add in an interaction term which could make sense I think because this is the only one that's above 0.5 so it's the only one that's relatively significant apart from obviously these ones um but let's just do that and see does that help so we want to add in here we're adding in an interaction term so what it is is we're multiplying together the two columns and adding that as a new column to our Boston data set so now let's run this again we're gonna add in actually let's just copy this so that we can have it separately okay so now we're adding in our new variable and you can see the house improved everything a little bit not a huge ment but it has made a difference so that's one of those things so our or a squares improved only a very small amount that we might kind of wonder was that really worth it I'm not sure in or you can do something called the adjusted or squared which basically kind of um it will penalize the org squared score by the number of variables so I think in this case you probably wouldn't want to keep that because it doesn't really affect it enough to consider it because it just does make enough of a difference and it probably is just a very small improvement because of the addition of a new feature so I think we're pretty happy with this model so far another thing that we can do now is start to look at the actual um relationships of the variables themselves which to be honest is something you should do more so in the beginning like straightaway but when you've got a feed when you've got a data set that's got hundreds and hundreds of features you don't really have necessarily have the time to be looking at all of the data so not that I'm recommending not doing that step I do feel like it's a bit easier now to look at these and see the relationships between the variables so particularly obviously we want to look between the price and all the other variables what is the relationships so one thing we can notice is that there is these values of the top which have the max price for heisting and these seem to be outliers in some way because of the fact that there are all bunched together but like across different things they don't really follow the pattern and I think the reason for this is that the housing price was actually capped at 50 so it's kind of been if they should be somewhere maybe up here as these outliers I think these houses are probably worth a lot more money so it just really depends with um it really depends with these kinds of data sets what you to do because it is possible that there are outliers and it's always difficult to know when you should remove outliers or not but for the sake of this let's just say we are removing it let's see what happens so you've got boss and housing data set let's same okay let's drop those values and yeah we don't need this this is just so we can see those once we've done that it does actually increase our or a squared by a good bit so that means we're much closer to zero point eight now and even if we add in again our previous one that I was talking about so we have or M star L stop let's just see does that make a difference so now our squared is more like zero point eight zero one which is a pretty decent for this data set I think lots of linear regression models typically yet somewhere around zero point six six so to get up to zero point eight I think is quite an achievement and other things we could try out at this stage would be to look at the relationships between the variables so here now once I've removed those 16 points we can see the difference in the variables themselves so here this looks a lot more smooth now and so does this but one thing we might see is that there could be a bit of a curve here maybe this is a nonlinear relationship but I think for this I'm pretty happy with how it looks I'm gonna keep in this though if you want to see how to incorporate a nonlinear relationship so even just I'll show you that now as well so here we have we've just squared one of our columns to see coming out in a nonlinear term again doesn't really make much of a difference but that's just an option as well to start looking at things like that and we don't need this okay so that's it for this tutorial I really hope that you enjoyed it I hope it was useful and let me know if this is something you've seen before or something you've looked into doing before or if it was new for you in Python because I think this is something I know I used to know how to do it quite well in ore but trying to transition to Python it was actually quite difficult to learn how to do it in a sense that especially for K nearest neighbors I wasn't sure how this would work initially but now it's something that I am quite familiar with so anyways I really hope you enjoyed it I obviously have this Python notebook linked down below and if you want to see more videos like this be sure to give this one a thumbs up and subscribe to see more content thanks so much for watching and I'll see you in the next video you
Info
Channel: PhD and Productivity
Views: 22,716
Rating: undefined out of 5
Keywords: machine learning in python, feature selection, machine learning, what is feature selection, boston housing data eda, correlation analysis and feature selection, data science, what is feature selection in machine learning, feature selection in python, recursive feature elimination, feature selection in machine learning, scikit learn, sequential feature selection, stepwise feature selection, knn regression, stepwise regression, python tutorial, machine learning basics, beginner
Id: iJ5c-XoHPFo
Channel Id: undefined
Length: 27min 45sec (1665 seconds)
Published: Thu May 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.