Pre-Modeling: Data Preprocessing and Feature Exploration in Python

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

well Chen she's going to talk about data processing they feature expiration okay so she will demonstrate how to use Python libraries such as I could learn stats models of matplotlib perform pre modeling steps thank you very much all right hi okay so today I'm going to go over pre modeling I think since pre modeling is in something as flashy as actually doing the modeling it doesn't get enough attention but if you're actually working on a modeling project a huge portion of it is actually doing the data processing before you actually build the models so I want that to be the focus today so the goal of today's talk is to go over data pre-processing so data cleaning feature exploration and also feature engineering and show the impact that it has on model performance and then on top of that just go over a couple of the pre modeling steps so it covers quite a lot of material so I'm going to go over it with at a pretty high level but if you want to talk about any of these in more detail come find me I'll be very happy to talk about this also the way I set this up it's very much tutorial style I'm going to walk through in a Jupiter notebook can you guys actually see the words I can't okay great I'm going to walk through both the concepts so the machine learning concepts the code and also point out the Python library since this is the Python conference after all and then I'm going to use an edited version of this data set called adult data set to do all the pre-processing and the objective is to build a binary classification model using this data so I use the couple of Python libraries but the main ones are numpy pandas scikit-learn and matplotlib these four actually cover a decent chunk of the workflow so I'm not I'm sure there's a pretty huge range in terms of experience with modeling so I just want to give a modeling overview pretty quick pretty high-level introduce the data talk about data cleaning so dealing with data types how to handle missing data do more data exploration so how do you detect outliers and handle them talk about plotting distributions going to feature engineering so increasing essentially increasing dimensionality and also decreasing dimensionality and the benefits of doing both and then finally doing feature selection and model building and I'm just going to contrast essentially data sets that have been pre process and data sets I haven't to show the impact that actually has on model performance so just to give a very brief overview of modeling it's a statistical technique to predict some kind of given outcome and today's example it's going to be a binary classification model so what this means is you're trying to predict the probability that observation belongs to one of two groups so this is the kind of thing that occurs pretty often in real life so for example if I want to predict whether a given person will vote for Clinton or Trump and the next campaign or whether or not I want to figure out if this person will be diagnosed with diabetes in the next year or whether you're a credit card company wants to figure out if a certain transaction is actually fraudulent so this is the kind of thing that happens all the time so in terms of your data you're dividing it up into inputs and outputs so inputs are those predictors these are also called independent variables or also called features I'm going to use the word features because that's I like that better and then there are outputs so these are this is essentially the variable you're trying to predict so this is called the outcome and also the dependent variable and essentially what a model does is it explains the fact that these features have on the given outcome and to assess how well your model is doing the way you would do this is you would divide up your data into train and test sets you would build that model on a train set and then you actually make predictions on the test set and then figure out assess essentially whether those predictions match the actual outcomes and there are tons of metrics for doing this but the one I'm going to use a very commonly accepted one is the area under the ROC curve so this essentially just measures the true positive to false positive rates there are tons and tons of models that could be used for binary classification so logistic regression random forests gradient boosted trees support vector machines so tons of different types of models we're going to go simple today we're going to choose a logistic regression in scikit-learn so before we get started I want to just give a little bit of introduction to the data so the task is using given attributes about a person predict whether or not their income is going to be less than or greater than 50k so you could just imported at a-- take a look at it so you have a bunch of different features there's age work class education level number of years of education how many hours they work for a week race gender and then they have this outcome variable which is income so whether or not somebody makes less than or or greater than 50k year we probably want to take a look at our income and your note so you can notice over here that more people make less than 50k than greater than 50k and just for the ease of modeling we want to transform this outcome variable income into zeros and ones so if somebody makes less than 50k that will be a zero if they make more than lba1 and then you will also divide up your outcome and your features into two different sets so this x over here represents essentially a data frame of all your features and then the Y represents your outcome so it's a series and then just take a look at what that looks like so this is exactly what it was before so now it doesn't have this in over here and then your why we'll just have your income zeroes and ones which is also your outcome and come out come um so the first thing we do is do some basic data cleaning the first thing you deal with is data types so there are three main types of data numeric so that's usually represented as an int or a float and it could be anything like income or age so just a number there are also categorical variables so categoricals are things like gender or nationality and they can't really be represented very well in the form of numbers so for example what does it what numerical value does female or male have it doesn't really make sense and then the last one's ordinal so ordinal is similar to categorical as so it has essentially a scale but that scale is a very arbitrary numeric scale so it could be anything like a low medium high short medium height tall but you know the distance between each one of those categories could be different and models can only handle numeric features so if you if you want to build models in categorical or known features you have to convert them somehow into numeric and you can do this using so you can do this by transforming that categorical variable into a set of dummies so this set of dummies will essentially so essentially each categorical feature will be transformed into dummies that essentially represent each unique category and those would be ones and zeros so female you could be one or zero male also one or zero and this is just an example then so education over here is a categorical feature so you have a bachelor's degree a high-school degree maybe you didn't finish high school so you have a 11th grade degree one great thing about Python it has really great tools that make your modeling workflow and your pre modeling workflow really easy so pandas has something called dummies so instead of building something to actually make your dummies you can just use this get dummies also there's also a very similar package in scikit-learn called one hot encoder which pretty much does the same thing so then I say get dummies of Education and essentially takes each one of those unique categories and then assign some zero one so over here you see that observation zero one bachelors so if you go to bachelors over here observation zero and one one and one but you don't necessarily want to dummy up all your categorical variables so for example if you have a thousand observations but 999 different categories it's not really useful to actually have dummies of those so one thing you could do is just check how many unique categories you have and for this your notice you can notice that essentially there aren't that many different categories but for something like native country it's 40 unique categories so maybe we want to take a look at this and you're seeing that for the most part they all belong to a single category when you have a situation where most of the categories are very low frequency you don't have to create dummies for those you can create something like the other and then essentially have other represents low frequency pins so we can do that here so in this case bucket everything that's not the United States this other since most of them fall into the United States and now we have only two categories u.s. or other and you don't want to do this for every single variable so it's just easier to create a dummy list and this dummy list is essentially the names of all the features that you want to essentially make into category or to dummy up and I just write a simple function that loops through each of the columns on the list so you're dumbing it up you're dropping the original feature which can't be used in a model and then you're adding the dummied version of that feature and this is what it looks like when you do it to X so work class relationship race sex on either country are now all done it up so now you have everything dummied up and everything's numeric but models also can't really handle missing data the simplest solution you might think is I'll just remove observations or features that have missing data but this is actually a pretty problematic approach and causes tons of issues for example if it is if it's randomly missing you could potentially lose a ton your data which is a harmful to your model but it's even worse if the your data is not randomly missing so in addition to losing a bunch of you data you're also potentially introducing biases into your data set so there's no longer representative of the full population so this is usually a pretty poor solution a good but really simple alternative is just to use imputation and what imputation means is you take that missing value and you replace it with another value based on the non missing values for that given feature and there's a bunch of different strategies for this you can use mean median or highest frequency value of a given feature and those are these are the most common ones so we're going to take a look at our data set again say how much of our data is missing so you can see over here that these features have 107 57 and 48 missing values again site get learn comes in to the rescue and it has a really great thing called in pewter which essentially just lets you impute all the missing values I'm going to say this is what my missing values look like what strategy do I want to use I want to use medium median and then I also want to do long columns so now I just impute and now I've actually looked at how many values are missing I'm seeing bad there are no more missing values so now at this stage sorry you can you can actually oh yes sorry and the question is for secular you can't use a pandas dataframe you can and I can show you later that I'm putting a panda's data frame into a psychic model um so for where was I yet okay so right now essentially your data is in a place I eat actually built a model but you know this data cleaning isn't the only part of pre modeling there's tons of things you can do and so far in terms of the workflow that I've shown you a lot of it can be generalized and automated that's why I created functions for them but just understanding the problem understanding the domain and understand your data is extremely important for building high-performing models so this section gives some tools essentially to explore your data to make smarter decisions so the first one is outlier detection so a simple definition of an outlier is it's an observation that essentially just deviates drastically from other observations and a dataset and this can occur one of two ways it could can occur naturally so if you have a data set of incomes Mark Zuckerberg will be an outlier in that just because he's much wealthier than most of us but you can also have it occur because of some kind of error and this could be some kind of measurement error because your machine is having an issue that day or it could be human error so somebody could be putting in their weight and instead of type being 200 pounds they add extra zero and it's 2,000 pounds so that's unlikely factual person's weight so why are outliers problematic if they're naturally occurring they're not necessarily problematic but it can skew the model by affecting the slope and this is just an image of that so this is a data set with no outliers put some outliers over here and then suddenly this loop is changed but also if it's an error it's problematic because it's really indicative of data quality issues and since its data quality issues it's not really information that you really want to be using so you can honestly treat them at the same way as you do with missing values so using something like the imputation there are many many different approaches for detecting outliers there are actually entire textbooks dedicated just to this topic but I'm just going to go pretty quickly over two of them one is to these interquartile range and the other is kernel density estimation so two key interquartile range essentially just identifies extreme values in your set of data it's defined as essentially so the interquartile range is q3 minus quartile 3 minus quartile 1 so values below quartile 1 minus 1.5 times interquartile range or above quartile 3 plus 1.5 times quartile 3 minus quartile 1 so the interquartile range will help you find your extreme values and this one point 5 is a little bit of an arbitrary number if you want to find the most most extreme values you could change one point five to some other kind of multiplier and a lot of people like to use gender deviation from the mean as as a way to detect Ahlers but I prefer to you simply because it doesn't have some of the problems that student deviation from the mean does so it doesn't make assumptions about normality it also is less sensitive to very extreme values this sort of demonstrates what it is so if you have this vector of data you have your first quartile your third quartile the range is the difference between these two and then you have this multiplier 1.5 and then essentially have a floor and a ceiling and then the floor is less than the first quartile minus this number and then the the ceiling is quartile 3 plus this number over here so in this case five six and seven and 104 and 130 Q are your outliers you can write a pretty simple function to do this for a given a given feature so find quartile one and three the difference find the floor and ceiling and if it's below the floor but or above the ceiling then essentially I'll just give you your outliers so if we do this for age you're just seeing that the outliers in this case are people who are older who are still getting income and the reason I also give indices is because it's usually helpful to have indices if you're interested in pew ting or or something else afterwards then this is a good way to access those data points the other method is kernel density estimates so this is a nonparametric way to estimate probability density function of a given feature it runs a little bit slower than two ki since it's not as simple but one advantage is that two ki looks like extreme values but kernel densities has ability to capture things in things like bimodal distributions that extreme values would never capture so this is also just a pretty simple function and secular has Oh actually I use there is a secular and implementation but this is actually the stats model implementation of it they're actually pretty similar but this is a unit very kernel density estimator so you can find this univariate on a scaled version of your of your input feature and then essentially come up with something pretty similar where it's just a lot of people who are older who are working also the other thing that you can look at is distribution of features so the simplest way to do this using a histogram so if you're not familiar with history jams the x-axis represents value bins and then the y-axis represents the frequency of observations falling into those bins and other things that could be interesting to look at is history grams broken up by your outcome categories so when somebody makes less than 50k or greater than 50k or by some other categories so age or like income distributed by or divided up by gender or something and then matplotlib makes it really easy to make a simple histogram so this is a one is just a age and then this is a function that does something similar but now it breaks it down by your outcome variable so you're able to look at the distribution of age when the dependent or the outcome variable is zero so when somebody makes less than 50k so that's in the blue and also when they make greater than 50k so that's in the green so once you have you done your exploration in your data cleaning another thing that's important is future engineering and in terms of future engineering I like to divide up into essentially two classes the first is increasing dimensionality and the second is decreasing dimensionality it's advantageous to do both and what I mean by increasing dimensionality is essentially creating new features so if you're very familiar with your data you probably have a lot of great domain knowledge for building out features but a good automated way to do it is looking for interactions amongst your features and as simple so there you could do 2-way 3-way interactions so this is a simple two-way interaction so X 3 is equal to x1 times x2 and x3 is your interaction and x1 and x2 are the variables your your x3 is interaction of the first two and this could probably be explained better with an example so so say that you're interested in predicting somebody's sentiments toward climate change so whether or not they care about climate change and are concerned about it and so you have two variables which are education and political ideology so you could obviously put both those features into your model but another interesting thing to look at is also the interaction between the two variables so what you can tell here is if somebody's extremely liberal or moderate as their as their education increases they actually more concerned about climate change while if somebody is extremely conservative it actually has opposite effect so with the increase in education they're more likely to care less about climate change and if you're just looking at education as a whole these almost cancel each other out in the in the aggregated data but by creating an interaction term you're able to separate those out and essentially this interaction between these features these two features provide more information than just like the features alone but one thing to note is interaction terms you really blow up your data so essentially if you have ten features you have 45 two-way interactions we have 50 features you have over a thousand if you have 100 features it's almost five thousand and then if you have five hundred features you get over a hundred thousand so you don't necessarily want your dataset to grow so big and it's not really recommended to just go and have these kind of internet 100 thousand interaction terms on your data so if you know your data well and you know potential areas for interactions it's better to use domain knowledge rather than creating 100,000 interactions and dimensionality has essentially benefits and cost it's very beneficial because essentially you're adding new information but it's also really costly one because computationally it's very inefficient and also too it causes issues with potentially overfitting your model so you're always trying to find the balance between you know adding more information but also not having it be too high dimensional so cycle urn has essentially a tool to let you build interaction terms across all your variables since our data set isn't too high dimensional this is something I feel a little bit okay with doing so this is just a function that's called add interactions you input your set of features and then it creates two-way interactions between all those features and then since the interaction terms for dummied categories for within a categorical feature is always going to be 0 you remove those so over here we're just adding interactions for X so now you can see you have interaction between races white and also generous female race white gender is male and so forth so now your data set is actually pretty big it's about 1,700 features now so the opposite of doing things like find interactions or building new features is to go the other way around where you're doing dimensionality reduction and one of the most common ways of doing dimensionality reduction is using something called principal component analysis so it short is PCA and essentially this is a technique that transforms your data set from many many features into a couple of principal components that essentially summarize the variants that underlie your data and the way it's calculated is it finds the linear combinations of your features that maximizes variance at the same time it really is mindful that there is zero correlation with the previously calculated principal components there are tons of use cases for this so if you have super high dimensional data this is a good way to reduce the number of dimensions also if you have really poor observing to feature ratio this is a good way to do to fix our problem and it's also really good if you have a data set of really highly correlated variables because essentially is just going to take the variance from all those from all those variables so that doesn't have that kind of correlation anymore but the main downside I'm using something like PCA is it makes the model really hard to interpret for example if you're predicting income H makes sense education makes sense but principal component number one two and three don't really make sense and this is just an image of what it looks like so you take all these different features education health housing climate and then essentially you're creating in this case two components out of all those features and psych it has a really great implementation where you essentially just tell it I want M components as two and it transforms your entire dataset into a ten new features so again one of the problems is that it is a bit hard to interpret these are just arbitrary principal components so it's really hard to talk to for example client or somebody using the model and tell them the drivers of their target outcome variable because you'll just have to tell them it's principal component one and then so the last part is once you build your new features once you have your data cleaned once it's in a form accessible for modeling you can actually build your models so the first step is just split your data into training test so there's a really easy way to do this using trained test split inside get and now you notice that originally you have about 14 features one of this is the income and now you have more than 1,700 features since you have 1700 features this can be a bit problematic one because you only have 500 or 5000 observations also competing time is slower and also you have issues with overfitting so the thing that you also want to do is make sure that you select a couple of your features and not all of them and there are you know entire textbooks on how to do feature importance and feature selection but I'm going to go with a really simple method so whoo so I'm going to go with the really simple method which is select a best and second learn and this is a unit variate method of doing this so essentially it just looks at that outcome and the relationship with each one of the features and selects the K number of best features so now this is your your list of selected features so you notice like there are a lot of interactions here so things like age and hours per week so how much they work age and education things like education number and hours of the work so tons of interaction terms within your your selected features then you can build a simple model so I'm just using a logistic regression model and secular and I'm going to use the area under the ROC curve to determine how good that model is so over here we see that our AUC is 0.88 which is pretty decent model but I think it's important to have a source of comparison because we're talking about how pre-processing actually helps you improve your model performance so let's build a second model on unprocessed data we have to still process it a little bit in that you know it still can't handle categorical variables is still can't handle missing data so let's do it using approaches I warned you against so just get rid of things that can't be inserted into the model so get rid of non-numeric columns and also get rid of rows that have missing data and then now try to build that model so you notice here that the AUC went down significantly so the ACS at point 61 which isn't well it's sort of awful because I mean point five is random so that's pretty close to point five so this is not a good model and just for comparison purposes so the first model with data pre-processing and also notice that it was very simple data pre-processing it wasn't even that I was doing that much or using any kind of domain knowledge at all it was like very automated ways of doing pretty processing I had a SC 0.88 and then with that it's a point 61 so that's a forty four percent improvement using pretty automated techniques I think I still have five minutes for questions I can't see anyone dish out yeah in general regularization I think is better than PCA in almost all cases that I worked with so in that cases where I've used PCA that are that's good is what things that are really high dimensional so for example things like text data so if you use something like bag of words or something like that and you have really really high dimensional data I found that's really helpful but in most cases if you have like 500 features or something like that regularization is usually better and more ideal for my experience sorry what was that yeah I'm going to post the notebooks in the data set sorry oh yeah will I post this notebook to github yeah I definitely will and and then I'll how to happen is there a way I can send out like an email with a link okay okay yeah I'll post everything up from this notebook - to github any other questions just shout I I can't see anyone what do you mean is like processing they're actually pretty fast and a lot of the things like a lot of the things that in my functions here since this is smaller data so I didn't really paralyze but then if I had a bigger data set that I'm working with it's pretty easy to paralyze a lot of these things as well so I mean unless you're working with data sets in you know in like that are much much much bigger than a compute time isn't terrible the longest I've had the longest I've had I was working on sets of data that didn't fit into memory so was doing it a single row at a time and that was on a dataset of about seventeen billion rows so that took like a week or something that it was pretty long yeah or maybe longer but I mean in normal situations these are actually pretty high performance for example for outlier detection kernel densities is actually much faster than some of the more advanced techniques so like something like one class SVM's or something takes a lot longer so these are actually pretty efficient compared to more complex techniques but they're also a little bit simpler than those as well okay thank you

Info

Channel: Next Day Video

Views: 135,663

Rating: undefined out of 5

Keywords: depy, depy_2016, AprilChen

Id: V0u6bxQOUJ8

Channel Id: undefined

Length: 35min 36sec (2136 seconds)

Published: Mon May 09 2016