TidyTuesday: Feature Engineering with Recipes and TidyModels

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey y'all attention couch here and this tell you Tuesday video I'm actually not gonna work on this week's data set instead I'm actually enough folks on feature engineering with the recipes package so when I was looking at the recipes package I actually saw that they had like an ordering of steps article which kind of shows you a little checklist of how to actually use the recipes package in a specific workflow so I thought it'd be fun to kind of do this go through this checklist and use the famous Titanic data set and for the Titanic data set um each row basically represents a passenger on the Titanic and whether they survived it or not so it's a great predictive modeling problem and you can learn a lot in fact the first time I was actually using are the Titanic data set it was the first day so I was actually analyzing and I was actually doing that in besar so that was a lot of fun so we're actually gonna load up our markdown file let's see it from template document i'll call this the tidy tuesday titanic okay cool so we're gonna load in the tidy verse the tidy models package and our data set so DF read CSV Titanic CSV okay so first let's just do a basic summary of it and we can see that we have 1 2 3 4 5 6 7 8 columns we can see our survived column which what we're trying to predict and it appears to be a binary variable of zeros and ones one i'm gonna assume is that they survived and 0 if they did not addition we have 7 other columns we have i guess p class name sex age cool so when actually looking at some data sets for predictive modeling it's pretty important to look at the data quality issues so the biggest thing is just seeing how many na s are missing values there are and this summary function kind of shows each column or each features toll na s that's pretty helpful because if you are looking at your data and you notice say um one feature is missing half its values it's worth considering actually dropping the tire feature additionally one way one other thing we she'll get is I guess row wise total na s so if a sample is missing say half of its features is worth considering also dropping that sample and not using that in your data in your modeling or analysis additionally if you see that say a lot of na s are centered around a certain feature say for example highly pay high fares have more na s than low fares it might be worth considering why that's happening for me I actually know that these n A's are random because I went into the data set and ran we removed some values but we're actually gonna find out say the the row wise total count of n A's okay so I actually don't really know what an elegant solution to finding out unlike the row wise counts I know there's a row wise function well I'm not sure what that does so maybe if I look at that row wise row wise do so there's something going something it's paired with a do so it's some type of like I guess like mapping style thing but I know one way I would I know one way I would do is basically so I would probably just make a essentially a row key in this case we could use name but sometimes we might not have a unique classifier so that's all we're gonna do we're gonna say mutate and we'll say row K equals a row number right and then what I would do is I would do like gather so key equals key value equals value and then minus row key okay so now we have that and then finally I would say filter value that value is our na and then I would just count the row key and then maybe say I got a sword equals true right so right here we're just counting the missing values in each row and we can see that okay the most amount of missing features is - I'm comfortable with that even though there's only about one two three four five six so one-third of the total features are missing at max and that's only with two of those samples I'm fine with that I would say that it's definitely dependent on on the person the amount of features you don't have right so we're gonna just deal with that and let's look at the actual recipes steps so the first thing I should do is do any imputation zhh okay so we're gonna do that okay first time say two row wise counts of n A's and A's okay so we're gonna do a summary again and let's first focus on um P class okay so if we will get P class let's just do some summary statistics of it so we can do that we can do a let's here group by P class and let's say summarize count equals n average age equals mean age and say I guess na done RM equals true average fare equals mean fare and age are M equals okay so what we can see here okay so we're missing for um p classes and this this in this case I think it's payment class because we see here with the average fare that the first class people are paying a much higher fare terror class people are paying less fair middle class or second class a pan you know a little bit over the lower third class fare and usually what we can see is kind of the basic counts of each class so we can see that Oh the majority of people were in the third class then in first class and then in second class okay so there's kind of a dispersion right there we don't see that aid you kind of Center on 25 or three there's not a huge age difference between the third and second class however the first class people are a little bit older we can also look at our na s and see that okay for our average NA age it's actually was 39 so they're much older however they have a much lower fare so that could mean a few things I want it could mean that there's a person in this sample it has a much higher acre as much older and it also could mean that one person has a much lower fare so let's look at that so let's say filter P class is n/a okay and we look let's look at that we can see the fair is centered around what I would say is that the second I guess it's a center around a little bit but all of them so this person misterfrancis davis is probably maybe a upper second-class person but the majorities people are probably in the first in the third class right idea so we can see how this guy is much older and also that he's paying more so if you're older and paying more I would say you probably fit in the first class however these young people who are paying way less I guess this young person who's paying less and these people are paying much less I would have a center around senator tell them that there are they are probably in the third class so these type of like kind of like not necessarily not necessarily hard combinations but it's something where you kind of are comparing the similarities and it appears to be meeting some type of pattern we can impute this using the KNN imputation algorithm and they actually have this thing called step can and impute well we could do a mean meeting or mode but since we see that there's other things that are kind of how are these small indicators that we think are intuitive we can actually use a kind of machine learning algorithm to actually the imputations which should be a little bit more robust estimate but if you want to be safe you can do a median so that's all we're gonna we're gonna do the step can impute okay so we're and say step K and N and pew and I'm actually going to do some data cleaning first so first we should do but first we need to clean so if we will get our data set one thing we don't need is we don't need the name okay the name can basically act as a primary key so we don't want that okay additionally what we should do is we should mutate sex it's equal to as that factor sex and also survive is equal to as that factor as that factor survived okay so we have that we have P class which int I think a payment class can definitely be senator served as a factor so it was AP class goes as a factor P class okay so we have what three factors and the rest of these guys I would say are these like continuous variables so we don't worry about that okay so we got that now it's a definer recipe it's a recipe and in this case what we're trying to do is we're trying to predict survived so as they survive data equals DF for this example we're actually not gonna do any modeling right now we're actually gonna focus most on the feature engineering so we're doing basically is you using the entire data set which will cause leakage if you want to use this in the modeling however it is very simple we could just do a training and testing split and then just use a training data but since we want to see kind of this like tactile feedback we're just gonna apply it to the entire data set to get some type of intuition I also figure out what it's doing okay so we'll say what's this called Titanic I'm Titanic wreck boom Oh Titanic wreck we can look at that what's good okay so a Titanic wreck we can do the imputation so step was it cannon and pew and we want to impute this to the PE class right so suing that it's applying the K nearest neighbor imputation for a P class cool okay so next what we get what we need to do is simply finish P class well we start looking at oh it's here well we should start looking at is this the number of sex the sex okay so in this case if we do a count sex we can see that there's five missing values the majority the sex is male okay so that's pretty interesting um one thing that we should do is we should just do a group by sex and maybe do it just to summarize and say like just to maybe do an average age equals mean age average fair fair equals being fair so we're gonna do that we can see the oh and Ida are m and E are true and then you are out it goes true okay cool so we got that what is our missing values showing us well our missing values are appears to be on average older and on average has lost money that is a strong indication of male so let's look at that say DF filter sex is a name okay so we got that they're definitely older they're paying more money one percent is sibling first and then two of them two of the five survived cool so for me when I'm looking at this I don't I'm not really certain on how it's gonna work additionally since this is a factor or categorical variable we can't use median or or mean so one way that we can do this is we can do a modal imputation so we're gonna impute the most common value so that's what we're going to do so when we say impute the modal Kat gorac all value tight so say a Titanic rack step mode and pute and we're going to apply this to sex very cool cool so we have that we can do another summary just to see what else we need to do okay so now we're gonna do sibling and parent - oh we can do age okay let's do it let's do H - okay so with age I'm not even gonna look at the actual crosstabs I'm gonna just do a simple plot so let's do the plot age but genome density so when I'm looking at this we see that age is definitely centered around this big hump right here which begs the question should you kind of like a mean or median so let's look at that if we look at our ages we can see the difference is which is between basically one and a half years okay since we're doing age I'm just gonna do median I think when I think of age I always think median value so that's we're gonna do where it's a impute using median age okay it's a Titanic wreck step median impute age right yeah let's do it cool so we have that now what we're gonna do is sibling and spouses so I'm just gonna copy this and that's a lot of text okay so we're gonna do DF select these guys and one thing that I was looking at is okay so parents and children aboard let's just do some let's see let's just do a basic plot so it's to gather and maybe just do a GG plot so GG plot yes x equals value color oops color equals key plus genome jiya mouth what could we do here Jim histogram was facet wrap like a scale as he goes free okay so we're gonna do that cool so what's a year most of the time it's centered around zero however for siblings spouses it also has a little hump at one this little point at one and just intuitive that kind of makes sense because okay if you have a spouse on board they're probably the spouses also can be counted and they were gonna say they have a spouse on board okay whereas if you have a parent on board you might just have only have like a mom on board or a dad on board and you're probably not gonna have you might not have more than one children so that makes sense um I guess like the parent and children thing a child could be counted and they could say they have a parent however it appears that there's way less children on maybe in general whereas there might there's probably more likely to be spouses when traveling on the Titanic okay so what does that mean well we can do a lot of things if we look at this we can say okay on one hand we could do a we can we'll get the summary statistics and we see okay most time it's zero so we could do a mean which a medium which means zero what could do a mean in this case I think the smartest way to do it since um you will can't have half a sibling have a parent on board I think we should still do mode so we're gonna do that we're going to say since you cannot have a half a parent or sibling a board let's compute with mode obviously you could just do median and mean but we're kind of just gonna do this as just as simple as possible so we're gonna impute using mode okay so Titanic wreck step mode and pute and we're gonna say siblings parents children aboard - okay cool so we have that and I'll ask you will do fair okay DF will just do GG pot yes Xcode fair but Jim and City okay and then maybe I'll do a plus scale oops X log 10 so log 10 so it's increments of 10 I like log logarithm 10 so when you see there's three humps and that's kind of interesting when when I'm looking this I'm thinking oh I bet this represents like the three classes all right so let's do that let's see what that goes fill equals P class yeah and that's what we're kind of getting maybe I'll do color equals P cause yeah color equals P class so what are we seeing here on one hand we can see okay well our fair is definitely I'm kind of senator around the third class and we have all these other classes so it could be it like a a third or second class so let's look at the actual densities without it again and we look at that okay so let's a look at now let's look at the summary statistics so select fair and we'll do a summary let me see a comedian mean for me since I think it's in between first and second class and we can see that our second classes for a second third class um fairs are kinda in between each other so it's 1320 what's see here 30 or 20 actually I'm gonna do a min fair some men men men fair equal equals men fair and I doubt our M equals true let's see what that looks like so the min fair third class 13 25 2000 K 13 25 I'm gonna say system in fair oh I don't say filter air just not equal zero just of justice get an idea what the min fair was looking at for second class four okay so hmm third okay so I'm gonna do the the median I feel very comfortable the median so what's a MPU fair with the median value on a strong suspicion that it is third force or a second class actually what third a third or second class so if we do uh let's let's just do another filter then filter bear is a second class so the second class that means what we can do is we can do the actual mean okay actually instead stretch also scratch that MP you using means since the PE class is second class okay so we're gonna impute using the mean so Titanic wreck step mean impute fair cool so we did that we finished with that recipe so let's go back ordering the steps we got done with imputation we can do individual transferring a transformations for skewness so let's see if any of these values are highly skewed that we feel that might raise some complications okay so if we look at our summary we can we yeah there we go sorry well you notice that our minimum value does not go below zero so if you look at this if using box-cox transformation don't Center the data first or do any operations that might make the data non pasta alternatively use the huge constant so you don't have to worry about this so when I'm looking at this we don't we can use center scale you Jeong Gyo Johnson or box-cox for me I sees box-cox because I kind of like it um especially if we don't have non non negative numbers so um we can keep that in our heads this is for most of our individual transformations so that's what we're gonna do we're gonna say I still select what suck if is that factor fact select Oh God select suck that ball how does a not not is that no bars select was like at is dot bars is dot back Oh God okay no we're not gonna do it so select select add he's not select if how do I do the let's see here select if I serious select it not tiny vers what's the apron I think how did I do this again oh well we're not gonna worry about that so we're gonna do DF select - survived - P class - sex maybe do that throwing it gather and then just do a GD plot a es x equals value Phil Phil equals key um histogram plus facet wrap my key scales equals free so how do I suck non-numeric select if negate oh okay so I do select if negate a gate is dot factor cool tells really easy okay cool so we can see that um age it you know it's showing kind of like a normal issue distribution we can see how these guys are are not it's mostly center around you know it they'll left I don't know the exact name of that distribution maybe it's a good experiment like that I don't know I'm one thing I was thinking maybe we do like a scale or X log ten and see if that does anything and that doesn't change anything too much so we're just at that okay so we kind of look at it no distributions are super concerning so I'm going to kind of check this off and say ok we can keep on going on to the next list which is just discretize discretize is it's actually binning so say if we see fair we could say we can essentially make the payment class so we could say oh like fair between zero and 100 is a non upper-class hundred dollars learn and above is upper class we don't have one all I need to worry about that right now so we're gonna go on to the next one which is the create the dummy variables okay so we know we have factors so we're gonna say finished imputing now let's put let's put in some dummy VARs okay so if we do DF we know we have three different factors however we're trying predict survived so we only have to convert P class and sex into atomic variables so that's all we're gonna do let's see your Titanic track step dummy dummy and then let's do P class and sex in this case I'm not gonna do any one high in coding so I'm seduced this dummy bar so we're gonna we'll get that we have the dummy bar and we're gonna do that so cool we got finished with the dummy bar let's what's next so we can create interactions in this case I don't feel comfortable making any interactions that there could be an interaction between say your sibling and fair I'm not sure but that's something where you you you'd have to do a bigger eda and I think I don't want to make this video too long so we're gonna go to the next step which is these global normalization steps okay so noir normalization so most models actually benefit from some type of standardization or normalization I honestly don't know of a model that gets actually hurt by normalization one thing that I would maybe think about is maybe normalization might hurt say like I got like a stand model which really wants these wants to preserve these specific distributions but I don't have a deep understanding of that so I would say a good rule of thumb is always normalized I certainly can't hurt you and if you know when did not normalize then you're probably out of the scope of this class or out of this video so we're gonna start normalization if we go to the function function around for instance I mean see there's actually a decent amount let's here do you fit decent amount of know my stations so there's step center step normalized which does the sarin scaling we do step range step scale so we can kind of side one hand we could do a box-cox transformation which I actually enjoyed doing we also do is say you know Johnson but I think the the default thing that you could do if if you don't really know what to do or if you kind of don't wanna look at anything I think a step normalized is the best and this applies both as turning and scaling so that's what we can do Titanic wreck and step normalize and I would believe was at all all and we just do is all numeric I think all numeric is that the go to one let me check that let's see if my Internet's slow why is my internet so low okay I think it's all numeric let's see what that does steps entering for all numeric doesn't oh okay so all numeric I'm gonna keep it let's see let's see what happens let's see what happens so step almost all numeric we applied it I don't know why this thing is huh weird so we did that we finished step six and now we're on step seven where's the multivariate transformations always I'm not very familiar with a lot of these but I've seen say obviously like PCA I think spatial sign pre-processing so all of these things you could definitely do a huge thing into it um it definitely goes with say like it's kind of like distances between each calms and the on the relative amount of samples and it's something that could be done I guess I was like kind of a feature reduction there's also other things like say removing highly correlated values which is definitely necessary saying like linear regression for like multicollinearity there's also nzv or near zero variance and zero variance which it basically says that if a value doesn't change so it's like a very if the value stays for the entire dataset we'll just remove it P is redundant and you can give it wiggle room for the near zero variance then there's also these row operations like you can do filtering you can do lag features you can do I guess shuffling you do sampling like down sampling and up sampling you can also do I guess I don't intercept call them window functions and stuff like that but we don't need to do that so we're gonna do final oops finally prep and bake the recipe so let's say Titanic prep prep Titanic wreck data should be character or factor to would compute the mode Oh what the Titanic wreck let's hear my supposed to change why is the saying what data should be character or factor to compute the mode let's here step centers edited it out what's here hmm very odd let me look this up recipe okay I'm gonna stop this video and see what's going on so I write that I I was running some errors with the was it with the the siblings abroad and the parents abroad with the modem computation it I guess I guess it requires a character so we're just gonna ignore that and just kind of let it ride okay so instead of using a modal impute we're just gonna do the the median impute so we're gonna just do siblings dots spouses dot board and parents taught children troops two children got a board okay and we're gonna prep put that into a prep so we're gonna throw in the prep oops so instead I'm just gonna do a I'm gonna show the entire thing so Titanic prep okay and then finally what we're gonna do is we're gonna bake it or apply the alcohol pre-processing function to the data set cuz we're gonna bake it and say Titanic prep with the df' what do we see here so first we can see the age is centered scaled siblings parents fair survived sweet okay so I think that's what we need - 1 2 3 4 5 and then 6 and then that's our 7 we also can do a summary statistic see if we are missing anything we are not missing so survive is still missing one and that is something where we can actually drop the NA for that so that's what we're gonna do we're say processed D F fake and then we'll do process D F and we'll say I dropped any of the remaining n A's and boom okay so then we have a process D F and we can finally use it for actual modeling and stuff like that with Italian models you actually won't see it but you'll it'll kind of this is kind of a video to show the actual kind of like inner workings and how it works so I know this is a pretty quick video but I wanted to kind of demystify the the recipes package I'm still learning about it I still think there is a probably a way to circumvent the modal imputation for the the the siblings and spouses abroad but it's pretty easy to just throw in a median instead of a mode so I'll see you guys next week in Taiyuan
Info
Channel: Andrew Couch
Views: 1,476
Rating: 5 out of 5
Keywords: Rstudio, Tidyverse, R Programming, Data Science, Analytics, University of Iowa, Statistics, ggplot2, ggplot, data wrangling, Dplyr, Data Visualizaiton, Data Viz, EDA, TidyTuesday, RStats, TidyModels, ML, Machine Learning, Data, Data Modeling, Black-Box, Recipes, TidyModels Package, CRISP-DM, Recipes Package, Feature Engineering, Box-Cox, Yeo-Johnson, Imputation, Mising Value, Preprocessing, Data Preprocessing, Data Manipulation, Data Pipeline
Id: OeHz3bkOo5c
Channel Id: undefined
Length: 35min 20sec (2120 seconds)
Published: Tue Jun 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.