Episode 3: Handling Categorical Features in Machine Learning Problems

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome to the new episode episode 3 categorical variables and it's been a while since the last episode it's because I have been quite a lot busy but that also means I'm going to have some really cool videos for you guys in the coming days and I've tried a lot of new things like training models on GPUs and I'm going to explain you in some short tips and tricks videos how to do that and since I was also requested to have a cup of tea so here it is in this episode I'm going to talk about categorical variables and I've seen this quite a lot of times in the industries like even people with a lot of experience and in PhDs and stuff they are not able to handle categorical variables and it's it's not something really very different it's there are just some tips and tricks on how to handle cat a little data so in this video I'm going to talk about what different kinds of categorical variables can be and how you can handle them in a very easy manner and build model based on that so everything that I am doing in my videos aren't doing it from scratch as you can see so there is no script so there will probably be a lot of mistakes but we will figure it out and there will be a couple more videos after this one regarding categorical variables because it's it's not possible for me to fit everything in one single video if I do that the length is going to be huge so let's start let's start with categorical variables and see what the different types of categorical variables are so categorical variables can be majorly divided into two types so you have categorical variables and the two major types are nominal and ordinal so nominal is all kinds of categorical variables which are like two or three two or more different types in it and ordinal are categorical variables that have a fixed kind of order in it so let's say you have data about salaries and there are three different levels of salaries like low medium and high so this becomes your ordinal variable and nominal will be like if you have the sex of the person like male or female for example so this becomes nominal variables not to me all the categorical variables are the same and can be handled the same way so to to start with this I would first need some kind of data and that's what we have been doing so we just take a data set and and try to approach the problem so let's see if we can do that so we're lucky that we have an ongoing kaggle competition with categorical data so let me see if I can find it yeah here it is and I discussed this in one of my previous videos but that is the old version of this computation so today we will see okay the new version which has more categorical variables so this data consists of binary features nominal and ordinal features and there are some cyclicals features like day of the week and month so let me let me just quickly download this data okay so I have downloaded the data and now we can start working on it so my data should be an input so I downloaded it as drain underscore or cat and it looks looks like it's a huge data set yeah so you can see there are many different kinds of variables in it so I would try to first see what they look like and for that I would use something that I don't generally use so I'm going to make a new folder called notebooks and inside that I'm going to start Jupiter so let's see okay so we don't have anything here let's create a new fight entry notebook and I'm this I'm just taking what the different kinds of variables are so we need pandas and we need to read the data frame so we read CSV and here we have input another score cat dot CSV let's see okay so that was quite fast so let's see different data okay so you can see like okay it has binary variables bin underscore the number nominal and ordinal and you also have day and month so now we can we can just check so like if I if I do TF bin underscore one dot value counts so it's going to give me all the different types of values that this column has so it has zero and one and it's binary column so if you should check something else when for yes and no hopefully and same thing you can try for nominal so norm underscore zero let's say it has three different categories red green and blue and if we check the ordinal one of the ordinal variables what we have here one two three so probably like three different levels as I talked about it has an order so this one seems interesting so let's see this one okay yeah so now it's more intuitive different levels of temperature okay and hopefully then you have day and month so we can also check that DF told a dot value counts so you have days of the week and one month will have 12 but just for the sake of checking it okay so month yeah that's true hopefully so there's not much exploration that you want to do with this data you can see like okay some values appear much less than others and yeah that's quite possible you it might also be possible that there are some categories which are in the training set that we have downloaded but not in the test data so how to handle that and even the count is like quite less for some of them it's not like they appear only ten times so we don't have rare values okay so back to back to our workspace so we created we created quite a lot of files we created a class for a cross-validation and we're going to do something similar today we are going to create a class for categorical features so in future whenever you have categorical features you just need to input what kind of what kind of processing you want and you're done categorical features that's why or should I just called categorical okay and there are many different ways of handling categorical data and let's let's see that first so that there are many many different ways you can handle categorical data and let's look at this data here okay so this is my ordinal column and what we want to do here is know that a number of different ways so we have we have freezing and then we have warm cold boiling hot and we have hot and lava hot so one very basic simple way to handle categorical data would be to just represent them as numbers and if you represent them as numbers what you have you have 0 for freezing let's say then 1 2 3 4 & 5 another way to represent so this works and most of the times it it just works so you represent them as numbers and then you throw a model which works on categories like random forest model or you extra trays or XD boost like GBM all kinds of tree based models you don't need to care about categorical variables more than more than this part you know so this is called labels and coding that is you encode all the categorical variables to a certain kind of label and then what you can do is you can represent them as binary variables so if you have many different levels of data they can be represented as binary so what-what would just be in binary zero zero zero if we take 0s like from the label encoding and then you have and then you have one so this can be represented as 0 0 1 then this one can be represented as 0 1 and 0 this is 3 so 3 can represent it as 1 0 0 no sorry 3 is 0 1 1 and then you have 1 0 0 and this one will be 1 0 1 so we have represented them as binary n sometimes you need to represent them as binary variables and this is mostly when you have data an ordinal format another way to represent this would be to do one hot encoding and in one hot encoding what you do is you have a vector which is of the same size of the number of different levels that you have so I think this is the same size and then let's say this is zero bit right so you put one here and all these others are zeros and similarly for others it's like okay you have only one bit which is one and rest others are zero so this all is zero and this is one and for this one you will have 2 3 4 this is 1 and all others are zeros so you can see like in one hot encoding you have less number of ones than binary encoding and that's that's sometimes good because what you want to do is what you have here is 1 2 3 4 5 and 1 and then you have 0 0 1 0 then you have like this 0 1 0 0 0 0 1 0 1 1 1 sorry 0 0 0 0 so your data looks like this actually so you have converted just six different levels to six new features 1 2 3 4 5 6 and all of those features are binary features now why this is a better way because you can see this this consists of very less number of ones so it can be represented as sparse matrix and when you have tens of thousands of categorical features in one column different kinds of categories representing them in a binary form would require more memory in space than representing them in sparse format and once you have this sparse matrix you can also use simple models like logistic regression so what we are going to do is we are we are probably going to avoid binary and convert everything to one heart encoding or label encoding so going back to our data set the status it has has a lot of different variables it has a ID column and then you have a target so this seems like a binary classification problem so let's see if this is yeah okay this is binary classification problem and it seems like okay and if like twenty percent of the values are once arrest zeros so how would you start coding with this now I want I want things to be like very generic you know so we have let's see what we have to do in this code we have different ways of handling label encoding and then we have one hot encoding we can probably add binarization and then I will add some more things here as we progress but for now we start with making a class so I say the class name is categorical features and what should the class expect as its inputs so we have to define an init function what kind of input do we want and we want a data frame or do we want some kind of processed data so let's say we just want a data frame and we want to know beforehand what categorical features are and encoding type so we want to know all these things so I will just save them self-taught katic cat feeds make categorical features local features and self dot encoding type is my encoding type now here you have to note that DF is the pandas dataframe categorical features is a list of features so here you have or one norm zero these kind of features list of strings list of column names and you have encoding type which can be like which which can be for us can be okay label encoding so I will say label binary and one hot encoding ohe so I will build some functions that do these different kinds of encoding to the data so we have deaf and I'll say this is my label encoding function so what this function should do is it should take the data set as input and different kinds of categorical variables and then convert all those different categories to numbers do we need anything else here so we have everything I think we have everything or probably we can do fill any values okay let's let's look at that later so if I search for labels including you have second loan and scikit-learn offers you a label encoding class so you initialize the label encoder you fit the label encoder to some categorical variables you have and then the transform function will transfer it to encoded label encoded values so you see here we have three different levels one two and six you fit them you sorry you fit the label encoder here one one two and six so you have three different classes and then you transform them so you were left with zero one and two so if you have one 1 2 6 so 1 is 0 2 is 1 and 6 is 2 and you can also do inverse transform on it so what I'm going to do is yeah and it also works when you have strings so I saw something here and you can see like bin won bin 0 has some any values in it so I just want to check if my label encoding will work with any values so from SK learn and for pre-processing and then what I'm going to do here I'll say okay this is my this is where I initialize label encoder class can I say dot fit one bin zero so if it I if I have to put it on when zero I have the data frame column name values right and it works perfectly fine so DF dot bin zero dot Val use we already know that it has nine values in it you can see hopefully ya see it happen some none so now if we do transform DF taught then underscore zero dot Val use now it shows me in her so you can see like it fits perfectly but when you have nan values or n values it will it's going to show you this error so what you want to do is you want to fill this fill the any values in it so let's say I know that I have only zeros and ones so I fill any with minus one and then a fit and when you transform it you're going to use the same for any values oh sorry okay it has to be here and then a transform perfectly and chillest here you see the fourth value is minus 1 na value and here also we can can see fourth value any value has been assigned 0 so we want to we want to handle any values here and to handle any values should we leave it to the user to decide whether we want to handle any values let's say we leave it for the user so when you input the data frame it has to have no any values in it and then we will import from a scalar and for pre-processing or you can say that ok handle any to false I will say ok if I want to if I want this class to handle any values it can handle any values and then I can it can be true or false and what what do we need to do here then is we have to go through each and every category features say okay I have to handle any values so the best way to and handle any values would be to just convert everything to a string so cells are DF and so I chose the column name C and then I do then I do this just take this bar from here as type string not fill any with some imaginary value now you have to be careful here if like this value you're choosing it's already in the data set then you have a problem but hopefully nobody puts a value like that so now you have adjusted your data frame so what you do in label encoding you have a label encoder class so pre-processing the label and Colour and then you fit it on the column so to fit it on the column first you have to go through all the columns for C in self dot categorical features define new label encoder for each in every of them and I will define here self dot label encoders is a list okay probably not a list let's make it a dictionary so you have this and then you say okay I'll be l dot fit self dot be F dot served ODF column name not values and let me create a new data frame self taught a PDF which is self dot D F dot copy deep equal to true so this one just makes a deep copy of the Datagram so you don't have to so when you change something in the original data frame this output D F does not change so we fit it on the original data frame and then we say self thought out to D F dot [Music] locks C is will be lvl dot transform self dot d F C dot values so now we have transformed a single column but why did we make this we made it to save our encoders because in case we want to do some kind of transformations later we want to save it so I will say okay Porter see this is lvl and I think you're done for with label encoding here so every values has been label encoded so we make another function called transform function we can also input the data frame in the transform functions but but I would like to keep it just here and then you have [Music] then you can say okay if self dot encoding ANC type equal to equal to label then you call the label encoding function subtotal label encoding and we wanted to return anything let's say we want to we wanted to return the output data frame so we have something right c return this LF else you can raise an exception saying encoding type not understood so till now we have implemented only one one label in color so how I want to test it I'm I'm just going to just chair I say okay if I'm calling the main function calling this crap so what we do is my D F would be PT dot read CSV and I don't have pandas imported do we need to import Wanda's here I don't want to import pandas like for everything so I'm just going to incorporate it here we create your read CSV now it's in my input file / input / train underscore cap dot CSV and then I say cat feets let's go to categorical features you have the data frame I also needs the column right okay now this data frame has everything it has ID column it has target column so we don't want to encode them we only want to encode the categorical features column so for now okay I take I'm going to say columns is all columns in DF tour columns if column name is not in ID and target so we can also see this what we get here and here DF comma what was so variable in categorical features okay so we need to we need to fix this Park so if handle any then only you need to do this thing otherwise you know you don't care categorical features this columns and then we have the encoding type I believe ruling type will be labeled because we have not implemented anything else yet and handle any to true I already know that we have any values and my out would DF is cat feet dot transform so let's let's see now print output D F dot head and I can just run it so see let me get justice I would try to not run it here I'll keep it as a empty space for more tips and tricks so this looks good and start a new terminal here and it has already activated my machine learning environment that we created so let's get rid of that increase the space a little bit and what do we have here ok I need to go into this folder and then I just have to run categorical features do I not have it there did I create at some other place ok so I created it at a different location let me just move it and I should have it here now ok it's there and I just do patent category go let's see ok yeah error obviously see in DF dot columns no and that's a wrong syntax so it's going to be C for C in D F dot columns you see not an ID or target and see what we get so it prints all the columns that I have no it's taking a while and that's that's pretty obvious it has to take a while but now you can see that it has converted everything it has left like ID as it is in target that is as it is what it has converted everything else so if you remember Ben zero at this too so we can also try it in in Jupiter but we won't for now we'll do it later so so now we can move to the next type actually we can move to binarization or we can move to one hot encoding so label encoding beta so we now move to binarization then we can move to one hot encoding so let's let's see label binarization now so in binarization you were basically creating a lot more new different columns right so binarization for banner ization you also have module in cyclone okay so this is something I didn't know I can probably take a look at this one then code category code features is an integer array so you have ordinal encoder yeah okay we don't need that so label finalizar and binarize labels and one versus all fashion is this what we are looking for yes so this is this is what we are looking for and binarization should be done in the same way as we did the label encoding it's just that in bio ization you have a lot of different columns so we need to handle them so I say okay label binarization and what we have here is okay go the same way for see in self-taught categorical features and then we will just name it lvl pre-processing or label binarize err this time and it goes in the same way so I'm just going to take this part from here copy paste it here and then it's a little bit different here now so Val okay this will be this will probably be a numpy array look let's take a look how it works label binarize ER then i put it on this one same and okay I don't need this one anymore and then let's see what it does okay yeah so it's like taking the data frame but it's creating a three different three new columns right okay so go back to our code so Val here if you do I'll be l dot transform same sort of D F dot values Val is a array many different columns so what you have to do now is remove the column that you fit label button razor on and replace it with three new columns so I will say okay for J in range Val dot shape one what we have here is a single column now so we'll look before this let me just say okay my data frame so DF is self dot d after drop column name x is 1 and now new column name will be something something that we can know so you know like in this data set all the binary columns are called binder bi n but we have no other choice so I'm going to say it this column name plus underscore underscore and let's make it a half string and J I'm just going on the score bin underscore J so your new column name will be the origin column name plus to underscore underscore bin and the number and so the DF now you have to make sure that no column name like that exists in the original dataset but it's a probability of that happening is quite low self dot DF and new column name will be Val all rows and g8 column okay so did we do something wrong here yeah we did so we need to use the output TF not the original data frame so we dropped from output and this is also output and I think just this function looks fine now one more thing here is you need to create another dictionary for binary encoding let's call it self dot binary encoders and let's send the dictionary for you and say self dot binary in colors column name is your binary encoder enriched on sub dot output D F now we have to see if this works don't need to comment anymore so we are going through each and every categorical feature and then we are initiating a label binary Iser class fitting it on the column then transforming it to an array and then you have the output data frame will you drop the column and then for every column in this Val array you create a new column name and add to the output data frame and then in the end you save the encoder I hope this works so to run this we need under if and call it binary self dot score finer what what do we call the function we don't label binarization okay label binarization then this should work fine in my opinion can try it so where's my turn let's let's see if this make sense it's probably gonna take a while ok so it doesn't work Oh transform it should be fit into scope transform this [Music] okay hopefully it's not taking a very long time no it's not so what what does it print for me it doesn't look like it work you have two column names that's fine you have all these things here mmm should be binary but I don't see them as binary so something weird has happened somewhere so we have v transform encoding type is binary oh yeah we didn't change the encoding type here mmm now have you run it again let's see how much time it takes now so till it finishes we can take a look at one new function and I'll call it transfer function so I've already written this function now I don't usually do that so you have the transform function now that accepts a data frame and it checks if it has to handle any values then it's handling any values the same way but instead of self store TF you have data frame and then you have two different encoding types so if your encoding type is label then it's going to go through the label encoders dictionary and then transform each and every value and similarly we can do it for binary so we have binary here and instead of table and couriers we will be using binary encoders now binary encoded stored items but again we cannot we cannot just create like this because binary encoders has an array so we need to change this again Val and then you need to you need to do the same thing that we did here so I'm just gonna copy this part I'm gonna paste it here so watch what you have you don't need the output DF obviously you need data frame and you need this here and you need to change this to data frame see Ben yeah now looks good so why do we have the transform function because in case you have a test data set you need to have a transform function you don't want to fit on the same thing again and again and [Music] this will transform the data frame but we also need to return something to return data frame and here also you need to return data frame so as you can see now it's taking way longer and so I'm going to just stop it is actually working but I'm not just going to stop this and I'm going to drive it had 500 just to make sure that this is working for me here because it's taking way longer than I expected yeah okay so this works so you can see you at the ID column you have everything else but now we have modified it so bin zero and risk of being 0 and 110 described in 9 so it's like that and [Music] everything is binary so this works for us but thing is if it's going to take so much memory if it's going to be so slow then why should we even bother using it right and now we have we have the next thing that we have to give a try and that's one hot encoding but before that let's see you have we just created this transform function right and transform function has this label transform binary transform and let me make this disappear and we also have a test data set in this same competition we have the test dataset and I have downloaded the test data set it's here called test underscore cat now the problem is I have seen the test dataset but we are just going to give it a try I'm going to call TF underscore test just wait is it going to remain the same and you have output DF instead of that I'm just going to call it train DF train and just call transformed and test wonders for transformed will be cat feet dot transform DF underscore test so in theory this should work let's see if it works it's going to run the function again I'm just trying to check if this works ok this works but there is another problem here let me make it labeled encoding and let's just use all of the data set for all of the data set it's quite fast so it should give me something very quickly it's not creating any new columns maybe encoding is really fast so it's working working working so now it gets meaner so we saw this kind of error why contains previously unseen labels we saw this kind of error when we have when we have nine values right but in this case we have entirely unknown categories in the test set now to handle it very easily in an easy way it's to just combine train and test data frames to one single data frame so I'm going to say ok full data is VD dot concat DF + DF underscore test but the thing is in DF test you don't have target column so we create a fake target assign it a value of minus 1 and you also want to keep track of what's in your training data and what's your test data because you have to split later so what you can do is you can have train IDs is DF ID dot values and just IDs s DF by d dot values not DF DF score just so you got train IDs you've got test IDs and you assigned a fake target to test data I've concatenated training in test and instead of DF here we will be using full data and full data transformed is this and then you can just do this so you have trained underscore DF which is full data transformed and insightful data transform you have to take only those IDs which are in train ID X and then reach the index true and you have to do same for test oh I'm going to say just ear will be pulled at around form ID in test ID X then let's just print the shapes of these two you have you have to know that you have a fake target column so when you're building a machine learning model you need to remove that maybe no I will come to that later so I have fixed that and now I'm not using the transform function anymore I'm just using the fit transform it's transforming the whole data that at once including test one so we're just going to run it it's probably gonna take a while let's see and it's done so you have six hundred thousand in training set and foreign thousand values or samples and test set and we can quickly we can quickly check that so I will just count the number of lines and train cat and that's six hundred thousand one so that's one heavier column and four hundred thousand one one for the header so that works fine so now we have a working solution for categorical variables so we did the binarization thing with it one hot encoding we did binarization but we didn't do it in a very efficient way so one efficient one way of doing it more efficiently would be to not create the columns but to just stack the binary arrays for different columns together so that's one one more way of doing it I'm not going to do it in this video but if you plan to do it then you can send a pull request to the ML framework repository and I will merge the best ones I know I still have to look at the exercises which you guys have done from the previous video so I'm going to merge all of them together and that's quite easy so what you have to do is go to this label by memorization and instead of creating new columns you just keep track of which columns you saw and then just stack these values together in a new array create a new array and instead of returning a data frame now you are actually returning a numpy array so the next thing we're going to do the same in the third one one heart encoding so in one heart and coding you're actually creating a sparse matrix so if you look at that you have one hot encoder so okay this doesn't work don't work expect it to the area got 1d array instead okay look value is dot reshape - 101 okay so what one hot encoding is doing gay I think I'm going to have the same error here it apparently doesn't like one-dimensional arrays no fuel in it okay dot values okay so it creates a sparse matrix so we don't want to create a data frame of sparse matrix because you you have to you have to make it dense then you have a column name for each and every of them and it's going to use I'm necessarily going to use a lot lot of resources we don't want to do that so we are going to create a sparse matrix from whole data frame and you see like in this one if you supply multiple column names it should work right hopefully okay bins are open one okay so it's another bracket you okay so this works like this so you don't have to vet one hot encoder for each and every column let's say this and we don't need to reshape I don't need all this and you have six hundred thousand by six sparse matrix so you have 1.2 million stored elements and let's incorporate that in our class so you have two categorical features class and then you need self taught one hot encoder ohe let me just call it a Chi and assign it none or is it just one and then I'm going to create a new function one hot so what this does is you you were previously going with each and every column name and cat feet but now we're just going to use all of the columns so I say okay ôhe is pre-processing dot one hot encoder and you have have to fit so watch it out of it is soft RDF and on this you have self-taught catcalls I think that's what I call it okay that fits okay so categorical features I'm just going to say values and then we return our g dot transform self dot output D F we're going to use the same features so this is just a sanity check so if top cat beats not values and here you have as far as data set return and now we need to add this part here this encoding type is ohe then do one hot encoding underscore one hot let's see if this works for us you can no longer use these these things here your ID column it's gone so what you can do is you can do a train of length off D F and length of TF underscore test instead change the name a little bit train lend okay does Clint you don't even need this you can't cut we have all these columns categorical and one hot encoding pull data transformed it's this one which is still correct and instead of train DF you have don't need all these things you pulled it across wound all rows no just to train Len and here you have the same thing like this I hope that works that's right let's see if that works I know how much time it takes so it seems to be giving us in error times no my tribute one heart why doesn't it have we just added it ok ok let me make this small binary ok so this is not a function self dot underscore one just go hot it is there oh I call it OSE huh let's run it again so it gives us an error and we have made a big big blunder so it's telling me that input contains nine values and it's in the transformation function so the big blunder that we have made and we have made a few different places so what we do is we always do the transformation on output so here in level finalization we are not doing transformation output so first let's fix that do the transformation output and you can see now in okay I probably know it's probably an easy way to fix it would be to just use just EF everywhere do transformation on DF do transformation on DF and then here so if DF okay officers do transformation on DF now we're done one more thing that you can do is you can just move this from here to here so if you're handling any values you make the copy after that so this should probably be working fine now let's try I mean yeah it seems seems to be working fine so you can see now you have five thousand seven hundred twenty four columns training and test data so we're almost done with handling categorical features but I mean this is not everything that you have in Category all the features but since since we are doing this I was wondering why not just make a very simple model and try to submit it and see what happens so we already have the train data frame test data frame we have everything that we need what we want to do is we want to train a simple model so I will say okay up my simple model is logistic regression and I need to need to import that let me make this small for you just disappear for now okay so from scikit-learn in for linear model and there is one more thing that we need I always like it doing this this way you can probably there are many other ways of doing this so we need a sample submission data frame because we are just going to check how it performs on cattle yeah a very simple model I don't expect it to perform very well so I'm just going to download the sample submission dot CSV and let's see if it works yeah okay and let me just drop it to the input okay I should have sample submission okay I have sample submission here so one reading the sample submission and now I want to just just train a very simple model so what I'm going to do is I'm not even going to do any kind of cross validation I'm not going to tune any parameters and this is not a data frame anymore so I'm just going to fix names X and X test so you have X now it's very important to note here that in this whole class we have not changed the order of the crane examples we have not done any kind of shuffling if you have to do something do it here right after you read them because when you combine them and then you're using you're using this to to go back to the original order it's very important eat that that you should keep the order so you have lodged aggregation now and then you want to do we fit on X and my target will be DF dot target dot values and then I just generate the predictions the adapter project underscore proper X just checking everything for one so I've already seen the metric of this competition it's a you see or a you see you want to have probabilities for class one and then let's see the sample submission file how does it look like so you have ID and target okay so I'm going to say okay sample dot block all rows from a target equals this spreads and I'm just going to save it as a CSV submission dot CSV when X is false we don't want the index column okay I'm going to run this and see if this works and hopefully it should let's see let's see that so going to terminal Titan categorical so it worked it gave me some warning total number of iterations reach okay that's fine I'm not doing any kind of parameter tuning at the moment I don't care about it so I have the sample summation file sorry the final submission files it looks like this looks fine to me let's try submitting that mmm so I'm uploading the submission file now let's see how how it performs maybe we did something very wrong we don't know because we are not doing any kind of cross validation at the moment we are just reading a model and score 0.78 437 okay and if we 784 it's quite low yeah okay so it's it's quite low but seven eight four three seven so it should be here 100 something rank which is not bad to be honest and this was our first try right it's called quite quite good in my opinion so there's this lot more things that you can do with categorical variables so in this video I did not talk about handling rare values and we found a way to like conquer the problem when you have new categories in the test dataset so they can also be handled in a much much better way and then you have you have many different things like replacing categorical variables with their frequency count replacing them with some kind of target encoding which I don't like to be honest because it leads to overfitting but if you do it in a correct way then it's very useful or brings a lot of value there are also many other techniques for categorical variables like feature engineering techniques that I have not discussed it's all because [Music] there was less time for this video and I think covering three concepts it's good enough and there is one more thing called entity embeddings to handle categorical data which is also quite interesting and works quite a lot of times and for me it would always works so what I'm going to do is I'm not going to skip them but I'm going to create separate videos and they will be part of this handling catechol data video series so in this episode we are I think we are done we have seen you can use label encoding binarization and you can use one hot encoding so we have we have one thing missing from here and that's why not encoding transformation function so I'm just going to add that okay I say okay LF this is ohe one hot encoding so if it's one hot encoding what you do is you have the data frame and you just want to return self dot ohe I think I'm calling it over Qi yeah which is self-taught ohe data frame and self-taught categorical features so just to ensure that the order is the same not values and that's that you're done else raise an exception saying encoding type not understood so this is the first part of categorical features this is a lot more a lot more cool stuff which is coming soon so stay tuned and if you have any improvements for whatever I have done in this class then send send a pull request feel free to send a full request and we will merge all the improvements and if there are some mistakes that I have made let me know in the comments if you if you want any kind of improvements or something let me know for this for this specific part we have used categorical features and encoding challenge too so take a look and stay tuned for the other parts of curricular features it's going to be amazing and see you bye
Info
Channel: Abhishek Thakur
Views: 11,013
Rating: 4.9148936 out of 5
Keywords: machine learning, categorical data, kaggle, sklearn, preprocessing, feature engineering
Id: vkXEHpuu03A
Channel Id: undefined
Length: 84min 0sec (5040 seconds)
Published: Sun Jan 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.