Dython Library in Python for Correlation Analysis between multiple Categorical variables.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hope you're doing good uh today uh for the sake of my software presentation i would be presenting on a library that i found very interesting uh in python and uh basically i'll walk you through my exploration of the library and uh how it is useful and what can you do with it so the library i would uh how i found the library was through this article that was posted on towards data science medium by shakit singh ziklinski i hope i'm pronouncing his name right so uh he wrote a very good article on the on the library that he built himself uh which is the search of categorical correlation uh so let me walk you through a little bit of uh through his uh article basically uh the library aims on uh uh making uh easier to calculate the correlations especially between the categorical variables and the mixture of both a continuous and categorical so generally the article just walks starts by talking about what is correlation and uh in human uh language correlation is the my year of uh two features uh and how well are they correlated so uh mathematically uh this is the definition of correlation and one of the common correlations that have been existed is pearson's r but the article again talks about that pearson's r doesn't work for uh checking correlation or performing correlation analysis between uh variables that are categorical uh and the the article further talks about uh how what there's one way you can calculate correlation between categorical word variables is by uh creating dummy variables or encoding them so uh the curse of that is that even if you have like 22 or like 15 to 22 categorical variables hot encoding them uh converts uh like in this example uh 22 categorical variables of the machine data set uh was converted into 112 uh features uh and then this is something that they got which is not very easy to visualize uh and even uh read if you have the numerical correlation or something so uh when going categorical there are multiple ways that has been existed to uh to uh check categorical correlations uh it can include distance uh my ears uh but in this uh article and generally in this presentation we're going to be focusing on the pearson's cha chi square test which is uh which determines if there is an association between any of the two categorical variables and then uh introducing the cramer's b which is based on the pearson's chi-square test uh basically it might use the strength of the correlation so unlike the pearson's r uh correlation for uh numerical variables uh in case of grammar's v it's always going to be uh the strength is or the association going to be from zero to one where zero would mean there is no association and one is going to be fully uh association between those two categorical uh variables and then it talks about uh crammer v is symmetrical so it is insensitive to uh swapping x with y which uh later on they introduced another uh uh type of uh another method to uh check the strength of the association that is i think called teals u uh which uh take cares of the symmetrical thing so this is just a basic function that was used uh for the library uh one of the the smaller versions of that and then uh basically uh what you're gonna see is those 22 variables from the mushroom dataset are going to be displayed or visualized using python library the daiten which is built upon some dependencies matpotlib and a few other and i'll show you that later uh and then you can see uh the strength of different variables the article ladle talks about the curse of symmetry that i was just talking about and then uh just like uh based on the value of the x column you cannot tell the value of the y column but based on the y variable you might be able to uh tell the value for the x variable so to that for that they introduced method that this teals you that is based upon the uncertainty coefficient uh uh and uh the conditional entropy between x and y uh i won't go into the statistics uh of that right now but the article uh is here uh and it's called the search of the categorical correlation and you can always go ahead and read uh the article online so going forward uh it's just going to show you that this how teals you is going to show you a different between let's say order can has a very strong association with class but that does not necessarily means that a class has a very strong association with order so you can see 0.91 from order and class and from for class and order it's 0.39 if i'm not wrong so i just wanted to give you a quick overview of the the the library and the background behind who built the library and what is the purpose of it and this later talks about how you can check correlation uh between the categorical and uh numerical variables and uh so for a mixture of both so going to the library so the page uh the the link is attached to the article too and it's uh called daiton it's built by shikha uh shikits i'm really uh might not be pronouncing his name correct but check it i think so uh but anyways uh so let me talk to you a little bit about the library so the diaton library was designed with analysis uh use in mind so for uh your general uh plotting uh of the accuracy scores and this categorical correlation functions uh the main aim behind the library was the ease of use and functionality uh the production great performance was not uh taken into account that much uh for the for the library uh modules can be found here and i'll walk you through the the the steps of that there are two ways to download the library either using pip which i recommend doing that and uh using the pip install uh from the direct the source code from the github i personally use this one so what you have to do is just go copy the code and then just like uh paste it on your uh uh terminal uh i have done that so i might not be uh going through that uh my shell is running with the with a there's the anaconda running so that's why i open a new window you just open that uh paste that commander hit enter and it will install uh the library directly and then once you have uh done that you can just import the library so let's go walk you to a little bit about the library itself and the source course that is available on github so the best way to do is start from the docs uh that tells you where to get started uh and then there is an installation uh module there or a file that helps you walk through uh the installation process make sure uh that you're running through python 3.5 or higher and you have one of the following uh packages uh installed uh that are uh numpy pandas seaborn uh matplotlibs sk learn and uh spice uh spicy yeah i call that spicy but yeah uh sci-fi so uh uh going back to the the uh documents uh if you go uh to the docs again uh you can see the modules and for this part of the presentations i would be uh uh focusing on the the nominal module so just make sure you have all the dependencies because uh the numpy's and all everything because it's gonna use the plotting features from matplotlib and seaborn and scifi and whatever that's that is so the main function for this uh module which calculates the correlation is the is the associations function and uh i'll walk you through that so this function uh requires uh accepts a lot of parameters uh that can be uh optional uh based on how what kind of analysis you want to do but uh generally what it does is this allows you to calculate pearson's correlation uh the r of the pierce's r correlation for just continuous and continuous variables uh a correlation between the continuous and categorical variables and then finally the one that i just showed you for the the categorical two categorical variables and this is the one we are going to be focusing on for the sake of our presentation so what you need to do is uh import the the function the import the libraries and this function accepts a data set that is going to be your uh pandas data frame uh a list of columns uh if it's not given the parameter is gonna be auto but uh if you wanna specify which columns are categorical [Music] you should do that that is the best way to do you might have some nuances if you have a mixture data or your data uh in uh is not uh properly uh converted to the right type so i i'll show you an example later so and then mark columns tells you what kind of uh it displays on the plot whether the the the variable is categorical or uh uh [Music] yeah it's whether it's categorical or numeric and then the fields u equals to fall i think this is the default it's false um let me see default is false so that is if you want to use that uh uh thiel's u coefficient that take cares of the asymmetry problem for us uh plot is going to be default is going to be two other uh parameters that can be put into there uh and i just don't um not gonna walk through all of them right now but the main ones were the ones that i just walked you through let's go and see the library itself has some examples so i think they are in docs and then getting started and then examples there is another file that also has examples but we'll just focus on this one so they're doing this on the iris data um they just call the the libraries uh and the function from the dyson package uh called the the data frame uh formatted the target variables uh and uh yeah uh started converting to strings uh to allow association so uh that's where i'll come later for to this point but uh then they're passing the the the data frame and then the target variable the column target variable which were categorical as a name of variables those are categoricals and then you can see a association uh between them i'm not sure if the the column names function is on that's gonna uh that might show you uh the the type of the the variable and then some more uh complex examples are there too you can just have a column name uh you can just pass a list of columns uh that you're gonna pass to show that what these uh are categoricals and then you can just pass that in there uh and let's go to uh and then there is also a ks abc example uh the function uh that is going to be a plot that's going to plot the area uh between curve of simply binary classifiers for this breast cancer data says i won't be walking into that too much i'm just going to focus on the categorical variables and that part so um let's kind of give you a real uh type example uh i'm using this uh library for the correlation analysis of my uh project the capstone project uh it is basically a data set from the enhanced website uh and i'm going to be focusing on the questionnaire data set we have a few target variables that we have to uh decide which one to keep but we want to check the correlations uh of these target variables uh between uh the other categorical variables that are in the data set so just let me load the required libraries and we can just uh do that right now so the libraries are loaded i've for this part i'm just loaded i've just installed imported four libraries that are i think they're gonna be necessary uh here is my data set i'm gonna be uh using that uh and then uh this is something i did earlier so i just ignored that uh do the check the data types so right now they're integers uh you can convert them into factors uh when while i did that it gave me some errors uh the it was still conver considering that as a numerical and then giving me a correlation so one thing that distinguishes the correlation between uh categorical categorical uh and categorical or continuous to continuous or is going to be that in crammers b it's always going to be 0 to 1 the association it's never going to be negative or uh yeah it's not never going to be negative so if you see negative this means that either you don't have the right data type or either it's treating that as numeric just double check if there are any null values or no and then i passed in my data frame thiele is used to for now we can just do that false and then i'm passing this so if you don't want to convert them the best way or is this either passing the column name since for my data set everything is going to be treated as categorical i'm going to do that and then the figure size so let it run my computer might be a little bit so so yeah you can see it tells you the correlation between uh different variables uh and obviously the ones the the one between the same variables and we can see that oversleepy and stop breathing has a very uh has a moderate [Music] quick relationship trouble breathing uh it depends upon how you want to interpret the value of kramer's v that's another uh that's a debate for another day but this library is just as simple it gives you the numeric uh association to you as well as with the plot if you don't uh want the plot there is a description how you just don't want what the plot i'm going to drop this column because it didn't have the right uh uh number of categories to calculate the crammers v and then we are going to run another plot where i'm going to pass in a c parameter as one of the other uh ways or the palettes to uh plot uh one quick hack for this library is if you just have what would have done it with this figure uh size it would not be very easy to export the the plot or to uh just like download it with its right figure size so what you're gonna do is uh you might wanna define an axis and a figure and then you're gonna pass that into your uh into your uh parameters axis is equal to xs which are defined here and then you're going to save that figure give it a name and if you want to give it a uh path so you can give it a path uh like see and then but what like generally how you give it a path um so this one if we applaud it uh it's just gonna look something like uh this and i have two variables uh ignore that which are alcohol um which and alcohol category uh which are uh they both are categorical they're treating the libraries treating them both as categorical however uh one of them has uh different uh for a format and uh the levels are uh uh different uh uh they're encoded so if one two two and that's encoded in different uh strings so anyways they're the same thing uh and you can see we we kind of get a very uh we can visualize the correlation between how often do you snore and stop breathing and this library is very handy when you actually want to uh check the correlation analysis or have it perform the coordination analysis uh especially when it comes to categorical categorical variables there are many other ways to do that but this gives a bigger picture an easier way to visualize uh and it's very simple as you can see i have another data set uh the same data set but with different variables uh and you can just uh just import the libraries load your data check your data for any uh missing values or uh weird data types or something like that uh give it uh uh yeah that's something that i was doing one while converting it to uh uh a strain you don't have to do that if you're passing uh in here and then uh you just passed the the data frame uh the parameter of teals u equal to false or two and then access and then you uh you should be able to uh get the plot now there's one thing that i want to show you so my variables i'm i are supposed to be categorical that's how they are in the data set here it is converting them as numerical which i don't want it's going to still do the job because my data is uh with discrete values from uh from 1 to 4 or 1 to 5 like that and it's converting them as numerical so what i should do over here is i should uh pass this parameter nominal columns equals to all or a list of columns but since everything is nominal uh we want to keep that and we can do that and then one more parameter that we can pass is uh let's see going back to the example and you can always come here to the modules to see uh what uh parameters are there uh we'll go back and we will see the module for the python library and let's see there is this parameter mark column so we're gonna do more columns equals to true to see to give us an idea so let's do that more columns equals to true and then try to run it again so see now it's treating everything as nominal or categorical and you can see the values are not any more negative and they're way different than the ones we were getting earlier there were a lot of high correlations but uh this one they're not a lot of high correlations since they're categorical and we can see that uh they are uh there are some like strong ones but smoking cigarettes has a very high correlation with hard drugs so this library is very handy when you want to perform a correlation analysis and then it has a lot of other functions that you might want to look at it especially the the the sampling ones and the there's one uh there's one i'm sorry so yeah there are other modules uh the sampling one might have some other functions i haven't explored that a lot but uh it is handy and i'm assuming this one is for sampling and there is this for the accuracy my ears and you want to plot the area under the curve and stuff like that so you you feel free to explore the library i find it really interesting when i wanted to perform a correlation analysis on a data set that might be very very huge or a subset of the data and just to see which which of these values were very strongly correlated if you do this way figure with the fig axis you might not see the the the numerical ones as you can see here uh in that and then that's a lot of other formatting options that you might want to use so that's mainly uh what i wanted to talk show you about this library you can pull up post them i think in discussion on the github page it's i i have been using it for a few days and i see a lot of my uh uh questions and uh problems that i had were already discussed in the uh in the pull request sections uh you can see or the issues uh uh which were closed uh to get an idea if you already have a a question that is there uh and yeah explore the library download it play around with it and have some fun
Info
Channel: Maaz Rana
Views: 892
Rating: undefined out of 5
Keywords: #Data Science #Python #Categorical Variables #Correlation #Cramers' V
Id: sYZ2KfT7Ryc
Channel Id: undefined
Length: 24min 46sec (1486 seconds)
Published: Sun May 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.