Correcting Skewed Data with Scipy and Numpy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back to the channel in today's video we're going to discuss correcting skewed data skew data can impact your data analysis in many ways of which you might be less familiar with but in short most of the machine learning models you might be building are based on the assumption that the data is somewhat symmetrical what do I mean we have a data set that looks like this some gaussian distribution the mean is here and roughly the value on both sides of the mean are about 50 and so we have skewed data let's say we have something that looks more like a log normal distribution where we have a long tail on the data set this can really impact the performance of your model and so when we talk about correcting skewed data what we really want to do is compress it so these larger values out here we want to shift closer towards the mean and of course if we do that sometimes you will see a shift of the actual mean value that's okay because those smaller values will shift less if we have a value I save a value out here that will be 16 and we have a value near the mean I say the value is 2. the distance between that is just the distance 16 minus 2 and that equals 14. that would be the uncorrected data now let's say we apply a square root can a corrector to the data and so we take the square root of 16 minus the square root of 2 and that now would represent our square root corrected data so now we have 4 minus 1.4 and that approximately equals 2.6 and so now we've gone from having a value of close to 14 14 units apart to now only 2.6 thus these values out here will move much more than these values here and that's really what we're doing when we talk about scaling the data we could extend this to taking the cube root or the fourth root or log distribution or apply the log of these values and that's what we're going to do in our notebook right now and so here we have a notebook with a few modules that we're importing pandas numpy map potlib it's C1 for plotting as well as sci-fi stats and we'll talk about this yo Johnson corrector here in a little bit so in the next cell we will create two distributions one which is a log normal distribution with a thousand samples and then the other is a gaussian distribution to kind of anchor us and understand what that normal distribution looks like we're going to set the random C to 3 so that each time we run this we get the same output so if you want to do this on your own just set your C to three and you look at the exact same output I get from this data set as well we're going to calculate the skewness of the data set this asymmetry of the data set as we apply these correctors just so you have a sense of how it's moving and do that in a quantitative manner so if we run the cell you see I've already calculated the skewness of our normal distribution the skewness is close to zero and our log normal distribution has a skew of about 7.2 by many factors we're really hoping to get the skew to be between negative 0.5 and positive 0.5 that's considered to be symmetric enough and so the closer to that range we can get generally the better and there might be certain trade-offs depending on the models but that's beyond the scope of what we'll discuss today in the next cell we're plotting the kernel density plot of these this is basically a smooth histogram and we can see that the blue Trace is gaussian is very symmetric and the orange Trace is that log normal distribution with this long tail I'm trailing out near 40. and so our hope is now to move that in and sort of compress it so that the high values are closer to the mean and we can reduce the skewness of the data so let's begin instead of writing everything out I'm just going to copy this and paste that there and the first skew corrector I want to show is taking these square roots so if we just apply square root using numpy and plot that data again that skew data is shown in Orange and we can tell based on this x-axis that we have reduced a lot of the skew it's out of those large values near 40. those those values are now compressed down between around zero to six and so you can see how those larger values move more than the smaller values let's look at the skew of that there so we just copy that part and use the skew method you can see that now instead of having a skew value near 7.2 our SKU is 1.9 and we're going to save that into our dictionary so we'll just call it skew dict and we'll just call this square root [Music] create a bar plot at the end looking at that data next correction we want to apply is let's take the cube root of this so if the square root does some compression we might expect the cube root to do even more and there isn't a direct method we can use however we can raise this to the power one over three which mathematically will give us a cube root and we can look at this this now we've even further compressed this data and is looking significantly more gaussian than even the square root and if we continue this and look at the skew notably let me show you what the output is the output is a is a panda series and so that's how we're able to use the SKU method we said to make sure you wrap the entire statement in parentheses that they can run dot SKU you can see now our skew has been further improved to be 1.1 next one more in this series let's look at the fourth root of this data set so we'll just raise this to the power of one over four and you see again we can press it even further repeat this analysis [Music] D we've gotten our skewness down to less than one and so we can continue to explore this and these are some of the parameters you might just hope to optimize as you evaluate your models we will save this okay now I mentioned in an earlier video since we know we have a log normal distribution which tends to be the case for a lot of measurement data where most of your values are small and you have a few that might have large values one of the best ways to correct a log normally distributed array is to take the log of that and so let's look at what that does now that we have a better sense of what skew Direction looks like if we erase this and now just apply the mp.log method to that now we have correct our log normal distribution to have a mean of zero and looks very similar to our gaussian reference in blue if we copy this and look at the skew of this we have a very small SKU near zero and so this is ideally what we're hoping to achieve by applying some of these other compressions but because we understand the distribution is locked normal we would expect this to be the best of the methods we've tried so far so let's record this next we're going to look at one more skew corrector this is called a a power transformer and the math is a little bit more sophisticated than these others but I still want you to see that because if you have data that's severely skewed or has some other abnormality and the Symmetry this could be a good approach that is where this yo Johnson approach comes in this is out of the Sci-Fi stats module there's lots of other correctors there so I advise you to go just take a look and be familiar with it and we'll use that at the bottom here so all we need to do is pass this method in if I shift tab just so you can see the equation you see we have a number of parameters and the most important here is this Lambda corrector and if you don't put a Lambda argument in this method will actually optimize for the proper Lambda and we can see what that is once we run this method and so for X we would just pass in our data set which is L Norm and we receive this Tuple where the first is our array so this is that data we've been looking at in numpy Array and the second value at the bottom here is that Lambda value actually going to copy that and this is the optimal parameter and when I put that back in here this will just help to keep the data in this this 1D rake now we're telling what Lambda we wanted to compute based off of and we can look at the skew now this data is a numpy array so I can't just use the SKU method like I did before or if I run SKU here we have this attribute error because we're talking about an Umpire array so the way around that is to use the SKU method from sci-fi stats I know this seems kind of confusing in the namespace but I actually imported SKU at the top of the notebook for this particular reason so now we have another skew value that's between our 0.5 and negative 0.5 cutoff and this could be a suitable approach for your machine learning algorithm there's another similar transformer called box Cox which if your values are all positive you can use that yo Johnson is a more flexible that allows you to input negative values as well which I know this data set has next let's plot this last data and see what it looks like okay now you can see that with this particular power transformer the data is severely compressed our range looks like it's now between just below zero and around 1.5 1.6 and for most machine learning models this will significantly improve the performance and so depending on how your data is skewed this is a really powerful technique however which technique is ideal really depends on the scenario and I recommend just trying these out seeing the performance before making a final call you have this dictionary called skew addict that I want to convert to a series [Music] and next we're going to plot it so we can see how each of these methods be used to correct the skewers of the data performed here we can see that our log and Neo Johnson are the closest to the normal distribution and as we continue to increase the power this looks much better and so there are many ways to correct the skewness of your data and it really comes down to what you want to do with it next sometimes these really powerful Transformers are good sometimes they could also make the story a little bit more difficult to understand and so depending on context I advise you to try them all in any case if you enjoyed this video like share subscribe and I'll see you the next one peace

Info

Channel: Christopher Pulliam, PhD

Views: 5,955

Rating: undefined out of 5

Keywords: log normal distribution, scale log-normal distribution, numpy log, yeo johnson transformation, scipy yeo johnson transformation, scipy yeo johnson, scipy yeo johnson python, nth root data transformation, square root python, data cleaning numpy, data cleaning scipy, scipy, numpy, cleaning log normal data tutorial, np sqrt, dataframe skew, how to measure skew of dataframe, pandas skew method, python, log normal distribution explained

Id: ngBxYo6FQiI

Channel Id: undefined

Length: 11min 35sec (695 seconds)

Published: Sat Mar 04 2023