11 Visualising Correlations with a Heatmap

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in the meantime let's do a little bit of work in our Python code to make this table a little bit more clear let's visualize our correlations in a way that we could put into a really snazzy report I have to do this we're gonna represent our correlations as a triangle instead of this whole table here we don't need to show all these duplicate values showing all these duplicate values doesn't really add anything and it just makes the whole thing look really really busy so my goal is to hide half of this table and to accomplish this I will create an array which will help me filter out the values that I don't want to show and the values that I want to show I'm going to call this filter array mask and I'm gonna set it equal to an array that's identical in size to this table of correlations this correlation matrix that we've got up here the module that I will use to help me do this is called numpy I'm gonna have to add this to my notebook imports at the top in order to use it so I'm gonna say import numpy has NP and hit shift enter on this import this module scroll back down here and then I'm going to use the zeros like function from the numpy module so we're at n p dot zeros on the scroll like and this function will create an array of zeros that is like whatever array is passed into this function as a parameter in our case that's going to be the return value from calling the correlation method on our data frame so let's have a look at what this mask array looks like at the moment and hit shift enter here and we can see that we have an array of well just zeros now I need to make another modification to filter on the values in the top triangle first need to know the indices of these cells in myarray thankfully there's another numpy function that will help me find these so I'm gonna say triangle underscore indices which is gonna hold on to all the indices in the top triangle of my array I'm gonna set that equal to num pi dot PR I you try you underscore indices underscore from and then I'm going to pass in my mask this will retrieve the indices for the top triangle of the array and now that I've got my indices I can use my masquerade to select just those cells and change their values so I'm gonna say mask square brackets and then triangle underscore indices and set those equal to true let me show you what our filter looks like now so I'm gonna say mask hit shift enter and then you can see here that the top triangle in this array has the value 1 and the bottom triangle has the value 0 and that's because true is mapped to the value 1 and false is mapped to the numeric value of 0 so with this in hand I can now move on to creating this beautiful visualization that I keep talking about we're gonna use our old friends Seabourn and matplotlib to accomplish this the first thing I'm going to do is I'm gonna set the size of our figure so I'm gonna say PLT dot figure parenthesis fixed size is equal to it was 16 by 10 and then I'm gonna use our Seaborn's heatmap function to generate a heat map of our correlations we had the Seabourn modulus SNS and then I could put a dot after it right heatmap and then within the parentheses I'm gonna provide our correlations so this was the value returned by calling the core method on our data frame so I'm gonna leave it like this SNS thought heatmap parentheses data dot core and then I'm gonna show our plot with PLT don't show let me hit shift enter to see what this looks like voila look at that we're almost there what we can see already is that the different colors show us strong positive correlations have a dark red color and the negative strong correlations have a dark blue color anything that's close to zero is pale or white so this color scheme is actually conveying quite a lot of information already which is really really neat on the visualization front now if you're having trouble reading what it says down the sides and at the bottom of this chart we can increase the font size of these labels with PLT dot X ticks parentheses font size equals 14 and I can do the same for the y-axis with PLT dot Y ticks parentheses font size equals 14 hitting shift enter we see it updated like so so now it's a bit easier to read now it's time to add that mask that we created we want to hide the correlations on this chart that are duplicates coming back up here inside the heatmap method we're gonna add another argument we're gonna say mask so the argument called mask is equal to well the mask that we've so painstakingly created in the cell above so mask is equal to mask and this might look very confusing but this mask here refers to our variable in this cell here and this Python code reading mask equals refers to the name of the keyword in this function let me hit shift enter and show you what this looks like voila now we've effectively hidden half of our chant so when I modify this even further I'm gonna add the actual values of our correlations on our heat map because what I want to do is that what I want to display these numbers here on our chart with the colors so I'm going to see a naught is equal to true and hit shift enter how you'll see the values of the correlations being displayed in the heat map of course by default these numbers actually really really small and difficult to read I don't know why it's just how it is so we can increase their font size with another keyword argument so you can see a not underscore kW s is equal to and then curly braces quotes sighs : and then 1414 is going to be the font size of our annotations the value of this a naught underscore KWS argument is given as a dictionary it's a Python dictionary that we're looking at him and you can always spot Python dictionaries very very easily with this kind of curly bracket notation and a key value pair or some key value pairs inside the key here is the string size and the value is 14 and these are always separated by this colon let me hit shift enter and update the heat map now voila brilliant now the only thing if I a little bit strange is why this background here is not all white because I expected the styling to be a little bit different I expected this to be a white background instead of this gray hair now if you're also seeing something a little bit unexpected like this on the styling front you can always set the style manually of C born with SNS dot set under school style parentheses and then provide the name of a style so I'm going to go with white and it shift enter and that line of code should force this background color here to be set to white but you know the thing is all in all writing this Python code with the mask and with Seabourn and the heat map it's kind of like the easy part actually that much harder part is making sense of what it is that we're actually looking at here what is it that we can learn from this correlation matrix so first off you and I we said we're going to be looking at two things strength and direction an example of a strong positive correlation would be something like NO x and indus now this in this feature measures the proportion of non retail business acres per town and this NO x feature measures the nitric oxide concentration in parts per 10 million at least that's me reading it off the documentation on the feature descriptions these two features have a correlation of 0.76 so the question is does this make sense and I think yeah yeah it does I would expect the pollution to be higher in industrial areas the amount of Industry and the amount of pollution should be positively correlated but looking at this table a little bit more you know what I found quite interesting it's the correlation of tax and the industry variable higher tax levels are apparently associated with more industrial areas and I actually found this quite surprising so coming across these kind of relationships is why the correlation matrix is a useful tool for data exploration but there are of course has with everything some limitations looking at this heat map here we can see that the highest correlation of all is the correlation between tax and rad access to radial highways this is a positive correlation of 0.9 1 which seems super high now remember how we looked at the documentation of this correlation function we went up here and we had shift-tab and we learned that the default method for calculating this correlation is the pearson method now it turns out that one of the things that you have to know about this type of correlation is that it makes some assumptions about the kind of data that it's running on this correlation calculation is actually only valid for continuous variables and this means that it's not valid for say like a dummy variable like whether a property is on the Charles River or not because this is not a continuous variable it's only got two values right 0 or 1 and looking back up here where we've created our histogram for accessibility to radial highways we can also see that this is not a continuous variable this feature was an index if you remember and what this means is that our correlation calculation is actually not valid for the R ID feature because rad is not a continuous variable which goes to show that it's very important to know how the individual features are measured what units they're in and what the distribution of the data looks like for these features because we can only use statistical tools that are appropriate for the kind of data you're working with okay so let's look at this last row down here the road that reads price which is our target value on this row you see the correlation of all the features in our model with the price with our target one of the things that I'm interested in looking for here is for which features we we don't find a relationship for which of the features is the correlation close to zero the lowest correlation of course is with the Charles River dummy variable but as we've just said chance is a dummy variable with only values between 1 and zero so the correlation measure is actually not appropriate but what about the next lowest one the next lowest one is this one called D is and D is is defined as the distance from employment centers now that's interesting so D is is not very correlated with price but D is is very highly correlated with the industry feature looking here we see that there's a correlation of minus 0.71 between D is and indice the reason I suspect this is the case is because many industrial areas are probably employment centers so being far away from an employment center is associated with a low amount of Industry and this discovery adds something to my to-do list for the regression analysis stage what we should probably do is we should check if our distance feature adds explanatory value to our regression model in other words does having both the industry feature and the distance feature included in the regression make our model better or was can we get away with just having the industry feature for example because the thing is if a feature is not adding any explanatory value it's often better to exclude it and try to run the regression without it because by excluding features you might end up with a simpler model and simplicity is usually a good thing okay so where does this leave us the correlation matrix is no silver data exploration bullet while it may not answer all our questions it can give us a bit more perspective and the correlation matrix has its pros and cons it has strengths and it has limitations just like every other tool regarding the pros we've learned something about our data we've learned that the amount of tax in the amount of industry are correlated and we've added something to our to-do list for later namely that we should investigate if we really need the DIS feature in our model or not another Pro is that we've learnt that certain features with high correlations are possible sources of multicollinearity now I emphasize the word possible and this is another thing for our to do list high correlations don't necessarily imply this problem of multicollinearity but we will revisit this issue during the regression analysis stage by running a formal test for this problem now we're also learning a few things about some weaknesses of looking at correlations for example we've learned that the correlation calculations assume continuous data this his'n correlation calculation that we've looked at is not valid if the data is not continues as it is the case with our accessibility index or our Charles River dummy variable and a second limitation that everybody likes to hop on about is that correlation does not imply causation just because two things move together doesn't mean that one thing causes another in other words everybody who drank water in 1850 is now dead but this doesn't mean that drinking water will kill you in fact if you look at enough data and you look hard enough you will find that there are all sorts of weird correlations out there just Google funny correlations or spurious correlations and you find a bunch of great examples of completely unrelated things that move together purely by chance and if you do this you'll probably come across Tylar vegans website who uses census data and data from the US Department of Agriculture to show that divorce rates in Maine and muhajireen consumption are in fact highly correlated so the earlier chart of mine showing a zero correlation between these two things was in fact a hay lie Tyler's chart shows us how it actually works now another limitation of correlations is that they only check for linear relationships and it turns out just because there's a low piss and in correlation coefficient does not mean that there is no relationship between two variables let me show you some examples so you can actually see what I mean here's some fictional data on a chart showing the x and y values x and y have a correlation of zero point eight one six now let me show you a different shot this is some more fictional data and the correlation between x2 and y2 is in fact also zero point eight one six and on this third chart here you guessed it the correlation is also zero point eight one six and the same goes for this fourth chart here x4 and y4 also have a correlation of zero point eight one six in fact these four graphs are very famous they're called hands combs quoted and they're named after an English statistician who came up with them these four graphs actually have very very similar descriptive statistics and a very very similar regression line but of course they're showing us completely different relationships they're showing us that outliers and nonlinear relationships often only become apparent after visualizing the data and this is what this implies it means that it's important to look at these correlations and these descriptive statistics in conjunction with some chance and with this in mind we're gonna be complimenting our analysis of the correlation with some more graphical analysis that way we can discover if there's any hidden linear relationships or outliers in our data as such we're gonna be visiting our old friend again the scatter plot but before we move on I can't resist showing you this infamous comic strip from xkcd this is the kind of humor that appeals to you more than you'd care to admit then I highly recommend subscribing to xk cds RSS feed and get your dose of geeky web comics and a regular basis I'll see you in the next lessons take care
Info
Channel: Summer Training For Developers
Views: 8,776
Rating: undefined out of 5
Keywords:
Id: HoD9Fs7CTNw
Channel Id: undefined
Length: 21min 37sec (1297 seconds)
Published: Mon Jul 29 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.