Correlation and Regression in Excel 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there this video will show you how to do the first part of your fourth Excel assignment which involves running a correlation analysis and creating a correlation table and I'll show you how to do that in APA style also how to generate a regression equation to make a prediction about a single dependent variable based on a single independent variable and then also how to create a scatter diagram that has your regression line in it there'll be a separate video to show you how to use pivot tables and the chi-square function to do a chi-square analysis but for now we're just going to focus on basic correlation and regression so in this assignment of having you look at a study that examined how somebody's income level and the perceived income level of their social circle has an impact on perceived wealth inequality in the United States also on how likely they are to be willing to redistribute wealth in terms of political policy you know helping out the poorer less fortunate people in the country and then also they looked at their political leanings liberal versus conservative so I encourage you to read the study overview on your own and also look at the journal article that accompanies this assignment it's really interesting stuff so let's quickly take a look at the data just to see how you're going to end up answering all these fun questions so if we look at the data here we've got our subject ID that's our first column so as you typically see in Excel each row represents an individual and this is all their data and then each column represents a variable that was measured along that individual so we have their household income then the average or the mean of their social circles estimated incomes they were asking or what have urghhh how much do those in your social circle make agir they were also asked to estimate the population income the mean population income at large so for everybody in the united states they also have some really unique ways to measure these two things but i'm going with the most simple version of this data so if you read the methods section you'll see that was it all kinds of stuff with these two variables that i'm keeping it simple here also you'll see that they asked them about the fairness and how satisfied they were with the current distribution of wealth in the country and those if you just look at the little common here there was two items so these numbers down here represent the mean of those two items for each person each item was rated on a one to six scale with higher scores indicating more perceived fairness and more satisfaction with the current distribution of wealth you also see this redistribute income variable here and this was the mean across four items that were rated on the one to six scale or higher scores indicate being more in favor of you know social welfare and distribution of income and resources then here you'll see political preference they were simply asked to rate how what their political leanings were on a scale for more an extremely liberal to nine extremely conservative and then I use these ratings to create a political category so anybody that had a rating of one two three they were labeled liberal four to six moderate and seventy nine on this political preference variable they were labeled conservative you also have their sexes male or female and then their age as well so before you can jump in and start doing all the fun things with correlation regression make sure that you have your data analysis toolpak installed so you want to go up to file then go down to options go to add-ins over here on the left then you want to look for analysis toolpak go down here where it says go click that make sure analysis toolpak has a check box check mark next to it and then click OK and then when you go to the data tab you'll see that there as an option so let's go ahead and take a look at the assignment so we just installed the tool pack now we're going to run a correlation analysis for all numeric variables and let's be specific with that here all numeric variables because you can't really run the same type of correlation with non numeric data we're going to stick with Pearson's R linear correlation among numeric variables we're also going to create an apa-style correlation table I'll show you how to do that step by step okay so let's go ahead and go under data the tab click data analysis and actually real quick I want to talk about a couple things with correlation because I notice something here so with correlation it's kind of finicky in Excel notice that this says none missing if you have anybody with missing data in your data set and you try to run a correlation analysis it will not run for you so I had to go through and clean this data to get rid of any participant that have any missing data for any of these values so you'll notice when we compare our correlation table to the correlation table presented in the article there are a couple of slight differences and that's because they probably ran it in SPSS where you don't have to have everybody having complete data to do that so we have a few less participants in our data set than what they ran because we got rid of anybody with missing data another thing that's important to note if your variables are not adjacent it will not run the correlation analysis so I want to include age in this analysis but age is kind of separated from all the numeric variables by these two categorical variables so in order to run the analysis with age in it I'm going to just right click here and click insert and then I am going to grab age just click the K right click cut click the H right click paste so now I've got age right next to all my other numeric variables so again for correlation and regression in Excel you can't have any missing data and if every variable but you want to look at needs to be adjacent so now we can do this all right data analysis find correlation and then click OK now we're going to select our input range here and we're basically just going to highlight all of the labels and the values for all the numeric value variables in our data set let's go ahead and do that you can click and drag to do it or you can click the top left so for the data you want to highlight and then scroll all the way down and go to the bottom right hold the shift keys I'm holding shift on my keyboard and I click it and it just highlights everything from that point to this point which is really handy if you just want to click and drag to highlight you're welcome to do that as well all right groups by columns yep every variable has its own column labels are in the first row and I want to make a new work sheet so I'm just going to go ahead and label this correlations so correlations all right so now let's just click OK and you'll see this new worksheet pops up at the bottom here I kind of like to have my data at the beginning so I'm just going to click this and scoot it over so your little black arrow is you don't have to do that that's just a personal preference but now you know how to move tabs around in your worksheet now this is a hot mess we got to fix this before I even want to look at it so I'm going to right click make sure everything's still highlighted go to format cells we're going to round these values to the hundreds place because that's typically how we report correlations being flow though okay here we go number two decimal places click okay it's a little better but it's still cutting off some of the words right so if we go to the Home tab and then go to format we want to autofit the column width so that these cells will hug the data that's around it now we can actually see all the things so if we look at this let's go ahead and turn this into a very beautiful APA style table and we'll go ahead and do that within Microsoft Word so let's go ahead and copy this ctrl C to copy or you can right click and hit copy let's go ahead and go into our assignment control vo let's hit backspace it read that number to ctrl V to paste and look at this this is crazy so let's go ahead and do something really quick to see if that'll this will be fixed so what we want to do we don't want to have the full labels across the top here and for now so it'll fit let's just click on this little four corner arrow up here go to layout autofit auto fit window so now it's shrunken so that we can see everything so let's go ahead and clean this up a little bit so we're going to put a 1 dot period household income and then two and you'll see what I'm going to do this here in a minute hope not 36 the heck yeah they're alright four five six and seven for age okay and now what we can do instead of having all these things labeled up here just delete those one two three four five six well we've got a little crazy there six and then seven here so now I'm starting to look a little bit better um this is kind of ugly as well so maybe just get rid of the little underscore so just household income social circle mean income you know just get rid of those underscores can't really see them right now because of all the red that's happening okay income distribution so maybe fairness and satisfaction oh that's my ice ball satisfaction satisfaction with wealth distribution we'll just abbreviate that redistribute income favorability of redistribution of income political preference is pretty straightforward and then we've got age as well now there's a couple of things that I want to point out before we start to really form at this table and make it even more beautiful if you notice along this diagonal here you just see once and that's because of course household income is perfectly correlated with household income because it's the exact same variable so you can keep those ones there or you can just get rid of them typically that's where you would put something called cronbach's alpha which is a different type of statistic that you compute when you have a multi-item scale that you're using kind of like the redistribution fairness favorability redistribution item and cronbach's alpha just tells you how well all the items on that scale are correlated but I'm not going to make you do that so you just leave it blank also you can just get rid of this column 7 just highlight it and then backspace to get rid of it if you hit delete it won't do anything all right we're almost there we almost have a really beautiful perfect AP style correlation table but we're not quite there just yet so I'm going to hit ctrl I done italicize these and now we want to label this table so if we're in the top row under the layout in table tools you can hit insert above and we want to merge those cells so let's do that and then we want it to be left aligned here so you just click this right here in your layout then we're going to say table 1 and then metallics 0 order correlation coefficient for all variables in the study all right now at this point we don't have any indication of whether or not any of these are statistically significant now our sample size in this study is look so it's 296 it goes down to 297 that's because our first row are the variable labels and the critical value for that is point 1 1 so you'll see that reflected on your assignment right here so what I want you to do next is just go through and bold all of the correlation coefficients that are statistically significant so anything with a value of 0.1 1 or higher or negative 0.1 one or lower is significant so go ahead and pause the video and do that all right now once you've got that accomplished you can go back to the bottom row of your table under the layout tool and click insert below merge those cells left-aligned it right here and then in italic so control I for italics capital in and then OTE for note on the towel size bold coefficients are significant and then we'll put cap control I for italics P less than 0.05 because that critical value I gave you is a two-tailed test with an alpha level of 0.05 now look at this we see the grid line that's cuz I have you grid line selected you may not see that on your end but if you're under layout you've got view grid lines here let's go ahead and make sure that's unselected so we can see what this is actually going to look like now the lines are almost good but we need to mess with them a little bit to make it APA style so let's go ahead go to the bottom left the last row here where we put our note and wait until you get a little right pointing white arrow and click it now highlights that whole row go to the design tab under table tools and make sure that we have a bottom border and then we get rid of our sorry that we have a top border but we get rid of the bottom border and then here do the same thing to the top row white arrow and let's get rid of that top border there I clicked it twice to make it go away and technically there's supposed to be a space right here as well alright now we have a gorgeous APA style correlation table there we go beautiful and if you want to just because it might be bothering you like it is me you can put little dashes here if you want and let's go ahead and Center our values as well so we want to highlight all the numbers like that go up to layout and Center those now we're cookin gorgeous and your table should also have all of the significant correlations in bold so for instance this one is definitely significant it exceeds that critical value of 0.1 one so I can hit control B to bold it or I could just click bold up here under the Home tab to bold it make sure you do that on your own now I want you to look at these and pick out which relationships are not statistically significant so I'll give you one answer and then whenever that happens they try to automatically do a3 just hit stop Adam and automatically creating numbered lists and they'll stop it for you all right so the first one I see that's not statistically significant is right here and all you have to do to read this correlation table is to figure out where the two variables are connected right so if we look at the number this is variable 1 which is household income and then age this one is a relationship for one household income and political preference and so on and so forth so I know this one is not significant it doesn't reach that criteria for significance down here so I would put age and household income on number two for one of my answers age and if you want to be lazy you can just highlight household income ctrl C to copy put it down here ctrl V to paste and there's one of your answers for that if you want to make my life easier you can even highlight this under home and you can bullet those answers and then just hit enter and give me your next answer in there all right now next I want you to pick a significant relationship to interest you so go back up to the table find one of the bold values that's interesting to you what two variables are kind of interesting so for me if I did and you can't do this one because I am if I looked at household income and social circle mean income right this one right here I would put household income social circle mean income for a and B and then the correlation coefficient would be 0.49 and if I wanted to explain that relationship in plain English since its positive and it's also strong and significant I could say those who have a higher household income tend to have friends with a higher household income or at least they think they have friends with a higher house conversely remember positive relationship X&Y change in the same direction so you can have higher x times to relate to higher Y or lower X attends release lower Y you could say those with a lower household income tend to perceive that their social circle makes less money be really careful when you're explaining these relationships in plain English always avoid statistical terminology be specific higher higher lower lower higher lower well however it goes in terms of direction and don't overstate the causality of that relationship you wouldn't say those of the higher household income have friends with a higher mean income they tend to have friends with a higher mean income all right number four so what is the correlation coefficient between estimated population mean income and household income so you'd need to go up here household income is number one so you could just look here and then find population mean income and there's that correlation then I want you to explain that relationship in plain English so before we move on to the regression analysis I want to take a look at this table and see how it matches up with the journal article for the study where the data came from so let's go ahead I'm going to just do kind of a split screen situation here and see how all that works make this little bit all right now you're welcome to look at um let's see what page is it if you 1393 up here but page 5 of 12 on your data or on your article and kind of follow along with me here let's see if we got something similar to what they got now remember there they had a slightly different sample in theirs because I had to delete everybody who had missing data and they did not so let's take a look at this social circle mean income and household income 0.48 up 0.49 right we've done here population mean is income and household income point 1 9 pretty close with 0.18 we've got these two I did include so they were a little bit complicated so let's skip to fairness and satisfaction with household income huh that's exactly the same support for redistribution in household income negative 0.21 and then political preferences in household income 0.15 I also included age in the scenario so just because I thought it was interesting and notice that your table looks like this table here the only difference is that they distinguish between these different levels of significance and they use asterisks instead of bold they also included their mean and standard deviation here as well which I didn't have you do but if you look here whenever you're trying to interpret a correlation table in a journal article this is what it's going to look like and you see you've got the numbers here and then they're reflected here they don't have anything here they didn't even give you cronbach's also for multi-item scale scores shame on them but anyways you see this right P less than 0.05 means that it's significant with an alpha of 0.05 to Astrix means it's significant with an alpha level 0.01 and 3 means it's significant with an off level of point zero zero one so the more asterisks you have the less of a probability there is that that relationship is just due to chance but for our purposes we just stuck with one alpha level 0.05 and we bolded them which is totally fine as well just want to show you how what you created is exactly what researchers are doing in the real world with their data all right now we're ready to move on to regression so let's go ahead and go to excel and we are going to predict estimated population mean income based on household income so in this case our independent variable our X variable is going to be their household income and that's going to predict Y hat the estimated population mean income so the theory is that if you make more you're going to assume that other people in the country make more as well you're going to be less attuned to the struggles of those with less money that's kind of the theory of the paper so let's go to data analysis make sure the data tab is selected up here data analysis and then go down to regression and click OK and I already did this before so it's already programmed but let me get rid of this alright so make sure that your input Y range is the thing you're trying to make predictions about your Y hat so to speak your dependent variable and in this case we think that the estimated population mean income depends on your household income so our Y range our Y variable is population mean income just make sure your cursor selects that box click and scroll down including the label and the values below it so notice the label is included here as well and make sure that little box next to labels is checked now I want you to click in the box next to input X range that's our independent variable that's our predictor and we think that household income is the predictor of population mean income estimations so with that box clicked I'm going to click household income I'm just going to do it this way scroll down hold shift and click the last value in my data set and highlights everything in between labels are selected make sure you get a new worksheet and type in regression and that will pop it up down here we're ready I'll give you a second to type alright let's head okay okay again I'm going to scoot this down I'm bothering me now these are kind of messed up values right now there's some a messages and some weird stuff so what we need to do is right-click make sure all the things are highlighted go to format cells number and let's go ahead and round to three decimal places and click OK you also want to make sure that you're able to see everything here some of the stuff is cut off so go up to the Home tab go to format autofit column width we're golden let me show you a couple things in here before we figure out what the regression equation is and make some predictions with that so this is your multiple our since you only had one correlation coefficient this is the same as your raw correlation and if we go and we look at that correlation between household income and population mean income its 0.18 here is 0.18 5 but it's pretty much the same our squared is your effect size so what this tells you is that 3.4 percent of all the differences in people's estimated population mean income 3.4 percent of all the differences in those estimates of the population mean income are attributed to their household income the adjusted r-squared accounts for sampling error we won't talk about that here's your standard error here's how many people were in your data set this is the analysis of regression that you learned about to see if there is actually a significant relationship or in fact if household income is a significant predictor of assumed population mean income and you see here that it is right this is less than 0.05 this F value is rather large as well this is the most important piece really this is the most important piece of the regression output and this is what tells you how to interpret the regression equation and what values to plug in so your intercept is the same as a the last part of your regression equation and then this is your B this is your coefficient for household income so if we just let's see just copy this piece ctrl C to copy put it into my control the paste there's our coefficients and what we could do is we could write out our regression equation as instead of writing Y hat you want to want write predicted mean household income in the population that's our Y hat equals remember the coefficient for household income is point zero six seven times whatever their household income is plus our constant our intercept which in this case is fifty-five thousand three hundred thirty three point one two one that's our regression equation so with this equation we can plug in anybody's household income and make a prediction of how much they think everybody in the population makes on average I'm going to go ahead and copy this ctrl C and paste it into my worksheet to show you how to use Excel to to rate those predictions so in the assignment I asked you using this equation predict the estimated population mean income for all these different values so 40,000 year one hundred forty thousand dollars a year and two hundred forty thousand dollars a year so if we say household income equals and we put let's see forty forty thousand yep forty one two three don't put a comma in there just put it a forty thousand one hundred and forty one two three thousand and two hundred and forty one two three four thousand predicted mean top income just kind of abbreviate it there I'm going to do a little double click action so woah that's too much of a double click will shrink it down with this equals so we're going to use the formulas in Excel to generate these predictions without having to actually bust out a calculator and the nice thing is once you do it once you can copy the formula across for all those different values so check this out equals remember the equation up here right so we're going to put and we want to put this whole thing in parenthesis so a left bracket so equals left bracket click on the coefficient for household income hit f4 to lock it in so that it is fixed no matter which cells we move to with our formula then put a little star for multiplication and then we're going to multiply it by the household income right here forty thousand close it with a right bracket so that's this piece right here the coefficient for household income times the actual household income we want to make predictions with and then we're going to do plus the intercept or the constant which is right here let's just click that hit f4 to lock it in and then hit enter now you can click this go down to the bottom right corner with the little plus sign drag it over what the heck let's see there we go double click it so it's big enough and just turn into weird numbers because it was such a big number and I'm also going to right-click these make sure they're all highlighted right click format cells let's make these currency and round to two decimal places that looks a lot nicer now check this out remember the regression equation is linear so for every one dollar increase in household income we expect their predicted mean household income to go up by point zero six seven so if you think about this this is constantly increasing by a hundred thousand dollars and if you take one hundred thousand times let's put it equals here equals one hundred thousand times point zero six seven you should see that for each hundred thousand dollar increase in household income you're expecting their predicted mean household income in the population to go up by sixty seven hundred dollars so let's see if that's the case right because this is one hundred thousand dollar difference it's one hundred thousand dollar difference one hundred thousand times point zero six seven is sixty seven one hundred so if we do this equals this minus this we it that's okay equals this minus this and we get basically within rounding error that's $6,700 neat I'm gonna get rid of this I just did that to show you how that works okay so now you have the values that you need to plug in to your assignment right here to see those predictions play out now I'm going to show you how to create a scatter plot of the raw data and make a regression line so we're going to go into Excel go back to your data and yeah we want to make sure that the two variables are adjacent so we're looking at population mean income and social circle mean incomes just right click here and click insert then click that little e to highlight all this hit control X or right-click cut go here control V or right-click paste and I'm just going to delete this so now these two things are next to each other so go ahead and highlight the B and C column your two variables do you want to make a scatter diagram for go up to insert you're going to choose the scatter go ahead and choose the one that looks like this first and now you can kind of see what's going on here right now this is not very beautiful so let's make a couple edits to this let's go to click layout and let's see if this looks a little better um nope we're going to do it manually so go to add chart element access titles primary horizontal add chart element access titles primary vertical here we're going to change this to estimated population mean income and down here for this axis title just click it and highlight it household income let's also get rid of this population mean income thing right here and so now let's make it a little bit bigger too so we can see all those values along the bottom that's better now if we look at this and we need our regression line so under design chart tools add chart element check this out trendline linear there's our regression line let's go ahead and copy this and paste it into here and take a look at it so this is neat so here's our X here's our Y if we go in just real quick too ferm this regression line let's see if it works out so if we go to our regression output for somebody that made 240 grams last year we predicted that they would estimate the population mean and come to be about 71 so if we go here here's about 240 and look that's about 71 right that's where that line is we can also see that this is a moderate positive relationship it's pretty spread out but we can tell that it's going upward and you also notice that regression lines always in the middle of your dots all right so I hope that this helps you understand correlation and regression a little bit better and also talk to you some really useful tools in Excel for how to generate that without doing all those fun hand calculations
Info
Channel: Statistics with Dr. M. and Dr. B.
Views: 55,341
Rating: 4.8679867 out of 5
Keywords: correlation, excel, 2016, regression
Id: Ub9fsKot62c
Channel Id: undefined
Length: 34min 5sec (2045 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.