Introduction to Correlation & Regression, Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey folks its Alex here and I'm here to talk to you today about linear regression and correlation analysis we've had some discussions about that in previous screencasts so this should be def tailing with some of those other screencasts and some of the other information but I wanted to specifically talk today about how we can in some ways visualize the relationships that where we're trying to either describe or estimate and what some of the calculations look like in terms of the math behind it and what that means but also how we can use that kind of information to better interpret the results that we get when we run these analyses with statistical software programs so I hopefully will be very helpful a set of slides and descriptions that we're going to go through I have some pretty straightforward goals for our screencasts on correlation and linear regression today after working through these with me you should be able to calculate and interpret the simple correlation between two variables you should be able to determine whether the correlation is significant you should be able to calculate and interpret the simple linear regression equation for a set of data you should be able to understand the assumptions behind regression analysis and finally determine whether a regression model is significant now I have some other goals which if we have time we'll get to although I can't guarantee and perhaps this will be in a different set of screencasts but there are also ways that we can look at interpreting confidence intervals ways that we can recognize different applications for prescript prediction and description some ways that we can identify potential problems in using a regression analysis and then what do we do if we have nonlinear relationships for the most part today though we're going to we're going to stick with the basics one of the first ways to think about what a correlation in particular does is to think about visualizing the data using a scatterplot and it usually shows the relationship or the association between two different variables now when we look at a scatterplot we're really visualizing a correlation analysis which measures the strength of association or the linear relationship between those two variables we are only concerned with the strength of the relationship and also the direction whether it's positive or negative we are not trying to apply any sort of causal effect or implication to that so we're just looking at the strength of the relationship and the direction of the relationship and quite frankly also add that the statistical significance of the relationship we are not trying to say that there's a causal relationship so what does it look like well if you have a linear relationship you can see on the left over here there are some really interesting forms it usually the the points will group together either in a positive way or in a negative way and these are simply the variables on the x axis and the y axis and you plot so there's a case where the value for one variable X and the value for another variable Y intersect and that's what you plot and so all of these show these plotted points where one variable and another variable intersect for a given set of data linear relationships are simply relationships that you can plot a line through whereas curvilinear or Schock relationships that have curved lines or curved relationships so as the X variable increases Y increases at first and then shifts and decreases at the end we can also think about the strength of the relationships and it essentially means that the tighter the plotted of intersections between these two variables are that the stronger the relationship is and you can see that these are more narrow or tighter plots then these over here on the weak side these are much wider distributions of these scores on the scatterplot and the no relationship is when essentially you have a flat line right one of the variables does not change significantly or at all when the other variable changes and that can be a flat line or it can be just sort of this amorphous blob but that's usually what it looks like so we've got an idea what a linear versus a curvilinear relationship looks like and remember we're focusing on linear we have an idea of what a strong relationship and a weak relationship would look like among linear correlations or linear regression and then we have an idea of what no relationship look like looks like it's when there's no variation on one of the variables axes when we have variation in the other or we can't predict it we have no way of knowing where this is okay so what is a correlation coefficient well we can we can talk about the population and we can talk about the sample we've done that before we're often trying to get a statistic as a way to either describe or infer something about the sample if we're looking at a population correlation coefficient it's going to measure the strength of the association between those two variables but what we are more often doing is getting the sample correlation coefficient which is simply an estimate of Rho and is used to measure the strength of the linear relationship in the sample observations and we often designate that as R now there are some features of Rho and R in particular remember we're thinking about are that are unique one is their unit free it's it's a set of numbers that ranges between zero I'm sorry between negative one and positive one but the there's no unit associated with that there might be some one we interpret it but but the values themselves of Rho n are our unit free they do range from negative one to positive one the closer to negative one the stronger the negative linear relationship and the closer to positive one the stronger the positive linear relationship and then you might have guessed the closer to zero the weaker the linear relationship is so let's take a look at what that looks like if we have a perfectly negative relationship in other words R equals negative 1 here is what that plotted relationship looks like for every one unit of increase in X we have one unit of decrease in Y that's if that's perfectly negatively associated and if we went to the other extreme and we looked at a perfect positive relationship this this is it straight line for every one unit of increase in X we have one unit of increase in Y so you'll notice that the way that I'm talking about the relationship uses the term unit one unit of increase in X associates with one unit of increase in Y for example in this positive scatterplot but the value itself this plus one that is unit free and then if we wanted to look at no relationship here we go R equals zero there is the flat line that can be plotted in other words for every one unit of increase in X there is an estimated zero units of increase in Y and then what does it look like in between so for example if we had a plus 0.34 our correlation coefficient it could come from this kind of a grouping this would be kind of weak you can see how wide this distribution is but there is a straight line that can be plotted through that and it's for every one unit of X you have 0.3 units of Y increasing on average for our a sample of a negative correlation Association we have R equals negative 0.6 now this is a little tighter and so this is efficient as a little stronger a little larger right but it is negative and so you can see how when we plot these we can visualize the association that the our coefficients represent now if we wanted to calculate it out it really is less complicated than it looks in this equation this is simply looking at a couple of different relationships this is our you know that Sigma stands for the sum so this is adding this right here is an individual score for the variable X minus the mean of that variable and this is an individual score for the variable Y and this is the mean of that variable and so you're looking at the differences between each individual score in the mean for both variables multiplied by each other and summed and then on the bottom here you're simply taking the square root of another way of thinking about variation between each individual score and the mean for that variable and if you wrote it out a little more lengthy you were to get this algebraic equivalent now let's do an example an easy one we're not even going to talk about education we're just going to talk about a straightforward calculation example if we have wanted to look at the relationship between the height of a tree and the trunk diameter of a tree and here's a tree you just remember what we're talking about and let's say the tree height is y and trunk diameter is X and we have all the values for tree height listed here and all the values for trunk diameter listed here for a sample of 1 2 3 4 5 6 7 8 trees ok and we're simply going to calculate that if we wanted to write it out so we have Y we have X we have x times y we have y squared and we have x squared and then we have the sums down here at the bottom of each of those and if we wanted to we can plot it so for example why is 35 X is 8 y is 35 X X is 8 all right so that's that one case right there at the intersection and if we plotted all of them this is what we get if we calculate out the equation using the simple table that we set up here we get this and if you solve it it's point eight eight six this is a relatively strong positive linear association between X and one if we plotted that it would look like that that's all it is it's a measure of the relationship its strength its direction and eventually we'll get to its statistical significance so we know that this is positive we know that it is relatively strong because it's very close to one and when we look at it it looks like a tight linear distribution now if we wanted to we could calculate this using Excel or SPSS or many other different ways and the the way that it usually looks is they don't have a table where each of the variables are listed in both the rows and the columns obviously each variable is perfectly associated with itself right and then when you have the correlation between the two variables there it is there's the coefficient that we had right back here same one now how do we think about significance testing for that we're going to pause and we'll come back to it in the next screencast in this series
Info
Channel: Alexander W. Wiseman
Views: 231,178
Rating: 4.8391485 out of 5
Keywords: Screencast-O-Matic.com
Id: z7kMeJQWr4Y
Channel Id: undefined
Length: 12min 55sec (775 seconds)
Published: Sun Mar 02 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.