Bias in Linear Models (Regression Part II)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and so what we're going to do today is this really follows on from last week's lecture and so we're going to move on last week we were talking about linear models and how we can include lots of predictors in them and all that sort of thing and this week we'll have will kind of move that on a bit to look at how the models are kind of specified and how we can use them and then we're going to go back to my favorite subject of bias and have a look at how we can assess whether the models that we create are actually any good so we kind of touched on this a bit at the end of last week's lecture because we talked about F and R squared and they're sort of measures of fit broadly speaking but we're going to go into a whole other world of pain when it comes to model fit in this lecture which just randomly for some reason that has reminded me class test happens sometime in the week after next week's lecture so if your practical class is our Thursday your test will exercise it's not a test it's an exercise your exercise will be next Thursday so it'll be some sometime in that week because people are getting a bit confused about when is I've posted something on the forum but I thought I ought to mention it anyway sorry so regression on linear models we had a look last week at an example where we were trying to predict album sales from how much advertising a record company had spent or how much money they spent on advertising that product and we had a look at well this would be like a simple linear model and you know shows a straight line and we talked about how when we have one predictor we get a line and then we kind of extended the model a bit and we added in a second predictor which was how many times the record company managed to get the record played on radio and we saw that if you add another predictor in then rather than a regression line you get a regression plane so it all becomes a bit three-dimensional and we also talked about you could add in even more predictors if you like but it becomes quite difficult to to visualize unless you're on acid so you are on acid on me so yeah oh I don't have any graphs for that but you know you can do it and the equation you know it just kind of extends lalala so moving on from this we're going to have a look at actually how so you know we've looked at the pretty pictures where we're now going to have a look at how we actually use the model so it's the same example we're just looking at trying to predict how many records sorry how many albums we can sell based on how much money we spend advertising it how many times we get it played on the radio so hopefully what you would have done in your practical class last week is done a bit of this on SPSS so gone through some examples and we talked a bit about estimation I think going to detail about that because it's complicated and we don't really need to know it but we have these things called parameters that define the model so these are these pieces here and they get estimated and SPSS produces this table that tells us the estimate so we're going to focus a bit for the time being on what these beaters represent we're also I'm going to come on to this a bit later you'll notice that you can build up models systematically so in this case what I've done is I've put an advertising budget first I've estimated the parameters seen what's going on and then in the sort of a second model I've added in radiator the amount of the amount of the record I played on radio so I don't wanna get sidetracked into that just now that's why the table looks a bit complicated but we just want to focus on this bottom bit here which is the model that has its got our B so 0 or AB constant in it it's got advertising budget as a predictor it's got plays on radio as a predictor so these things in this column here are what I keep going on about these parameter estimates so these basically tell us something about the relationship between these predictor variables and the thing we're trying to predict record sales so we're going to start off by having a look in a bit of a bit more detail about how we interpret them and yeah it's reasonably straightforward actually the other thing worth noting is I mentioned last week that once we've estimated these parameters we can test whether they're different from zero and you may remember I was talking about if a beta is zero then that's like a flat line or a completely flat plane so what we can do is test whether the beta is significantly different from zero and if it's significantly different from zero that means it it's not flat it's you know it's going at an angle or dipping down or something and this T test at the end is that test so the T test test whether this value here is significantly different from zero and you get the significance at the end so the basic idea is if this significance value is less than 0.05 you know if you want to be that sort of black and white about it if it's less than 0.05 then we assume that these are significant predictors of album sales because it means the slope the slope or the relationship is different from zero it's you know so and if it's different from zero it that zero represents or no effect at all so how do we use the beaters where we use the beaters to actually construct a model this is the worst slide the lecture I think so we saw last week we can build up a model and if you've got multiple predictors we generally denote them with X's but you know you can also replace them in variable names so this is what I've done below so X 1 becomes adverts that's you know the advertising budget and X 2 becomes plays which is just plays on radio so all that table in SPSS is giving us is the values of these betas so we can just plug them in from the table that I just showed you so for example if I go backwards our constant our beta 0 was 41,000 123 I don't know why I didn't people it's simple of anyway and I've rounded off but that's the value we get here so we're just literally taking the numbers out of the SPSS table and plugging them in to construct this model similarly be to one that was the the parameter estimate for the relationship between advertising and album sales so we can replace beta one with the value out of the SPSS table and similarly with the plays on radio we can take beta 2 out of the SPSS table and that value was three thousand five hundred eighty nine now you might think why would we do that you know it's bad enough that we have to look at equations why do we want to make them worse by putting numbers into them it's because it allows us to make predictions beyond our data so for example if you were working for a record company and you wanted to know how many sales you could expect to get based on the the amount of advertising you had in your budget the amount of money you had and how many times you thought you could get it played on radio you can plug these numbers into the equation and get an answer and that answer you know might be useful to you potentially so for example if we had a million pound to spend on advertising all we could do is replace this predictor variable adverts with the value of 1 million so that's what I've done down here and if we think we're going to played 15 times on radio then we can replace the variable plays with the value 15 which is all I've done down there and basically when you work all of that out it gives you an answer so how many sales will you get well if you spend a million pound on advertising and you get the record played 15 times on the radio per week then you can expect basically a one hundred and eighty two thousand sales but you know rounding off of it so there's an interesting point here which is bear in mind to the previous table we know that advertising budget is a significant predictor so you know it has a nonzero relationship and we know that plays on radio is a significant predictor in other words you know it has a nonzero relationship but do you think for example at spending a million-pound on advertising if you spend a million pound on advertising do you think 182,000 sales is a good return on that investment no it's rubbish I mean you'd basically be Pat this would be like what you'd get if you're trying to sell Robbie Williams records or something you'd literally be paying people to buy the record Sarris Abbott means I'm sure he's a lovely man and so yeah so I mean that's useful information away because if well no it totally is so far as a you know an exactly working for a record company which I often am in my sleepy dreamy time and I would be sitting there thinking well don't you know don't spend money on advertising you know it maybe we may be better off focusing on trying to get it played on radio because you know perhaps that's going to give us a better return but the point is you can pick apart interesting things like don't waste your money don't waste a million pounds on advertising because it's it's not going to increase your sales enough to kind of give you enough money back to cover that cost so that's why we use these parameter estimates we construct models and we start making predictions it gives us useful information it's used in business all the time as well as psychology so how do you interpret these beta values well the the parameter estimate the beta it's literally the change in the outcome that's associated with a unit change in the predictor the word unit in there always potentially confuses people all that is referring to is the fact that you know different predictors are measured in different units so advertising is measured in pounds plays and radio is measured in number of times per week so in order to make a sort general statement it's just you know if our predictor changes by a unit then our outcome will change by a unit so in other words if we spend an extra pound on advertising what how many more records are our albums will we sell we can also we'll look at both of these we can also standardize these parameters and that can be a useful thing to do because it gets rid of this whole issue of units so what a standardized beta value gives us is it's the same thing it sort of the relationship between the predictor and the outcome but it's expressed it's a bit like a Zed score it's expressed in standard deviation units so all that means is that you know we can compare these across different variables it doesn't matter if they're ones measured in pounds and ones measured in number times per week this standardized beta is is that is the same units of measurement for both of them it's both both are expressed as standard deviations so standardized this is a quite often more useful in a way because you can compare the strength of relationship across predictors measured in different ways so let's look a concrete example so for our normal beta so our unstandardized one for advertising budget we got a beta of point zero eight seven so what this means is if we spend another an extra pound on advertising if our advertising increases by a pound record sales will increase by point zero eight seven units so in other words for every pound we spend on advertising we'll sell an extra point zero eight seventh of an album not a lot at all in fact I think if you if you saw flip that on its head and you work out how much you would need to spend to sell one album it works out about eleven pound fifty or something like that so you know that tells us something interesting because although advertising budget is you know a significant predictor like I said it's sort of practical utility is zero you'd never spend like an extra patois in terms of like album sales you'd never spend an extra pound to get only point zero eight seven of an extra sale of an album that would that would be a really bad idea well unless you want to get bankrupt then it's great idea what about four plays on radio and this had a rather larger value so this is telling us that each time extra so every additional time the song gets played per week on Radio one its sales increased by three thousand five hundred eighty nine units now that you know that seems like a more what I don't know you might have to bribe the DJs I really don't know how radio one works these days but yeah there's probably some bribery and corruption goes on but if this was actually like a financially cost-free thing like maybe you just had to Chris balls as left I don't know who does the radio 1 breakfast anymore you know that oh I don't know that is anyway you might have to like you know tickle-tickle them or something really if it was cost-free then this is a great thing to do get it played an extra time on radio you get three and a thousand extra sales fantastic so that's your regular beaters these standardized beaters like I said are expressed in standard deviation units which is as tedious as it sounds so what this is saying so for advertising budget is about half so what this means is that as advertising increases by one standard deviation whatever that may be record sales increased by about half of a standard deviation so in order we'll look at this in the next slide but in order to interpret this we kind of have to know what the standard deviations of each variable are but essentially this is just a standardized measure of the strength of relationship and actually that that is quite a strong relationship so standardized pieces can vary it's a bit like a correlation coefficient because it's standardized they vary between zero and one hypothetically anyway so if this is zero that would be like no relationship at all if it was one that would be like a you know perfect predictor and you know this is about halfway between it's actually a reasonably strong effect which again shows us that you can have a strong predictor or a strong effect that in real life has very little practical utility so advertising budget is actually reasonably strong predictor but in the real world you never sort of spend lots of money to get so few extra sales now if we look at an plays on radio the standardized beta is about the same it's about a half so what this tells us is statistically speaking the strength of these predictors is about the same it's about a half each time so again this tells us that if we increase the number plays on radio by one standard deviation sales of albums will increase by again about half a standard deviation so if we were going to sort compare which is statistically speaking the strongest predictor there's basically there's nothing between them about they're both about as strong as each other but again coming back to this point here that you can have predictors that are you know equivalent statistically but in terms of their real-world utility they can be very different and as I keep saying in this case I think this if you were trying to increase sales you spend your time trying to get it played on radio and probably not be pumping money into advertising it just to make these standard standardized beats is really concrete you don't you know you don't ever in real life probably go and do this or thing you would just interpret them as they are but just to make it very clear what they represent here are the standard deviations of the three variables so this is our outcome sales that is the standard deviation of about 80,000 advertising budget as a standard deviation of about half a million and standing ovations of radio play is 12 so what we're saying with these standardized pieces is that with advertising as it goes up by one standard deviation in other words as it goes up by this amount how many sales will we get and because the standardized feature is about point five it'd be about half this value similarly standardized beta for radio claims about the same is about half so this tells us if we get it played 12 extra times that one standard deviation on radio then again we'd expect about half this number of sales which is what this slide is about to tell us so if we increase advertising by a standard deviation which turns out to be about half a million quid our record sales increased by half a standard deviation about 42,000 sales this is just really it's different ways of looking at the same things sort of radio play gives us about the same number of extra sales but from 12 plays on radio so our sort of normal beaters our parameter estimates if you like are expressed in the units that the variables are measured in so we can't really compare them directly well we can't compare to at all actually but they still tell us about the strength of relationship between predictor and outcome the standardized beaters tell us the same information but in a in a standard way which means we can directly compare how big they are in different predictors so we can know statistically which predictor is stronger than the others without having to worry about how they measured well what units they're measured in I should say okay so we turn our attention to bias now and I mentioned earlier on so we're going to look at different things that might bias the model and one of the sort of simplest things that bias our model is how how you specify it in the first place so you can specify models in a number of different ways and in the output up earlier on I said well there's two models and don't away also said because there's ten minutes ago and I've got a memory but anyway we can fit models in different ways if you've got more than one predictor it's obvious if you just got one predictor there's only one way to do things you just put the predictor in them Bob's your uncle if you've got several predictors you can choose several different ways to start fitting the model so one of them is known as hierarchical regression and what this is is basically that as an experimenter or researcher you make informed decisions based on theory about which order to place predictors into the model so you might put in what you know theoretically is the best predictor first and then you start adding in other predictors later on so you sort build the model up hierarchically step by step adding in one or two predictors at a time in SPSS this tends to be called block wise entry because if it turns things as blocks but a hierarchical regression this in a way as I'll explain in a minute is kind of the best way to do it if you've got theory on which to base the order in which you put in predictors then you should use that theory that that's definitely a good way forward another way to do it is the rather unpleasantly taught termed forced entry so this is a kind of like burgling a house and you get all your predictors that you've measured and you just you literally bung them in you just say I'll just throw them in all at the same time see what happens find a way you can do it is what's known as stepwise so this is where SPSS decides which order to place the variables into the model based on a statistical criterion which is the fans to be the ascending partial correlation but that doesn't particularly matter all you really need to know is that it puts in predictors based on how good they are at predicting the outcome essentially so it will start off putting in the best predictor based on you know the based on the correlation to begin with then once that's in it looks around the other predictors and says well which one of these can can add something to the model can add us you know significant benefit to the model and it throws it in and then it looks around at what's left over and it says well can any other variables go in that will make the model better so you're building the model up using statistics essentially which is not always a good idea so SPSS will just be going around saying right we'll put the best predictor in first then it looks around what's left can we put something else in that would improve things yes we can is there anything else that improve things yes there is and so on and so forth okay hierarchical it's generally a good idea because you're making the decisions and you're human and you've got a brain and you can make these decisions formed an educated way forced entry is not necessarily so good there can be uses for it but one of the problems with it is you know you're just it's like making a cake and just bugging everything into the bowl at the same time for some cakes that will be fine and you'll get you know fairly nice results but for others yeah be a disaster it won't rise in the oven it'll be horrible and Christmas will be ruined your family will never speak to you again and stepwise because it's based on on statistical decisions that that can be okay if you've got no sort of no other way of deciding which order to put things in but it can also be a bad idea because there may be just very very tiny differences between two predictors you know one of them is a slightly better predictor than the other it gets entered first and then because it's been entered first that affects what other variables get entered so I've got a hopefully good analogy for this so if you imagine I'm getting dressed in the morning it's not the most pleasant image I'll give you that but it's quite useful so I have a jacket for example imagine this jacket is a predictor variable but a waistcoat as well don't worry this isn't this isn't going too far so that's another predictor variable so we've got two potential predictor variables I've got shoes quite elaborate shoes there predictor variables too I've got socks also predictor variables so we go there so we've got predictors and because I don't actually strip in front of you we've got pants and a t-shirt so imagine I'm getting up in the morning and I want to dress myself so I've got these predictors I can choose any one of these to adorn upon myself so I'm the model if you like and we're going to use these predictors to make me ready for the world so we could do it hierarchically so therefore I'm making the decision so I will look at my predictors and I would think well theoretically it makes sense to put my socks on so you know I might put a sock on and they'll go with theoretically now my socks model now I wear my socks in the model it makes sense theoretically to put my shoe on oh yeah thanks makes sense to put my shoe on I've already put my pants on obviously so that I might say well you know put my waistcoat on next so I'm making informed decisions about which predictors go into the model in which order and the consequence is I end up coming to work hopefully looking vaguely respectable now let's imagine that I let a stepwise regression dress me in the morning so my stepwise regression is going to make the decisions about which predictors go into the model so SPSS stars as it looks around at my predictors that it says what's the best predictor well it goes all it's your trousers will put your trousers on first so it puts my trousers on then it says I what's the what's a good predictor of that shoes shoes a really good predictors let's put a shoe on that's going to improve the fit of the model so it puts my shoe on and then it looks around and says oh she's done gone says what's left over oh there's a sock that's going to improve the fit of the model let's try and put that in and then it says there well hang on these pens they can improve the model they need to go on to ladies as well hang on you've got Jackie yeah that's good improve the feather model that goes in next oh hang on tisha I can improve the model stick that on she's got a fantastic model Oh Fitz and Tatum really really well basically this is what happens if you do a stepwise regression you end up looking like a more of a so by letting SPSS make the decisions for you you end up with socks on your shoes t-shirt so be jacket find a weird Superman look in your pant arrangement because it's making decisions based on statistics not on such sensible information so maybe there was only a really small difference between whether my t-shirt was a good you know new predictor to put in or whether my jacket was a good new predictor to go in and all SPSS does is it just looks at the numbers it doesn't look at it and go no it's stupid to put a sock on before your shoe that's a ridiculous thing to do it just goes wrong shoes a good predictor will plug it in see what happens and once it's in it affects all the other decisions that SPSS makes so stepwise requestion is not always a great idea because the decision-making is out of your hands and you're kind of you're leaving SPSS to do stuff for you to make decisions for you and it's going to do them without any knowledge whatsoever of what theoretically meaningful so that's the first way that you can get biased into your model the way that you select predictors affects the parameter estimates that you get so you can have the same set of variables that you've measured same step predictors if you put them in hierarchically you'll get slightly different answers so if you put them in stepwise and you'll get slightly different answers to if you put them in forced entry so think about your method that's the more of that story we can also look at how well models fit the data so we've already covered this a bit so we looked at F which is a survey rule measure of whether the model has improved our ability to predict the outcome we also look at a square which tells us this or proportion of variability in our outcome that can be explained by the predictors that we've got or by the model as a whole they're both very very useful things but we can also look at other things so we can look at what are known as residuals and you may remember from earlier in the course residuals are just the errors you get between what the model predicts and the data points that you've observed and we tend to do this well you can you can look at residuals in lots and lots of different ways but I'm going to focus on what are known as standardized residuals because again they're just they're like z-scores so they're in a standard unit of measurement so we can we can make some general rules about you know what they should or shouldn't be we could also look at whether there are individual cases so this is like outliers when we talk about outliers before this is kinda what we're doing here so looking at cases that are influential that have affected the beta parameters have affected the model so just to remind you like I said these residuals or errors are just distances between the data we observe and what what the models predicting for the green area is what the model is predicting and the yellow dots are what we actually observed in the data so we get these errors you always get them it's not too much to worry about but we need to know that the errors are sort of not - there's not too many of them so we can look at these standardized residuals like I said they're basically z-scores and we know a few things about said scores and one of the things we know is that 95% of Z scores ought to lie between plus or minus 2 it's actually 1.96 but two Z's this remember so what we can do is look at all the residuals for all the cases of data that we have and we can just sort count up our roughly 95% of them falling between plus or minus two and if they are that means the error is kind of what we would expect in a certain normal sample and you can go to a sort more extreme level and say well we also know that for z-scores 99 percent of them should fall between about plus or minus 2.5 okay I think it's two point five eight something like that but two point five will do so if 99 percent of your errors or residuals are falling between those values then that's fine that's what you'd expect but if you've got a lot more than the one percent falling outside of plus or minus two point five then that means you've got a lot of kind of a lot of residuals that are kind of slightly too big or you've got too many that are big and you can also look for individual cases that have huge standardized residuals so if an individual case has a residual or standardized residuals above about three then that might be something you have a look at and see if see if is a good reason why looking at residuals in itself is not enough though although it's useful because you can have cases that in themselves do not have big standardized residuals they don't have much error associated with them but the reason they're going at much error is because they've had a massive impact on the model so here's an example of some real data which measures the number of pubs in certain districts within London and mortality rates and basically this is the model that you get so these all represent you know a district district of London is a dot and number of pubs number of numbered deaths so what you can see is you get this beautiful straight line like this and then you get this massive outlier and this outlier is actually the city of London so the central bit of London where there's an enormous number of pubs because it's the business centre but relatively speaking the mortality rate is not that high or not particularly high in central London relative to the rest of it so it's a massive outlier in terms of how many pubs there are it's not a particular outlier in terms of mortality rates it's pretty you know similar values to these but what it does do is it has a massive effect on the model so this is the model you get if you fit a linear model to these data that the line you get now you can see that line is a really good predictor of that data point there's very little difference between the dots and the line it's so different so little that you can sort barely detect it really whereas these values for example would have you know not necessarily big residuals but they would have some residual there'd be some error associated with them what happens if we get rid of that data point if we fit the model without it basically you get a completely different line so by having this data point in the data set this gray line kind of gets shifted you know the whole angle the intercept everything this has a massive massive influence on what the model looks like so this what's known as an influential case and it inferential cases are not necessarily outliers as such so if you looked at the standardized residual you would not conclude that was an outlier you'd say that's fine you know it doesn't have much error attached to it but the reason for that is because it's it's so much of an outlier if you like that it's actually influenced the model it's changed the whole shape of the model that you fit it so residuals are not enough because you can get cases like this which if you just looked at the residuals you think you know they're completely fine and actually you know they're not so what you can do is that again SPSS does various things and you know there's a really tedious array of statistics that you can use to look for influence at a very simple level what you could do is fit the model with the data point in so look at the green try green look at my dot look at the sort of reddish line and work out the or estimate the Beeker parameters for it so you'd have a beta for the constant and a beta for the slope and then you delete that point and then you work out the new regression line the b20 and the beast' for the slope and you compare the two so you look at the difference between the parameter estimates the difference between the beaters when you include the data point and when you don't and if there's a big difference then that's that's essentially a measure of influence now obviously you don't do that manually you know SPSS can do this for you so you can look at that sort of difference between beta values when you have a case included and you don't have a case included is known as DF beta in SPSS but I generally just stick to telling you about cooks distance and cooks distance works on that principle but what it does is with the DF thesis you're getting the influence on individual pieces in the model one by one so you can end up with you know about if you have three predictors you'd end up with three DF betas and a deer feature for the constant whereas cooks distance or wraps it will wrap into a single value which is kind of convenient to from a teaching point of view so what cooks distance does is it is it does of comparing the model with a data point included against when the data points not included but it wraps it up into a single value and as a general rule of thumb if that value when you ignore the plus or minus sign if that value is greater than one then that that's a data point that you should look at as possibly being an influential case or you know two influential so that's a nice easy sort algorithm to remember cooks distance if you ignore the plus or minus so it's greater than one that's potentially a problem what about other sources of bias well a few weeks ago we had a whole lecture on bias and I'm going to return to some of those ideas now so we look we talked about outliers and we've kind of just talked about them again so I'm not really going to go on about that again oh there was an assumption of linearity and additivity which again I'm not going to talk about again particularly I just want to remind you that if we want if we want to assess bias in a linear model we have to assess whether a linear relationship is the best way to model the relationship between the variables that we have what I'm going to focus a little bit on is that you may remember there was an assumption about normally distributed something or others and I'm going to come back to this in a minute but essentially if we're doing significance tests and confidence intervals around our beaters then it's the sampling distribution of those beaters that needs to be normal and if it's the yeahthey if we want the estimates of the beaters to be optimal then the residuals or the errors have to be normally distributed to the other assumption we looked at was homogeneity of variance and in linear models it tends to be called homoscedasticity but these are the same things really but it tends to be called homogeneity variance when you're when you've got groups of scores and it tends to be called homoscedasticity when you don't have groups of scores you just have lots of continuous variables but they're they're basically the same thing um so we're going to come back to that as well so this is just our flagging that we've talked about these things before there's a couple of other things that we haven't talked about before none of which take a lot of time to explain it all quite simple first of all we need to think about variable types a bit so when we fit linear models our outcome ought to be a continuous score so you can you can fit linear models where the outcome is like in categorical or something but those models are more complicated and we don't cover them on this module the predictors in a model they can be continuous and generally most examples we've dealt with so far have had continuous predictors like advertising budget that just ranges continuously on a scale but it is possible to have categorical predictors as well and that's what I'm going to talk about in next week's lecture there's a your you'll find some textbooks talk about their needing to be nonzero variance and that you know that that's really stating the obvious it's just saying if you're going to put predictor in you have to make sure your schools are very to some degree or another independence the scores you get or in particular actually it's the errors that matters so again these residuals they should be uncorrelated with each other so in other words they shouldn't be like a systematic pattern between them the best way normally to make sure this happens is to collect data from different people but you know you can you can still sometimes get correlated errors so this is a bit important but it's quite easy to check and the other one is a multicollinearity and this basically just means that predictors in your model shouldn't be very very highly correlated with each other and that's really for obvious reasons if you put two variables into a model and they're both very similar to each other and they correlate very highly with each other then it becomes very difficult to estimate the the basis for them because they're both kind of similar so my slightly dubious analogy for this is if you are like going if you went on a date with with a twin or something and like both twins were there and they're identical twins important it would be quite hard for you to be able to distinguish which one was the significant one to you you know it's important it's the first date because obviously if you if you need a bit better you'd be able to distinguish them although I do actually know you some some male twins who I haven't dated and even though I can tell them apart really easily I still call them by the wrong name all the time which I think annoys them quite a lot I didn't do it on purpose thing and in fact I was even having a conversation with the monster I was saying you know I don't understand why people can't tell you apart it's really really easy and then called in by the wrong name anyway so multicollinearity is like if your predictors are too similar you just can't distinguish between the two of them so when SPSS is trying to estimate the basis for you it just it runs into trouble so you want to try to avoid that so on a musical theme I've written a little song to help you remember why normality and when normality is important and I wrote this in the shower yesterday so this is like a world premiere so depending on your reaction I may or may not use it again next year it goes like this if SPSS says your data's a mess and you're going as mad as a hatter because you really want P to reflect reality that's when normality matters when you've been up for days in the statistic call Hayes you're tired and emotionally shattered you don't want to be fooled by your confidence intervals that's when normality matters if your life has got skew and you're wondering what to do because you feel like your brain has been battered if your samples are small then remember the rule that's when normality matters if the scores you collect our distributional wrecks remember that this doesn't matter cuz the CIS and peas you need normality of the sampling distribution of the parameter Thanks he'll be on iTunes tomorrow so normality matters not within the scores themselves but that's typically what we what we test but it matters if we do significance testing that's when it's important if you want to trust your confidence intervals and trust your p-value then it's the sampling distribution of the beaters that matters I feel like I'm singing again and that is more important in small samples because the central limit theorem which I couldn't fit into the song some peculiar reason the central limit theorem tells us in big samples that sampling distribution ought to be normal anyway but we do need to look at the the residuals as well the residuals are important at least in telling us that the estimates that we have are optimal using a method least squares so spss will chug out a nice distribution for you a nice histogram and a PP plot and you might remember from the lecture on bias this PP plot we want our dots to basically fit along that diagonal which in this business this is actually the the album data album sales data so that's beautiful fairly beautifully normal and you can see from the histogram that's pretty normal too so this would be a situation where I'm you know happy happy days and all if we saw this at the bottom of our SPSS output we would look like this I returned into an apple for some reason anyway and so we look like that is however we see something like this so this is you can see quite skewed distribution and the PP plot the dots are starting to deviate and s around s around the diagonal so they're deviating a bit from normal then that means that your your beaters have not been the values that you have are not optimal so maybe the method of least squares was the best one to use although that that's the only one we teach you that there you go so if we see something like this then look like that side-side-side face so what about heteroscedasticity I've got a song for that as well so this is this is not a world premiere and and it goes like this heteroscedasticity is hard to say if you get it you'll hope that it goes away or perhaps that syphilis it's hard to tell but syphilis won't leave you in statistics hell if your residuals are fine are laying out you better get ready to scream and shout because if your data are heteroskedastic your model is a lousy fit jeez so any of the important bit this is not the and not be like in your exam don't write anything about syphilis so not not good and it's that bit so residuals funneling out we've talked about this very very briefly and I said I will come back to it and I put this diagram up before so this is a plot of the zebras it is your standardized residuals which I've already talked about and your Z pred are standardized predicted values so this is the the values that the model predicts that they've been converted to z-scores there's a lot of converting to z-scores this week so what we're hoping to see when we look at this plot of zebras is versus Zed pred is this basically a pretty random array of dots if it's a random array dot that means first of all our residuals are independent which is that ticks the box of box of independent residuals if they're sort of all reasonably evenly spaced out that means we've got homoscedasticity which is a good thing so that's kind of what we're looking for really and also if it's if it's sort of fairly random evenly spaced out pattern we can normally seem linearity as well so all good so if you see something like that very very very happy if we have heteroscedasticity that will show up as your residuals fun a link out so that's what the song was about so you're looking for this sort characteristic sort of funnel shape and it may not be this way around you might get you know the funneling might go that way that's fine too but it's basically just you know it's some somewhere along the x-axis you've got a smaller range of your errors I have a smallest variance or smaller range than they do at the other end so you're looking for it being wider at one end than it is at the other end and that will show that's the sign of heteroscedasticity non-linearity will show up on this plot if you get a sort of sausage shape so this thing down here that would be classic sort non-linearity so like I can said if it's still fairly evenly spaced out that means you can probably assume a linearity but if you get this something sausage II and again it won't look exactly like that sausage but it will have some kind of bend in it somewhere and this is an example where I mean literally there's your model really ought to just give up the ghost this is where you've got heteroscedasticity and non-linearity at the same time so you can see like at this end the errors are very very widely spaced out but at the other end they're very narrow so that's that's the heteroscedasticity got narrow wide but then not only the arity is coming out in that you've got a bit of a sausage thing going on as well so if you compare these two this is you've just got Hector oh so good ass disa T and this is where you've got heteroscedasticity and also a bit of a sausage in there as well so there you go so these are sumption czar in a way of very easy to test you just look at the graphs and you sort of make a make a call about it and normally I mean heteroscedasticity is normally quite obvious actually it's not it's not too subtle normally if you have it you've got you know you've got quite clear funnel so what we get for our actual album sales data up here this is our zip red versus dead resist and we have actually got quite a nice pattern there's no no real obvious funneling so we're probably all right as far as homoscedasticity goes and it looks like a nice random jumble that's exactly the sort of thing that we're looking for happy days I've also put up here you get out of SPSS these things called partial plots so this is a plot of each predictor so this is advertising budget as plays at Radio and actually this was a I had another predictor which was the attractiveness of the band and I only I included this just cuz it you get to see some funneling so you can also look if you want to see like which bear of always creating the problem these partial plots are useful so you can see here basically a nice linear trend for advertising budget so that's kind of happy days and the the spread of scores is reasonably even along the whole length of advertising budget again this shows uh sort of a nice linear pattern as well so these two predictors kind of pretty much okay that the spread around the line is reasonably even it's not too bad like I said I saw through this into the mix just because you can hopefully see that that there is a some funneling going on so if we looked at a model with this in as well that might you know that might be a reason to think about not including that predictor so mostly collinearity which is the last thing we need to worry about again it is quite easy to check because spss throws out some statistics so if you if you see how to do this in your pattern class but if you basically tick on a box there's a little check box that will give you these two columns as part of your output and what you're looking for here is torrents and vir f are actually the same thing so basically that's an interesting thing to remember so tolerance is 1 divided by the VI f and VI f is 1 divided by tolerance so they're completely related statistics I don't really know why it gives you both of them but anyway there are some interestingly contradictory recommendations about tolerance so may not said that torrent should be more than 0.2 so in this case all our tolerance is a bigger than 0.2 so that shows no multicollinearity that's good or the iPhone is easier to remember cuz it's a nice kind of round number is the VI F should be less than 10 so you just be looking down this column if there's any values here greater than 10 then that potentially shows multicollinearity so that's a problem these are all smaller than 10 so based on this criterion no problem at all so just to summarize all of this once you've fitted a model we need to assess its accuracy we need to look for sources of bias we need to get some idea of is it actually a decent fit or are there things affecting it we can look at how well the model fits the sample we can do this by looking at residuals informational cases we would need in the first place to think about which order we put predictors in to the model and whether we do it step wise or hierarchically and there's also the issue of kind of how well the model solve fits the population really and that's when we start talking about confidence intervals that is what we're doing really with we're trying to estimate what the actual value of beaches are in the population we've seen that normality matters the significance tests and confidence intervals but also the normality of residuals which we haven't talked about as much before today matters in terms knowing that the beaches that we've got a sort of optimal using the method that we've used we need to look for homoscedasticity look for to try and make sure that there is actually linear relationships between the things that we're modeling and also look to check that our residuals are independent so next week I'm going to is the final Electra in this sort of segment which is going to look at categorical predictors and that's it thank you
Info
Channel: Andy Field
Views: 37,007
Rating: 4.9431281 out of 5
Keywords: Regression, Multiple Regression, Normality, Heteroscedasticity, Linearity, Assumptions, Statistics, P-P plots, Stepwise regression, b coefficients
Id: ywvhmYNvbyI
Channel Id: undefined
Length: 51min 31sec (3091 seconds)
Published: Tue Jan 29 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.