Spatial Regession in R 1: The Four Simplest Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] welcome everybody today we're gonna run our first set of spatial regression models in our and today we're going to do the four simplest versions we're gonna start with OLS which is a non spatial model so y equals x beta plus epsilon where we're assuming this epsilon is just a normal stochastic error term a lot of slopes a lot of X's here we're gonna run this facially lagged X model after we test this OLS model a couple of different ways to see if there might be some spatial relationships that we should be accounting for the Spacely lagged X model adds the average value of the neighboring X's so for example if we were explaining crime rates if that was our Y our X's might involve things like income education unemployment rates things that we might think have something to do with crime rates in the Spacely lag next model we don't only include our own crime rates in education we also include the values of our neighboring regions education and income and unemployment rates the idea here being that maybe neighboring unemployment rates if that affects crime in their area those same people might come here if they're unemployed or are desperate and commit crimes in our region second model the spatial lag model sometimes called a soror model although roger Vivint complains about that he says that SARS should not stand for spatial autoregressive which we use in econometrics sometimes we'll call the spatial lag model a soror model he says that should stand for simultaneous auto regressive only not spatial auto regressive since this is a common terms the Soraa model I'm going to include it here although know that some people will take offense at using SAR to me in this model and in this model we do not have the spatial lagged X's we have spatially lagged Y's so Rho W Y here the idea would be that if there are higher crime rates in neighboring areas maybe that also affects our crime rate maybe some of this crime going on in our neighboring areas spills over into our area this is a global spatial model whereas this facially lagged X model is a local spatial model local meaning these X values of our neighbors they affect us but that's it it stops there there's no global spillover effect where our neighbors values affect us and then that also affects our neighbors and that affects our neighbors and so on and so on with the spatial lag model this is a global model where anything that happens to our neighbors Y both effects us and then that has a feedback effect both to the original neighbors and beyond so that every region in this network feel some kind of effect so like a stone being dropped in the pond this ripple goes on forever in a spatial error model what we envision as being the thing that spills over is the residual so here our value of the residual the unexplained value is a function not only of our unexplained stochastic error term but also it's a function of our neighbors residual values one interpretation here is that there's some missing variable that is spatially correlated so perhaps there's some explanatory variable that we have not put in the model that is causing our residual to be high so maybe we leave out unemployment as a variable in our model but unemployment rates are spatially related so there tend to be clusters of a variable that we haven't included that will lead there to be a spatially related unexplained clusters of high values of the residual higher-than-expected crime rates in a region and then maybe lower than expected crime rates in another set of regions in kind of a spatially localized area now is this SEM or spatial error model is also a global model in the sense that if there is a shock if there is a high residual in one area then that will ripple through all of the regions in our model so let's go into our and let's actually do this before we do that though let me show you something I've been working on here that's included in our download for today this is a first rough version I'm calling it version 0.5 of the mother of all our spatial econometrics handouts where I list a lot of the commands that you might need to read in weights create weights we talked about that in the previous video and how to estimate various models again today we're going through these first four models that we just talked about in later videos we'll get into some of these more complicated videos that are on this side and also on the back the sorrow and the Sarma let's help you get going you want to follow along and so that you can follow along here's what you need to do I'll include a link in the description of the video below or you can go to my website spatial Berk Academy comm click on GIS and spatial stats and you can find the file here under link to file downloads our spatial regression 1 dot zip and inside that zip file once you download it and you unzip it you will find the cheat sheet that I mentioned at least the current version a description of the variables in the data set a shapefile that has both the maps and the data that we'll be working with that we created in an earlier video a neighbor file a gal file that we created in Yoda in case you want to practice reading that in we did that in a previous video and here are our commands for the day in a text file so here's what I want you to do you need to install R and R studio have links my facial webpage if you want to easily find those and download those install our then install our studio if you haven't open our studio and when you get to our studio I have these instruction in our text file that I showed you there so just open up that text file it shows you the link and extract it to a directory open our studio and here the next steps we want to do go to file new project so we're creating a new or studio project click existing directory and then we want to select that directory that you downloaded and extracted that zip file so that our spatial regression one analyst created a new session create project and so once our studio opens you see those files that are in that directory over here on the right now the next thing we want to do is to go to file new file or script and that opens a little window down the bottom here where we can copy and paste all of our commands from this text file so if you want to you can just go ahead and select everything copy it and paste it down here what that lets us do is edit these commands and run these commands without any trouble now if this is your first time running any kind of spatial regression stuff in our you're gonna want to run these three commands right here which will install packages SP DEP RG Dow and our Geo's if you've already got them installed you can skip that what you want to do next is you want to load in the data set from the shapefile in there and so just go to this line here library our GDL we need to load that library click run over here then click this run again and this reads in that shapefile click run again this just shows us the names of the variables in that shapefile and then click run again and that shows us a some summary statistics of the data in that shapefile along with the names now one problem that you're gonna run into and are quite often so we might as well cover this is some of these variables the quantitative variables what you're looking for is that they have a minimum a median mean and a maximum that means that our is treating these as quantitative variables however here's one that we're interested in PCI which is per capita income and instead of seeing the men and the mean you see a frequency distribution that means that R is treating these as qualitative or categorical variables and you have to fix that I have a little command here that you can use to fix that just put the the name of the data set which we've named spat data just fat data and the name of the variable PCI per capita income and then this command over here will convert that re save it as a quantitative variable so click run and if you want to verify that that has worked we can go back up to this summary line here and click run and go back up to that per capita income variable and now you can see that it's giving us minimums and means and maximums and these are in u.s. dollars all right the next thing we want to do is you might want to create a map of some of your variables so click run and this is gonna make a map of the sales per capita of liquor a little map over here now there are a lot of ways you can customize this but that's not the point of this video maybe we'll do that later if you guys want me to go through how to do that now let's load the SP DEP library so I just click run on that line and the SP DEP library is the one that contains most of the spatial statistics commands that we're going to be using so first thing we're gonna do is we are going to take the data set that has the map in it that we're looking at over here and we're gonna create a queen neighbor file or a version of that W matrix the spatial weights matrix that tells us who our neighbors to whom click on the end of the line and click run and now let's create a rook weights matrix just in case you wanted to now in order to use these major sieze there are a lot of different ways that our stores them one is called a neighbor in B another is called a list W whenever you want to use these matrices for most kinds of spatial regressions you're gonna want the list W version you convert them with this little command in B to List W so just click on the end of the line run run and that just saves a new version of them and you see as we create new objects they're listed over here in this values column over here now the one we're going to use in these regressions just to make it easier so we don't have to type this queen dot list W every time I'm gonna just kind of rename it save it to another name list W one if we wanted to use the rook we could just replace the word Queen here with the rook and it would use that one this just makes it easier to switch between the weights matrices if you wanted to play with that similarly so we don't have to type our regression equation every time let's just name it here's the equation we're what we're gonna be doing here is using an equation to explain the DUI rate driving under the influence of alcohol explained by the liquor sales per capita college enrollment percent so what percentage of people in this county are college students the distance between where people live neighborhoods or block groups to the closest ABC store what percentage of people are Baptists why because historically they are anti alcohol let's see if that has any impact the distance between where people live and the closest bar where they could buy a mixed drink so a mixed drink is like orange juice and vodka and what percentage of people in that county work in the entertainment and recreation industries these might be tourist areas so let's define that equation and one other thing I'm gonna do before we get going is turn off scientific notation this seven basically says don't use scientific notation to represent a number unless it's gonna have like more than seven zeros before the digit or if it's a larger number like seven bill or something like that seven is a reasonable value you can play with this the larger the number the more reluctant our will be to use scientific notation helps us see what our results are a little easier and run okay let's do it less first so just LM our regression equation tell it the name of the data set run and let's look at those results so we have one barely statistically significant variable here Baptists so the more Baptists the more driving under the influence there is okay and we have an R squared of about 0.05 eight not too great okay let's see if there's any residual spatial dependence in our residuals so what this does is this is a Moran's correlation test designed for regression residuals you can't use a regular Marans correlation it won't give you the right results so this LM Maran test looks at the residuals of that OLS regression and then you give it what kind of spatial relationship matrix you want it to use this is the Queen that we had and click at the end here click Rotten so the null hypothesis is no spatial correlation and the residuals and this p-value is very small so we're going to reject that this is telling us that there's something funny spatially going on in our residuals maybe we should investigate some kind of spatial model and now another way to see if we might want to run a spatial model just looking at the OLS or to do these little grange multiplier tests we mentioned in a previous video when we were looking at judah let's run these tests in our here's the command well good grief the documentation says that you can just say test equals all and it'll run all the tests I guess we've got to go back to the old version of this which is you have to do a list of all the five Lagrange multiplier tests that you want it to do here so let's try this I'll include this new command in the decks biology download this output here is testing to see how much would our model improve if we ran a spatial error model p-value suggests that a spatial error model would improve the fit this test here looks to see if a spatial lag model well that's adjust the spatial lag model would improve the fit this is the robust Lagrange multiplier error test this is seeing if an error model would improve the fit but trying to filter out some of the false positives because a lag model can be a false positive for an error and vice versa because they do have some similarities in their data generating structure so this is suggesting maybe an error model same thing for the lag so what are we to do well speaking with Lu cancel and many years ago he suggests picking the one with the lower p-value so this one has a lower p-value Luke insulin would suggest the lag model now we have a fifth test here for the Sarma model now as far as I am a dove able to tell number 1 R is not capable of running this Sarma model number 2 this was Lu canceled ins idea to run this test and Luke insulin suggest that the Sarma model is probably not ever the right model he's suggesting this test for completeness so Luke insulins idea would say let's look at the lag so now let's run the SLX models basically lagged X's where again we're just taking basically the average values of our X's and throwing them in to see if they explain our value for arrests for drunk driving so what we get are two sets of slope coefficients one for our own X's here we'll call those the betas normally and then we have WX theta so T for theta these Thetas are a set of slopes for each of those explanatory variables and they're gonna tell us how our neighboring days of explanatory variables affect our values so there's two ways to do this one is there's a and in SPF LM s LX we can run so let's do that all right now let's look and see what it gives us here so summary reg2 click run so as you can see if we have two sets here's the first set these are the X's and these are the lag X's so basically how you would interpret these is find a couple that are statistically significant here okay so here's one that says this is the distance between neighborhoods block groups where people live and the nearest place where they could buy a drink in a bar if that distance is higher where we live this leads to a lower amount of drunk driving arrests significant at the 0.05 level here so that's our own value so if in our county the distance is higher they're lower drunk driving arrests however if we look at the value in our neighbors region if the value the distance between where people live in the neighboring County and bars is higher in the neighboring County that kind of leads to a higher drunk driving value in our County now I'm not saying that this model makes sense we're just throwing some variables together to see what happens here but it kind of gives you this interesting thing that can happen with an SLX model is that that variable has a positive value for one of the two for in this case it's it's a negative for our own value our own distance but a positive for the other one so this makes you think what's the best policy that you should have should you try to make this distance higher by restricting the number of bars or should you not well there are two offsetting effects it lowers if you make this distance higher at lowers drunk driving risks in the own County but it tends to increase drunk driving arrests in the neighboring County if we can believe these results and if this is a correct model okay this goes with a lot of caveats but interesting result so this kind of makes you think that well since there are these offsetting effects what's the overall policy effect what if in this entire region we put in a policy to restrict the number of bars to make this distance higher what would the total effect so we call this the direct effect and this is the indirect effect what would the total effect of this be well there's a command that we can run here to help us out so here's the command if you want to figure out what would the marginal effects be when you really have to take this spatial nature of what's going on here into account the fact that there are the direct and indirect effects you can look at these coefficients here and they are the marginal effects but if you're interested in what if every region increased the distance what would the overall impact be on drunk driving arrests we need to use this impacts command and you tell it what the weights matrix is so let's run that so this gives us the direct effects which are just those coefficients like the 0.015 for that comes right from here point one five four and the indirect effect the point O two six four nine of sales per capita just comes from here 0.06 two six four nine but then we can look at the total effect 0.04 one nine so basically what they're doing here in this case is they're just adding those two together now when you have one positive and one negative so this negative one does positive here's the total effect but here's where it gets interesting what if you wanted to know whether that total effect is statistically significant then you need to make your impacts command a little more complicated so here is the more complicated command you have to do a summary of the impacts and tell at Z stats equals true this R here tells how many simulation repetitions you need when you're looking at a nestled X model you actually don't need that it's going to be ignored in this case but later on when we do this for the lag y-you have to tell it how many repetitions you want because it is truly simulated but in this case they are directly calculated let's run this command click in here run and so a dot not only in this case lists the total effects it calculates the standard errors for the total effects in the Z values for the total effects and the P values for the total effects why this is important as you might have two effects the direct and the indirect that are both statistically insignificant but then it could turn out that for the total effect that is statistically significant and that's what we saw for sales per capita if I'm not mistaken the direct effect was statistically insignificant the indirect was statistically insignificant but when you look at the total well I guess I'm sorry I'm wrong here that direct effect is 0.06 so not quite but it could be that the total when you calculate the Z stats and the P values for that it could be statistically significant here we see the direct effects is statistically significant but negative the indirect statistically significant and positive and the total is also a statistically significant and that total impact this one was negative this was positive the total is positive what this tells us is if we Institute a policy that increased the distance to liquor stores everywhere it would have an increase overall effect in drunk driving arrests assuming we can trust this model unless correctly specified and all that right ok let's move on here I have some code if you wanted to try running an SL X model and doing it kind of by hand without trying that let me just point out one thing that I noticed when I did this you can try this yourself is when you run this and let me note here that you can run a spatially lagged X model just using simple OLS it doesn't require any kind of fancy maximum likelihood estimation or anything like that the only really good read and why you'd want to run this with a spatial package is to do these direct indirect and total effects here and to get those p-values calculated correctly but you can run the SLX model if you just wanted to by hand to calculate those average values for your neighbors I show you the commands here to do that so it's create WX coming in to basically and then you can run just a regular simple OLS what I found that's confusing I don't have to ask some people about this is I got different r-squared values let me show you side-by-side those results so here's the two results side-by-side so this is using SPF that command we just ran if you run it more or less by hand calculating those average values you get exactly the same coefficient estimates and exactly the same p-values and standard errors and everything the same residual standard error the big difference here is that the r-squared is much higher here using s/pdif much lower using OLS for some reason and for some reason we also get a different value of degrees of freedom for our F stat here 13 + 2 21 here 12 + 2 21 here and different p-value for our F statistic the r-squared here is much higher using SPD EPS a selects estimation compared to using OLS so let's figure out which of these two is right if either one of them are right so let me go through just run these commands that did that by hand run creating the lagged variables creating a data set running that regression using OLS and here we see that lower r-squared so let's calculate the R squared by hand here and see so when we calculate it by hand by doing 1 minus the sum of squared residuals over the total sum of squares we get the point 1 5 3 9 which that's the same thing Oh LS is giving us which makes me wonder why this high multiple r squared in the SLX I'm gonna have to ask around about that and see if that might be a bug in a selects estimation procedure here we'll try to track that down and get back to you well due to the magic of making videos here it is a few days later and I got it figured out we found a bug and I say we because I would not have found out this bug if I wasn't making this video for you so I guess you deserve a little bit of the credit as well so you got to witness the process of finding a bug try something do it a different way if the answers don't match probably something strange is going on so here is my name sorry I couldn't list all of your names I don't I don't know who you are but here's my name enshrined in the our code so I reported the bug to Rodger Bevin and day later he created a fix for the SP DEP package it is January 2018 as I record this the bug fix won't immediately be found in SP DEP if you download it today but very soon within the next few weeks or maybe months the fix will roll out to those packages so maybe whenever you do this along with this video you will get the right answer all along okay enough of that let's go and run these other two models so let's run the essaouira model although as I said Roger biven does not like calling it he says sorry should stand for simultaneous auto regressive instead of spatial auto regressive so let's call it the spatial lag model and that's where it has to lag wise here's the command lag sorry limb etc let's run it and let's look at what it gives us summary reg 3 and run that so here it gives us this row so it gives us one slope estimate plus this row and remember that row is this spatial lag parameter here so it tells us to what degree do our neighbors values of Y affect our own value of y and do the effectiveness in a positive way or a negative way and here we see a positive 0.3 and Nate and that is statistically significant yeah here's the rub when you estimate a spatial lag model you can't look at these slope estimates and interpret them as marginal effects why not well it's because of this global feedback effect whenever we change something in our own region like sales per capita of alcohol that not only affects our Y but when our Y goes up it affects our neighbors Y and when our neighbors Y goes up it affects our Y again there's this infinite feedback don't look at these slope estimates and don't look at these to see if they're statistically significant it's nonsense what you have to do here is run that impacts command so let's do that we already did the impacts for a regression - let's skip down here to regression 3 which is that lag Y and if we just run this it tells us the direct indirect and total effects so the direct effect if we were to increase our own sales per capita by 1 what would happen to our arrest rate for DUI this is if all of our neighbors change their sales per capita how would it affect ours this indirect effect can also be interpreted as we increased our own sales per capita what's the total impact on all of our neighbors and then this is the total effect kind of the direct and the indirect both going back and forth what's the total now you also want to run this next version here with the R in it and the Z stats equals true to see which of these impacts are statistically significant run that this takes a few seconds because it's doing a lot of simulations to calculate these now one warning while we're waiting to these to come back caution I've noticed when you run these simulations multiple times this is simulating 500 if we were to simulate these another 500 times we're gonna get different Z stats and P values running another 500 times you get different Z stats and P days even if you crank this up to 5,000 times each time you do these simulations there's a lot of variants there so here we see these p-values for the direct indirect and total effects this one is 0.07 if we were to run this again whether it's 500 or 5000 times this p value could be 0.02 or 0.03 or it could go up to 0.1 1 or 0.15 I've seen a lot of variability here so be cautious not only might you want to choose a high repetition value here for these simulations you might want to run these simulations multiple times just to get some confidence that what you're seeing here is stable so that's the lag model let's go to the spatial error model and run it and look at a summary drag for here and here we're going to get an estimate for this lambda parameter and the lambda parameter here is positive 0.4 and that's this lambda right here that tells us if there is a stochastic shock to our neighbors how does it affect the value of our stochastic error term and we get a statistically significant value now unlike the spatial lag model you can look at these estimates and you can interpret these as marginal effects because the only thing that's happening here is involving the residual terms now there is an interesting test we'll come back to this and we'll look at a test that I think it was pace and Lesage have a paper where theoretically since all we're seeing here with a spatial error model theoretically if all we have is this spatial relationship in our errors it's a kind of spatial autocorrelation problem or you could think about it as a spatial heteroscedasticity I know I'm weird and saying that but in any case the residuals are non spherical non random alright there's one last thing we need to do and then we're going to wrap this video up and that is a spatial Hal's bend test so let me show you paper so this is a paper in economics letters by a Kelly pace and Jim Lesage where they developed this spatial Hausman test now what is a spatial Hausman test you might be asking yourself well in order to understand a spatial Hausman test you really need to understand a Hausman test if you haven't ever read well when I say read I really mean flip through this paper by Jerry Houseman in November 1978 please get a copy of this paper and read through the first couple of pages at least I mean I know whenever I start reading the real theory here and the proofs my eyes start to glaze over as well but this is a brilliant paper that is Nobel prize-worthy most of us are familiar with the Hausman test in the panel data sense of the Hausman test but really what jerry Houseman's paper here says is that look if you've got two choices between models in the panel data case but this is general he generalizes it to all kinds of cases but in a panel data model you can either include fixed effects or if you don't include fixed effects the problem is those fixed effects might be correlated with the variables you include and if you don't include the fixed effects it will cause bias so the Hausman test for panel data is testing to see if I leave out those fixed effects parameters does it look like it biases the other coefficients so it's looking to see if the coefficient estimates for the panel data model without the fixed effects sometimes called a random effects model and the model with the fixed effects are those parameter estimates very close if so perhaps we're not biasing the parameters too much but we're gaining efficiency by leaving out those fixed FEX a spatial Hausman test on the other hand is looking at comparing the estimates of two models model number one is OLS model number two is the spatial error model now the spatial error model if the model you're estimating is a spatial error model then since the structure of the spatial noise is just in that residual term that is a kind of spatial heteroscedasticity or spatial autocorrelation which theoretically if it's true that you're estimating a spatial error model that should not bias the parameters if you estimate it with OLS the big benefit from estimating a spatial error model is that your standard errors will be estimated correctly what this spatial Haussmann test does is it compares the parameters estimating a model with OLS to the parameters you get if you estimate it using a spatial error model and the parameter estimates should not be too different if they are this is a sign let me show you the quote for a given set of variables a divergence between the coefficient estimates from a spatial error model and OLS suggests that neither is yielding regression parameter estimates matching the underlying parameters in the data generating process this calls into question use of either OLS or SEM for that set of variables so what Lesage and pace are saying here is if this spatial Haussmann test detects a significant difference in the parameter estimates between OLS and the spatial error model maybe neither of those two models are correct in the spatial world maybe this is telling us that we do have some spatial dependence in our model just a spatial error model is not the right way to capture the spatial results so for this spatial error model that we just ran again we had this lambda parameter that was highly statistically significant and we've had some other models and tests that suggest that there is some spatial relationship in this data that we're modeling here let's see if this Houseman test suggests that maybe the spatial error model is not the proper way to capture it so let's run that and here the Hausman tests since the p-value is 0.05 seven if we were using an alpha of 0.05 then we would say you know this p-value is a little bit higher so we cannot reject the null hypothesis just based on this that the spatial error model could be the right one but if we were using a p-value a little more lenient like a point one we might reject the null hypothesis and we might use this as evidence to say that neither OLS nor a spatial error model are the right model to be estimating these coefficients there's enough difference to say that perhaps we should explore another model now of course there are a lot of other reasons to explore other spatial and non-spatial models to approach this data of course number one is what does your intuition say should be the right model we have covered four models here in this video Oh LS SLX spatial lag and the spatial error model in the next video we will cover some of the more complicated spatial models and perhaps we will get some intuition from those tests and those results that one of those models fits this data on drinking and driving arrest rates better than what we've done so far so I'm going to call it quits for this video I hope you have learned a lot and I hope that if you have any questions about what we've done so far that you will contact me please leave a question or comment in the comment section below please like this video if you found it informative please subscribe for more and I look forward to meeting with you next time where we build on our results here and explore some more spatial models rookie academy signing out bye bye now
Info
Channel: BurkeyAcademy
Views: 41,803
Rating: undefined out of 5
Keywords: spatial econometrics, spatial Hausman, slx, spdep, spatial lag
Id: b3HtV2Mhmvk
Channel Id: undefined
Length: 40min 36sec (2436 seconds)
Published: Mon Jan 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.