Day 7 - WR674 - Simple features in R (Spatial Data Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright here we are at day seven of our environmental data science class and today we're finally going to learn a little bit about using spatial data in our there's a phenomenal resource for this that's a book by Robin Lovelace and Jacob now Assad and I don't know how to say his name honest muncho and basically it's free online you can also buy it as a book but it does a really thorough job of sort of answering any question you might have about working with geospatial data in our I encourage you to to read it in at least before class on Tuesday read this introduction of geographic data and our and read the vector data section so too for this section of the workflow we're going to make a new repository and we'll call it Lagos analysis or that's the description we'll call it blog dose analysis and if you're wondering at Lagos is Lagos is a pretty sweet data set made by Patricia Seurat Serrano Emily Stanley and a bunch of other really awesome scientists that basically is a bunch of Lake datasets joined together over I think 400 different Lake datasets joined together and it's a really well documented data set that they built and it includes data like the clarity of the lake algae blooms or information on nutrients chlorophyll a and other things that can indicate algae blooms so it's a really sort of useful data set and the wonderful thing about it is that they have a package that makes it really easy to download and work with iterate data directly in R so that's what Lagos is so we're gonna analyze some of that data for class we'll keep this repository public actually I'm going to make it private just for this I said whatever okay create a repository and I'm gonna clone it as an ssh that's what I use you'll clone it again as an HTTP don't fork it yet because it's I'm gonna add data in here so while you're watching the video just sort of watch along and then in class we'll fork it and do a separate analysis in class and I'll open a new our studio I'll go back to new project this should feel familiar now version control get here's the URL it's already in my environment data science I just want to make sure it's parallel yep and open that that's what I want like this class analysis is fine okay now that we have this here we're gonna go ahead and do file new file our markdown now I prefer that you work on our markdown from now on in class now that we've gone over it I'll call this blog of spatial analysis and I usually just start when I make into our mark on document I often just delete everything except for the setup chunk and you notice there's this option here that they put in this set of chunk that says include equals false and what that means is basically it's not going to include any outputs even though it's going to run the code in here and that's kind of nice because it's another way to suppress the messages for all the packages you load so I kind of like doing that here I'm just saving it so we'll just call this one gaming spatial analysis and so we're gonna load tidy verse we're gonna load a very important package called SF SF is the sort of main special package we're going to use we're gonna load another really sweet package called map view that makes it easy to look at data spatial data interactively we'll also go ahead and load this package called Lagos in E and you're probably gonna have to install map view and Lagos NTSF before we load them but you know I'll go ahead and come to what they are spatial package so SF can read shape files and it can create them and shape files are just spatial datasets and then this one interactive maps lots and lots clean the critical thing about Lagos data is that the data set is what's called analysis ready so it's the data in here is actually pretty clean unlike if you've worked with the water quality portal or other data sets like that where you have to do a lot of the cleaning yourself so one of the first commands you might want to have you're gonna have to look up is the Lagos commands so we'll love these packages and a nice thing that they do when you load Lagos as they tell you where to get more information and how to cite the data so if you use this in a paper you just always want to make sure you cite this so I'll rewrite this question mark because I'm trying to get at that Lagos get so the Lagos get is a way to get Lagos data onto your computer and so here they just sort of run the default so we we want to get the default to the latest version we're gonna go ahead and say Lagos Northeast they've only published the data for the northeastern United States that's why it's called Lagos and II and we're just gonna say that it's in that it goes into the common path that it normally goes into so we'll go ahead and run this and it's gonna just download essentially all of the Lagos data from the internet directly and it's gonna store it in a file on your computer and it's always going to know where it is because it's associated with the data set so in this case normally and I talked about putting a data folder in here but in this case we're going to be only working with data to start at least that's already in our Lagos sort of data set so we don't have to move that data around so this download might take a while so once that file is finally downloaded you should get returned it looks like this and it tells you all the different pieces of information it's downloaded and after you've downloaded that data you need to load it into our so we're gonna ask we're gonna have to do another command I'm just gonna comment that this is the download script and if you run this again after you've downloaded it it'll say that you've already got it and you have to force it to rear overwrite it so in this case it doesn't matter if we sort of leave that it's nice to know that we already downloaded it and then it's fresh and then I already just asked how do I load this data in it because I kind of forgot so we'll go to Lagos any load and basically just can say load it looks like you say that the version is no it'll just pick up the newest one so let's see if we can do that we'll just call it Lagos and we'll do Lagos any load and what it's gonna do is just open up the newest version and it looks like it's good that we left it not that version open because if we had filled it in like in the help file on the bottom right we would have gotten an old version and that's not something we wanted when you load in Lagos data you get a list of 37 different objects so when you have a list like that list can hold data frames vectors spatial objects it can hold all kinds of stuff so the first thing we might want to do is just sort of check out what's in that list okay it looks like there's some County information there's land use land cover change there's all kinds of different stuff in here I know from experience the thing that we want to look at where these lakes are is this thing called the locusts and the locust is the lat/long of the center of a bunch of lakes in the US so we'll just say lake centers and here we'll do lagos locust and the little dollar sign here is just calling that listed that named list so now we're just getting this thing called locust and now we have a very large data frame with one hundred and forty-one thousand observations of eighteen variables so from there we have most of the data we play with loaded in so maybe let's make a new chunk and we can write some stuff up here that I'm gonna actually split out this so that we can see it and what I just did there to split that out I went up and into the code itself I hit enter and then I hit ctrl alt I and our does this clever thing where it knows to cut off this chunk so I can make a new one so I inserted a chunk inside of a chunk and it knew that that means that it wants to I want to close off the next one so I'm gonna not call this the setup chunk I'll call it the data read and I'll say the title of this whole thing is probably a lot or the first level of analysis is that this is the Lagos analysis this all will say and then we'll say load again data specifically grab the locust so this gives us sort of location and we've already run that and grab the like central ok so now that's commented well a little bit more understandable and now we have this thing called Lake Center so maybe something I always like to do when I have a new data set is I like to look at the names so I'll go to names like centers ok it looks like it's got a bunch of different columns it has Lagos Lake ID so that's going to give us an ID for this data set it has NHD ID NHD is just a National hydrography data set so it has IDs for all the lakes and rivers in the country this then it has a bunch of other information that we might use or might not use it sort of tells us where it is geographically what Huck it is so that's basically what watershed is it in the piece that we're going to first play with is this NHD lat and NHD long so this is just saying what's the Latin long of this data so here I'm gonna say convert to spatial data so that just gives us the name but maybe we should look at the structure before we go ahead and start manipulating the data so it's the structure at Lake centers okay so we have this NHD lat and an HD long it's a numeric and it's a decimal degree Latin long so that's that's helpful that's good to know and I'm gonna actually say look at the structure and I'm gonna comment that out because I don't want to look at the structure every time but you can at least know what how to look at the structure and then look at the column names so it's just gonna automatically comment it out and then finally if you really wanted to see the data you could view the full data set and you could do view Lake centers when you view large large data sets like this you want to be careful that you don't load too much because it can bog down R so like if this was a million points it just can take a while so one way one nice way to do to look at less data is you can either index it sort of in a base R style where you say hey I want the first 100 rows and R goes by rows is the first index and the second one is columns so I'm asking for the 1 through 100 rows and if I leave this blank it means of all columns so I'll go ahead and do that and now I just have a hundred rows instead of a hundred thousand the in the exact same way you could do view like centers and you could do slice 1 through 100 and slice just means it's gonna grab the first 100 rows so we'll do that that's the sort of tidy first version and now we can look and we can actually see what the names of these things are so there's a bunch of different ponds late lack Latin long in Lake area etc and then we want to do something where we take what is essentially a data frame so again when we did structure this says this is a data frame and we want to convert that to a spatial object so how might we do that well there's a sweet function in the package called SF that is called st as SF so it's saying I want to take a simple table in st and I want to turn it into a simple feature which is a special object and so here is sort of how you do that and you can do it with a bunch of different things you can do it with polygons you can do it with points here we have point data so I happen to remember that the way you do with point data is you assign the coordinates the long and lat so we just need to sort of take our data set and we'll salt we'll call it spatial lakes and we'll do st underscore as SF then we'll do lick centers and then we'll do chords and here we need to figure out is if longer lap comes in first so we'll go ahead and look for chords again in the help file so chords is in the case of point data names or numbers of the numeric columns holding coordinates okay well it doesn't tell us the order still so okay it looks like in their example you have x and y i always get this confused but essentially that means that we want to have lat first so latitudes are the X so that's going to be this column called NHD underscore lat and then we want to have long next and we're going to go ahead and call this command and now we have this thing called spatial lakes and maybe we want to look at a subset of the data so we'll say subset spatial and we'll do spatial lakes and we'll again slice this one through 100 let's just see what the first 100 returns look like and here's what we'll use another new function called Matthew dynamic Matt view or I can't see what I'm writing because of that thing that fewer and we'll do map view subset underscore spatial okay so now what we can see here is that these these are plotting these points but if you notice I should be able to add some base maps to this and there's no context so we should see something like Minnesota or Wisconsin or New England or somewhere in here we should see other states so something went wrong when we read this data in and with that if we look at s cssf again so we go back to our help file we have the coordinates in there but we're missing what is called the projection so we need to know essentially how is this data projected and the way you do that is with what's called a CRS so we can look if that's in here yeah so this basically said these extra dot dot dot means additional options that it's not going to go over here but it could include named arguments like CRS or poor precision and what CRS CRS is this is not a J's class I'm not going to fully explain it very well but a CRS is basically the projection system that projected out the data and if you take into GIS class you know that knowing your projection system is really important lat/longs in decimal degree our fourth often projected in wgs84 which I happen to know that code for is four three two six if you don't know the code you can go ahead and look it up you can say wgs84 and then the code that we're entering is called an epsg code and it's just a way to simplify projection information but you can go to these spatial reference normally you can go to those huh well that one's broken oh there it is so you can go to these spatial projection ones and it'll tell you okay the code is four three two six and it's again in lat/long in degrees so it looks like the data we have which is nice and then you can see where does it cover the globe I think maybe you can't normally there's a map over here but this site looks weird today so anyways that's how you can sort of figure out your projection frequently this is going to be like you know that your data is in nad83 and so you could look like nad83 colorado epsg and you'll get these different projections so that epsg would be 2 2 3 2 so there's all these different projections again this is not a J s class this is a bit of here be dragon so you really really need to know your coordinate system before you play around if you're working directly with shape files shape files include the coordinate system but here we'll add this coordinate information so that the simple features function knows that this is in the wgs84 which is a global projection so I'll rerun that we'll take the subset again and now let's try the map view again okay turns out the first hundred examples are somehow in Antartica or that's not very likely right look at the world imagery okay no there's definitely not the right one so as I frequently do I got the long and laughs backwards I would assume so I just rerun this again every subset it and now we have some sensible data so now we could look we're in Minnesota we can see if indeed these points are about at the center of these lakes and that's exactly what they are you can quickly change the background imagery so you can see like is this an algae bloom looking like because it's a clear lake and we just have the first 100 but you know this is a large data set of a hundred thousand i wouldn't recommend using this map view function for more than 50,000 points it can get really slow but this is sort of a just first step and so you usually want to subset when you're plotting out big datasets so I'm just a subset for plotting but indeed now we have it right spatial lakes looks like it's working and then our subset just helps us map the first 100 points all right it's a new day but we're still in here coding on this Lagos dataset and we just plotted this subset of spatial data and we just subsetted that kind of randomly with this slice function but what if we wanted to subset it by you know maybe we only want data from Iowa or Minnesota I'm going to show you how to do that but first we need to load end data from that includes shape files of the United States so we're going to load this package called USA boundaries and what that has is is what it sounds like it's USA states and counties and you'll probably have to install this to get this to work but what USA boundaries allows us to do we'll add another section here so what US boundaries allows us to do is it allows us to just immediately load from the internet a data set of USA polygons so for first States for example so we'll do States and we'll do US states we'll just call this function and when we do that it returns this object called States and let's go ahead and look at the structure of States so States is an SF object so it's already a spatial object and it's a data frame so it means it has two types of objects and the data frame part it's filled with these things called the state FIPS so this is a FIPS code which is just a code for identifying states and it has some other Geo codes what we might want to be able to filter on is something like name so we have this object called States but we want to make it just Minnesota so I'm just gonna call it Minnesota and I'll go States and just like a regular data frame or Tibble we can actually filter this data this spatial data directly in R so we can say filtering name equals equals right and now we have just one object called Minnesota and if we want to check what's in these files we can always do a map view so States only has 52 observations so let's go ahead and do States Matthew or Matthew of states and there we go we got it we have Hawaii over here got Puerto Rico so we have all these states if you click on them some of the information pops up okay that's great and then we can also go ahead and I'm gonna comment it out let's just say if they loaded and then I'm gonna comment that out and then now let's check if Minnesota indeed when we filtered on Minnesota we got Minnesota back great so now we have an object of Minnesota we also have this object of spatial Lakes and so we can do is filter the spatial lakes to only be Minnesota lakes and the way we would do that there's two different ways one is a really simple way where you subset with this function where we are going to subset with a square bracket so we do Minnesota first sorry not Minnesota this is gonna be late spatial lakes we're gonna subset spatial lakes and I lost my tick marks there we go spatial lakes with square brackets and we're gonna say we only want the spatial lakes inside of Minnesota Minnesota will do a comma and that says hey I only want the rows of this spatial Lakes dataset that are physically inside of this polygon so in here it throws out a warning all the coordinates are longitude latitude SC intersects assumes that they're planar and this is an important warning where you should really understand your data set before you assume that this actually worked and that's because if we look at the projection system of Minnesota it's still in this epsg for three to six which is a global projection and so it's actually not projected directly on to a flat surface it's actually projected on to a global surface so it's when it says it's planar it's saying we assume that you've already done this work to make it a planar thing so to fix that and to make sure that our data sort of actually matches up we can do another command where we transform the projection so now we're gonna project it on to a what's called an equal area us a projection and that I just happen to know the epsg for that is to 163 and so now Minnesota will be in this new projection called 263 and then we can try to subset it but here we're going to get an error because they're in two different projections and you can't subset when they're being projected into different rejections so here it says SC CRS does not the two SC CRS is does not work don't doesn't work so to fix that we can go back up to spatial Lakes here we told it the original projection is 4 3 2 6 but maybe you want to transform it to this equal area projection and so did two and six three and now that'll be an exact same projection this command will take a little bit longer because it has to project out a hundred and forty one thousand different points but it's actually still pretty fast and now we still have Minnesota in the same projection and then now we'll do this sub setting and we if you see we've lost this error so we've lost that error and then this Minnesota lakes is actually still 29,000 observations that's right at the limit of what Matt view can handle so I'm just gonna say subset lakes based spatial position and then here we'll go ahead and still subset the lakes the Minnesota lakes just to not plot a million of them but we'll just keep a thousand this time and I'm just gonna pipe that through and I'll just do so here I just called this Minnesota Lynx object I'm saying keep the first 1000 rows and then pipe that forward to a map view and there we go we've got a thousand lakes plotted there all over Minnesota they have all this information if you click on them so we can plot them by Lake area another thing that we might want to do what are plotting this many lakes is maybe we want something about the lake information in here to be to color the different bubbles so maybe let's color it by Lake area which is called Lake area and Hector's so we'll do there's this function inside of Matthew where you can say turn up function and option where you call it Z call equals Lake area and so now it's going to go through and color all these lakes by different colors so basically the vast majority of them are less than 5,000 hectares and then there's some big ones in the midst of that if we wanted to make that plot a little bit nicer we can also arrange the data by lake area and we would want that to be descending so we'll have the Brno actually we want it to be ascending we'll have it so that the smaller lakes are plotted first and the big lakes plot on top of them and hopefully that'll make some of the bigger lakes pop out from the background of lots and lots of the small lakes so there's not just not that many big lakes but you can see that this dot here is plotting over the ones adjacent to it and same with the yellow dot they're plotting last so they show up sort of on the math better okay I think this video is at about 30 minutes so we'll stop here and in class what we'll try to do is we'll link some of this Lake spatial data directly to Lake water quality data and we'll sort of see if there's any spatial patterns of Lake water quality in Minnesota
Info
Channel: Matthew Ross
Views: 238
Rating: 5 out of 5
Keywords:
Id: KxNW72CS1Ys
Channel Id: undefined
Length: 28min 25sec (1705 seconds)
Published: Thu Sep 12 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.