Make Your Own Interactive Map of COVID-19 Spread Using R Shiny

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Everyone, I'm Caitlin. I'm going to be doing a little bit of like a coding tutorial today based in R and shiny for those that haven't worked with R or shiny, we have linked on the page some other videos kind of introing R and shiny separately. So you do need a little bit of background for those those two languages and packages to kind of get a-- feel comfortable with this video, also linked a bunch of really fabulous free R resources. If you want to really dive into R not just for data viz but for data science and programming in general. I really wanted to focus on kind of like a hands-on sort of example of how you can work with actual COVID-19 data, that's out there and it's public. There's a lot of it out there. Not all of them are created equal unfortunately, some of them are so messy. It's hard to even explain how messy it is. We're going to start with a kind of more friendly example from the New York City Health Department, so since I'm in New York City, Rockefeller's in New York city, this is a more personal kind of like look at COVID-19. We're gonna start on GitHub actually with the source of the data I'll be using today. For those that haven't worked with GitHub before it's a really phenomenal kind of place for the broader data community. So the format of this is essentially what are called repositories or repos are shared and stored on here. So this is sort of the username for the Health Department and so actually don't know they probably have other data here. We can take a take a quick peek. Oh, it turns out it's just COVID data. Yeah. So a fun thing about COVID for a lot of Health Departments is the kind of urgent need to share data with the public was really fast and a lot of Health Departments had no capacity to do this in any way, let alone just like handle and collect the data. New York City in general for various departments has pretty good like open data resources, totally recommend checking out those data sets because there's a lot of really fun stuff you can learn about the city that way. Like COVID kind of hit everyone very fast and the need for information and the need to talk about that information and then subsequently visualize and analyze that information was really fast and New York City had the capacity to ramp up, but a lot of places didn't, and it took them a while to get to a place where they could share data, so as a fun scavenger hunt on your own time you can probably try to look at data from other state's counties to get a sense of the variability of how this looks, this is the repository we're going to work with today. There's a lot in here. We're not mapping every single thing on here. The game plan will be to create a map of the city and broken down by a sort of form of zip code which we'll talk about in a bit, on top of that map obviously we care about COVID data and the spread of COVID-19. So we'll be looking at a few different metrics the kind of most basic one being case rate. So cases appearing per a certain number of people in the population. Take a minute to look at the data itself and what it is, I talked a lot in the last video about how you really have to be familiar with your data before you even begin to think about visualizing it. The NYC Health Department, has their own set of visualizations actually on their data webpage. We're going to make something similar to that, but we'll be focusing mostly on just a handful of metrics. Whereas I think they're showing a bit more. Already you can see, and this is mirrored on the GitHub as well, how they're defining cases. What is a confirmed case? People with a positive molecular test, probable cases people with a positive antigen test. So the right, there are different ways in which we're assessing someone having or not having COVID. You guys are probably familiar with many different types of tests that are available, ones that are detecting the presence of an antigen ones that are based on PCR. There are many different ways in which you can actually detect COVID-19. And because of that, there are many ways to actually define someone having a case or not. So we're going to make a kind of version of this. So this is a static map what's known as a Chloropeth map. So that just means that regions are sort of colored by a data point. In this case, it's a seven day percent positive. So what does that mean? Percent of people that had a test who tested positive that's actually quite strict definition, right? This is not like the percent of the full population that tested positive. This is the percent of the people that actually had tests that tested positive. And they have this really important, note here that all data are incomplete because we're not testing every single person. And the other feature here, this is a seven day. So what does that mean? Seven day means that, these numbers are being aggregated on a weekly basis. Individual days can really mess up data collection. And so to kind of correct for that noise or kind of attribute too much interpretation or weight to one given day, you average across seven days to calculate a percent positivity rate, you can imagine if this was a daily percent positive let's say, I don't know, somewhere down here in the lower East side, one of this zip codes maybe three days of data were input or collected for some reason on one day instead of, you know split across those three days. If someone were looking on one day from the day before, and like coming back to this map and seeing all of a sudden this becomes a much, much darker color. They can interpret thinking like, oh my God something happened on this day to cause you know, some huge burst in percent positivity which wouldn't be the truth, right? That's like a slower division of time. So kind of drop in all those new cases. So this is why we have the weekly averages or aggregations I should say of percent positive. It's more stable and reflects kind of change happening in real time. Some people actually even extend this further to further reduce the kind of potential noise. So like 14 day, for example we just had the holidays in December. And so that was actually like a huge problem with reporting in a lot of states counties the weekly averages were actually susceptible to that, because there was this kind of longish period of time for some people need more than a week of vacation and people weren't getting tested. There were backlogs from on the reporting end. There were these kind of artificial drops in numbers, which you know people had to sort of clarify for their audiences through our publishing data around this. Hopefully they did, some of them I'm sure did not. This is kind of a map sort of what we're going to do but we're actually going to do this, over the course of several months. So starting in August this seven day numbers this calculation, didn't start happening actually in New York City until early August. Jumping back over to the GitHub repo. If you look at any GitHub repository you often will encounter a readme mark down page that's what M D stands for. This is pretty awesome in terms of how well explained the data is, it even goes into detail about like how to download the data, like how to make requests if you have questions, you can submit an issue on GitHub also updates to the way in which maybe data is being described or the new types of metrics that are being included at different points. 'Cause these things happen. For example, they change from PCR test to molecular test, right? This is a technical distinction, but they get into this stuff. There are so many different kind of facets to talking about COVID and they really do a nice job here of explaining the data. On your own time you can definitely explore the different types of data here but we're going to focus on trends since we're going to look over the course of several months. And so in this folder, we have lots of CSVs. So that's, comma-separated. The department recommends against interpreting daily changes as one day's worth of data, due to the difference between date of event and date of report. So that's really key too. It's something I didn't actually mention of people associating an event being recorded on a given day as it actually happening on that day, but think about in real life, if you've tested for COVID or have had COVID and that testing event could also be lagged between when it's actually reported to the system. There's not this kind of like instantaneous thing where you get a positive test in some systems, somewhere on a server it gets updated to indicate your positive COVID case. So that's an inherent flaw with what these types of reporting weekly data so we're going to be focusing on. And so there's a lot of data in here, antibody testing, case rate, which is one of the ones we're going to look at, again, the cases by day which has all of these different metrics. And these kind of prefixes here are actually sorted by borough. So people wanted to compare for New York City, you know like the Bronx versus Manhattan versus Queens versus Staten Island. These were important especially kind of getting broader area aggregations of hotspots and things like that. Deaths by day. Hospitalizations by day. Percent positive again, is when we're going to use. And then we're also going to take test rate. If you notice the three data points that I'm mentioning are ones that are followed by MODZCTA. So ZCTA is a specific type of geography, right? So we can talk about across the city. We can talk about borough or should I mentioned before so it's like kind of sub regions within the city but then we can talk about zip codes. MODZCTA actually is not exactly zip codes which I'm going to talk about in a bit but it's a way to map what are regions defined by different zip codes. So it's much smaller region to get a sense of COVID spread in those smaller regions. So much like this map, right? These are this MODZCTA regions I was talking about. This is much better than just kind of comparing the full borough. Why? All for those that know their boroughs in New York City, you can look within these big boroughs and you can see that there are actually lot of variability across that even adjacent zip codes. So that is actually really important to inform a public that is concerned about travel or concerned about quarantines and transmission of the virus and all of this stuff, right? So people wanting to be informed about severity of the disease. And this is also helpful for people on the government side who are trying to figure out, you know where to prioritize resources, if COVID is more severely spreading in one region versus another. This also affects hospitals too, right? Hospitals being either distributed evenly across the city or not which is definitely not the case but you know, if a hospital is here it's gonna need more resources probably than a hospital here. So that is why we're looking at these sort of smaller regions. Again, I'm sort of, you know, filtering and making decisions about the visit we're going to make today since this is more of a tutorial but there's a lot of data here. And so you don't have to do exactly what I'm doing. And this is meant to just sort of teach how to make a specific type of visualization. It's going to be interactive as well too. So this one, when you hover over you get some information about each zip code whatever you do, if you're working with this data if you're working with a different dataset make sure you're understanding what the data is. This is super detailed. We're doing the seven day aggregations as well too. So people with the molecular tests are aggragated by four weeks from each Sunday to the following Saturday. And that's how they're classed and log. So again this is seven day aggregations. We want to download this data, to do so, there are actually a couple different ways you can do this. If you have experienced with the command line you can actually just kind of interface directly with GitHub, using Git commands G-I-T, Git commands to pull this repository. That way you can download it manually. So literally clicking here code you can download the zip file. Additionally, clone the repo as well too, if you're working within GitHub yourself and have your own GitHub repo, right? You can, what's called fork the data over creating essentially like a copy of it and then branching it off into your own space to work with, the way we're going to do it today. I'll show two different ways we're using R, you can actually copy the link address here where it says download zip. So it's right clicking your command, clicking on what you're doing, copy link address. The other thing you can do with RStudio is you can use this address. Which is ending in a dot git when you load RStudio. So now that we're in R, if you don't know some of the terminology I'm using about R or shiny definitely check out those other intro videos. I've done introducing the language and like how to work with some basic stuff. I'm starting off with a pretty much clean workspace. One thing you can do if you like to use R projects is you can start a session in a GitHub repo too, so for example, if you go here I'm not actually going to do this but when you do a project, you can make a, its own directory and all this stuff set up a new working directory. But if you want it to start with a git repo you go to version control, which is the main benefit for using GitHub is you can track versions over time. If you click git repo URL. So that was that git addressed I was showing you on the NYC Health GitHub page. You drop in here, give it a name and you can put it wherever you want on your computer. So that's an option too. The way I'm doing it here is a little bit more sort of direct within the script. So I'm calling this script NYC COVID prep data.R, and you'll see why in a second. This is how I'm downloading the file. So, first of course, I'm setting my working directory. So wherever you want to work on this project, make a clean directory for it to clean folder for it and set your RStudio initialized to go to that working directory. So once you do that, you download the file and that URL for the actual zip, download those here and then you set a desk file. And so this is going to be the name of that file. And this is the name of the repo entirely data master. That's zip. And that's going to be the folder that pops up I could run this super quickly and it downloaded that zip. And then we can actually just unzip it now that it's there. So I already moved to where my directory is. I've already done it before, so you are going to see this already but if you, for example, the data is updated all the time. So if you wanted to do it over again, you'll read download this zip, which is what I just did and then unzip it in the same location and it all unzip there. And so it did that. And so let's just take a look to make sure it's all there. Yes, it is, we have R folders. So, right? The trends folder is going to be where we're gonna grab that percent positive by MODZCTA test rate as well, again by MODZCTA and then case rate as well two or the three we're going to grab from this folder. So on top of any R markdown or R scripts whatever it is, we want to initialize some libraries. Tidyverse is how I work with data, data wrangling and general recommended for everything and includes ggplot, it includes deep outlier, all that good stuff. Vroom is a much newer package. It allows to really fast reading and importing of data I should say, also specifically CSVs for the most part it can recognize delimiters and it can import stuff pretty quickly on that front. So that's awesome. SF and tigris are packages to working with spatial geographic data. So that's important. I'm kind of listing here why we're importing all this stuff. So quickly start up all of these guys. So now these guys are all set up and so vroom like I said before, this is a way to read in data. I like vroom for CSV files that are pretty straight forward. So usually I look at the delimiter, right? And so what I'm doing here, right? Is I'm reading in these data frames. And so as you can see here, I vroomed in this CSVs. It recognizes it's a comma separated file, right? So a comma and it's linking in all of these different values, the columns. And so these are three separate data frames that I've just generated right now. Let's take a look at a case rate first just to see what this data looks like always the first thing you want to do. And there are a lot of columns here. So this is the raw data from the city. So we have one column. So remember, I mentioned that this type of second seven day aggregation didn't really start until August of this year, right? A lot has changed since August. So it's fun to look at, fun or depressing fun to look at the trends in time just even over these past few months. So right, We have week ending. And so that means the cases reported in these subsequent columns are reported from that Sunday before through that end date that's listed here. So that's why it's week ending. And so we have data all the way up to fresh start in 2021. So that's our kind of like last week to compare and not going across, so these are pretty nicely labeled columns. We have case rate over the city. And so why is it a rate and why is it not just a raw number of cases that data does exist in the repo. If you want to take a look at the raw numbers of cases, but this rate as the GitHub readme mentioned is calculated as the number of cases per 100,000 people. So if you remember from my last video why is this happening? We want to get a rate that is including the information about population or some sort of per capita measurement because we really need to normalize for numbers. There are many different numbers of people in different parts of the city. The population is concentrated differently. And so you can't compare any region to another unless you do some sort of normalization for the population. So this is why we're looking at case rates here as opposed to just number of cases. So we have case rates across the city New York City does not have the same kind of COVID spread because it's super regional. And so we have borough breakdowns and then we get into the zip codes. So 10001 all the way 10280 and this is cool, right? This is a lot of data in here. And the data looks clean. It looks, there's not a lot of like weird errors with the formatting of the data. A good thing to always check when you get a new data set too is check the class of the different, you know, variables. Yeah. We have doubles here, which is good. These are just numeric data. The date, however, is actually a character which is fine. There are a couple of things we gonna do, We'll see in the code to change that. Yeah double checking what your data structure is, before you even do anything with analysis. The other point I wanted to make is, remember this this data file was by MODZCTA. So let's talk about what MODZCTA is, because it's a little confusing again, if we need you have a question about the data cutting into the actual readme for the GitHub repose is a really good place to start. The data here is accompanied by this lovely explanation. So MODZCTA stands for Modified Zip Code Tabulation Area and ZCTA is just Zip Code Tabulation Area because the zip code, as it's explained here isn't actually a domain. It's not like a county or a state. It's actually a collection of points that make up a mail delivery route which is not something we typically think about. But when we're talking about regions and space, and so when when we're dealing with mapping in R, really any software mapping for viz, we need to actually understand what the regions are bound by. There are these issues as well too with, you know, some areas like having their own, buildings having their own zip code, or weird stuff like that. Essentially what the Health Department did, is create modifications to these ZCTAs that take these like tiny little zip codes that are affiliated with like maybe a building or something and combine them into like a slightly bigger region, right? To make these estimates for population size, bore legitimate, because dividing by a hundred thousand, you know, is a good normalization across the board, but doing it for a place that has a tiny population it might not actually capture the regional spread of a disease. So you want to keep your geographies like fairly comparable in terms of size, a little bit of aggressive detail for mapping, but this is the kind of thing you run into especially when you're dealing with geographical data. So to deal with this they offer this ZCTA to MODZCTA file. Just take a quick look at that. And you'll notice for the most part it it's like a one-to-one mapping. It's like, oh, why did they even bother doing this? It's not all zip codes, right? This is pretty much uniform for everything until you get down here, look at this. So the MODZCTA is on this side. This one MODZCTA is encapsulating several different zip codes because those are like tiny, tiny little zip codes not comparable to maybe a more standardized zip code the way we're going to handle this, is we're going to we're going to use the MODZCTAs. But if you wanted to as well to make a more enhanced visualization you can create like a mapping system within to convert over to ZCTA. So a little added complexity, the way we're going to do this in our final app is to include this table for lookups as well, too. So if you live in a zip code that maybe is one of these smaller ones, you can go and check and make sure it's mapping over yourself. So these are the files we're actually going to read in. So where's this in geography to sources exactly the same as what's on the GitHub repo just now it's in the comfort of my little RStudio space. There are a couple of different ways in which a shape geography data can be stored. The one we're gonna work with today is the shape files that are SHP files from 2010. So this is the last time these kind of domains were defined. And this is getting into, you know, like some infrastructure stuff in the city and how these lines are drawn, especially for things like-- I don't know, precincts being redistricted and county lines and all this stuff but this is congressional districts, like depending on what you end up doing with this in the future, you can get really, really far into a shape file data and all this stuff. But for now, just very simply we use this st_read function to read in this shape file. So this is what's necessary to read in geographic data. And this function is part of the SF package that I loaded initially early in that in, simple feature collection with 170 features in two fields. So this is how geometry data store, this is literally like the lines upon which we're drawing these boundaries on a map you can imagine, and this is how R is storing that data right now. Now that we have our data, we wanna work with it a bit to get it into a space where we can visualize it, checking this out. We have right, all the data we could ever want but it's not exactly in the right format to work with. Why is that? So currently we want to map by zip code or ZCTA, I should say. But right now that zip code data is kind of spread across this data of frame in a classic data cleaning situation or should I say data wrangling. We're going to be shifting the format of this data frame into a long format. And what does that mean instead of having, you know, all of these variables spread out in this wide format, where for every week we have a different zip codes going across this way. We want to have a situation where we have a column that is zip code and the case rates are aggregated kind of lengthwise as opposed to across the zip codes this way. But this will be, you know, a few columns as opposed to 22 columns. The first thing we're going to do is actually remove, we're not interested in the borough data or the city-wide data. So we're going to remove these first few columns which does that. I'm calling it case rates now just to give it a new variable name. So let's take a look at what that is after I do that. So now we've got rid of that stuff that was here, right? The city-wide, the borough-wide and now we just have zip code data. Awesome, this is the reshaping. So tidyverse introduced actually very recently this new function, a new set of functions pivot wider and pivot longer. It used to be gather and spread. I think this is a little bit more intuitive that the new language, but essentially what I'm doing right is I'm moving into the long form format. So pivoting it longer. I'm taking all of these columns 178 columns. Actually, I think I said 22 before that was wrong. 170 columns, 22 rows and we're going to put them all actually now into one column called MODZCTA. The prefects for that is going to be, that case rate blank, right? So that's the prefix that's encapsulating all the zip code data I want and that's going to be kind of removed, right? We don't want a list that says case rate underscore we're going to want to get rid of that. So we just want the actual zip code itself. What will those values be called? It could be a new column just called case rate. Currently, it's individual case rates switching to this long format, case rates long. And just like that, we have our weeks, we have our MODZCTA and we have our case rates, and this is a lot easier to work with now that we will be trying to visualize it. So now that we've done that, we're going to do the exact same thing to the percent positive data frame and the test rate data frames. So just to prove to you that these are almost identical in format. Definitely, please check before you do that but we're just gonna look at the head of this just to show you it's identical, right? So instead of case we have PCTPOS percent pause, city all this stuff here in the front of the boroughs, and then it gets in here. So it's literally the same format. And so that means that it allows us to use almost the same code. Definitely, check before of course and same thing for test rate. So I should say again, the reason why it's rate is that these numbers are calculated per a hundred thousand people, percent pause again is the number of molecular tests that ended up being a positive out of the people that got those tests. Okay. So I'm just gonna run this next batch of code and then we will have our reshaped long format data frames. And now that we have those three data frames we're going to merge the three of those together. So to do so again, tidyverse gives us some pretty cool tools left join so left join and everything. If you're imagining A on the left and B on the right and we want to join in those two data frames we're matching based on values, all the values that exist in that left data frame, those are preserved. And then the ones on the right are matched over. But we already know in this case that actually all three data frames are the same in terms of the week ending, for example, and the MODZCTA those are all the same. Definitely double-check before you do this because that's going to affect the type of join you do in this case, we're matching to the kind of leftmost one but and then of course you can pipe this in. So we're first matching case rates long 2% positive long and then upon that, we're joining in test rates long and this is a really easy place to mess up your data frame, depending on what values you're matching to. So once I run that, now that I've merged them all in, let's take a look at what this looks like and this is exactly what I wanted. So now we have all three in one lovely little data frame all merge on week_ending MODZCTA case rate percent pause test rate, but it's missing something, it's missing that, I should say MODZCTA shape file. So that geography data if you remember, looked a little something like this, yeah, feature collection, geometries polygon all of that. You'll see this multi polygon thing for a given MODZCTA. The two data frames have something in common and that's MODZCTA. And so to do so while preserving this geometry data we can't just use a standard tidyverse join. You have to actually use something called a geo join. Which is a cool function from the tigris package. And so running this, now we've joined the two data frames. Now we have a geometry file, but it includes our week ending our case rate or percent pause and our test rate. And this is the data frame that we really wanted. We're going to make one more change for before saving it. But this is looking really great. Again, here, I added a note in the code this code we're going to put up on the website. We're using MODZCTA here not technically zip codes, even though oftentimes the two are interchangeable the way we're going to handle it here is leaving the lookup table for people to double-check, remember how I mentioned, of course, that the class of all MODZCTA is a character. We're going to make that a date just because it allows us to kind of track things over time if needed. Typically, if dates are there, you want them to be a date unless you have some specific reason to class it otherwise but this is easy to do with the as date function. And then finally, I'm going to save this as a RDS format. This is the full final data frame. And so now this is R final data frame, but stored in RDS format. We want to, before we build our app another kind of like check to see kind of like what's happening within this data. Like what is going to be the story of the data? So to do so just focusing on case rate first it's a good idea to check the distribution of anything you have. That's kind of like numeric. So for case rates, for example all I'm doing here is I'm just plotting case rate as a numeric of course, and just a simple histogram to look at the range range of data. So you can see that the kind of range of case rates extending out to here it's actually imperceptibly small, but the fact that this you know, Y or X axis was moving out to 1500 you kind of know the extent of the case range. Another way to check this is to do min and max. So I'm going to actually look at max first. So all my MODZCTA case rate or 14 hundreds. So that's why this is extended all the way out here. So just looking at this by hand you'll see the vast majority of the data is in this space but it is extending out to a lot of cases actually so that's not great now that we've kind of like gotten a sense of the range here. And again, definitely do this for the other metrics as well too. So for example, like max all and checking out some of these other ranges as well too and you can repeat the same thing to get a sense of spread across everything. And this is important, so the reason why, you know, getting a sense of like the distribution of data is a great idea just so you know how to visualize it, right? When we're doing sort of coloring and making design choices, we want to actually represent the spread of data to cover the range of things. And so when we're actually thinking about, making like a simple version first of what will, eventually the app will be built around. We want to use leaflet. So leaflet is this really phenomenal phenomenal package for R that uses, it's actually called leaflet it a JavaScript based interactive map package. And so checking out the documentation for leaflet I totally recommend there's so much you can do with it using the leaflet package for R allows you to kind of access this suite of interactive typically you would code it in something like Java script, but this R package lets you kind of get around that, and use the suite of tools to do so. We want to actually set these labels, usually just HTML tool package that's what's the HTML tools package HTML and the sprint F function as well too. Let's you kind of set the formatting for the pop-up that will be over each one of those regions. We hover over what we're doing here is we're actually making so strong, like sort of bold formatting for the zip code itself. Or I should say the BODZCTA will appear on top and then there'll be a line break and then that'll be followed by the case rate number itself. So things will be shaded and aggregated, but if you wanted to actually see the actual case rate per, so that's per a hundred thousand people, right? Case per hundred thousand people, you could hover over and you would get the actual case rate for that MODZCTA. So let's just run this to initialize that, and I'll show you where that comes out when we build them up. And then we're going to set our color palette. So for leaflet, this is actually like a function we set. So we have a couple different options to do this colorBin colored numeric. I'm doing bin for this, this is really an aesthetic choice. This is not something that you need to like there's only one way to do it, but I'm using colorBin just because I want to make it so that, it's kind of like less work to kind of guess at different ranges. And so I'm taking my range of data so that the full range of case rate data and bending it according to like a set number, I give it so colorBin, we'll split a palette that I'm defining. And so this is something that's inside RColorBrewer. If you don't know about RColorBrewer definitely like Google that and check that out. There are a lot of kind of pre available pallets that you can choose from orange, red. This is OrRd, is the one I'm working with here, as an example, to show you the range. And so we're going to split the all possible kind of color values across this palette into nine bins. And that's the max for this palette. This is really just an example for now. I'm not saying this is like the correct thing to do. You can really, especially for database, this is somewhere where you can spend a lot of time tinkering and playing with different metrics and seeing what works best for the data you have, something you can do, especially because the data looks like this, it's bin at such that, you know you have a kind of smaller range bins at the lower end of the spectrum and then maybe larger arrangements here. But for this case, I'm going to make them all equal size just to keep it as simple as possible. So I'm going to initialize this. And so that's defining that pal function and here's the code that I've written to actually make this interactive map. So it's going into this variable I'm calling map interactive which will be the actual object. That's the interactive map. And I'm taking my data frame all MODZCTA and piping that into several things. So first this is kind of like a weird data or I should say R cartography type thing. It's part of the SF package, essentially transform the data we have into a specific coordinate system. So that's going to like orient the polygons the specific way, a lot to explain here, but the CRS is thing of like a coordinate system and we're transforming R current geography data into this sort of format. And then we're going to run the leaflet function. So taking that data and running leaflet that's enough to sort of like generate a map, but we have to sort of then set the options within that map. And so to do so you can add layers called tiles or in this case provider tiles. And this is really like the type of map you're using. So I'm choosing the one that's called CartoDB.Positron but this can be a lot of things. So just to show you an example this one is going to run this part. Oops, you get this warning message, but you can, you can pretty much ignore that. And so I haven't put any data on here yet because I haven't actually defined anything about it. So that's why you don't see any data on here. And all we have is this leaflet map using open street map and it's the entire world. So we're not even looking at New City at this point but this is sort of what that leaflet instance generates. And the CartoDB.Positron is this type of map. You can send you to change the way the map looks. So like different types of projections you can play with. If you check as always, if you don't know what a function does, checking the help section. And so for a no map provider, there are lots of types of maps out there that you can play with. So the actual help guide example is stamen.watercolor. And so let's just pop that in here and see what it does just as give you a sense of the range stamen.watercolor I'm gonna run this. And I got this crazy surreal stream but this is cool, right? You can imagine whole types of maps. There are a lot of types you can actually grab in here different map providers, interface with leaflet and offer these up as packages you can play with. So I think that's pretty cool, but if you're a member like we think that this data viz in the purpose of it, we're not trying to, you know, showcase some awesome art, even though it maybe in another situation that might be what we're trying to accomplish. This is a very cool map, but at the same time we are using color in this case to highlight the data we care about. So in this case, we're mapping case rate in New York City. And so we actually want the cases to be the thing that's colored and that's what the eye is going to be drawn to. As you load them, you can play with them and see you can zoom in on locations. You're getting all kinds of stuff here. Oh boy, it's actually quite beautiful. You can play with this and look at the different maps and see how they look at different resolutions and zoom settings. Let me reset the map to go back to, it looks like fairly boring where we're going to put some pretty stuff on there. So again, just running that and now we get to the data. And so this is a lot, there's a lot happening here. We're gonna to walk through this slowly. So now we're in the add polygon section. There are lots of things you can add to a leaflet bat. You can add circle markers. You can add kind of like pop up points. We're working with polygons because the shape file. That's why we went through all that work of joining in that geography data are considered polygon sort of like these irregular shapes. If you wanted to add sort of standard like little dots or circles or some sort of like uniform thing across there, there's a leaflet function for that. But in this case, we're adding polygons because we have very specific geometries. Those are those shape file data showing those MODZCTA regions. So we're using add polygons. And within this function, there are a lot of arguments that you can play with. So the very first one is we want to make sure we incorporate our labels that we wrote up over here. So this is what's going to happen when we hover over any given polygon. So label equals labels. That's labels up here. Some of the design factors these are really not a hard and fast. These are just things you can play with. So the stroke of the actual polygon shape I'm just turning that off. Smoothing just a little bit soothing because some of the edges are look a little bit kind of not super straight. And so I just upped it up to 2.5 a bit, just to add a little bit of soothing. So the lines aren't super kind of jagged aesthetically it just looks a little bit nicer. It doesn't change anything about the polygon itself opacity being one, fill opacity being 0.7, and so these are two different features, right? So the opacity of the kind of polygon outline and then fill opacity is like what's inside the polygon itself. And then the most important part here actually is ends up being fill color because we're trying to make essentially a Chloropeth map. What that is again, is recoloring regions based on a data metric. So in this case, it's case rate but it's not just case rate because we want to apply that color function we defined up here. So for all case rate values we want to assign a specific color on the scale that we set up. And that scale is going from this orange to red colors, sort of like a light, it looks actually a little more like yellowy the light yellow color to a darker red. And so this is the notation for that. This is the actually function notation I should say. And so the fill color and so the polygon will be filled in according to the case rate there, the other option again, this is another place where you can fiddle around with and see how the different features affect the shape and look, but we're adding in some highlight options because we actually want to sort of single out the polygon you're hovering over. So thinking for the user, using the interactive map if you're hovering over region, you want there to be a little bit of feedback being like okay this is the one you're hovering over. So instead of, you know, just to set up polygons they're colored in a sort of static way but they don't actually change any sort of color when you hover over them. So this is just gives the user a bit more confidence they're hovering over the right thing and know what they're hovering over. So adding in a very light, it's kind of like a gray scale highlight option. And this is that you bring to front is important here so that you actually see that when you hover over and then the fill is actually quite it's almost totally filled out. So, you know, you hover over it. It turns like a little bit gray whitish, and then it, you move on to another one. It returns to its original color already here there's a huge range of possibilities with what you can play with. Again, I'm going to keep saying this but it's important when making data viz have a lot of kind of like tinkering and it's kind of like photo editing sometimes when you are just playing around with different aesthetic features but in this case, right? Like I'm making choices based on sort of like my aesthetic preferences. Some of those have to do with the data itself. I want to highlight certain features about the data. Like I mentioned about the color, for example but this is not the one right way to do this. And also it's not the one right data to actually play with. So there's a lot of stuff you can visualize from this dataset. If you have questions or you don't know what's the right way to visualize something, there are a lot of people you can chat with in different forum online. And then also you can always reach out to me if you have questions about stuff too, there is no one right data viz is for something, right? There are some best practices which we've talked about, but a lot of this is just getting a feel for how to make your own visualization. So I don't feel like you need to stick with exactly what I'm doing and feel free to experiment. 'Cause that's where it gets fun. And the last thing we're going to do in this leaflet function. So again, every single one of these is a pipe function. So add provider tiles add polygons is very kind of deep outlier format. So we've added our polygons and then we're gonna add our legend and we're going to put this position on the bottom, right? I'm giving it some opacity. So it looks, it's hovering a bit over the map. And then again, we have to match our color scheme. So that pal function we defined before. Same thing up here, values again, being case rate and giving the legend at title as well too. So let's run this whole thing. And so you'll see sort of how the code lines up with what we see. I can run this map interactive is the one we just made. One thing you'll probably remember is that our big data frame the all MODZCTA had a lot of weeks in it. We went to long, a lot of work to get all those weeks out there. This doesn't actually specify a week, if you notice in the code I'm not actually pointing it to a specific week. So what this is doing is just taking one snapshot of it and kind of stacking all the polygons on top of each other. That's sort of why it took a minute to load which is not what we want in the final product but just for visualization of kind of the aesthetics of the whole piece, this is what our kind of final product will roughly look like restaurant. Imagine as a rough draft, this is an interactive map. So I can, you know sort of play around and drag zoom in, zoom out, zoom in. There's a lot of data on this actually, you can see the structure of the different boroughs and this is colored in based on case rate. As you can see, right? Every time you hover over something, you get a little bit of feedback here from the coloring and you also get that pop up information that we work to construct. So you have the zip code and then you also have the number of cases per a hundred thousand people. And this is nice because you have that backdrop with the information about, you know, versus from open street map is just the actual, you know, city itself is behind it but you then also have your polygon stacked on top and the color we mapped all possible cases. So our highest possible case number was in like 1400 something. And so that's in this top range. We set as a sort of auto cut the full spectrum into nine evenly created bins. And so that's why we have these 200 size bins but of course, right? Like you can play with this. You can set it such that each bin is a very defined number and the bins are like, I don't know, zero to 50 then 50 to a hundred. If you do that, though, you have to, you know remind the viewer that the bins are not equally sized. You might want to do that in this case just because of how the bins or excuse me how the cases are case numbers are distributed. But in this case, it's as simple to just cut it automatically into this number. Especially with cases on the rise addressing the y-axis with cases rising is a kind of depressing thing to do, but you have to always check to make sure your data is actually within the scope of data color. Okay? So now that we have our interactive app the kind of standalone leaflet. You can save it using save widget function. So doing this. This is the object name. So this one's called map interactive and then you give it a actual file name. So was an actual HTML file. You can then load this up wherever. So this HTML file, you can drag into any browser loaded up there, and it's an interactive map. You can scroll around it and view and share, which is really cool. So this is how you would export a leaflet map. Now we're gonna flip over to building this into a shiny app. So if you, again, if you have no idea what shiny is, and what I'm talking about, definitely check out the other video that I go into and explain what shiny is and how it works. And so very fundamentally a shiny app is a way in which you can use R to create a web app. So if you want to make a new shiny app in R you can do this pretty easily, new file, lots of options you can do here. I'm going to say shiny web app. And when you do that, you give it a name. So in my case, I called it NYC COVID and you have the option of doing multiple formats. So I am doing a single file. That's just my preference. But if you want, you can do multiple file where you have a UI file and a server file because for every shiny app you have to code for both halves of the app. There's the UI. This is a user interface, and there's a server side where you would include your functions and computations at the, you know, connect the inputs and the outputs that the user is experiencing. I already did this, so I'm not going to reinitialize it. This is my app to R script. So anytime you do this in R it gives you some really nice commented out information here, to run it, you can literally just hit the run app. I'm not going to do that just yet. 'Cause we're going to walk through how to make this. RStudio has a bunch of really great tutorials on how to work with shiny. I also included a couple of resources specific to shiny in the links on the page. So definitely check those out too. The reason why I made a new script for this and not over here is just for kind of like, code project organization. You could in theory, do this all in one script but in my opinion, that gets a little bit messy. So just for clarity, I, for any shiny app I usually make a directory. And so this case is NYC COVID where I have my app.R script. So this is the actual app for the shiny web app. And then the other thing you'll notice here is I put, actually this is the wrong RDS file. I put the data frame that we're going to work with here. And the reason I did that. So let me show you over here, all MODZCTA our data frame is all MODZCTA. And I want to save this as an RDS file that can then be initialized for the app itself. And so this is how I'm connecting. The two R scripts in a way you can do this in multiple ways. You can actually like initialize a script within another script and do like a lot of cool connected script stuff, which is fun. But in this case, it's just one data frame that I want to reference that we already worked to build in this case I'm just going to change the path to be actually inside this folder. So I'm going to redo this and then you'll see a popup over here cause she to get rid of this one, just so you don't need it. And that's the RDS we're going to use for the app. And so this is really good to keep your app scripts in one place, just so you keep everything together. It's just a good practice for kind of project organization when you're working with lots of code. So now that we're here in our app, just you remember, right? We're trying to build something like this, this type of map. But make it such that the user kind of determines what you're looking at. Do I have my libraries I'm initializing as before I'm going to use shiny for for the app itself leaflet for the interactive map, tidyverse for data wrangling, HTML widgets again, for the labels on the interactive map, setting the working directory, I'm going to read in that file from my app script. So that should already be in the directory where you are, which is what I've set. Now we have the data frame ready to go for the app. We have to define a UI first for the application. And so again, this is what the users experiencing when they are interfacing with the app. So in terms of inexperience for an app, right? You could either be inputting information. So making a selection for example, or you can be experiencing something. So like viewing a graph, playing with the interactive graph. So in that case, that's sort of what we're going to build here in a sort of simple fashion. And so for the UI, it's always a fluid page function. The structure of this that I've decided to kind of build here is I want to use a sidebar layout. So that just means there's going to be a panel on the left and then a pedal on the right, as well as just an easy title panel at top too. When you're making a shiny app. The first thing you should do always is to think about, for the UI at least they think about how things are going, where things are going to exist with the app. I'm going to run this quickly just so you have a sense of what the interface is going to look like. So let's hit run app, it's a little play button. Okay. So this is the the kind of final end product we're building this, when you run the app, it creates a sort of like local like window within tied to RStudio. You can actually also click this openness in a browser whatever you have for now, we'll just supposed to stay here. So this is the layout I've chosen just because of the window itself, when it pops out was kind of looking like the sidebar was actually on top but that's just because of the window dimensions. So in this case, right? I've additionally used a sidebar layout. So I have the sidebar panel here where I'm putting specific features and then my output is actually going to be over here. And that's where the leaflet a map is going to live and putting my title here which I now realizing I need to fix, because remember this is not actually ZCTA, it's MODZCTA. So I'm going to add that in a second. I have a URL here, and then I also have some texts about the data itself. I'm also going to tinker with in a second but the main thing here, right? The issue we had with R sort of single leaflet map was that it sort of took all that data all those polygon shape files for the week ending and just stack them all for all the weeks, all the time which isn't really helpful when talking about a trend you just are looking at like one state. So in this case relating the user choose the week to look at. So in initializes, such that it starts when the data collection for this metric started. So that's the week ending in August 8th of last year now and then looking at the case rate here I'm actually also building, since we have the data for that in the frame for test rate and percent positive. Again, this is the beginning of the data collection. So this is in August. That's why the colors kind of look a little bit lower. I'm doing three different color schemes here. Part of the reason I'm doing that, is because I want you know each metric to have its own kind of unique identity. They're not the same thing highlighting, right? These are different metrics. If I use the same color scheme, someone could come in and think these things were all tied to each other. Of course they're connected but we don't want people to misinterpret what they mean for case rate. I actually ended up changing this to this yellow green palette, just because I wanted to provide some distinction from percent positive, which I think is arguably the most concerning metric and a sense of urgency. So I added that red color to be associated with that. You have to always think about, how colors are perceived by people. It's not just an accessibility thing of like someone being color blind, which is very important to consider but it's also the associations people have with color, right? So green here, people tend to associate it's a calmer color. It's a more like active color, which is why I used it for the cases in this case, you can really put whatever color you want here. I wanted something that was just distinct for kind of like the sort of scale with lighter being less and darker being more, I think doing the reverse will be confusing in my opinion. But again, this is, there's no science to this. It's very much an art in terms of designing the graphics and that I chose a blue palette for a test rate. And so now that you have a sense of sort of like, let's go back into the nitty gritty, always remember to like close the app or stop running it. Otherwise RStudio isn't really functional. So make sure you do that. So if you remember, I said I was going to fix the title panel. So now I'm going to make it this modified ZCTA. So I'm not hiding at all. When I'm showing this, these are, you know, they do line up with zip codes, they encapsulate zip codes and represent them, but they're not always individual actual zip codes since we talked to them about those nuances. And so within the sidebar panel that like clump of text I had, the first thing I had actually was a URL. And the way you do this, this is if you have an experience with HTML, this is taking directly from that to work with within R, and so what shiny lets you do is you add in this tags dollar sign A, which lets you create a URL. And so first you write in the URL link and then I want the text showing to just say data repository. This, you know, the data that we pulled from. So anyone should be able to replicate this and also putting target equals blank which is a way to initialize the link. So if you click it in a new browser so instead of just refreshing losing your app it'll just create a new browser just a nice thing to add for the user's experience. And then each file, this is another kind of way to add in texts for HTML and CSS. And here I'm just literally writing some notes about the data. So someone would just stumbled upon my app looking at it for the first time. They have a sense of what, where the data's coming from. So the very important thing I wanted to highlight here is that data metrics are aggregated by week. So categorizing by week ending in a date. So they have some reference point before they go down to the scrolling, what is percent positive bin because that's not explicitly obvious from the data. So I'm saying indicates percentage of people that test for COVID-19 with macro tests who tested positive, right? So the denominator is people who actually had a molecular test for COVID and the numerator is who tested positive in that case, if you get the percent, all data source from the New York City Department of Health, the other thing I'm going to do here is add in a note about MODZCTA. So If you wanted to as well here, you can add in another tag link with the direct URL to that one, the one branch, excuse me, the one folder inside that repo that shows you that conversion table. You can do that if you want. Or you can just kind of say like, it's also in that data repo. So I'm just adding this as again, more context. I think the more you can explain the better you don't want to overwhelm people, but I think being as upfront about what you're showing, is always important. Okay. And then the most important part for the user is the actual choice. So we're adding in a select input tag for this. I'm just doing a select input where I'm making a list of choices based on every possible week ending in the data frame. And so that is, I'm doing that by unique and then week ending. So any unique week ending date is propagated into a list and then the text prompt in that select input is just gonna be selected date week ending. And so that's what the user sees, and then organizing the actual output side. What we're gonna do, you know, write the code for it below here is going to be a main panel. So this is again within the sidebar layout. Now I'm working on the main panel part of that. And then as you remember I had three tabs there that you could select across. So this is going to be our three leaflet maps. And so the way you do this is you nest in a tab set panel. And so I have three tabs here one being case rate, test rate, and the percent positive the most important thing in any shiny app are the other names you actually give the different inputs and outputs. So always remember for shiny app for the server is we're stitching together the inputs and the outputs that's why the server always has this format of function, input, output. And so we need to remember what we named different things in the UI. So for the select input it's date and we're going to talk about where that shows up in output. And then for the actual plots, the names we're giving them our cases, tests, percent positives. So this can be literally anything I'm giving it something informative. So I don't forget what the inputs are being named as well as the outputs. But yeah, again, if you ever forget this is where you have to reference for the names and those names have to match. If you want there to be any kind of interaction between inputs and outputs in your app, jumping to the server now, again, we have the format of function input output and I'm gonna open these up in a second but just so you have a sense of the general structure of this, right? We're defining the sort of function and then to run the app you just run your UI and server. The first thing I'm doing here is actually I'm creating a reactive part of the power of this is by using this reactive function making this reactive function I should say is that this is dependent on the choice the user is making. So this calculation, this function only happens when the user is actually making a selection. And so what I'm doing within this reactive is I'm writing this function, just sit here. I am taking my entire data frame and I'm selecting, I'm filtering that data frame down to a specific weekending which was, you know, the goal here. And how am I defining that? I am saying input, which tells the server, okay we're looking for something that we named in the input and that name is date, right? And so that matches date. If I had more inputs, you know, that's how you would class and call different different inputs. In this case, I just have one. So this is how I'm selecting the specific date. I care about that date, right? The way we named this is, it's a list of week endings. And so one of these week endings is being selected and I'm telling this reactive function to filter my big data frame just for that week ending. So I have that set of polygons to play with, right? It's going to close this really quickly. Then we want to build our three tabs. So each tab is just gonna have a leaflet map on it. Not really add anything else you, you could, of course if you want it to, but for this I'm just making a leaflet map for each. And this is going to look very familiar to what we just looked at for the the single leaflet map we already made. So for cases, I have a pal color function. And so that is a museum of color bin once again, in this case I am choosing a yellow, green palette. Again, check out our ColorBrewer for a full list of kind of like embedded color palettes. You can make your own if you wanted. There's a whole bunch of things you can do with color palettes in R, in color in general. The thing here is that I'm actually setting the domain for the color bin to be the entire range of case rates, right? So no matter what week you jump to, it's one fixed color scale. That's really important actually because you don't, the whole point of this app is for people to compare the severity of case rate over time. So if you don't have a constant, you know, color scale across all of your weeks that becomes a kind of very difficult at least visual process to you. The numbers might tell you a story, but you want the color and the color scale to match that experience. And just like everything else we are using, we're making those pop-up labels like just as we did before. So MODZCTA and case rate are going to show up there so we know where we are. But if you notice here I'm not taking the entire data frame anymore. What I'm doing is I'm taking week_ZCTA so what was week_ZCTA again? This was this reactive function I defined. So no longer is this a data frame. And this is why I have this extra set of brackets here because I'm taking a function actually and calling something from it. And the reason why it's a function is because this is happening only when someone is picking a week ending. So it looks a little weird like it's not normal kind of data frame, column selection but this is, this will work. This is the kind of inside shiny nitty gritty. Now we get into the actual leaflet map. This is almost identical to what we had before. Big difference I'm not starting from my entire big all MODZCTA data frame. I am just starting from week_ZCTA. So this is again a function. So that's why we have these closed brackets here starting from that, because it has the data we needed it has the geography data, it has all that stuff but now we're running it off of that. It's the filtered list of polygons as it were, leaflet again, we're adding our provider tiles. I'm actually doing one more thing here which is sort of optional, but it makes the work kind of a little bit faster for shiny. So think of, these are all processes happening every time you make a choice in the app. So what this is, is I'm actually setting the scope of the map from the outset before I even add any polygons what this is, is actually coordinates for New York City. This takes a little bit of trial and error to get the right initial kind of viewpoint but we're taking coordinates launched your latitude of New York City. You can Google this and look this up for wherever you're mapping and also setting a zoom so that I'm focused sort of in this space. So you have a view of all the boroughs without like too much extra from, you know, surrounding states and things like that. So that's just like a fun leaflet trick. And then we're adding in our polygons. And this is again, identical as above as we've done before. Again, the only difference in this case is I'm saying week_ZCTA instead of the full data frame, it's still a function. So we're using the closed brackets and then adding a legend. The legend isn't tied to that reactive. It's tied to the broader data frame. That's why I just have, you know dash case rate here. This is how you kind of modify your leaflet code to make it into something that is interactive in a shiny app note also to generate a leaflet map within shiny you have to run this render leaflet function. And that is how you kind of create this output exact same thing for tests and percent positive. I'll just show you, I chose a purple blue scale in this case, and then present positive. I use the orange red color scheme there for the pallet spectrum. Also gonna harp again on this idea of how we identify the specific output tabs we're doing. So where, you know, we have to actually tell an interface and some of the UI. So again, much like we had for the input date object we want to call specific output names. So just as I said before, we have three leaflet outputs we coded for, they have the names, cases, tests, and PCTPOS. So that's what they're called here. And this is how the UI and server then communicate with each other. So let's run a one more time. So now we have our updated text here where I have a little bit more rundown, you can click data repo and they'll take you straight to the GitHub. And we have our three tabs, which look really nice. So let's jump, let's jump around a bit actually, let's go to end of November and just said that you see the maps updated with the new data. All ready we're seeing some darker colors on here suggesting a higher numbers for cases as well as test rate and percent positive. And then our most recent data point is not great. Yeah, at this time cases have been really taking off across the city and the state. And so this is reflected actually by a couple of metrics. So for the case data, you can see that, the actual number of cases in this specific week has increased quite a bit and the number of tests as well, but it's not, you know, they're telling like somewhat different stories as well, too, right? The interpretation changes somewhat. And so this is a good place, right? You have a first pass at a shiny app and you look at her like, okay, so does this make sense? Like this is jive with what I was experiencing with the data when I was kind of inspecting it on the front end. And is this the sort of story I want to tell. And at this point, this is kind of like, you know, imagine like you're writing an essay and this is like the first draft of an essay you go and then ask like all right, what are the interpretations from this? Like what can be improved to better drive home a specific story? Can we add any more context? Annotations things like that, right? But bare bones, this is pretty good. First effort I think there are there places where you can improve on this. And so you definitely should try, for example, right? People might be like, oh, like why are the cases and tests and present positive kind of like out of sync with one another in terms of maybe like people are just looking at like, oh this is dark, but this one isn't dark. So like, why is that? And that's a complicated thing to answer. So that might actually require another like text blurb or an explanation to see what's been happening. If you notice discrepancies in the data this is also a cool opportunity to like reach out beyond the GitHub or elsewhere to the New York City Department of Health and ask them like, hey, like did you guys experience some data recording issues reporting issues as well too? As I said before, the holidays hit a lot of people hard in the sense of data being lagging for COVID across all fronts pretty much. So this is a very cool way for people to experience kind of COVID data, right? You have this level of interactivity. So people living in New York city, right. They might be like, well, I live in this zip code. Like, let's see how that looks for cases for tests, for percent positive, maybe how to interpret different levels of cases as well too. So I'm just going to jump around a bit. Yeah, this is right before holidays, right? So this might be a bit of precarious time reporting. And so if you're wondering why maybe the scale looks the way it does when most of the data is in this kind of lower zone, we had a couple of pretty intense weeks where certain zip codes and regions had very high number of cases per a hundred thousand and easy way to improve this, right? Would be to change the bin structure. For example, if you wanted to to bin that way to kind of cover the the variability in these lower levels of cases, for example. This is a good starting off point for people getting into data vision and visualizations. And I talked about a bunch of different really cool tools that R offers, R and RStudio offers. So shiny leaflet, these are cool ways to create interactive data without going through kind of more front end tools like HTML and CSS and yeah, let's you really have fun with the data and there's really a lot of possibilities you can do with this. You can make all kinds of shiny apps. You don't even have to make a shiny app. You can just play with a single leaflet interactive map hope this was helpful and hope that you guys end up making some even better visualizations than this, showing all kinds of creative stuff. You can take this stuff to, to not just apply to COVID data, of course you can take this and apply it to whatever your favorite data set is, we are talking about COVID-19 the most important thing is transparency and understanding of the data. And again, I will just say, as I said in the other video that when putting out data viz for COVID-19 just always ask yourself first like what new information is the data viz showing that hasn't already been shown? Does this serve a public need? Is this, will this be wildly misinterpreted? Have I done everything possible to make sure that enough context has been provided for this visualization? And the data is solid? The responsibility is unfortunately too high for something like COVID viz to be kind of careless with that, right? So always keep that in mind. But again, the data is out there to play with and practice on your own time always. So whether or not you decide to publish some of this data, I think, you know there's a lot you can do with the data even just to understand for yourself like how these visualizations are made and think about maybe the best ways you would wanna visualize it for yourself.
Info
Channel: RockEDU Science Outreach
Views: 3,889
Rating: 5 out of 5
Keywords:
Id: eIpiL6y1oQQ
Channel Id: undefined
Length: 77min 5sec (4625 seconds)
Published: Fri Jan 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.