Analyzing COVID-19: Can the Data Community Help?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Of time. So hi everyone. My name is Karen. I am on the devel team at Databricks, and we're happy to be able to bring this, online meetup to you so quickly and thank you all for joining. And we all hope that you and your families are safe and healthy, and we hope you enjoy the session. So I'm gonna pass it along to my colleague, Denny Lee. He's also on the devel team. He's a developer advocate, so I'm, he's gonna kick things off and get it started here. - So hi there. Again thanks very much, Karen, for the introduction. My name is Denny Lee. I'm a developer advocate here at Databricks, but before that I was actually a pseudo biostatistician. I was actually doing a masters in biostats at University of Washington. I actually was working with the, on the ADRA project, which was working on HIV and AIDS research with the Fred Hutch cancer research center and University of Washington biology lab here in the Pacific Northwest. So the reason we decided to shift gears from our original session today, that was about addressing GDPR and CCPA was because due to the current health crisis, health concerns, we thought this might be a very good, a good timing for us to go ahead and talk about a session like this. Even though I never completed my biostats degree, just as a quick call out, the reason why I didn't just because Microsoft offered me money to go work versus me paying for a degree. So (laughing) yeah, I took the money, but (laughing) nevertheless, I do actually have a masters of biomedical informatics and I do have a background in medicine. I actually have a degree in physiology that doesn't not, I repeat, does not make me an expert as much as my parents wanted me to be a doctor, Asian parents. No, I'm not, okay. So let's not fall into the pretense here, but I do have many friends and many colleagues here in the Seattle area that are currently helping to fight the coronavirus and helping over patients. So, before we go into the data science side of things, I wanna be very clear. The number one thing you can do to help everybody is wash your hands. Okay. I cannot overemphasize that enough. The number two thing you can do is social isolation. This is the reason why I'm currently sitting in my laundry room as opposed to being out and about because it's the right thing to do, okay? So those are the two most important things for those who want to see if they can possibly help. We put a panel of really cool data scientists together today to talk about the different data sets we have of the currently, there's actually probably more by the way, but is the ones that we happen to be working with the South Korean data set out of Kaggle The CORD-19 dataset that's also on Kaggle and the Johns Hopkins dataset that's actually sitting in GitHub, okay. And we're going to be showcasing these notebooks. We're gonna be saving them. We're probably gonna be writing this on Databricks right now, but we're actually gonna be saving things as IPython notebooks. So you can run them locally on your Jupyter instance. So what'll happen is that after this session is done, we're gonna prop on both the Spark online meetup, as well as YouTube, where this video is playing as well, links to those notebooks that you can download and work with yourself with these public data sets, okay. So what we're here is simply just trying to encourage you folks to go ahead and take your data science knowledge and see if you can possibly help, but obviously, you know, understand that, you know, still the number one thing you do is wash your hands. And the number two thing you can do is go ahead and social distancing. And if you have time to try to be on the lookout for your friends, for your elderly parents or your elderly colleagues, and on top of that, just like a lot of the folks here in the Pacific Northwest, and I'm sure it's happening everywhere right now. Go ahead and donate to the various meal deliveries to your healthcare providers, okay. So just because to help them out, okay, this is, these are the number one, two, and three things you should do by the way, okay. Now, since we're born, we're at home and we wanna do a little data science. This is actually also a fun thing to do, to try to analyze the data, okay. So saying that I did want to first introduce Vini. Vini, you're gonna introduce yourself obviously, but I'd like to start with you to go ahead and actually present your session, okay. So let's start with you Vini and please take it away (laughing). - Thanks Denny. Hi guys, Vini Jaiswal, customer success engineer at Databricks. I have been with Databricks since late 2018 and I serve as a trusted advocate for our customers to make sure they are happy with our platform. I have been working with big data industry and in general, like seven years, and I have masters degree in Information Technology and Management from UT Dallas. And I also worked with Citibank and Southwest airlines before. So that's a little bit about me. Today I am going to just show us some of the, alright, let me share my screen. So I'm gonna share with you a quick notebook. I have analyzed South Korea dataset, and you can find this dataset as Denny mentioned later, this was available publicly. And I just started analyzing some quick points since a lot of, since there is a lot of, you know, worried about how, how the cases have been found. And I'm just gonna walk you through some of the insights that I have seen in the data. So, I'm using Databricks notebook here, and you can do the same thing in either Jupyter or other platforms, as you wish. Here if you can see, I have patients CSV and just writing some SQL codes to understand what the data is. So this is how my data set looks like. You can see that there is patient ID, gender, birth year, country, of course, Korea, but there have been cases from other countries which visited South Korea. And we have other details like confirm date, release date, and things like that. So the first thing I want to show you is number of patients in each city. You can see that there have been arises in cases, and it started with the capital city being, which has the most number of patients. And then you can see other, other regions as well. I'm just running a simple SQL query to find the patient counts for each region. So we can see the trend year. Another insightful thing I noticed is, how about we filter the cases based on infection reasons? So, what I did is I calculated month over month, what type of infection reason the patient was affected by. So as you can see in January, a lot of cases were from visiting the Wuhan region and then eventually developed with the contact with existing infected people. So, you can see that's where the trend started in month of February, all these cases, which you see in orange are contact with patient and it just grew organically. Then people started visiting other cities. And second biggest, second biggest infection reason was visit to Daegu and it just eventually spiked up. So you can see like the most common trend is contact with other people. So that's like wash your hands and we have all this logged on in place. Another interesting thing about this data is confirmed count. So let's see like what kind of trend we have seen depending on date and number of patients. So it started in January slowly. It picked up because of many reasons, travelers coming in contact with other patients and then, you know, testing. A lot of people were not even aware that infection has happened to them. So it just started slowly trending. So this is the trend over a period of three months. Now, let's talk about recovered and fatalities. So out of the cases that we are known about these are the number of recovered patients, and I'm trying to analyze it based on like where the recovery is coming from. So from the graph, it looks like most patients are curable in the capital area region. So you can see that there are highest number of patients plus recoveries happening. Now let's talk about fatalities. What I, what I did was, I took the number of patients and their timelines from the confirm date to the reported dates and just filtered by the region. And most of the fatalities would happened were from Daegu, which was the church event. And then Gyeongsangbuk and then the capital area. Now, let's talk about percentage of recovered patient by infection reasons. So, it looks like most patients which were in contact with people got recovered, 40% of the total population of the reported confirmed cases were recovered. Second, most recovery happened to the patients, which was a 22% of total who visited to Wuhan. So these are some of the insights which can be derived from our dataset. Number of fatality cases. So here you can see that, the main fatalities came from a church and the Daenum hospital, and a lot of these cases, which has unconfirmed date debt, maybe like for the reasons like they were not being tested or they were not reported, or maybe they were not in a state where they can explain the reasons why the disease got to them. So these are some of the insights which we saw from the from the data that we had from South Korea. And a lot of it looks like the main takeaway from this is over a period of time it just started happening. And it was mainly about coming to the contact with existing patients. Second take away is if we limit the contact, this is the cases can be controlled. So those are some of the things which we are also experiencing in our existing timeframe. So that's about the analysis. Denny you wanna add anything? - No, this is great. Thanks very much. Actually do me a small favor and score right to the bottom. So just, we can show that call out actually. - Uh, yes, presented fatalities by infection reasons. So as you can tell now, remember this is actually from Kaggle, the South Korean dataset, okay. And the South Korean dataset actually is relatively complete because they actually have a very high rate of testing in South Korea. And so the fact that they can go ahead and specifically call out the various hospitals or the church, which was the main epicenter for South Korea is pretty substantial, but the fact that they also cannot go ahead and actually explain the reasons, which is the vast majority of the group is actually very telling, which is basically with calling out that yes, once it spreads, it spreads. And we actually can't really do that network effect to figure out, is there a single cause anymore because it's has that network that kicking in. And so, this is a really awesome work by Vini. She did it over a few days, I guess, which is pretty cool. So, there are actually some quick questions here that I thought were pretty cool for you. Vini, if you could go ahead and sort of answer these questions here. So Vini, I guess the first question is, why did you, what, basically, as you were looking at the South Korean dataset here, what led you to actually ask these particular questions? Like why, why did you decide to write the course that you did? - Yeah. So this was based on the news channels. I have been looking at the number, the reasons like I talked to my friends and what, what are the, some of the common questions that have been asked or that people want to know about? So that led me decide on some of the questions. Also the dataset present, the scheme I had in the dataset led me to find out those questions so I can find very relevant answers. - Cool. Cool, cool, alright. And then relate to that. So then basically, you know, you yourself as a domain analyst or yourself as an expert with the background that you have, like basically is that, is that the reason why, for example, you decide to write this SQL or is there a reason why you chose to go ahead and write the queries the way you wrote them? - Yeah, it was easier to do in SQL. I could do it in Python as well. So, I can also mix and match different languages within this notebook. But the reason I selected SQL is because a lot of people whom we both share this notebook with, they may be more comfortable with SQL, so they can do easy analysis on SQL as well. - Perfect. Perfect. So, just as finishing note then there's some other questions that are hot popped in the Q&A and I'm I'll actually take care of them in a separate session here in separate session here if that is okay. But as Vini had called out, she's got this awesome notebook that she created. We're going to be providing this as an IPython notebook shortly. So that way you can run this locally against the Kaggle dataset yourself, okay. So, just as a quick call out, we are working on getting those links up. And so right now we have them currently as Databricks notebooks, but we are going to, like I said, convert them over. In fact, the final session today, droves out, you're gonna showcase the fact that he actually was ahead of the curve of all of us and had already taken care of that, okay. So, there you go. So, Oh, and there is a interesting question that I will take over. So actually Vini, you can stop, you can stop presenting. I will present my screen now, okay. Alright. Assuming I learn how to present my screen. That's a different story altogether. All right. - Thank Denny. - Thank you very much, Vini, I appreciate it. All right. Cool. So there are, there's a, there's a couple of questions by the way, can you, I noticed there's some questions popping up in the chat, please put them in the Q&A, just because it's actually easier for us to go ahead and organize. We're not gonna necessarily answer the questions right away because each person is actually trying to present their or what they're doing first, okay. So, the first question I'm gonna go answer is any tips for developers who are relatively new to data science and the quick question is that everybody who irrelevant of your experience, you could be, your experience could be from a data engineering, could be SQL, could be algorithms. You can come in from a different angle. And actually this is very apropos for this particular dataset, because I'm actually gonna show you, from this dataset, the CORD-19 dataset from more of a data engineering perspective, okay. And Chengyin right afterwards is gonna show it much more from a data science perspective. They're both equally important, right? If I can, if this will become very apparent very shortly, why they're both really important, but the call out is that I would just start with that. Now, if you're really, really new to data science, and never worked with it before. I would actually suggest to go ahead and work with Pandas and Scikit-learn. This is probably the easiest way to get on board. The, the Scikit-learn library. The website actually has a lot of good stuff that gets you jump-started to learn how to do the stuff really quickly. Pandas is a relatively easy, you can install right on your laptop. So that way you're not actually going out and trying to do something brand new, it's just use your own laptop, whether it's windows or Mac doesn't matter (chuckles), Linux for that matter. And you can just download the data sets and start working with it, okay. Alright. And then finally, there's another question which I'm gonna go ahead and also answer live, which is, oh, for these data sets are how often being refreshed. Actually, there are different organizations that are refreshing them on a different timelines. So for example, the South Korean dataset was updated at least twice in the last week, the current CORD-19 working books, dataset here that we're gonna be referring to as part of the COVID-19 open research dataset challenge on Kaggle, that data sets actually been updated. Right now they're on version three. It's my understanding that they're updating it on a weekly basis though I'm sure they're updating as fast as they can. If you wanna check out, by the way, there's actually a subreddit on COVID projects called, that actually goes ahead and has various people who are trying to go ahead and analyze, organize, cleanse the data even faster. So I'm not gonna point to any particular project just because there are many, whether it's actually data science or 3D printing, by the way (chuckling) . So, there's actually a wide variety of them. But the call out basically is that there's actually a lot of projects out there, as well, to take whatever challenge you wanna go and do, because our attitude is like, you know, go out, do some analysis, see if you can help out, but again, wash your hands first. Okay. Now, now that I've, I'm jumping off the soapbox, (chuckles) pun intended, let's go ahead and talk about the CORD-19 dataset. So, the what's cool about this one, the CORD-19 dataset, it's actually a combination with Allen Institute, Chan Zuckerberg Institute, Microsoft research, Georgetown, National Institutes of Health and the White House. So, we've got a bunch of collaborators, as you can tell right away. The data set here is actually, there's basically, it's actually primarily JSON files, okay. There's, it's broken down to the commercial use subset non-commercial use subset and pre-prints that have not been peer reviewed. These are all papers that are about the coronavirus, okay. And so, we're gonna be using this particular notebook on Databricks, okay. But these is notebooks, like I said, free to share, and we're gonna be also converting this into an IPython notebook as well for folks to go ahead and work with, okay. So the beginning of this notebook, I'm actually gonna go a little different first Vini had showed you. Here's what we can do with the analysis of it. Now, I'm gonna be a little bit more boring and show you how you're gonna go ahead and try to read all of this JSON, okay. Because that's what this CORD-19 dataset is. And then hopefully you guys won't fall asleep because Chengyin will follow up and she will do something much more interesting with this dataset, okay. Because I wanna do the boring data engineering stuff first. All right. So, if I look at the schema of this, JSON. And this is just a GSR schema dot TXT that, that they included within the data sets all three or four data sets that they have as part of COVID-19 are actually here. And you'll notice it says things like the paper_id, metadata, like as in what's the title, the authors of it, what the abstract is, the body text, the text of it. And also all the bibliography references. And there's actually many, many, many references. Okay. So, it's pretty cool stuff to have, but how do you make sense of it? Okay. So, what I did is I said, okay, I'm gonna go ahead and actually take all the JSON's, which actually are, we actually did upload this, if you happened to using Databricks, we did upload this already to Databricks data sets. So you can just use them right away. If you don't, like I say, you can just go to the Kaggle. I actually include the link in the notebook to Kaggle to download the dataset itself. It's not very big, it's two gigabytes, okay. So it's not like a gigantic dataset yet, okay. Now, I'll explain in a second why I have these parquet variables because, but the context is that I want to go ahead and convert them from JSON to parquet. Okay, so why did I wanna do that? Well, the reason why is because in order for me to just have Spark and I've got four notes, by the way, for this, for the Spark cluster. Just to read the 9,000 commercials use subset of, JSON files, each JSON contains information of the paper, like the set, the author, the title of the, the body text and the bibliography. It's a pretty long winded JSON, as you can tell here. Okay. And so I went ahead and read it to read just to go ahead and do the initial read, which isn't actually doing a real process, okay. Basically it took me almost six minutes just to read those 9,000 on four notes, okay. So it's pretty, it's pretty tiring. The good thing about using reading and Spark though is because I went ahead and actually just simply put all 9,000 JSON files in one directory and it was actually able to infer the schema anyways. So, that's why it took so much time because it was actually able to look at all the JSON, figure out what the schema was and just automatically create the schema for me on the file. That's why you see what you see here, okay. Important to note, these JSON's are multi-line. I.e they contain character turns. So if you don't specify this, when you try to read the JSON files, what will end up happening is you'll see corrupted record. That's all you'll see. And just because, that means simply because Spark did not recognize or understand what these JSON files were. So, it is what it is. There you go. So, but make a multi-line then you should be good to go. Alright. So, as we said, here, here's the count and it tells you right away, okay, there's 9,000 files, so cool stuff, all right. All right. And also, I just verified it by just simply doing it. Let me look at the path. So just do an LS on the system and just do a count. All right, now, here's where the fun part happens, okay (laughing). The fun part basically happens here where you're going ahead and saying, let me go ahead and get the number of partitions to 286 in order to be able to, like I said, to read it, it took, initially about five minutes, sort of almost six minutes. But then when I wanted to write it out, it actually took me almost 14 minutes, okay. So, what I did is because I have four notes, I re-partitioned the data and save that as a parquet. So instead of having reading the individual, JSON files, I saved as four parquet files. In fact, if you look here, here are the four parquet files that you're seeing, okay, these are the four parquet files, all right. So, because I did re-partition as opposed to coalesce, I don't, I've minimized the skew. In other words, if I wrote coalesce here, instead of re-partition, which biggest coolest would have been faster, the re partitioning actually went ahead and made sure the file sizes were roughly the same. The coalesce actually, when I ran the coalesced at the first time, it actually, one of the files was massively bigger and the other three files were smaller. What that meant is that when I was trying to do a query against the data, if I was distributing this across four nodes, one node would take on more of a hit versus the other? Now, let's roll back a second. Why did I do all this? Okay. So I already told you 9,000 files. It took me about six minutes to go ahead and run the queries, just to read it and do a count, okay. Well, because I went ahead and did this as parquet. You'll notice that now, when I run the query, it actually finished in three seconds, okay. And the, to read the files and it took one, less than two seconds, 1.5 seconds to go ahead and do the count, alright. So, this is the data engineering aspect. And so for all you folks that are actually doing the CORD-19 dataset again, this notebook is going to be made available to you so you, so follow up and hopefully this'll help you charge up and go through your data faster, because you're gonna be able to go ahead and convert the JSON into parquet. And you're gonna be able to go ahead and run your query significantly faster. When I was running queries before on the JSON, each one took me five or six minutes, but because I went ahead and converted to parquet, now they're happening in seconds, okay. So, from my data engineering and data science perspective, that's significantly faster. Okay. So, hence the reason why we went through the process of doing this, okay. And the next step, which I'm not really gonna dive too much into is basically the same thing. We just simply went ahead and provided, do the same conversion of the non-commercial dataset. Same idea. It took longer before when I saved it as a parquet, I went ahead and show the Cory went down significantly faster and also for the peer review data sets. Alright. All right, so Noel asked a great question. What's the benefit of running to get the number of partitions. Okay. In this particular case, I wasn't really, there's no real true benefit to getting, I just wanted to call out what the number was. In other words, that there were 280 partitions before. What it does tell me though, is that basically because we shoved the data in, right. I actually have a lot of partitions and if I was to save it out, remember the, the overall size and this amazingly enough, the commercial dataset was the largest, but the old data size is only about two gigabytes, right? So we're not talking massive scale here. We're just simply talking about like, you know, a small set of files, right. But, I have the small set of files that spread across 283 partitions. So if I saved the, in Spark or for that matter, if I was doing this in Panda, same problem by the way, okay. I was to save it, I would actually save it out as, 283 Parquet files, right. And that's bad, right, because then there's the overhead of trying to read all of those little parquet files, even though there's actually not that much data in each one. Instead, because like I said, I actually chose to do it as four just because the cluster set up the role set-up here that I have happens to have four notes. So I've saved us four. You could also just as easily turn around and tell me, dude, it's a pretty small thing. This parquet has snappy codec as well. So it does a little bit of compression. You could have just saved as a one file and be done for the day and that's valid. And in fact, that's actually what I'm planning to do for those folks who are gonna be doing Pandas because you can actually in Spark go read those files, save it as a pen, save as a parquet file and provided you remove the commit and the, the metadata files then Pandas can actually go read that file too, okay. So there, so basically that's the real quick call out. And finally, the other call out from does not tell you whether the files are splittable, yes and no. In this case, basically there were 9,000 files to begin with. So basically the idea of partitions in memory, we're basically taking those 9,000 and we put, we created to Spark grade 283 partitions in memory, and basically shoved everything, those 9,000 files into 283 partitions in memory. By definition, if I write those partitions out without actually organizing it. And if I was to actually specify partition, or if I even did it with single partition, usually what ends up happening is, I have to I'll get something like 280, or at least a heck of a lot more files than before that I specified. And that's the reason why I didn't wanna do that just because it's, the data size is relatively small in the first place, okay. All right. Perfect. And we did have another first first and asked, are we gonna talk about Delta Lake? We are not in a future session if you guys are interested because we actually we're thinking about diving deeper on the each and every one these notebooks. So Vini's notebook, this notebook and Chengyin stuff, and also a Dhruv's notebook that will be coming very shortly. We actually are thinking about going ahead and diving deeper into those situations where we will be doing streaming and we will be updating the data sets. In those cases, I may go ahead and decide to go ahead and include things like Delta Lake, but right now I'm just focusing on what's the best way for every single data scientist, to make use of the data. And there's a sizable chunk of folks here that I'm sure they're going to go ahead and tell me, I'd rather do this in IPython notebooks locally with Pandas. That's fine. We're not actually asking you to use one version or the other. That's why we're actually going to save the stuff, these notebooks in that way. So that way you can go ahead and run this as an IPython notebook in Pandas as well, okay. All right. So, I did want to call out some quick analysis of the data prior to changing, showing the cool stuff, okay. So, right, right now, like I said, I went ahead and read the data and I said, okay, cool. Let me go ahead and read the parquet file. So that's what's great also is because, because I saved as parquet, then any subsequent notebook that I have, I can just go ahead and say, Hey, let me just go read those files, parquet files, bam. I've got myself a nice data frame and I'm back up and running, okay. So, the first few cells here are very similar to the previous cells so there's nothing really new here, okay. I'm just, again, showing the schema. It's pretty complicated, all right. So, how do I wanna make sense of the data? So in this case, I'm gonna just use Sparks SQL to help me try to look at the data. And then I know from this schema, JSON schema texts, that the important things that I wanna look at, because in this case, I was just wanting to have a little map it out and see which geo has what papers. Okay, so I can look at the paper_id, that's that gives me the count of the papers, metadata, which tells me things like, okay, the title, but more importantly, the authors and their affiliation and their location, inside the affiliation location, the affiliation, excuse me. The there's the location that location basic tells you where the author is from, okay. So, let's go do that. I'm gonna go ahead and do select paper_id, metadata title. So this is an example of the evolution of pox virus vaccines. And here is the paper information. And as you can tell, here are all the array of authors and all their affiliations, okay. So, this is the author and some of them will indicate exactly where they're affiliated. There's the affiliation for example, and the location. And we're gonna say Spain, right here. Alright, perfect, alright. So, this tells you a little bit about that information. So I'm diving, we're diving into the JSON trying to make sense of it. So, now I'm saying, okay, let me break that out. So in this case, I wanna gonna to go do an explode because what I care about specifically is just the author information. I don't about all the other information. I just care about the author information. So let me explode that out. What that basically means is that as you remember, as you can tell, there are many, many, many authors for this first paper, okay. All right, so that's great. That's actually good information, okay. See, all these authors here, and some actually having one row for each author, I'm a, oh, sorry. Instead of having one road that contains all of these different authors inside here, where there are six of them, I wanna have separate rows for each author. So that way I can actually understand the author location. All right. So that's what this column is. And so that's why I exploded it. So I can say, okay, author one for the same paper, the evolution of pox virus vaccines, right? Here's the six different ones rows for that particular paper. That same paper_id, the same name. I don't really need the title, but it's easier to read for everybody. So that's why I kept it in here, okay. And I can see where the affiliation is, okay. And the, and where the location is, okay. So, like same idea country of Spain country of Spain. So, relatively straight forward, okay, perfect. So that means I should, if I'm lucky, I should just be able in order to be able to go ahead and map this out, I should be able to go ahead and say, let me go ahead and take the authors location, the country and map its paper_id and then I can do a count and I'm good to go. Except, okay, problem number one. I don't know where there's multiple authors, right? So which one do I choose? So I'm actually gonna simply choose the, the minimum geo. Because in some authors there're actually multiple geos. So literally I'm just gonna take the minimum one. That's actually, probably not the best one, by the way. In fact, I probably should have ran a rank query to get to do by first author, second author, third author, and the chosen just the first author. And in fact, actually a for, if you're interested in for a subsequent session, that's exactly what I'll do, okay. But so just as a call out, I'm actually showing my mistakes. Okay. But for now I'm just doing a min. So, that's what I did here, okay. I basically said, okay, give me the min country from this. All right. So this is perfect. So, each paper_id, the minimum country, as opposed to rank, which I said, like I said before the rank I.e the first author listed there, that's probably the one we should be working with. The each paper_id, author, country. And that should be good to go. Right. I should be able to map out. Except, this data can get dirty. Now, for example, instead of saying China it says PR China obviously stands for People's Republic of China, but that's not a standard code for us to work with, okay. And as you scroll through the data, you'll notice that there are mistakes in spelling, okay. So as you're doing your NLP analysis to try to find, because like I said, the CORD-19 days that has a bunch of tasks and they're actually trying to figure out what to do with it. Yeah, it's funky as heck. So, what if you actually look at the dataset, you'll notice that if I break it down, I actually have lots of things like, country of Spain, USA, okay. So, to finish off and then switch over to Chengyin what I did basic, oh, here's a funny one. Like there's actually literally a country called USA, USA, USA, USA that literally was inside the paper. Okay. So, what I did is basically I went ahead and said, let me just go ahead and get the mappings. And I actually created a map manually, which I will have given you, it's actually at the bottom of this, okay. So this literally this map here, it's boring as heck, but basically all it is, is a mapping of here's the author country for each one of these values. There's a two digit alpha two and alpha three. So, I literally listed it out and just did this manually. Fortunately, there are only 260 countries, so it wasn't that big of a deal, but the point is long story short. There's a lot of dirty data inside there, even on what has been cleansed and organized really well. But now that I've organized it, I can map it out nicely. And as you can probably guess the vast majority of papers for this commercial subset came from China or from the US. Okay, not that surprising, but still good to know. There's also papers from Germany and France as well, okay. So, now that I've shown you the complexities of just trying to make sense of data and how you have to cleanse it let's switch over to Chengyin to go ahead and show, talk a little bit about how we can do some cool NLP, which is the main thing about the CORD-19 dataset. - Thanks Denny, for the great introduction about the data sets. My name's Chengyin Eng, you can call me Chengyin. I'm currently a data science consultant at Databricks. I work with customers by delivering data sets trainings, and also professional services project where I help implement data science solutions. I am currently based in Chicago, but I did most of my undergrad or masters years in Massachusetts. I did my masters in computer science at UMass Amherst and my undergrad was in statistics and environmental studies at Mount Holyke College. I'm gonna go ahead and share my screen. It does. Okay. So as you can see here, the very top part of reading a dataset is identical to what Denny has shown before. I have a bunch of data parquet path variables here, and then I'm gonna reading the data in a parquet format. And as you already know, the commercial use subset has 9,000 files, and the noncommercial use subset has over just under 2000 and then a bio CIF has under under 900. And for this purpose of this NLP methods of analysis, I'm gonna be just using the commercial use subset data. Just to show you how it looks like. If I don't do any of the cleaning, it looks like this. And it's pretty messy. Can see that abstract back matter of the entries metadata. Something fortunate about this data set is that even though it's not perfectly clean, I can assess a lot of information from metadata column. And recall from what Denny's has shown before, you can see that this is a schema. So this is really just an Ester JSON file. So when people think about NLP, the first thing that comes to most people's mind is let's do something deep learning, but I'm gonna show you two methods, one is using deep learning, one is not. I'm gonna start with non-deep learning method. So we're gonna try to generate a Wordcloud from all the titles of papers. They are submitted to, you know, to the organizations. And first you need the word cloud library to be installed. So let's go ahead and take a look at what the metadata title looks like. Can see that here there's three examples of them. And here I'm just writing a really simple function to draw a Word Cloud. Can see that I'm importing Word Cloud and stoppers from this library and also using that POL lib as well. And here I'm just splitting all the sentences into individual words. And I'm gonna pass this clean words into this function right here. Where you can see here the argument for stop word here is what is really doing under hood is just removing all the comments stop words. So for example, like Es are off, and so those words will be removed. And what's nice about this generate method is that you actually cause the function generate from frequency under the hood. What it means is that the size of the work that you will see later in your word cloud actually correlates with the frequency of the word that you see in the all the titles. So you can see here, I'm just embedding the macro function over here. So here I'm gonna use two functions from the pipework SQL functions called CONCAT and also collect this. What I'm doing here is that I'm going to concatenate all the titles available in a dataset. So for example, there are 9,000 rows, 9,000 different papers over here in this data set, rather than reading them one by one , I'm going to concatenate them into altogether. So then I can just pass in this entire string to my word cloud function over here. And I'm going to separate them by comma. So you can see here that I'm creating new data frame over here. And I'm aggregating all the titles possible. I'm gonna create a new column called "all titles." So now I can just parse in my "all titles," the string into this workup. As you can see here, probably not very surprising for you is what infection, protein, virus, cell, human shows up as the top words in all the titles. But what we can do later is that we can remove some of the non meaningful words from the worktop. For example, here, we don't really know too much about using, we don't know too much about based or even viruses. We already know that Corona virus is a virus. So let's go ahead and remove them. So here I'm just gonna do some really minor modification to the function I've already written here. And then just update the stop word set. So here I'm removing the words using base analysis, study research viruses, and let's see what it looks like. So now you can see that there is no more words. There's no more using there's no more based, there's no more viruses, but there's still virus. But so here you can see that this is the overall picture, of all the titles they are in available in the data set. So as you can see here, I haven't really done much cleaning, but I was already able to do some really quick and dirty visualization about the data set that you have. So you don't even need to know deep learning to even start doing something with a data set, even though it's priority test. So am gonna show you now a deep learning method, which is to generate some reason extracts. So we know that there are 9,000 papers, and I said that I really don't have time, or I'm lazy to read through all the papers. I just really want to know what is summary, what is the important points about each paper? So I'm gonna generate a summary from each extract so that the worst that you have to read is even less. So here, I'm gonna use a summarizer model. That's straight on BERT that initially was used to summarize lectures. There's a link over here that links to the original paper that published this model. But what it does under the hood is that it utilizes the BERT model for text embeddings and also K-means clustering to identify sentences that are closest together to generate a summary. And to use this library, you just need to install bert-extractive summarizer. And if you're using Databricks, you will need to install it using PyPy. So here am doing a really simple import over here, summarizer model. And I'm gonna just take a first extract in a dataset and convert that to string. So you can see that this is the abstract that you're reading here in the first paper, just by really quick glance this is probably about 10 rolls over here. And just to show you a longer abstract, this is a second example of a longer abstract. This is probably like 20 rows. You can even scroll down even more. So this is a really long abstract. So am gonna train of summarizer model first using minimum length parameter. And what minimum length parameter does is that whatever number you specify here, you remove any sentences as fewer than 20 characters. So we can see the quotes here are really simple and are really concise, just one line of calling summarizer. And I defined it as a model. And then I'm just parsing in my abstract into this model function, model object and I'm specifying the minimum length to be 20. So I'm gonna generate the first extract summary. And you can see here that this is now two sentences compared to maybe 10 sentences over here. And I'm gonna look at abstract two. Can see here rather than having needing to like click into the cell and scroll down. You can see that this is significantly shorter. There's another parameter that you can use for this model, which is the maximum length. What is means is that you would remove any sentences that has more than 250 characters. So let's take a look at a first example again. If you recall, this is why we do this is the original extract, and I'm gonna go down and take a look at the minimum length. Can see that it's not two sentences. When I specified a maximum length, I was suspect that the summary would get even shorter because now all the longer sentences in the extract is already removed. But if I generate this again, you can see that it actually is also about two sentences long. So what this means is that, well, the first abstract is not that long to begin with. So playing around with the minimum length parameter actually doesn't make that much of a difference. But let's take a look at the longer abstract. So here I'm recalling so just to call here, you can see that the number of the maximum number of characters in maximum length is actually the same for both cases. And here you can see that it's actually a tiny bit shorter than the one that you can see above. So hopefully this provides you , you know, just like showing you example of what you can do with NLP. You don't really have to know a ton about NLP or even data science to start playing around with this dataset. So hopefully that this will empower you to do, you know, do something a little more complicated, or even just help you to you feel good about what you can do with data science as well. That's all for me. - Cool, thanks very much. So for all you folks who actually are wanting to do the CORD 19, the COVID-19 corovinus, COVID-19 open research data challenge, this excellent set up a notebook from Chengyin actually help you kick start your NLP process. Lots of really cool little examples there that it's not gonna tell you how to do those exact tasks, obviously, but at least should give you a pretty good idea of how it works, okay. So Chengyin, thanks very much. Let's finish off last but certainly not the least. Dhruv please go ahead and showcase your stuff. - Thanks Denny. Hey guys, I am Dhruv Kumar I'm a senior solutions architect at Databricks. I've been with the company for two years and before that I've worked in variety of big data roles. So I'm very grateful to be here and happy to speak to the intelligent audience here because there is a pandemic up on our hands. And, you know, I feel like we as skilled practitioners of this field, we can contribute a lot back to the research and also help each other out. So as part of that initiative, what we have done at Databricks has taken some of these open data sets and put them into our repository so that you guys can like start analyzing and inferring some good research from it, hopefully. Now my goal today would be to show you a where these data sets are located. How to get started with them and see some of the analysis I have done so far. Please, spoiler alert, there's not much going on. I've just created data sets and, come up with very very few basic visualizations. But the hope is that this motivates you to go back and do your own experiments. In the spirit of keeping everything, community research- based, most of the stuff here is around open source tools, IPython, and Jupyter, et cetera. So I'm gonna be talking about how you can fork in database environment also with just open source techniques. So you can download this IPython notebook, and work on your own systems as well. But that said what we have done and thanks to Denny Lee and our legal console. What we did was Johns Hopkins university has been publishing an aggregated data set on the coronavirus outbreak on GitHub. And the link for it, and in there, you know, they going to refresh it every day. And it's a nicely formatted data set, but it has some problems we'll come to it in a minute. But what we've done is we've taken this data set and put that into Databricks' community edition. So there's a Databricks' flash data sets folder in which all this data is located, okay. Now, because it's located over there you can go and easily start analyzing it. You can also download this data set on your laptop and you know, run Jupyter Notebooks and, you know, start mining it for sure. But where I feel like platforms like opensource, cloud platforms can help is because the ability for these platforms to give you flexibility in downloading multiple data sets into one area, right? So while Johns Hopkins data set is great, you know, it only gives you a, it only takes you so far. To get to some more interesting analysis, you can look at some other public data sets and also bring them into the cloud repository. So I have some research ideas, you know, if people wanna collaborate at the end of this presentation, Denny will be in touch. So what are we trying to do here? Well, what does the outbreak look like on a global scale? Let's play with some data. Let's see how far we go, okay. So let's quickly. I have, we have, as I said, we are already loaded into Databricks data sets folder. Let's see what that folder looks like. So right now Databricks is not connected to a cluster machine learning cluster, which we have created. If I just do FSLS this is by the way, a database specific magic command. But if I do this, if I control + enter, you see, I have all those other folders here, which are just a mirror copy of whatever's going on in GitHub, okay. Now let's look at some worldwide statistics, you know, so we're gonna try to find out at a global level, how does the epidemic look like, okay? So, because we had downloaded this data set on 17th March that's the most latest we have so far, but so we will use that as our reference date, okay. So over here, I'm just trying to get to this particular file, you know, on 31720.CSV, you see all those files, inside this GitHub repository, they are in CSV format and you know, the most recent ones are 17 and 18, okay. So this is what I'm drawing because it's on March 17th, okay. So if I do that, I create that file path. And now I can just create a Spark data frame out of it for folks who have been using Spark, this must feel very familiar. It's a CSV. We can read its header to infer schema, and let's just see what comes up. So right now it's running Sparked off. So you see this is a pyspark data frame, right? So this is what, because we are still in the Spark land, so a pyspark is an extension of pyspark data frames sort of extension of just regular data frame with other bells and whistles. But as a data scientist, you can just interpolate between the two. You know, there are also other bindings like callers, which allow you to go back and forth between these two environments. For simplicity, what I'll do is I will just, you know, I look at this dataset when I first downloaded it, I was like, hey, this slash here does not look very good. Like who names the columns with slashes, right? Because you know, it's very difficult to handle these characters later on. So first thing we're gonna do is remove these columns and rename them as country and state, okay. So let's go ahead and do that. Okay, great. Now we're still in country. So if this seems really trivial I'm keeping it simple so that, you know, you guys can understand how this journey goes. And for folks who are not so familiar with Spark or data science programming, my motivation is it's not that difficult. You can also actually start doing some cool analysis, okay? So now I have renamed this state and country columns. I'm now finally, what I'm gonna do is I have two options right now. I could just continue going down the Spark, pyspark data frames path, or I can choose to convert to Pandas as well. So I tried both approaches and there's another notebook which I have, which goes all the way into Spark. But I thought for this particular presentation, I will be using pandas so that, you know, you guys can interpolate with other tools as well. So we have, we can just quickly convert this pyspark data frame to pandas using two Pandas API, and then let's see what happens when we do that, okay. All right, great. So now, if you see, this must look very familiar to folks that are using IPython in Jupyter because it's a formatted like that, right? Now, you must also notice one thing that the data set, although I was able to change the country and state column names, there's still some weirdness in the state column. You see. Why is this particular row State in Italy giving me another value? So this is classic big data problem where you are trying to clean and massage this data set. 80% of the problems, or rather 80% of the blockage to doing some cool analysis on data sets is not ML. It's basically data engineering and ETL and cleaning that whole thing. So this is what we are doing right now. Now, how do you handle this? Well, there's a simple path. And you know, you can actually just go to OpenCage Geocode you know, the other geocode API services available where you can just parsing the latitude, longitude and lunatone and the states. So, I tried that and it was giving me the right values, but to keep things simple, I'll omit that step right now. Okay, we have this data set now in latitude, longitude. How do we go about plotting it? Well, there are a bunch of different ways. One way is that we can convert these country names, China, Italy, Iran, to something called an ISO 80633 expect there's a spec which means that country name, you cannot get to the three letter code and then you can parse it into Databricks environment, and then we have a map, a nice looking map that comes up, this is what you know, other presenters were showing you. Denny Lee was showing you as well. Well, there's also another way you think it's Plotly, Plotly is available opensource. You can just download it and install it on Databricks clusters. So that way I have already done this. So if I look at my cluster, if I look at the libraries which are installed, I've installed plotly, plotly geo, and I was also hacking with keplergl, which is Uber's open source map library, the other day. But the point is you can easily install any new libraries on these clusters and, you know, start working with them. So going back to our expense. So, all right, so we were at, okay. So plotly, let's try and plot this guy. So because we already have latitude and longitude information, it's fairly simple and straightforward to plot. All we're gonna do is parse it into this ape, into this call for Mapbox. And, you know, we are gonna ask plotly to give us a map and, you know, give us a hollering data sets around confirmed dates, okay. So let's see what happens. There you go. Okay, so what this is doing is essentially telling me, you know, it has taken that same data frame, the Panda's data frame, Plotly was able to properly express was able to plot this mind you, the same thing can be done in Databricks as well. But yeah, cause we wanna be pushing these out, drive out the notebooks lab. So it's another approach to doing that. You see who they, as predicted it had the, as we expected, this was just really tragic that they had 67,800 around 67,800 confirmed cases and 2,111 deaths. If we zoom in to what's happening in United States and our neighborhood, let's see. Just not giving me. Yeah, there we go. So mind you, this is still aggregated at the state level, but we can go further drill down into the county level as well. And that you can do using the geocoder lookup API. You know, it gives you not only the state, but also the county and the city name. So anyway, just some of the things you can do with data sets, you know, showing you how to go about finding APIs online and using them in the same workflow. Lastly, once we're done with that, what else can we do? Well, Johns Hopkins is also giving us a time series data in which they are telling us every day how has the disease been progressing? So you can also, all those data sets are also located in our community editions, you know, open data set, open data folders, and I've shown you an example of how to open it. So we can just, again, we can quickly read it through Spark and then convert to data frame and look at how the disease has been progressing per country over time. And using Plotly our Databricks internal visualizations, we will come up with some analysis and understand. Now this is all mechanics, what are we trying to accomplish? What can we do here? You know, one of the things is that we're trying to find out if weather can make it difference in the disease transmission. We are not sure yet. None of us are experts and correlation is not causation. Let's not confuse with that. But just for your own visual analytics purposes, to understand and donate back, you know, you can actually download NCDC data set, you know, it's a temperature dataset and then relate it with this, these data sets here and see how those trends look like. Another one, which I was thinking last night was that, hey, we are right now, isolating ourselves, you know, from others, staying six feet away, washing hands, as Denny mentioned, how effective is that number one? And number two, how effective is it, whether it's actually happening or not first of all, we have to understand that. Secondly, if it is happening, how has it been effective in containing the disease spread? So one could actually use Caltrans traffic data. You know, it's published freely by the department of Caltrans in Bay area. And if you go to this link, you can subscribe and register. And, you know, you can come up with these files, which are basically five minute interval files on the traffic information. So essentially use proxy of highway data as a measure of social isolation and see what it's coming, how it's been correlated with the disease transmission. Again, this you can, this goes back to our first point while Johns Hopkins data set doesn't give us, gives us a lot of rich info, for sure. But the magic happens when you start combining other data sets like the weather data or the third DOD data, transportation data to come up with more richer analytics. So that was as short and sweet. We are gonna be refining this notebook and publishing later on today. And that would be in other channels and back to you Denny. - Perfect. Thanks very much. I wanna be cognizant of the time right now. So since right now is a little after the hour, but so we ran a little bit long. I'm sure it's my fault. So no worries everybody (Denny clears throat) Dhruv that was a wonderful session, very helpful. So, to everybody on this session, wanna remind, we will be putting these notebooks online at the Spark Global Online Meetup, as well as Seattle Spark+ AI Meetup as well as the YouTube channel. So that way the links to the notebooks, we'll probably publish the Databricks notebooks first, just because we have them already done, but by the same token, we are gonna be publishing the IPython notebooks? So you can run them locally on your own environment. If you, hopefully this gives you a good starting point to be able to go ahead and make sense of the data, just like Dhruv called out. You can go ahead and join other data sets together to make things really interesting, just like Chengyin called out. There's some amazing NLP that you can do against the data sets, especially the papers there just like Vini called out. There's some amazing visualizations, amazing data on whether it's the South Korean data set that she was working with or any of the other data sets. But go ahead and give things a try. See what you can get out of that data because you can probably find some pretty interesting things. And as you do these challenges, don't just work with the data sets you've got. See as noted. See if you can join some data sets together to find some interesting patterns. And that's it for us. I did wanna finish it saying, like I said before, hopefully you guys stay safe, shelter at home, wash your hands and do social distancing. Otherwise that I thank you very much. Karen is there anything else that we need to do to close this off.
Info
Channel: Databricks
Views: 10,392
Rating: 4.9069767 out of 5
Keywords:
Id: A0uBdY4Crlg
Channel Id: undefined
Length: 62min 23sec (3743 seconds)
Published: Thu Mar 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.