- Of time. So hi everyone. My name is Karen. I am on the devel team at Databricks, and we're happy to be able to bring this, online meetup to you so quickly and thank you all for joining. And we all hope that you
and your families are safe and healthy, and we hope
you enjoy the session. So I'm gonna pass it along
to my colleague, Denny Lee. He's also on the devel team. He's a developer advocate, so I'm, he's gonna kick things off
and get it started here. - So hi there. Again thanks very much,
Karen, for the introduction. My name is Denny Lee. I'm a developer advocate
here at Databricks, but before that I was actually a pseudo biostatistician. I was actually doing a masters in biostats at University of Washington. I actually was working with
the, on the ADRA project, which was working on HIV and AIDS research with the Fred Hutch cancer research center and University of Washington biology lab here in the Pacific Northwest. So the reason we decided to shift gears from our original session today, that was about addressing
GDPR and CCPA was because due to the current health
crisis, health concerns, we thought this might be a very good, a good timing for us to go ahead and talk about a session like this. Even though I never
completed my biostats degree, just as a quick call out, the reason why I didn't just because Microsoft offered me money to go work versus me paying for a degree. So (laughing) yeah, I took the money, but (laughing) nevertheless, I do actually have a masters
of biomedical informatics and I do have a background in medicine. I actually have a degree in physiology that doesn't not, I repeat,
does not make me an expert as much as my parents
wanted me to be a doctor, Asian parents. No, I'm not, okay. So let's not fall into the pretense here, but I do have many friends
and many colleagues here in the Seattle
area that are currently helping to fight the coronavirus
and helping over patients. So, before we go into the
data science side of things, I wanna be very clear. The number one thing you
can do to help everybody is wash your hands. Okay. I cannot overemphasize that enough. The number two thing you
can do is social isolation. This is the reason why
I'm currently sitting in my laundry room as opposed
to being out and about because it's the right thing to do, okay? So those are the two most important things for those who want to see
if they can possibly help. We put a panel of really cool
data scientists together today to talk about the
different data sets we have of the currently, there's
actually probably more by the way, but is the ones that we
happen to be working with the South Korean data set out of Kaggle The CORD-19 dataset that's also on Kaggle and the Johns Hopkins dataset that's actually sitting in GitHub, okay. And we're going to be
showcasing these notebooks. We're gonna be saving them. We're probably gonna be
writing this on Databricks right now, but we're actually
gonna be saving things as IPython notebooks. So you can run them locally
on your Jupyter instance. So what'll happen is that
after this session is done, we're gonna prop on both
the Spark online meetup, as well as YouTube, where
this video is playing as well, links to those notebooks
that you can download and work with yourself with
these public data sets, okay. So what we're here is simply
just trying to encourage you folks to go ahead and take
your data science knowledge and see if you can possibly
help, but obviously, you know, understand that, you know, still
the number one thing you do is wash your hands. And the number two thing you can do is go ahead and social distancing. And if you have time to
try to be on the lookout for your friends, for your elderly parents or your elderly colleagues,
and on top of that, just like a lot of the folks
here in the Pacific Northwest, and I'm sure it's happening
everywhere right now. Go ahead and donate to the
various meal deliveries to your healthcare providers, okay. So just because to help
them out, okay, this is, these are the number one,
two, and three things you should do by the way, okay. Now, since we're born, we're at home and we wanna do a little data science. This is actually also a fun thing to do, to try to analyze the data, okay. So saying that I did want
to first introduce Vini. Vini, you're gonna introduce
yourself obviously, but I'd like to start with you to go ahead and actually present your session, okay. So let's start with you Vini and please take it away (laughing). - Thanks Denny. Hi guys, Vini Jaiswal,
customer success engineer at Databricks. I have been with
Databricks since late 2018 and I serve as a trusted
advocate for our customers to make sure they are
happy with our platform. I have been working with big data industry and in general, like seven years, and I have masters degree
in Information Technology and Management from UT Dallas. And I also worked with Citibank and Southwest airlines before. So that's a little bit about me. Today I am going to just
show us some of the, alright, let me share my screen. So I'm gonna share with
you a quick notebook. I have analyzed South Korea dataset, and you can find this dataset
as Denny mentioned later, this was available publicly. And I just started analyzing some quick points since a lot of, since there is a lot of,
you know, worried about how, how the cases have been found. And I'm just gonna walk you
through some of the insights that I have seen in the data. So, I'm using Databricks notebook here, and you can do the same
thing in either Jupyter or other platforms, as you wish. Here if you can see, I have patients CSV and just writing some
SQL codes to understand what the data is. So this is how my data set looks like. You can see that there is
patient ID, gender, birth year, country, of course, Korea, but there have been cases
from other countries which visited South Korea. And we have other details like
confirm date, release date, and things like that. So the first thing I want to show you is number of patients in each city. You can see that there
have been arises in cases, and it started with
the capital city being, which has the most number of patients. And then you can see other,
other regions as well. I'm just running a simple SQL query to find the patient
counts for each region. So we can see the trend year. Another insightful thing I noticed is, how about we filter the cases
based on infection reasons? So, what I did is I
calculated month over month, what type of infection reason
the patient was affected by. So as you can see in January, a lot of cases were from
visiting the Wuhan region and then eventually
developed with the contact with existing infected people. So, you can see that's
where the trend started in month of February, all these cases, which you see in orange are contact with patient and
it just grew organically. Then people started visiting other cities. And second biggest, second biggest infection reason was visit to Daegu and it
just eventually spiked up. So you can see like the
most common trend is contact with other people. So that's like wash your hands and we have all this logged on in place. Another interesting thing about this data is confirmed count. So let's see like what
kind of trend we have seen depending on date and number of patients. So it started in January slowly. It picked up because of many reasons, travelers coming in
contact with other patients and then, you know, testing. A lot of people were not
even aware that infection has happened to them. So it just started slowly trending. So this is the trend over
a period of three months. Now, let's talk about
recovered and fatalities. So out of the cases
that we are known about these are the number
of recovered patients, and I'm trying to analyze it based on like where the recovery is coming from. So from the graph, it
looks like most patients are curable in the capital area region. So you can see that there are
highest number of patients plus recoveries happening. Now let's talk about fatalities. What I, what I did was, I took the number of
patients and their timelines from the confirm date
to the reported dates and just filtered by the region. And most of the fatalities would happened were from Daegu, which
was the church event. And then Gyeongsangbuk
and then the capital area. Now, let's talk about
percentage of recovered patient by infection reasons. So, it looks like most patients which were in contact with people got recovered, 40% of the total population of
the reported confirmed cases were recovered. Second, most recovery
happened to the patients, which was a 22% of total
who visited to Wuhan. So these are some of the
insights which can be derived from our dataset. Number of fatality cases. So here you can see that, the main fatalities came from a church and the Daenum hospital,
and a lot of these cases, which has unconfirmed date debt, maybe like for the
reasons like they were not being tested or they were not reported, or maybe they were not in a
state where they can explain the reasons why the disease got to them. So these are some of the
insights which we saw from the from the data that
we had from South Korea. And a lot of it looks like
the main takeaway from this is over a period of time it
just started happening. And it was mainly about
coming to the contact with existing patients. Second take away is if we limit the contact, this is the cases can be controlled. So those are some of the things which we are also experiencing
in our existing timeframe. So that's about the analysis. Denny you wanna add anything? - No, this is great. Thanks very much. Actually do me a small favor and score right to the bottom. So just, we can show
that call out actually. - Uh, yes, presented fatalities
by infection reasons. So as you can tell now,
remember this is actually from Kaggle, the South Korean dataset, okay. And the South Korean dataset
actually is relatively complete because they actually have
a very high rate of testing in South Korea. And so the fact that they can go ahead and specifically call
out the various hospitals or the church, which was the
main epicenter for South Korea is pretty substantial, but
the fact that they also cannot go ahead and actually explain the reasons, which is the vast majority of the group is actually very telling, which is basically with
calling out that yes, once it spreads, it spreads. And we actually can't really
do that network effect to figure out, is there
a single cause anymore because it's has that
network that kicking in. And so, this is a really
awesome work by Vini. She did it over a few days, I
guess, which is pretty cool. So, there are actually
some quick questions here that I thought were pretty cool for you. Vini, if you could go ahead and sort of answer these questions here. So Vini, I guess the first question is, why did you, what, basically, as you were looking at the
South Korean dataset here, what led you to actually ask
these particular questions? Like why, why did you
decide to write the course that you did? - Yeah. So this was based
on the news channels. I have been looking at the number, the reasons like I talked
to my friends and what, what are the, some of
the common questions that have been asked or that
people want to know about? So that led me decide on
some of the questions. Also the dataset present, the
scheme I had in the dataset led me to find out those questions so I can find very relevant answers. - Cool. Cool, cool, alright. And then relate to that. So then basically, you know, you yourself as a domain
analyst or yourself as an expert with the background that you
have, like basically is that, is that the reason why, for example, you decide to write this SQL
or is there a reason why you chose to go ahead and write the queries the way you wrote them? - Yeah, it was easier to do in SQL. I could do it in Python as well. So, I can also mix and
match different languages within this notebook. But the reason I selected SQL
is because a lot of people whom we both share this notebook with, they may be more comfortable with SQL, so they can do easy
analysis on SQL as well. - Perfect. Perfect. So, just as finishing note then
there's some other questions that are hot popped in the
Q&A and I'm I'll actually take care of them in a
separate session here in separate session here if that is okay. But as Vini had called out, she's got this awesome
notebook that she created. We're going to be providing
this as an IPython notebook shortly. So that way you can run this locally against the Kaggle dataset yourself, okay. So, just as a quick call out, we are working on getting those links up. And so right now we have
them currently as Databricks notebooks, but we are
going to, like I said, convert them over. In fact, the final
session today, droves out, you're gonna showcase the fact that he actually was ahead of
the curve of all of us and had already taken care of that, okay. So, there you go. So, Oh, and there is
a interesting question that I will take over. So actually Vini, you can
stop, you can stop presenting. I will present my screen now, okay. Alright. Assuming I learn how to present my screen. That's a different story altogether. All right.
- Thank Denny. - Thank you very much,
Vini, I appreciate it. All right. Cool. So there are, there's a,
there's a couple of questions by the way, can you, I
noticed there's some questions popping up in the chat,
please put them in the Q&A, just because it's actually
easier for us to go ahead and organize. We're not gonna necessarily
answer the questions right away because each person is actually
trying to present their or what they're doing first, okay. So, the first question
I'm gonna go answer is any tips for developers
who are relatively new to data science and the
quick question is that everybody who irrelevant of
your experience, you could be, your experience could be
from a data engineering, could be SQL, could be algorithms. You can come in from a different angle. And actually this is very apropos for this particular dataset, because I'm actually gonna show you, from this dataset, the CORD-19 dataset from more of a data
engineering perspective, okay. And Chengyin right afterwards
is gonna show it much more from a data science perspective. They're both equally important, right? If I can, if this will become
very apparent very shortly, why they're both really important, but the call out is that I
would just start with that. Now, if you're really,
really new to data science, and never worked with it before. I would actually suggest to
go ahead and work with Pandas and Scikit-learn. This is probably the
easiest way to get on board. The, the Scikit-learn library. The website actually
has a lot of good stuff that gets you jump-started to learn how to do the stuff really quickly. Pandas is a relatively easy, you can install right on your laptop. So that way you're not actually
going out and trying to do something brand new, it's
just use your own laptop, whether it's windows or Mac
doesn't matter (chuckles), Linux for that matter. And you can just download the data sets and start working with it, okay. Alright. And then finally, there's another question
which I'm gonna go ahead and also answer live, which is, oh, for these data sets are
how often being refreshed. Actually, there are
different organizations that are refreshing them
on a different timelines. So for example, the South Korean dataset was updated at least
twice in the last week, the current CORD-19 working books, dataset here that we're
gonna be referring to as part of the COVID-19 open
research dataset challenge on Kaggle, that data sets
actually been updated. Right now they're on version three. It's my understanding
that they're updating it on a weekly basis though I'm sure they're updating
as fast as they can. If you wanna check out, by the way, there's actually a subreddit
on COVID projects called, that actually goes ahead
and has various people who are trying to go ahead
and analyze, organize, cleanse the data even faster. So I'm not gonna point
to any particular project just because there are many, whether it's actually data science or 3D printing, by the way (chuckling) . So, there's actually a
wide variety of them. But the call out basically is that there's actually a lot of
projects out there, as well, to take whatever challenge
you wanna go and do, because our attitude is like, you know, go out, do some analysis,
see if you can help out, but again, wash your hands first. Okay. Now, now that I've, I'm jumping off the soapbox,
(chuckles) pun intended, let's go ahead and talk
about the CORD-19 dataset. So, the what's cool about this one, the CORD-19 dataset, it's
actually a combination with Allen Institute,
Chan Zuckerberg Institute, Microsoft research, Georgetown, National Institutes of
Health and the White House. So, we've got a bunch of collaborators, as you can tell right away. The data set here is
actually, there's basically, it's actually primarily JSON files, okay. There's, it's broken
down to the commercial use subset non-commercial use subset and pre-prints that have
not been peer reviewed. These are all papers that are
about the coronavirus, okay. And so, we're gonna be using
this particular notebook on Databricks, okay. But these is notebooks,
like I said, free to share, and we're gonna be also
converting this into an IPython notebook as well for folks to
go ahead and work with, okay. So the beginning of this notebook, I'm actually gonna go a
little different first Vini had showed you. Here's what we can do
with the analysis of it. Now, I'm gonna be a little
bit more boring and show you how you're gonna go ahead and try to read all of this JSON, okay. Because that's what
this CORD-19 dataset is. And then hopefully you
guys won't fall asleep because Chengyin will follow
up and she will do something much more interesting
with this dataset, okay. Because I wanna do the boring
data engineering stuff first. All right. So, if I look at the schema of this, JSON. And this is just a GSR
schema dot TXT that, that they included within the data sets all three or four data sets that they have as part of COVID-19 are actually here. And you'll notice it says
things like the paper_id, metadata, like as in what's
the title, the authors of it, what the abstract is, the
body text, the text of it. And also all the bibliography references. And there's actually many,
many, many references. Okay. So, it's pretty cool stuff to have, but how do you make sense of it? Okay. So, what I did is I said, okay, I'm gonna go ahead and
actually take all the JSON's, which actually are, we
actually did upload this, if you happened to using Databricks, we did upload this already
to Databricks data sets. So you can just use them right away. If you don't, like I say, you
can just go to the Kaggle. I actually include the link
in the notebook to Kaggle to download the dataset itself. It's not very big, it's
two gigabytes, okay. So it's not like a
gigantic dataset yet, okay. Now, I'll explain in a second why I have these parquet
variables because, but the context is that I want
to go ahead and convert them from JSON to parquet. Okay, so why did I wanna do that? Well, the reason why is because in order for me to just have Spark and I've got four notes,
by the way, for this, for the Spark cluster. Just to read the 9,000
commercials use subset of, JSON files, each JSON contains
information of the paper, like the set, the author, the
title of the, the body text and the bibliography. It's a pretty long winded
JSON, as you can tell here. Okay. And so I went ahead and read it to read just to go ahead
and do the initial read, which isn't actually doing
a real process, okay. Basically it took me almost six minutes just to read those 9,000
on four notes, okay. So it's pretty, it's pretty tiring. The good thing about using
reading and Spark though is because I went ahead
and actually just simply put all 9,000 JSON files in one directory and it was actually able to
infer the schema anyways. So, that's why it took so much
time because it was actually able to look at all the JSON,
figure out what the schema was and just automatically create
the schema for me on the file. That's why you see what
you see here, okay. Important to note, these
JSON's are multi-line. I.e they contain character turns. So if you don't specify this, when you try to read the JSON files, what will end up happening is
you'll see corrupted record. That's all you'll see. And just because, that
means simply because Spark did not recognize or understand what these JSON files were. So, it is what it is. There you go. So, but make a multi-line
then you should be good to go. Alright. So, as we said, here, here's the count and it tells you right away,
okay, there's 9,000 files, so cool stuff, all right. All right. And also, I just verified
it by just simply doing it. Let me look at the path. So just do an LS on the
system and just do a count. All right, now, here's
where the fun part happens, okay (laughing). The fun part basically happens
here where you're going ahead and saying, let me go ahead and get the number of partitions to 286 in order to be able to,
like I said, to read it, it took, initially about five minutes, sort of almost six minutes. But then when I wanted to write it out, it actually took me
almost 14 minutes, okay. So, what I did is because
I have four notes, I re-partitioned the data
and save that as a parquet. So instead of having reading
the individual, JSON files, I saved as four parquet files. In fact, if you look here, here are the four parquet
files that you're seeing, okay, these are the four
parquet files, all right. So, because I did re-partition
as opposed to coalesce, I don't, I've minimized the skew. In other words, if I wrote coalesce here, instead of re-partition, which biggest coolest
would have been faster, the re partitioning actually
went ahead and made sure the file sizes were roughly the same. The coalesce actually, when I ran the coalesced at
the first time, it actually, one of the files was massively bigger and the other three files were smaller. What that meant is that when
I was trying to do a query against the data, if I
was distributing this across four nodes, one node
would take on more of a hit versus the other? Now, let's roll back a second. Why did I do all this? Okay. So I already told you 9,000 files. It took me about six minutes to go ahead and run the queries, just to
read it and do a count, okay. Well, because I went ahead
and did this as parquet. You'll notice that now,
when I run the query, it actually finished
in three seconds, okay. And the, to read the
files and it took one, less than two seconds, 1.5 seconds to go ahead and do the count, alright. So, this is the data engineering aspect. And so for all you folks
that are actually doing the CORD-19 dataset again, this
notebook is going to be made available to you so you, so follow up and hopefully this'll help
you charge up and go through your data faster, because
you're gonna be able to go ahead and convert the JSON into parquet. And you're gonna be able to go ahead and run your query significantly faster. When I was running queries
before on the JSON, each one took me five or six minutes, but because I went ahead
and converted to parquet, now they're happening in seconds, okay. So, from my data engineering
and data science perspective, that's significantly faster. Okay. So, hence the reason why we went through the process of doing this, okay. And the next step, which I'm not really gonna dive too much into is
basically the same thing. We just simply went ahead and provided, do the same conversion of
the non-commercial dataset. Same idea. It took longer before when
I saved it as a parquet, I went ahead and show the Cory went down significantly faster and also
for the peer review data sets. Alright. All right, so Noel asked a great question. What's the benefit of running to get the number of partitions. Okay. In this particular case, I wasn't really, there's no real true benefit to getting, I just wanted to call
out what the number was. In other words, that there
were 280 partitions before. What it does tell me though, is that basically because we
shoved the data in, right. I actually have a lot of partitions and if I was to save it out, remember the, the overall size and
this amazingly enough, the commercial dataset was the largest, but the old data size is only
about two gigabytes, right? So we're not talking massive scale here. We're just simply talking
about like, you know, a small set of files, right. But, I have the small set
of files that spread across 283 partitions. So if I saved the, in Spark or for that matter, if I
was doing this in Panda, same problem by the way, okay. I was to save it, I would
actually save it out as, 283 Parquet files, right. And that's bad, right, because
then there's the overhead of trying to read all of
those little parquet files, even though there's actually
not that much data in each one. Instead, because like I
said, I actually chose to do it as four just because the cluster set up the role set-up here that I have happens to have four notes. So I've saved us four. You could also just as easily
turn around and tell me, dude, it's a pretty small thing. This parquet has snappy codec as well. So it does a little bit of compression. You could have just saved as a one file and be done for the day and that's valid. And in fact, that's actually
what I'm planning to do for those folks who are
gonna be doing Pandas because you can actually in
Spark go read those files, save it as a pen, save as a parquet file and provided you remove
the commit and the, the metadata files then
Pandas can actually go read that file too, okay. So there, so basically that's
the real quick call out. And finally, the other call out from does not tell you whether
the files are splittable, yes and no. In this case, basically
there were 9,000 files to begin with. So basically the idea
of partitions in memory, we're basically taking
those 9,000 and we put, we created to Spark grade
283 partitions in memory, and basically shoved everything, those 9,000 files into
283 partitions in memory. By definition, if I write
those partitions out without actually organizing it. And if I was to actually
specify partition, or if I even did it with single partition, usually what ends up happening is, I have to I'll get something like 280, or at least a heck of a lot more files than before that I specified. And that's the reason why
I didn't wanna do that just because it's, the data
size is relatively small in the first place, okay. All right. Perfect. And we did have another
first first and asked, are we gonna talk about Delta Lake? We are not in a future session
if you guys are interested because we actually we're
thinking about diving deeper on the each and every one these notebooks. So Vini's notebook, this notebook and Chengyin stuff, and
also a Dhruv's notebook that will be coming very shortly. We actually are thinking about
going ahead and diving deeper into those situations where
we will be doing streaming and we will be updating the data sets. In those cases, I may go
ahead and decide to go ahead and include things like Delta Lake, but right now I'm just
focusing on what's the best way for every single data scientist,
to make use of the data. And there's a sizable chunk of folks here that I'm sure they're going
to go ahead and tell me, I'd rather do this in IPython
notebooks locally with Pandas. That's fine. We're not actually asking
you to use one version or the other. That's why we're actually
going to save the stuff, these notebooks in that way. So that way you can go ahead and run this as an IPython notebook
in Pandas as well, okay. All right. So, I did want to call out
some quick analysis of the data prior to changing, showing
the cool stuff, okay. So, right, right now, like I said, I went ahead and read the
data and I said, okay, cool. Let me go ahead and read the parquet file. So that's what's great also is because, because I saved as parquet,
then any subsequent notebook that I have, I can just go ahead and say, Hey, let me just go read those
files, parquet files, bam. I've got myself a nice data frame and I'm back up and running, okay. So, the first few cells here are very similar to the previous cells so there's nothing really new here, okay. I'm just, again, showing the schema. It's pretty complicated, all right. So, how do I wanna make sense of the data? So in this case, I'm
gonna just use Sparks SQL to help me try to look at the data. And then I know from this
schema, JSON schema texts, that the important things
that I wanna look at, because in this case, I was
just wanting to have a little map it out and see which
geo has what papers. Okay, so I can look at the paper_id, that's that gives me
the count of the papers, metadata, which tells me
things like, okay, the title, but more importantly, the
authors and their affiliation and their location, inside
the affiliation location, the affiliation, excuse me. The there's the location
that location basic tells you where the author is from, okay. So, let's go do that. I'm gonna go ahead and do select paper_id, metadata title. So this is an example of the evolution of pox virus vaccines. And here is the paper information. And as you can tell, here
are all the array of authors and all their affiliations, okay. So, this is the author and
some of them will indicate exactly where they're affiliated. There's the affiliation for
example, and the location. And we're gonna say Spain, right here. Alright, perfect, alright. So, this tells you a little
bit about that information. So I'm diving, we're diving into the JSON trying to make sense of it. So, now I'm saying, okay,
let me break that out. So in this case, I wanna
gonna to go do an explode because what I care about specifically is just the author information. I don't about all the other information. I just care about the author information. So let me explode that out. What that basically means
is that as you remember, as you can tell, there are
many, many, many authors for this first paper, okay. All right, so that's great. That's actually good information, okay. See, all these authors here,
and some actually having one row for each author, I'm a, oh, sorry. Instead of having one road that contains all of these different
authors inside here, where there are six of them, I wanna have separate
rows for each author. So that way I can actually
understand the author location. All right. So that's what this column is. And so that's why I exploded it. So I can say, okay, author
one for the same paper, the evolution of pox
virus vaccines, right? Here's the six different ones rows for that particular paper. That same paper_id, the same name. I don't really need the title, but it's easier to read for everybody. So that's why I kept it in here, okay. And I can see where the
affiliation is, okay. And the, and where the location is, okay. So, like same idea country
of Spain country of Spain. So, relatively straight
forward, okay, perfect. So that means I should, if I'm lucky, I should just be able in
order to be able to go ahead and map this out, I should
be able to go ahead and say, let me go ahead and take
the authors location, the country and map its paper_id and then I can do a
count and I'm good to go. Except, okay, problem number one. I don't know where there's
multiple authors, right? So which one do I choose? So I'm actually gonna simply choose the, the minimum geo. Because in some authors
there're actually multiple geos. So literally I'm just
gonna take the minimum one. That's actually, probably
not the best one, by the way. In fact, I probably should
have ran a rank query to get to do by first author,
second author, third author, and the chosen just the first author. And in fact, actually a for, if you're interested in
for a subsequent session, that's exactly what I'll do, okay. But so just as a call out, I'm
actually showing my mistakes. Okay. But for now I'm just doing a min. So, that's what I did here, okay. I basically said, okay, give
me the min country from this. All right. So this is perfect. So, each paper_id, the minimum
country, as opposed to rank, which I said, like I said before the rank I.e the first author listed there, that's probably the one
we should be working with. The each paper_id, author, country. And that should be good to go. Right. I should be able to map out. Except, this data can get dirty. Now, for example, instead of
saying China it says PR China obviously stands for
People's Republic of China, but that's not a standard code
for us to work with, okay. And as you scroll through the data, you'll notice that there are
mistakes in spelling, okay. So as you're doing your NLP
analysis to try to find, because like I said, the CORD-19 days that has a bunch of tasks and they're actually trying to figure out what to do with it. Yeah, it's funky as heck. So, what if you actually
look at the dataset, you'll notice that if I break it down, I actually have lots of things like, country of Spain, USA, okay. So, to finish off and then
switch over to Chengyin what I did basic, oh, here's a funny one. Like there's actually
literally a country called USA, USA, USA, USA that
literally was inside the paper. Okay. So, what I did is basically
I went ahead and said, let me just go ahead and get the mappings. And I actually created a map manually, which I will have given you, it's actually at the bottom of this, okay. So this literally this map
here, it's boring as heck, but basically all it is, is a mapping of here's the author country
for each one of these values. There's a two digit alpha
two and alpha three. So, I literally listed it out
and just did this manually. Fortunately, there are only 260 countries, so it wasn't that big of a deal, but the point is long story short. There's a lot of dirty data inside there, even on what has been cleansed
and organized really well. But now that I've organized
it, I can map it out nicely. And as you can probably guess
the vast majority of papers for this commercial subset
came from China or from the US. Okay, not that surprising,
but still good to know. There's also papers from Germany
and France as well, okay. So, now that I've shown you
the complexities of just trying to make sense of data and
how you have to cleanse it let's switch over to Chengyin
to go ahead and show, talk a little bit about how
we can do some cool NLP, which is the main thing
about the CORD-19 dataset. - Thanks Denny, for the great introduction about the data sets. My name's Chengyin Eng,
you can call me Chengyin. I'm currently a data science
consultant at Databricks. I work with customers by
delivering data sets trainings, and also professional services project where I help implement
data science solutions. I am currently based in Chicago, but I did most of my
undergrad or masters years in Massachusetts. I did my masters in computer
science at UMass Amherst and my undergrad was in statistics
and environmental studies at Mount Holyke College. I'm gonna go ahead and share my screen. It does. Okay. So as you can see here, the very top part of reading
a dataset is identical to what Denny has shown before. I have a bunch of data
parquet path variables here, and then I'm gonna reading
the data in a parquet format. And as you already know, the commercial use subset has 9,000 files, and the noncommercial use
subset has over just under 2000 and then a bio CIF has under under 900. And for this purpose of this
NLP methods of analysis, I'm gonna be just using the
commercial use subset data. Just to show you how it looks like. If I don't do any of the
cleaning, it looks like this. And it's pretty messy. Can see that abstract back matter of the entries metadata. Something fortunate about
this data set is that even though it's not perfectly clean, I can assess a lot of
information from metadata column. And recall from what
Denny's has shown before, you can see that this is a schema. So this is really just an Ester JSON file. So when people think about NLP, the first thing that comes
to most people's mind is let's do something deep learning, but I'm gonna show you two methods, one is using deep learning, one is not. I'm gonna start with
non-deep learning method. So we're gonna try to generate a Wordcloud from all the titles of papers. They are submitted to, you
know, to the organizations. And first you need the word
cloud library to be installed. So let's go ahead and take a look at what the metadata title looks like. Can see that here there's
three examples of them. And here I'm just writing
a really simple function to draw a Word Cloud. Can see that I'm importing Word Cloud and stoppers from this library and also using that POL lib as well. And here I'm just
splitting all the sentences into individual words. And I'm gonna pass this clean words into this function right here. Where you can see here the argument for stop word here is what is really doing under hood is just removing
all the comments stop words. So for example, like Es are off, and so those words will be removed. And what's nice about this
generate method is that you actually cause the function generate from frequency under the hood. What it means is that the size of the work
that you will see later in your word cloud actually correlates with the frequency of
the word that you see in the all the titles. So you can see here, I'm just embedding the
macro function over here. So here I'm gonna use two functions from the pipework SQL
functions called CONCAT and also collect this. What I'm doing here is that
I'm going to concatenate all the titles available in a dataset. So for example, there are 9,000 rows, 9,000 different papers
over here in this data set, rather than reading them one by one , I'm going to concatenate
them into altogether. So then I can just pass
in this entire string to my word cloud function over here. And I'm going to separate them by comma. So you can see here that I'm creating new data frame over here. And I'm aggregating all
the titles possible. I'm gonna create a new
column called "all titles." So now I can just parse
in my "all titles," the string into this workup. As you can see here, probably not very surprising for you is what infection, protein,
virus, cell, human shows up as the top words in all the titles. But what we can do later
is that we can remove some of the non meaningful
words from the worktop. For example, here, we don't really know too much about using, we don't know too much
about based or even viruses. We already know that
Corona virus is a virus. So let's go ahead and remove them. So here I'm just gonna do
some really minor modification to the function I've already written here. And then just update the stop word set. So here I'm removing the
words using base analysis, study research viruses, and
let's see what it looks like. So now you can see that
there is no more words. There's no more using
there's no more based, there's no more viruses,
but there's still virus. But so here you can see that
this is the overall picture, of all the titles they are
in available in the data set. So as you can see here, I haven't
really done much cleaning, but I was already able to do
some really quick and dirty visualization about the
data set that you have. So you don't even need
to know deep learning to even start doing
something with a data set, even though it's priority test. So am gonna show you now
a deep learning method, which is to generate some reason extracts. So we know that there are 9,000 papers, and I said that I really don't have time, or I'm lazy to read
through all the papers. I just really want to
know what is summary, what is the important
points about each paper? So I'm gonna generate a
summary from each extract so that the worst that you
have to read is even less. So here, I'm gonna use a summarizer model. That's straight on BERT
that initially was used to summarize lectures. There's a link over here that links to the original
paper that published this model. But what it does under the hood is that it utilizes the BERT
model for text embeddings and also K-means clustering
to identify sentences that are closest together
to generate a summary. And to use this library, you just need to install
bert-extractive summarizer. And if you're using Databricks, you will need to install it using PyPy. So here am doing a really
simple import over here, summarizer model. And I'm gonna just take a
first extract in a dataset and convert that to string. So you can see that this is the abstract that you're reading
here in the first paper, just by really quick
glance this is probably about 10 rolls over here. And just to show you a longer abstract, this is a second example
of a longer abstract. This is probably like 20 rows. You can even scroll down even more. So this is a really long abstract. So am gonna train of
summarizer model first using minimum length parameter. And what minimum length parameter does is that whatever number you specify here, you remove any sentences as
fewer than 20 characters. So we can see the quotes
here are really simple and are really concise, just one line of calling summarizer. And I defined it as a model. And then I'm just parsing in my abstract into this model function, model object and I'm specifying the
minimum length to be 20. So I'm gonna generate the
first extract summary. And you can see here that
this is now two sentences compared to maybe 10 sentences over here. And I'm gonna look at abstract two. Can see here rather than having needing to like click into the
cell and scroll down. You can see that this is
significantly shorter. There's another parameter that
you can use for this model, which is the maximum length. What is means is that you
would remove any sentences that has more than 250 characters. So let's take a look at
a first example again. If you recall, this is why we do this
is the original extract, and I'm gonna go down and take a look at the minimum length. Can see that it's not two sentences. When I specified a maximum length, I was suspect that the
summary would get even shorter because now all the longer sentences in the extract is already removed. But if I generate this again, you can see that it actually is also about two sentences long. So what this means is that, well, the first abstract is
not that long to begin with. So playing around with the
minimum length parameter actually doesn't make
that much of a difference. But let's take a look
at the longer abstract. So here I'm recalling
so just to call here, you can see that the number
of the maximum number of characters in maximum
length is actually the same for both cases. And here you can see that it's
actually a tiny bit shorter than the one that you can see above. So hopefully this provides you , you know, just like showing you example
of what you can do with NLP. You don't really have
to know a ton about NLP or even data science
to start playing around with this dataset. So hopefully that this will
empower you to do, you know, do something a little more complicated, or even just help you to you
feel good about what you can do with data science as well. That's all for me. - Cool, thanks very much. So for all you folks
who actually are wanting to do the CORD 19, the COVID-19 corovinus, COVID-19 open research data challenge, this excellent set up a
notebook from Chengyin actually help you kick
start your NLP process. Lots of really cool little examples there that it's not gonna tell you
how to do those exact tasks, obviously, but at least should
give you a pretty good idea of how it works, okay. So Chengyin, thanks very much. Let's finish off last but
certainly not the least. Dhruv please go ahead
and showcase your stuff. - Thanks Denny. Hey guys, I am Dhruv Kumar I'm a senior solutions
architect at Databricks. I've been with the company for two years and before that I've worked
in variety of big data roles. So I'm very grateful to
be here and happy to speak to the intelligent audience here because there is a
pandemic up on our hands. And, you know, I feel like
we as skilled practitioners of this field, we can contribute a lot
back to the research and also help each other out. So as part of that initiative, what we have done at Databricks has taken some of these open data sets and put them into our repository so that you guys can like start analyzing and inferring some good
research from it, hopefully. Now my goal today would be to show you a where these
data sets are located. How to get started with them and see some of the
analysis I have done so far. Please, spoiler alert,
there's not much going on. I've just created data sets and, come up with very very
few basic visualizations. But the hope is that this motivates you to go back and do your own experiments. In the spirit of keeping everything, community research- based, most of the stuff here is around
open source tools, IPython, and Jupyter, et cetera. So I'm gonna be talking about how you can fork in database environment also with just open source techniques. So you can download this IPython notebook, and work on your own systems as well. But that said what we have
done and thanks to Denny Lee and our legal console. What we did was Johns Hopkins university has been publishing an aggregated data set on the coronavirus outbreak on GitHub. And the link for it,
and in there, you know, they going to refresh it every day. And it's a nicely formatted data set, but it has some problems
we'll come to it in a minute. But what we've done is
we've taken this data set and put that into Databricks'
community edition. So there's a Databricks'
flash data sets folder in which all this data is located, okay. Now, because it's located
over there you can go and easily start analyzing it. You can also download this
data set on your laptop and you know, run Jupyter Notebooks and, you know, start mining it for sure. But where I feel like platforms like opensource, cloud
platforms can help is because the ability for these platforms to give you flexibility in
downloading multiple data sets into one area, right? So while Johns Hopkins data
set is great, you know, it only gives you a, it
only takes you so far. To get to some more interesting analysis, you can look at some
other public data sets and also bring them into
the cloud repository. So I have some research ideas, you know, if people wanna collaborate at
the end of this presentation, Denny will be in touch. So what are we trying to do here? Well, what does the outbreak
look like on a global scale? Let's play with some data. Let's see how far we go, okay. So let's quickly. I have, we have, as I said, we are already loaded into
Databricks data sets folder. Let's see what that folder looks like. So right now Databricks is not connected to a cluster machine learning cluster, which we have created. If I just do FSLS this is by the way, a database specific magic command. But if I do this, if I control + enter, you see, I have all
those other folders here, which are just a mirror copy of whatever's going on in GitHub, okay. Now let's look at some
worldwide statistics, you know, so we're gonna try
to find out at a global level, how does the epidemic look like, okay? So, because we had
downloaded this data set on 17th March that's the
most latest we have so far, but so we will use that as
our reference date, okay. So over here, I'm just trying to get to this particular file, you know, on 31720.CSV, you see all those files, inside this GitHub repository, they are in CSV format and you know, the most recent ones are 17 and 18, okay. So this is what I'm drawing because it's on March 17th, okay. So if I do that, I create that file path. And now I can just create a
Spark data frame out of it for folks who have been using Spark, this must feel very familiar. It's a CSV. We can read its header to infer schema, and let's just see what comes up. So right now it's running Sparked off. So you see this is a
pyspark data frame, right? So this is what, because we
are still in the Spark land, so a pyspark is an extension of pyspark data frames sort of extension of just regular data frame
with other bells and whistles. But as a data scientist, you can just interpolate between the two. You know, there are also
other bindings like callers, which allow you to go back and forth between these two environments. For simplicity, what
I'll do is I will just, you know, I look at this dataset
when I first downloaded it, I was like, hey, this slash
here does not look very good. Like who names the columns
with slashes, right? Because you know, it's very difficult to handle these characters later on. So first thing we're gonna
do is remove these columns and rename them as
country and state, okay. So let's go ahead and do that. Okay, great. Now we're still in country. So if this seems really trivial I'm keeping it simple so that, you know, you guys can understand
how this journey goes. And for folks who are not
so familiar with Spark or data science programming, my motivation is it's not that difficult. You can also actually start
doing some cool analysis, okay? So now I have renamed this
state and country columns. I'm now finally, what I'm gonna do is I have two options right now. I could just continue
going down the Spark, pyspark data frames path, or I can choose to
convert to Pandas as well. So I tried both approaches and there's another notebook which I have, which goes all the way into Spark. But I thought for this
particular presentation, I will be using pandas so that, you know, you guys can interpolate
with other tools as well. So we have, we can just quickly convert this pyspark data frame to
pandas using two Pandas API, and then let's see what
happens when we do that, okay. All right, great. So now, if you see, this must look very familiar to folks that are using IPython in Jupyter because it's a formatted like that, right? Now, you must also notice
one thing that the data set, although I was able to change the country and state column names, there's still some weirdness
in the state column. You see. Why is this particular row State in Italy giving me another value? So this is classic big data problem where you are trying to clean
and massage this data set. 80% of the problems, or rather 80% of the blockage
to doing some cool analysis on data sets is not ML. It's basically data engineering and ETL and cleaning that whole thing. So this is what we are doing right now. Now, how do you handle this? Well, there's a simple path. And you know, you can actually
just go to OpenCage Geocode you know, the other geocode
API services available where you can just parsing
the latitude, longitude and lunatone and the states. So, I tried that and it was
giving me the right values, but to keep things simple,
I'll omit that step right now. Okay, we have this data set
now in latitude, longitude. How do we go about plotting it? Well, there are a bunch of different ways. One way is that we can
convert these country names, China, Italy, Iran, to
something called an ISO 80633 expect there's a spec which
means that country name, you cannot get to the three letter code and then you can parse it
into Databricks environment, and then we have a map, a nice
looking map that comes up, this is what you know, other
presenters were showing you. Denny Lee was showing you as well. Well, there's also another
way you think it's Plotly, Plotly is available opensource. You can just download it and install it on Databricks clusters. So that way I have already done this. So if I look at my cluster, if I look at the libraries
which are installed, I've installed plotly, plotly geo, and I was also hacking with keplergl, which is Uber's open source
map library, the other day. But the point is you can easily
install any new libraries on these clusters and, you
know, start working with them. So going back to our expense. So, all right, so we were at, okay. So plotly, let's try and plot this guy. So because we already have latitude and longitude information,
it's fairly simple and straightforward to plot. All we're gonna do is
parse it into this ape, into this call for Mapbox. And, you know, we are
gonna ask plotly to give us a map and, you know, give
us a hollering data sets around confirmed dates, okay. So let's see what happens. There you go. Okay, so what this is doing
is essentially telling me, you know, it has taken that same data
frame, the Panda's data frame, Plotly was able to properly
express was able to plot this mind you, the same thing can
be done in Databricks as well. But yeah, cause we wanna
be pushing these out, drive out the notebooks lab. So it's another approach to doing that. You see who they, as predicted
it had the, as we expected, this was just really
tragic that they had 67,800 around 67,800 confirmed cases and 2,111 deaths. If we zoom in to what's
happening in United States and our neighborhood, let's see. Just not giving me. Yeah, there we go. So mind you, this is still
aggregated at the state level, but we can go further drill down into the county level as well. And that you can do using
the geocoder lookup API. You know, it gives you not only the state, but also the county and the city name. So anyway, just some of the things you can do with data sets, you know, showing you how to
go about finding APIs online and using them in the same workflow. Lastly, once we're done with
that, what else can we do? Well, Johns Hopkins is also
giving us a time series data in which they are telling us every day how has the disease been progressing? So you can also, all those data sets are also located in our community editions,
you know, open data set, open data folders, and I've shown you an
example of how to open it. So we can just, again, we can quickly
read it through Spark and then convert to data frame and look at how the disease
has been progressing per country over time. And using Plotly our Databricks
internal visualizations, we will come up with some
analysis and understand. Now this is all mechanics, what
are we trying to accomplish? What can we do here? You know, one of the things is that we're trying to find out
if weather can make it difference in the disease transmission. We are not sure yet. None of us are experts and
correlation is not causation. Let's not confuse with that. But just for your own
visual analytics purposes, to understand and donate back, you know, you can actually download
NCDC data set, you know, it's a temperature dataset
and then relate it with this, these data sets here and see
how those trends look like. Another one, which I was
thinking last night was that, hey, we are right now,
isolating ourselves, you know, from others, staying six feet away, washing hands, as Denny mentioned, how effective is that number one? And number two, how effective is it, whether it's actually
happening or not first of all, we have to understand that. Secondly, if it is happening, how has it been effective in
containing the disease spread? So one could actually use
Caltrans traffic data. You know, it's published
freely by the department of Caltrans in Bay area. And if you go to this link, you can subscribe and register. And, you know, you can
come up with these files, which are basically five
minute interval files on the traffic information. So essentially use proxy of highway data as a measure of social isolation
and see what it's coming, how it's been correlated with
the disease transmission. Again, this you can, this goes back to our first point while Johns Hopkins data
set doesn't give us, gives us a lot of rich info, for sure. But the magic happens when you start combining other data sets like the weather data or the third DOD data, transportation data to come up with more richer analytics. So that was as short and sweet. We are gonna be refining this notebook and publishing later on today. And that would be in other channels and back to you Denny. - Perfect. Thanks very much. I wanna be cognizant
of the time right now. So since right now is a
little after the hour, but so we ran a little bit long. I'm sure it's my fault. So no worries everybody (Denny clears throat) Dhruv that was a wonderful session, very helpful. So, to everybody on this session, wanna remind, we will be
putting these notebooks online at the Spark Global Online Meetup, as well as Seattle Spark+ AI Meetup as well as the YouTube channel. So that way the links to the notebooks, we'll probably publish the
Databricks notebooks first, just because we have them already done, but by the same token, we are gonna be publishing
the IPython notebooks? So you can run them locally
on your own environment. If you, hopefully this gives
you a good starting point to be able to go ahead and
make sense of the data, just like Dhruv called out. You can go ahead and join
other data sets together to make things really interesting, just like Chengyin called out. There's some amazing NLP that you can do against the data sets, especially the papers there
just like Vini called out. There's some amazing visualizations, amazing data on whether it's
the South Korean data set that she was working with or
any of the other data sets. But go ahead and give things a try. See what you can get out of that data because you can probably find some pretty interesting things. And as you do these challenges, don't just work with the
data sets you've got. See as noted. See if you can join
some data sets together to find some interesting patterns. And that's it for us. I did wanna finish it
saying, like I said before, hopefully you guys stay
safe, shelter at home, wash your hands and do social distancing. Otherwise that I thank you very much. Karen is there anything
else that we need to do to close this off.