Screencast: Cleaning and exploring the COVID-19 Open Research Dataset (CORD-19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robinson and welcome to a screencast where I'll be using our and our studio to analyze data I haven't seen before so this data set is a little bit different the data since I usually analyze on this channel it scored 19 that's a um challenge of a large set of scientific papers that have been released by the UM by the White House and a group of research groups containing many thousands of articles about the corona virus so I don't know about you but I've been feeling really kind of helpless throughout there some this global crisis and I try to think of what I could what I could do that would um could in even the smallest way help out so I am this is gonna be a there's gonna be a screencast or download this data set and then he gonna look through it exploring it and especially get into a form that you can use other tools within our to analyze the disclaimer is that while I am a data scientist I'm not an epidemiologist well I do have a PhD in computational biology I am NOT a violinist and while I have a good amount of experience in text mining I'm not really I'm not the kind of expert that generally dives in and does research level projects in terms of understanding scientific papers and um and using a natural language processing to gain those kinds of insights I'm just seeing what what I can do with with the data set data set that I have here so I'm gonna take you know the Kaggle has provided a set of tasks that they're recommending people go through I definitely encourage people watching to take a look at the data and then maybe try out some of these tasks I don't know whether I'm going to look at one of those tasks myself I might just spend a bit of time cleaning and formatting the data and maybe answered a couple exploratory questions I also want to take a little bit of a look at what's called um size space II which is a Python which is a Python sort of packages for working with scientific literature I that's something I could I can actually show how to do that with using that with tidy text all right so I'm gonna get started and hopefully this can be helpful educational to other people who might be interested in diving into this data so I've already downloaded the data I've already decompressed it and I'll show you what it looks like it looks like um where's it downloads now there's this one oops oh yeah here it is alright so this is the um this is what the what the OOP this is what the data set looks like and it's the compressed that has a metadata CSV would definitely take a look at that and it's got folders I took a brief look before I started just enough to see that I think I could get something out of this some out of this session and if we take a look at one of these JSON files it looks like each of the JSON files has a paper ID it has a some metadata like the title and authors it has an abstract it has body text and it has what look like references bibliography entries so there's a lot that we can pull out of course we can take a look at the text we're specially for I'm gonna wanna take a look at the abstract but with the and we were definitely not to be able to use a used text miner to gain like really deep insights into this but we can definitely do some categories looks at some look at what our topics of these papers and yeah but we're gonna need to parse it in Jason first which maybe not everyone has experience doing and that's why I wanted to show that in this um in the screencast so I mean it started just like I usually would this is gonna be called cord 19 the data set I'd say library ID verse library I'm gonna need tidy text and hey I'm gonna need Jason Jason light to be specific I'm not gonna be working much with that yet but we're gonna start with is read CSV I'm gonna read in the metadata all sources metadata 2020 alright so here we have the metadata for that alright and what it looks like we have is okay we have a source we have a the papers title we have the this is an identifier for the paper some will have PubMed IDs that's really useful as a standard way to refer to a paper maybe get other kinds of metadata the might we might do that today license is gonna be really helpful oh good it has the abstract okay that's definitely gonna be helpful the publication year we go to overtime maybe mm a lot of them are missing a lot of them missing the abstract to some of the licenses look a little confusion alright so I'm gonna I'm gonna rather than just browsing through MIT account a couple of these those who work with my street guys know that I love counting things I'll say count license don't know what I'm gonna do about that well I'm probably not going to do too much with particular license anyway account published time could make a graph no it looks like it's either it's 2020 which makes sense that the covet 19 really only showed up as a threat in really around like early January that it's sort of being really understood and documented so end up better than a lot of missing data so I'm not gonna do much with published time it doesn't have anything like a month which would have been useful there's a has full text which we can filter and we see it with 13,000 that do have full text they filter down to those okay so that's also useful to know am I gonna do much with the titles am I gonna do much with the abstracts I'm deciding that now mmm I think I'm gonna come back to them I want to show how I might go about Parsi in this JSON data even if at the end I might mostly be looking at the at the abstracts let me take a look also at what does the readme have okay keep superforce those papers all right yeah really I've definitely on this there's some missing data in here okay yeah I'm gonna start taking a look at the UM the extracting text from all the full papers there might be additional abstracts to something I'm curious for is how many abstracts are there abstract is that the paragraph that starts out out of paper not is in a abstract okay so twenty-six thousand of them have abstracts all right that's good that's good to know you know hmm am I gonna start to sign with it I get a first look at the abstract no I'm gonna stop it parsing the full papers okay so um the absurd is gonna have most important things in terms of the topic it's going to describe some of the conclusions show what directions of research people have been going in the authors might be interesting to I might take a little bit of a look at that but first I really want to show how we're going to get the text out of these out of these papers so what we're gonna do is work with all the files in the I have it in downloads 20 2013 let's use the commercial use subset and oh there's a folder within that okay and set full names equals true these are the paths to each of them now what I'm going to do is use pers map along with read Jason and that's gonna grab out all the Jason objects is their parse every single one of those and there's about nine there's nine thousand of those in this commercial one we can see cuz there's a thousand and then eight thousand missing entries I'm gonna need that to parse some things out of it we're gonna explore one of these objects first we can see we're gonna care about the paper ID you know metadata we might end up it could be useful because we have last names parsed separately here maybe I should I should pull that out I'll say I'll be taking a look at the we have abstract and most importantly we have body text and we have all the references why have I been saying that I'm interested in the references because I think I might want to look at like what papers are cited a lot that someone might want to look at well research if I want to look at him I want to look at do these papers cite each other all those are things I'm interested in I was stalling for time too because I'm still waiting on this but let's see the UM all right I'm gonna care for paper ID I care about the body text and then we're gonna care about see body text appears in multiple paragraphs in a row yes I'm gonna join all those texts together okay so we look at Jason objects just look at the first one for a moment and that it gets turned into a named list and oh we could grab out the paper ID oh we could grab out the abstract which itself of no list no abstract in this one is there a text at body text in this one there is so this one has like body text but the text is just one figure this doesn't look like a great example whew this one's great this one has 21 paragraphs and we can read through and we can say there we go about 21 paragraphs how maybe it is try to say here's our text each of them will have a text item field okay I'm gonna want to UM I'm gonna want to pull out particular items from every one of these objects there's a great verb I've only recently started mastering called hoist from the tidy art package so this is for when you have a nested list just like this Jason output that you want to turn into a into a into a rectangular form so the first step is to turn into a table this is a 9000 row table that is each going where each object is one of these named lists okay that doesn't that's not so useful it can't do anything with that what I would do is pipe this not waste and say heist from the JSON object I want paper ID equals I want to pull out the paper ID from every single one just hit me I probably want to create this table object does this take a while to run chasing Dibble maybe if I'm gonna keep messing with this yeah it takes a second maybe they keep if it keeps taking time I'm gonna want it to uh I just have a lot of stuff in memory I think that's the reason it's being a little slow all right the point is that I wasted out I wasted this to the top level it's no longer just a paper ID it's up at the top level level now you know what else I can grab out I can grab out I could grab out here it is the metadata that is all Sony pulled the top level but as a list so and what were the were the items in it oh there was a title and author so actually I'd only think an author's hoist allows me to say dig down into metadata grab the title and now here we go arretez metadata authors so yeah now that you've got your paper due your title your list your nested list of all authors and you've still got your JSON all right so notice that we're getting a little bit more helpful here now can i yes I could also say then the important one I could say is abstract equals see abstract text I think that was right let's take it just take one more look at this within the abstract there's well abstracts a list and each one contains text does that work out okay no it does not can I get can I somehow join them together so by the way this is like called a plucks this is like a pluck specification which allows me to get a set of act accessors like here the integer the string name ooh does this work an accessor function I wonder if abstract I'm going to try something out here whether try is map dot text Mac strength no it doesn't look like what if it was an actual function okay so it needs to be if it's a function it can't be a tilde oh my goodness what is I've never seen this before what is that question question question mark I have no idea what that what that is literally no idea whole ahead one whole abstracts yeah I learned this stuff as I go unspecified okay something I guess we could do with pluck pull abstract pluck one what is the class of this object I'm just really like kind of want a vector okay this is a vectors thing and I really don't know my way around the victors vectors package is unspecified is there a thing I can do for that hmm well I'm gonna try this anyway this kind of a is this silly made it's a little silly whatever I'm going to grab up the abstract that's not what we're here for so why am i doing the abstract like a a map because there's multiple of them and I'll need to do the same thing for the body text grab out of body text map character text and there we have it we have like character vectors each of which contains the text of the paper there's a lot more it could get out if you get out the references why don't I pull out the references and then I'm gonna select - Jason but I'm gonna say let me do one more hoist of what is it called I can do five entries and those ref entries wonder what the difference between them Oh references like figures and such we can probably skip the figures of course it could be interesting but we might not get anything out of it today but the important thing is we're seeing like how to grab out as part of this we're gonna call it there's called viv entries see the trouble with the bib entries is like we're gonna need to do some fuzzy matching this alt if we want it to like match them together well I'm just gonna say I'm gonna pull out the bit entries and keep it as it is for now and I'm gonna call this article data it's not I'm curious about how big is this JSON objects it's so huge is that the reason I'm having a hard time of everything slow I'm not gonna read in decimal points what is that in megabytes 480 megabytes maybe no it's four point eight gigabytes so no yeah that's big yep look nope yeah I might have a mistake yeah that's big so I'm actually not going to read all the papers in this time but you could combine together the lists from a couple of these subfolders or work with it recursively and you'd be able to um you'd be able to work with this yourself okay let last time do here save you take a look at the top view let me say mutate abstract equals map character abstract string see collapse pace together the abstracts okay like I glad this at least worked what I do is I'm putting together the abstracts into like their individual paragraphs and then if I say pull and let's throw in a filter not is in a abstract you know let's do that for the text too you know now that I've seen it let's do that here string see so this is my I'm changing my accessor function in the hoist to combine all the text elements of the abstract into one so this is like a this hoist is a lot of work in terms of our data rectangle in its gonna turn these into our text fields and then we say article data gosh this optics big and we say article data and absolute leaving authors the list I haven't decided what to do with exactly what to do with it yet and oh yeah let's filter for not isn't a abstract and here we have our papers and okay I'm actually gonna throw in the filter isn't it abstract technically could have a text and no abstract I'm not I feel like that's all that likely so here we go title abstract and we have the body text no I'm curious about a lot of what to do with this one is like do we want to separate out the sections do I want to separate out some point it probably I'm going to separate things like oh what's the introduction what's the individual sections like we gonna do sign to do with that okay but first I'm gonna do a little bit of text mining alright so I don't need this anymore here's my article data alright what I'm going to do for text money well for starters I'm gonna do what I'm gonna do unnecessary tax package by me and Julius silky and what we see here is we have word count word and uh in the titles the word virus appears a lot makes sense the description of coroner virus but we can also I could remove stock words so say I'm an ad vitur with words this is unnecessary are words so we can say for starters like just wanted to explore what are the most common words I could make a graph that looks something like this I would say here we go we called this title words and I just split one word for each line I'll say take this what are the 20 most common words forward flip you paid word equals reorder this get a little graph that goes okay virus respiratory infection influenza this is not like you know it's not super exciting porcine is relevant because I think it's over having something to do with pigs and maybe it shows relationships to swine flu could be we see the word coronavirus appears a lot I don't see the word kovat here alright and but yeah these are words that appear in many titles again we're not doing that's not doing much it's not telling us much it's like it's not not even it's really not excited at all I'm gonna just for kicks I'm gonna do that again for the abstract never hurts to like take a quick look see that things make sense these stock words aren't even like biology specifically the science specific ones so it's not perfect but we'll take a look at the abstract oops word up speck why don't do it man did you count to orally hmm well alright and this oh I didn't we do it on here it is okay so very similar results are na patients host data health nothing um nothing you can do you know this like a word cloud it's not really giving us any particular insight something that I've been thinking about though is first thing is how can we tokenize in a way that's a bit more intelligent and I want to show size space off size spacing so Spacey is a Python package for natural language processing that includes a lot of things including named entity recognition named entity recognition is when it finds like snippets that are biologically meet that are like meaningful in in a real context so I'm like on it might be a person's there might be a company's name in this case it might be a biological term also the sauce size Spacey gives a set of of these packages of these um models I should say and here's one and load and course I smalls thus is am a small model they load the example that they give here is though and I'll par set with this function and then they'll find the entities extracted by a mention detector and these it looks like our arms are bits and attacks that might have an entry in um LS it's um LS its unified medical language system so taking a little bit of a team a little bit of look at that spot at size space before I installed it before we started to make sure I wouldn't spend the whole time installing I haven't looked at it with this data yet so here yeah but the story is here like if we have myeloid derived suppressor cells and the MDS C are immature the point is that like these are actually one big term mile or derived suppressor cells and this K actually tailed that's a term myeloid cells a term similarly here we might have something like Cova nineteen might be a term or corona and virus or infection I'm not I'm not really sure yet but this story is we might have terms that should be that should be multiple tokens our tokenizer a nun nest home kits the default English one it might not be well suited to this problem so we're gonna use Spacey but we're not going to use Python we're going to do it in our space er so space er is a package developed by a Ken Benoit it's even it's really amazing and the UM it's really amazing it's really a terrific wrapper around Python and what we're gonna do is is here it is the way that it works is that we load in space ER and we have to initialize it with our Python installation as it happens I've got we see is this the right Python is that my Python hmm oh I don't know the right Python you see in the terminal that's okay I'll grab it out of here I think it'll still work installed mini Conda and I think if I say science Misha Oh nuts I pie spacey initialize let me see then this oh yeah then we say Python executable I remember this Python executable is here and we give it the name of the model I installed the model in advance I'm gonna use the medium model now we'll try the larger one later that is there are three sizes of model here and I don't know what the difference really is gonna be on this data or anything like that but here we go I'm saying Spacey initialize initializing with a particular model that's taken some time I just know that if I then do so I like space seed extract empathy that's I raise it this is just takes all day to run I should be able to run it on a script of string a little bit like what they show here hey initialized so story is like I just ran it on this string and it said we pull this out we pull this how we pull this out that does not look the same as this oh that wasn't small hmm what if I restarted out we said I've got got all the data back I promise I don't think I can that didn't work well he's are gonna try try restarting on I'm just curious is a small model maybe show this version it's definitely it's similar we can actually see like some of the same terms but it does my lord if this one separates myeloid and suppressor cells and then MBS see alright so I'm starting with do this we initialize needs me being curious if I can reproduce what's in that example no I can't I have to keep to the small one and then I'll switch to the large one later and we'll see if it makes any difference small might be more reproduced might might be faster I don't really I haven't tried it on a data set even remotely this large before but the story is noticed that it doesn't just break it down into like individual words it has an immunosuppressive activity as being an entity a little weird that it through leaves in the the new line they're never not crazy about that hmm but alright but I am working with this and I need to do it in a tidy format I actually I'm going to want to UM to apply this to use on nest Hawkins this is actually really fun which you might not know is that time if you interviews tide X tiger texts can take a a custom tokenization function so what I'm gonna say is tokenize Spacey so organized space the entities size space the entities not that's not sure it's not too long at all if I take text that I apply Spacey extract entity to that text oh I just murdered I forgot to do this reload after running this do I still have my data you know do I still have do I still have article data I do not I rerun this section that's gonna take a minute but the story is that it's gonna extract the entity and notice that it actually still has everything in dock ID that's not gonna be good I need to do group by dock ID and nest that's gonna be the point isn't like this is oh all of these are text one if I gave it to strings if I give it like two lines here I want to end up with a list and each element of the list to contain the tokens within it by the way st. cool here is this actually shows what's like the start of each of the like position within the tokens in this string which is kind of cool but the in terms of like you can actually say oh well this one starts position one position three dozen six more thing about it the more we should have some we should build some space see like into tiny text make them work really well together I think other people might want to use name that unlike unless in titties okay that's a name does that exist - Julia right that unless entities in tiny text nope well we're going to it would I think that sounds like a pretty good idea to me so we're gonna but the story is we're going to extract entities here we group by this and then we nest and then we pull out oh I'll show what what this looks like it's gonna be a little bit of an adventure the story is imagine I took this string and I ran hook I'm trying to keep the video up with what I'm up to here but a little slow cuz the size of that chasin objects I'm gonna remove the Jason objects for a while I think that might end I don't know just for a good luck I'll do some carpenters out the garbage cleaning all right the son of you remove a huge object sometimes that can help and not all that often but I think this had a minute alright since stories if I pass this to sentences I right now I want to end up in a list of character vectors it's not doing that yet so we're going to do is to also pull data now you the list of data frames math text here we go and now I get like okay this is lists of the tokens within each okay and all right and then so I'm taking the the so I've got this this is call now a tokenizing function I like the tokenizer is package for example you can find tokenize words which if I applied the tokenize words to this same vector you just would have gotten this out and that's what unless text is doing under the hood well not anymore what I'm going to do is create AB instead of abstract words I'm gonna create the abstract entities here we go uh nest entity abstract I'm not going to do any stock word anything notice that actually in the process this actually removed stock words and I did them in like a scientifically intelligent way it because sigh size space he actually doesn't know like oh yeah the word V is not meaningful but it but it's able to pick out the other entities that are meaningful let's take a look at these abstract entities should be one line for each article which entity in each article oh hey hey this got really big I bet that this became a real issue so big that it can't even our print that's great that's I was hoping that would happen right on right on schedule I'll tell you what I think happened I left the UM the full body text in there that was a mistake because when you I have nested the abstract it ends up duplicating the text full text of the whole paper for every single word in the abstract I'm gonna do something uh we professionals call this bailing out I just I'm gonna get out of here out of it ah oh oh boy oh boy where am I I'm in repositories data screencast I open up a whole new our studio by the time that this one is off that's great hey everybody or can do Big Data - I am NOT I'm not uh but you wouldn't know it to look at me here we go court nineteen I copied and pasted everything oh boy that's really uh having a fun time I know Triforce quitting this which one is it was this one okay okay and I'm gonna have to rerun that code so that folks I'll tell you what so going up I'll tell you what the next step is going to be is instead I'm really gonna need to take your article data and I'll I can unmask the abstract but I need to conclude only the ID the d hmm I guess I couldn't keep the title can I keep the title I can use the title to refer to them oh I'll use the DOI DOI and the abstract the stories drop all these other columns that are about to get duplicated over and over and over well that's the metadata oops that's the metadata it's not the one I wanted I wanted to look at the oh yeah paper ID it was called paper ID fantastic all right this was the metadata the thing that I'm looking out there there's actually some wisdom here which is generally don't start by analyzing your full data set because then let's say if you do something a little bit wrong at all crashes you you're off by just a little bit like it's not on you find yourself waiting a lot as soon as we could have just probably done a thousand of these optimal just for now I'm gonna remove chasing objects geez see and notice I'm not bringing in the other papers if you're doing this at home you absolutely can bring in all the papers analyze them I just make sure you follow the license in terms of how you share them if you use them for commercial uses but what i'm doing here is no just trying to clear up a little rememory you go what I'm doing is tokenizing and look I did not need initialized space er I'm gonna start with chest had 100 tokenize just a hundred of these here we go and it sort of contradicts t-cells parentheses TCS are Oh ice left it in as um as the regular unrest Hawkins here's what I need I have to add token equals and I give it this function tokenized sighs Spacey entities so now it'll apply that function to the vector of abstracts instead of applying the instead of applying the the English tokenizer so notice a couple things one is it says Madrid excels TCS are specialized antigen presenting cells and we also see the overseas so they weren't APCs immune responses and make adaptive arms what's adaptive arms and viruses is that is that a thing is that adaptive arms arms be adaptive captive mmm what was the title of that one I'm going to leave in the title and now if I said title one Oh that's the neck wait that's probably the next paper this doesn't match if I did this to abstract entities nope no article data I'm doing slain wrong let's find out what I'm doing on okay these have six oh it's the abstracts not the titles okay alright I was a little bit uh here it is okay they Britt alright here we go all right of the immune system they mature upon recognition of art pathogens and upregulate MHC molecules okay the point is that we've got things pulled out like transiently exposed neutralization epidemic oaks here I think what's mostly doing is combining adjectives and nouns but like because this was a new transition epitopes not to be confused with all the other types of epitopes and um let's see we usually we have we do have nouns as well we have yeah the story is we're actually getting scientifically meaningful tokens out of this the only difference is I really want to make these lowercase map pull the text out of each map string to lower some string are alright instead of the first thousand I'm gonna sample end really handy able to say sample and 1000 do I still need the title I don't really need the title I love this radical alright so now what I'm doing is is it's performing this to organization and entity named entity recognition on every one of these these abstracts and then out of those abstracts instead to find out a little bit more start to do thing I could remember I could use the full text as well I'm it's already taken long enough on just the tokens this is not a an hour thing this is something to supply in the to organization and each of steps to every one of those all right it would take it's just a wraparound the Python version all right so I'm going to do count enter the sword one of the most common entities all right the in these are in abstracts not in these are in abstracts not in in titles we see then as alright so viruses moved down the list I think the reason is it's probably split off with like other things like coronavirus or SARS so does nothing there's not other words I've seen or other types of viruses they might have an adjective attached to them so I'll show what I mean by that let's see I want to see if entity and Jim call code I'm doing the same thing I did before really somebody somebody helpful wrote a shortcut this which I have not attempted to create that graphic I have not yet installed okay we show then is here the common on named entities and MERS Cove yep MERS was what was that the 2010-12 there was am there's an outbreak yeah alright and see the study operations human infections viruses data host associated with background virus as fun as I don't see the word covet hmm yeah I don't know um they take a look at a cop I'm curious like is it just is coded not being recognizing these if I looked at title and picked oh not tackle abstract and picked a random one would it be talking about the hover do the antigenic now I think a lot of these just aren't necessarily about who are necessarily about kovin 19 okay that's something worth knowing all right uh no they said abstract too we go and bats yeah all right all right so what am I going to do what am I going to do with these abstract entities I have a thousand I'm gonna try point in two thousand later we can grab all of them if we like we don't even need to do it just out of these nine thousand we have the full-text for we might want to do is look at the UM is Lou is a couple days we do we could look at commonly a co-occurring entities or we could take a look at topics I tried classifying these into topics I'm gonna start with co-occurring this is I'm going to want to do a little bit of what we call unsupervised learning just a little bit of like clustering what are groups of words that tend to appear together these entities are going to be yeah those might be and that being like particularly meaningful term said okay well here or the the categories of papers that are included in this this dataset so well I'm just I don't expect to discover anything new here this is more about like understanding the AMA is more about understanding the the data understand like like like the the shapes of what what kinds of data have been included here so it's not by doing probably is say is add account to end to be and say okay if you if you don't appear at least 50 times you're not even you don't even count so stories like there are a hundred sixty six thousand entities but I'm gonna remove the ones that aren't out of certain level of frequency I'm gonna start by saying make it a hundred add count says okay here's how many times entity peers filter health and this one appears and now what I'm gonna do is load up my ydr package which can then find correlations among these entities pairwise core among entity based on paper ID I notice this just comes out of these two these the two columns in it and sort equals true is the correlation between a word appearing in a in an abstract in another word appearing all right in sensitivity specificity in vitro and in vivo there's like pairs of things is or a psychically meaningful it's not hmm I wonder I don't know what that that is all right and we all see pairs like Maher mirrors murders Cove all right some of these are like pedigree presumably is a virus that is common in hitted pigs okay so we're starting getting a feel for some of these frequencies I'm gonna call this entity and to the correlations and you know what I'm only gonna pick notice every one of them appears twice I'm gonna pick the top I'm making this up I just pick the top 400 and we'll try looking at those pairs of things well what am I looking at right now what I'm going to do is to library GG r GG r f juju rap is really helpful but I also sometimes use I graph for we're like a graph from data frame of entity correlations oh here it is and now I've created turn this into a graph then I want to say GG wrath I like I was different layout I like and I'm gonna create a network out of these I'm going to say we put a genome edge link a s edge alpha equals correlation I put in a genome note I just done this with text so many times why I'm doing it quickly without spending a lot of time describing it this is the this is a correlation network of work this is not despite as look this is nothing to do with like say contagious of disease these are how often do words appear together in abstracts which words what we don't know that we don't really know anything so I'm gonna add geo no text and it's label equals name pal equals true must go ahead theme void kind of just like kind of like it without the great back though your mileage may vary okay what is this these are topics within here oh yeah these are topics generally like topics I mean like they're they're clusters of words you know we might have a few too many connections here hmm no this might actually be kind of meaningful okay what groups of things do we see here well we see a couple of hours we see like SARS and MERS here we see MERS linked closely to bats and MERIS Cove we see here's actually there's a section on vaccines this is just like did it appear in the paper in the abstract at all or not like mentioned the word vaccine in vaccines i vaccine a protection vaccination I kept Ecosse as well as all these other ones review this is a cluster about vaccines this is one about protein binding functions interactions binding interactions etc there's other ones there's like patients there's these this looks at the section is more about epidemiology in the sense of like detection negative prevalence samples sensitivity specificity are measures of a tests on accuracy and yeah then we see a section on like epidemic outbreak SARS MERS coronavirus says so here yep so yes so this is a um like this kind of mapping out the set up the topics of a um and add one thing it's a legend position okay see that here I cannot I have to put in an additional you don't quite want the legend here but I also say title I want a words that often appear together in abstracts when you have a graph you'll often want to set a seed like set seed 2020 why am i doing that so that if i graph the oh it's not the theme here labs labels so then if i graph this multiple times the same clusters will appear in the same areas sarandon layout yes who said here's the pig section here's the it's the same graph we just it just as we plotted in a different order okay why do this because it does help I just start to help give a sense of treasure what are the types of topics that are scribed in these papers we probably would find is that um live journals and this metadata of what do we have do we have journal in meditator let's find out count looks like we do journal sort equals true okay um but most of them don't most of them don't have a paper that's frustrating actually wonder with these will do with anything else about it keep your ID title authors abstract back matter not all that much okay last thing so I'm just noting this is like this is a first thing we could do towards saying what are the common topics in this up like at literally just scratched me sorry Beth Oh what are the common topics in abstracts of this collection of covent of coronavirus related papers but so that's something we can learn out of out of text mining that we just in the first hour what's the UM see other things we could have done with the abstract we could have done topic modeling which would have tried clustering them into a into a particular number of topics out what probably you probably should do that at some point I'm sure we will but oh these aren't words these are entities that appear together in abstracts based on the size Spacey named entity recognition model okay so right so that doesn't want one size we could keep going on the in terms of what our topics of these papers try breaking them at like basically adding extra metadata to them which could have just started under said okay what are the what are the the topics that are discussed all right that's one start another the next direction they're doing that time limit this about 15 minutes so keeping not much more than an hour is to look at references let's take a look at our article data again we have our bibliography entries I'm going to select just the paper aidid and the bid entries what is then so the bit entries have every item in them has a name but they're not very they're not the most interesting names but they show where I guess they show where they're associated like here we go we can see for example ref spans but this one doesn't even have any spans associated with it I think a refs pan is supposed to be like oh yeah this passage has this citation I'm not gonna quite I'm not necessarily gonna look at that right now because it looks like at least it's missing at least some it's missing in this one but let's take a look at our other actual references my video title authors etc okay yeah first thing to do is the unnecessary examples cuz you never know when something is gonna take a lot longer than you expect now here I could hoist it do you want to hoist it Oh is there another thing I can do what I'd love to do is can I just I just want every level here at the top is anything I can do I wonder if I did uh nest longer two big entries what happens oh I know I want a nest oh okay this is cool well this is showing it's like oh in ref ID there is this in title there's this an author there is this I don't want that I want I'm nest wider oh this is great I actually haven't used this function before but I was just hoping to because it's related to hoist it would turn out similar unless what a great what a great function so this is I'm going to call our references article references that remember what I just did was only for a hundred now I'm doing it for all of them not yeah all the 9,000 that I passed earlier in this nine thousand forints that are available for commercial use which I earlier hmm well unnecessary for notice I needed to uh nest the list first to like get every one of these on a separate row but then I did I wonder element of a list call no L each element of a list column into a column yeah I think I needed one nest and then one a nest wider that's pretty cool so they've taken its time but this actually I'm saying I I sometimes do when I work where this is like you know your mileage might vary but this is not gonna be a huge data set I don't think is crossed but I could take it and if I just remove the authors and the other IDs nothing to stop me from say saving it I'm not gonna commit it to github but um nothing to stop me from saving it I wonder if this is um I wonder if this is slower or faster than if we'd written in a different way we got nest wider that is as I said I haven't tried it before I'm gonna let it run a minute longer what might we want to do with these references I think the most common thing I would usually want to do is take our references and count what are the titles that are the most often what height what height what these are not the titles of the articles the titles that are of the articles that are referenced what articles are most reference and then I can say you know I can I can keep coding even with even while this is running I can say title and I make this graph all the time we're gonna I'm gonna say what are the most we're gonna be title while doing it I wonder if I should should like wonder if there's a way to publish the clean versions these clean not in terms are like doing tokenization but rather doing this kind of like reference parsing like pulling out there I'm actually going to I'm gonna pull this on up a little bit earlier because if people other people want to try doing this themselves I want it might exploratory data analysis not to get in the way it's gonna say exploratory data analysis and I'm gonna move all that's that all you're gonna be able to this Wow this is a lot slower than I've expected it food yeah we can see a lot of things we can see what years were the things people are publicly were the things we were publishing on yeah a lot of things we can look at this is just so much that can be done in this data set uh if it ever finishes loading should have done it on a thousand that's like a thing you do is you say okay well I've got it on a hundred at work do it on maybe three hundred or a thousand before you do it on the nine thousand I don't know how many references are in here I don't know if some of them have weird fields it's always possible should I kill it do we think it would work for me to kill it I'm gonna kill it hello what I'm gonna do is say sample and 200 does this work so I will try 500 okay and that cause get is taking about as much time as I feel like I'm getting 500 random articles and point out the bibliography entries well keeping the paper ID so I can still go back if we want later we could go back and look more into like what um we could look more into like a what words are associated with what uh what articles are associated with the papers I'm really interested actually looking at do any of these articles to each other presumably so alright so what I'll do is I'll do you have our references and now I'll say yeah I'll be better look there's some that are really really way too long you know how we do this is actually called string trunk but I'm actually gonna here it is there's actually the problem here is that some of these are not gonna be real here we go so it looks like some reference some quote references are not references at all even this actually I do not know so so some of these are not references so I've been actually quickly to say like filter not string to tack I'm gonna put this in there and beyond in the cleaning is the other advantage to not make it too long as you say filter not streamed attack title yeah I may be doing more later I'm not gonna put it in the cleaning stuff but I'll say is submit your next or this article or Springer nature remains or publishers no sure okay so one of the stories there's one that's like isolation of a novel coronavirus with middle eastern Oklahoma virus we're seeing then is there's a couple articles that are really well when I say highly referenced I mean we're out of this 500 there they end up in like 15 15 okay so the in fact what I could say is count mutate percent equals and divided by n distinct of article references paper ID we know they're 500 but not everyone might and will say what percentage of articles mention this other than throw in a labels equal scales percent format so this shows like okay three percent of articles and that could just be noise that have this have this that here a couple other common ones one thing we see here is like okay there could be duplicates across some of these names these two I'm not sure not sure and it lots of things could start with that I'm gonna truncate 50 characters not nearly enough okay yeah those got those two got separated I think or got lowered hmm see here's one isolation Alberto's women from pneumonia in Saudi Arabia all right so that's that bless with 500 can I do with 2,000 gonna be a little bit patient let's find out bats are natural reservoirs okay so that's actually a good thing we're looking at is like yeah I wonder what a year oh yeah wouldn't be taking a look at the years so it's a we don't actually have the years of all these articles but it looks like we have the years of if not all maybe a lot of their references it's not gonna be a random sample and that's kind of a complication but it's something we can still look at is we could say article references unnecessary title and what did I call it token equals I had a function it was called tokenized size space the entities and I'm gonna say firstly was distinct title and year sting pedalin year just a petal me I do this thing title and year even though few years and then filter the filter or not is in a year under this calling this reference to articles man I do not like it when I let's and run and then it takes longer than I expected to don't love that I wonder if I don't need all of them I don't need all of them what do I need I need article data unless wieder fantastic but it looks like it might be a little slow pivot entries hoist give entries i want got the title i want to grab the title equals title year equals year these are the two things I really kind of want I did is there anything else that is really oh and volley not volume then you is solid I probably want authors but that's later going to be a list and that's gonna be a little bit of a of an issue so if I did this on 100 how hard is that that's easy let me unnecessarily it could be for a lot of reasons okay I'm gonna do this instead of I loved unnecessaries I'm actually I just I'm trying to put something useful here where was I here I was title the New Year I think if I specify what the type is it might make the end nest that might be make it easier than the end nest wider though specify which columns I want and then I can rerun this and the 8,000 Plus articles with they say based on be here we go Maximus a gnome articles notice how I said that would be this very fast then it wasn't welcome to party that has been too much longer and this but I think it just really interesting I can look at the articles that I referenced based on the nearside fun I'm gonna do blue glue based on the articles with on the trigger said I can actually use glue to serve a based on the articles open for commercial use references like to put lots of details like that there we go this our this doesn't exist all right so what do we see are the most referenced articles this is still the top one isolation of a novel coronavirus man with pneumonia in Saudi Arabia there's this is the this is blasts a really common bioinformatics tool this could be this could be covert 19 I don't I don't uh this I think is SARS severe acute respiratory syndrome so these are the more they mo are at least the most referenced kind of papers in the field now I was started to ask something different I was trying to say let's think I read our article references let's look only for the ones that have years and less distinct title and year okay and then let's unnecessary really there are 300,000 articles so I guess that could easily be it's 9,000 took 8 thousand how many articles are all right there there's eight thousand hundred forty six okay I'm gonna sample an one hundred five hundred Anna try this out reference article reference the article entities I'm taking out the name that duty recognition from the referenced article titles alright so then we see like what entities are used in what years so I can do that is count year and entity and now I see okay in this year there this many mentions I'm actually gonna filter for I don't know nineteen fifty or something what I'm gonna do then is say I only want to keep yeah I only want to keep somewhat common entities and say group by entity filter some and I don't have least ten papers 20 papers no good two papers three papers I don't know man all right so there 50 words that appear in at least four paper for the entities that appear and at least four papers some okay yeah all right why am I doing this because one thing we can do is say actually if you're at least five why would you that not sure why because then I can say well how often do does a word particular word of here and it's a year and boom call if I want to say was a by word every year so I can say here we go by the year okay so it's a filter for so I can say a filter for entity equals fact here's an example that's that's not mentioned a lot in even the referenced articles was this sample this was a sample that might improve when we look at everything yeah this will probably get better I was wondering why there was so few papers each year okay so what I'm doing is I'm um I'm taking I'm looking at the individual words the mentions of them and then I'm gonna say how often are these mentioned a particular year now this is not normalized it is not out of the this is not divided by the full sum of coronavirus just like mentions per and it's I it's a funky set it's like it's not even a set they said is related to coronaviruses all the papers that those are citing those are referring to but then I can say let's look at how often the mention could buy a year instead of a year I could do I could combine pairs of years while it's while I'm waiting for this to run say year times year two times I think I can do this I can do a year that's flawed division so then it's I'll combine pairs of years together I don't know I mean whether nineteen hundreds is everything always is always some paper that it thinks was in near negative two thousand or whatever but yeah what it's doing and this may take a while you know take a while cuz tokenized into three hundred twenty three thousand names using um a spacey aspasie model how many abstracts that I had was just now these titles are shorter but still suitably suitably fewer I'm gonna charge my computer I'm actually that yeah I'm gonna do is run this [Music] so I've got a cat that's meowing see here Oh did it finish running nope it didn't I just sample and 3000 does this take does this run in a reasonable amount of time it's one of the issues with like text mining huh I don't have that happened I bet it included one that didn't get that's another reason oh boy I bet it found some that didn't Oh No I bet there are some that don't have any Tok and any tokens that means we had to rewrite this some dysfunction I'm not going to do that now instead what I'm gonna do is actually put a little bit simpler I'm just gonna look at words only you I'm gonna treat them as if they're English text just so I can show this but we have a sense of like how we might go about doing this with Spade and Spacey by word year hmm hi we're here filter word equals back that's all right my word here here we go I think even include both okay so this then does is it gives us a way to say okay how much her bats talked about in cetaceans in particular in citations over time where's 2020 would be I guess there yeah and and the fact that there's like a spike does not necessarily mean oh the the discussion on bats has gone down it could just be the overall years have gone down how can we tell that well we can do is we can look at the totals we can take our referenced articles count year name equals total and one thing I could do is say distinct paper ID word and year because you don't want it as if the same word appears twice and one you don't know hmm I neglected to include the paper ID oh um oh look that time this way that's okay if the same word appears twice that seems pretty rare and it you know in a title doesn't seem like that's the concern all right then what I want to do is then graph filter word and this distinct word year you know word oh no I don't want this out oops so ever that folks here we go let me say what percentage of articles from that year mention this it seems a bit high yep and it is hi sorry I'm uh year totals how can it be higher than the year oh because of this I needed to count the combine like the merged here totals well how does the word of appear 28,000 times there's a total of n divided by oh my goodness and divided by total not n divided by year oh gosh and filter for year is less than or equal to 2020 yeah that's virus that's all right so how so what this shows is not that oh that literature used to be really crowded with that discussion it rather shows that if someone cites an old paper it's usually in this corpus of Cova nineteen related papers it's usually about bats so the point is how much - how much do reference papers how much do referenced papers refer to bats in the title not a particularly special breath there Wanda I just wanted to show that there's things you can do with this data set of ref of article references other we can say what are the what are the venues that the journals that include the most those articles that would need something so it looks like it's Journal of eye Rolla G followed by nature viola G science there's things we can do with this data could could have seen the most published authors could have done a lot of work with this okay so what do we do today I took a look at the to look at the metadata we didn't do much with it I extracted I showed how you would use the hoist function from Khidr to extract and a few other little tricks to extract all the data from each of these adjacent objects in these in these files I showed how you pull out details of the article references I then I did we did some exploratory analysis with words and titles abstracts and most of all space size spacey entities where we actually named entity recognition this needs a little bit of work to be to be production ready to work with larger data sets but the but yeah he but the store just is we we didn't just look at English tokenization we also of Dedham Malo specifically trained for this purpose and finally I took a little bit of a look through article references saw there are some papers that however this data set of articles those articles are tend to cite these papers the most though they are still pretty well spread out it's only a couple percent okay so that was a that was just a shallow look into this data and really focused in the titles the articles a little bit things were the references little things to time a little bit of organization nineteen cleaning and exploration I just I really wanted to do something together said to get to get a start out there in terms of how could someone pick up this data set and start analyzing it last no says is by far the most important thing that I or anyone I can do is probably to stay home to wash our hands to spread that message to avert everyone we love and really hope that all of us work the heart of this the scientists and health care professionals and everyone and that together we can get through this crisis all right thanks very much and I'll see you for future screencasts
Info
Channel: David Robinson
Views: 8,809
Rating: 5 out of 5
Keywords:
Id: -5HYdBq_PTM
Channel Id: undefined
Length: 81min 35sec (4895 seconds)
Published: Wed Mar 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.