Data Visualization Workshop with HoloViz for the Kaggle COVID-19 Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right cool so yes welcome everyone to today's workshop on doing data visualization with the covin 19 data set that's being hosted on Keiko and multiple other platforms so to give you a little background story of how this even came to be back in I think the end of April we had the CTO from closed-loop AI share a really really cool data science tool that created for healthcare professionals it basically helped doctors and hospitals give a vulnerability index on people who contract this virus to help them make quick fast better decisions and the turnout was phenomenal we had a massive turnout a lot of interest from the tech community and that inspired me to do more but I wanted to do things a little bit differently rather than just sharing ways that other data scientists have created tools I wanted to also empower our data science community to partake in in this whole challenge right of like how can you use data science to create tools or create insights and I was really excited when I found out that Cagle was already hosting a really awesome challenge and their goal is to whatever comes out of this cable challenge can be used by researchers and scientists who are having a really difficult time understanding or figuring out what papers or research reports are best suited for their specific task so really excited that we're able to do this event today we also have one on machine learning for the covett 19 workshop and that's like end of this month so I could share more details about that later but for those of you who are unfamiliar with galvanised itself I am their developer evangelist and we offer tech education programs both in software engineering and data science as well and I had the pleasure of working with Jim on a different workshop from a few weeks back he had really incredible hands-on workshop on teaching the community on building dashboard in Python very quickly it was awesome and I knew we needed to bring Jim back to teach this workshop today so I'm super thankful and grateful to Jim for creating this workshop from literally from scratch and if you want a recording of Jim's prior workshop on how to create a dashboard send me a message afterwards and I'd be more than happy to share the recording with you some ground rules for today please keep yourselves on mute and post your questions in the chat box and I'll be sharing that recording post event for those who want to review it and so forth so without further ado Jim I will let you take over okay thank you very much for inviting me and I'm happy to help out in whatever way we can with getting us all out of this situation humanity has found itself in so I'm I worked at anaconda and I'm here on my own time for this but at anaconda we my team works with corporate and government and academic clients to help them analyze and visualize their data and in the course of that and in fact prior to that when I was a professor we've been building tools to help in that process and each time we work with another claim we make the tools better and what I'll show you today is where the tools are today they've been they've been we've been hard at work from them for I guess a decade now yeah some of them were brand-new tools some are very old and together they addressed a lot of problems about how to visualize and understand data and the the point here the reason why they're relevant here is that when you're trying to participate in an a something like a caracal challenge you know we're quickly go from one thing to another and me too so maybe I should do it this way maybe I should visualize that way maybe I need to interactively select this on this and it's not if those steps are painful and difficult then you can't make much yes and so what we'll try to show you here is how to make those steps be quite straightforward where possible and so that you can go back to focusing on the underlying tasks that you're doing hopefully helping researchers with ghovat well hopefully not researchers work oh but hoping researchers alright so just as a background actually no not as a background just to make sure everybody is getting started in the chat window you can cut and paste anything you need from these instructions they should be on your screen right now basically you need to install Conda there's a package here this has the entire contents of everything that I'll show today so you have you can run along you should run along the same things during the talk if you have problems raise it in chat I'm talking and so there are serious problems she'll break in and stop me otherwise hopefully you can help bail to each other if you see that oh well I had that problem but I fixed it please chime in on the chat to show people how to do that but inside this project is some there are some Jupiter notebooks there is a specification for a common environment and specifications for the data file you'll need from kaggle and once you have it once you have it these are the instructions for actually running it what I'm displaying here is the results after step 5 I'll also show a couple of other projects but the main one is the what you get in step 5 kind of HTML can you all please place yourselves on mute if you aren't already all right I'll I'll give a few overview that's at the front while everyone's getting set up but I just want to make one point at the beginning and that this is a global crisis that in part that has had visualized okay I'm sorry I think I was muted briefly but this is a crisis that has had visualization sorry about that I was trying to meet everyone it accidentally muted you sorry yeah so this probably a good portion of the people in the world have seen this plot on the left and that's actually just a diagram not data but it corresponds to data as it's being collected and this entire this plot here has directly just driven policymaking and people have collected data and compared to this plot and tried to understand whether they have been successfully flattening the curve you have a lot of people across the world talking about data visualization ways they have never done before a lot of people have also seen this Johns Hopkins dashboard just showing alarming levels of infection across the world as they've appeared there's a lot of people have also seen comparisons and countries between countries and tried to say is my country doing well or at least somehow doing better and then other people have really tried to put it in perspective it's just like the flu well it's not just like the flu it's off the charts in so many ways and so there's just a whole lot of visualization going on it's like a renaissance of visualization and that's that's good but you have to be really careful because people all over the world are saying wow I can make a visualization I could analyze some data I could throw something out there and people put it out on the web and it's completely misleading people so you have to be really careful it's easy to not quite analyze the data in the right way and confuse everybody and really help people come to the wrong conclusions so and this is particularly important if you are a researcher or a scientist or an engineer but you're not in your field if you're working in some other field you don't know the pitfalls that are that people fall into as soon as they step into that field if they try to visualize Geographic data try to understand disease incidents there's a there are a lot of ways to misunderstand in this data and so I just want to urge you whenever you're visualizing anything here try to keep it simple so that it connects understood and so that you understand it if you're modeling please don't model something and then tweet about it and make pretend that this is what's actually happening this is a useful model and you have to stress that this is just a model it's just an understanding it's not you have to be very careful of predictions particularly people have been throwing predictions all over the place and confusing everybody and really just try to be humble and try to remember that whatever you're an expert in you're gonna be in an area that is involving things that you are not expert in so try to reach out to the people who are experts there get some feedback before you say present this as what the country should do so really try to be careful but that said there's a lot of good that you can do because we need a lot of eyeballs looking at this data a lot of people coming up from the different perspectives hey Jim quick question someone said they tried to down go to page at on a condom it says are they need permission to access the data let's see if I can see the download page at anaconda is that for Conda if so just do a web search for install Mini Conda and you'll find a page that helps you install it if that's what that's for if it's for the other one then let me just double check and try it in an incognito window so that I don't have any special permissions okay that link is working right now so hopefully that knows about the first question all right so what are we here for today specifically the Kaggle chord 19 challenge that's the open research data set challenge it comes with the data set by that name and this is anyone who's done work in science or engine earing knows that there are lots of papers that are published by researchers all the time and and since cope of nineteen arrived people have been studying it from every direction with every technique of really trying to work out what's going on and this data set includes tens of thousands of papers which is obviously a lot to sort through and the task is try to help people who need to sort through them that's them that's essentially that's the instruction for this task and what you have to work with is a metadata file and some data files so metadata file is one file with 63,000 some odd rows in it and this is basically one row per entry an entry is a single publication or are sometimes a part of a publication that has a certain author has title abstract and so on and then the data will for the cases where there are full texts of the papers available that'll be in the data directory to make sure this is feasible for everyone to run I'm only focusing on the metadata so this is not that big a file it can be downloaded it shouldn't disrupt your network but the same techniques can be used when working with the full data so what tools that we're going to use we're going to use the we run the project called hollow biz pallav is is a meta project we came up with holidays when we realize we produce several different independent tools that are built on other tools and it became very hard for people to make sense of how it all fits together the holidays is the one-stop shop for explaining how all these pieces fit together if you go to holidays org you can have there a lot of tutorial material and these tools are generally designed with some principles involved for trying to make it easy to work with data trying to make it remove limitations so that you can plot any size of data trying to make data more interactive in web browsers these general principles drive the design of each of these tools and then as you'll see it helps you easily make interactive visualizations of large data or small data and if you click on this link here it'll take you to the one of the packages I'll cover panel and if you look at the used by four panel it'll show you the people who have recently published packages that depend on panel and if you look at that there are a lot of people I don't know any of us I don't know any of these people or what they are doing but they are already using it to Explorer kovat data of some sort Tobit is the only common term that i see in these there are a lot of people using it perk of it and so this is a it's clearly something that is available to something that's useful to people in the public and we'll show you how to do what it what it is that they're doing so okay so first off I should pause is there anyone else who has completely failed to get anything running but still thinks they might have a chance if so now is the time good to ask otherwise from this point on I'll assume that you have something running or that you're watching this during your lunch break and you don't need to have it running you can always come back to it these materials will always be there that's a permanent repository well maybe not always but for as long as I can remember I imagine okay so let's to get started and we're in the jupiter notebook here it looks different it's for this is meant for giving a presentation but when you run it it will look like a normal jupiter notebook or so first off let's we just need to load the data and look at it and so if we do so we'll see it has this many rows in it and we see it as these various columns which is the title of a paper the DOI the canonical link to it a unique ID field what type of license it has whether it's for whether it's available for free basically the full-text of the abstract the data was published authors the journal that was published in and a few other fields on the URL where you can download it and also the the file that's on disk in the full-text it's got a corner to the full-text corresponding this document but that's not in this data frame it's just available on disk if you separately downloaded that entire day sale so here's here's our starting point and we assume that if you want to participate in this competition or just in general if you want to help people understand some data you'll have it loaded into some sort of data structure like this and then we need to do a little bit of clean-up on it or one thing once you're if the CSV format doesn't specify the types we'll go ahead and specify that some of these columns are strings some of them are categories where you choose from a fixed number of options such as the license there are only a fixed number of licenses that are defined some of them are one of them is a date and then the rest are strings so having having set that up now we will be able to work with this data set with nice clean type since by default it would just be objects and not many operations are available and objects and most you specify the types all right so let's dive into it and as soon as you've got a panda's data frame you can do plotting just like this this is not using any holidays tools this is just out-of-the-box plotting it's based on that pot live it's available to any pandas user and so if you have a panda's data frame you can you can do some pandas code in this case what we're doing is we're counting the number of each type of license we're sorting it from the most common to the least common in this visualization and then we're doing a horizontal bar clock and so what this suggests is that the most common type of license in this data set is actually Elsevier's coded special license I guess that I'm gonna just assume that's a that is specifically papers that were made available that are normally behind a paywall but are available for free during coronavirus and then there are a few other license types okay that helps now I'll show that the same thing but using our tools so if you change two characters of that instead of dot plot u dot HB plot this is our library called HV plot the H B stands for subtle views it's basically another one of our libraries that we won't really explain much about you can learn all about if you want to and the advantage of using HB plot is that it will give you an interactive plot and if you want to build part of the point of this challenge is to build a tool that people can use to explore the data themselves not just you but the people you share this with and so you're going to want to be able to provide interactivity to your users in this case the interactivity is minimal you can drag things around and hover over any particular thing and see the exact values but we can have more interactivity if we need to or when that's appropriate so in this case we're following the pandas dot plot API and if you are familiar with that all you have to do in most cases is just change plot to HB plot and there's also a way you can register it HB plot as the default block but we find that confusing for some users so we try to make sure that we're explicit about what's doing the plotting in this case is HB plot so immediately if we wanted to if we cared about licenses and cared about whether these are things we can access this data would let let us know whether this data set is covering anything we care about so far that's useful a little bit we can also there's another field called source X so we could do a similar plot and in that case you can see that the most common source for these records is PMC which I believe is PubMed Central and then there are other records come directly from Elsevier and from various archive sites and then all the rest of the options are very low and how common they are and what we're doing here is just illustrating a little bit of what HP plot can do that where you can customize it and you can also do this operator here this means overlay and so when you've gotten an HP plot in this case if we had just that all by itself it would I didn't execute the previous cells I should start from the start from the top just in case so that we have a good starting point that actually is the same story point that you guys will have so I'm just executing each of these cells in terms alright and we need to go to this line as well okay so in this case I pulled out one of these plus you can see what I mean here this is doing a line plot and instead I could have pulled out the other plot which is the scatter plot so if we look at that one all that itself you'll see that's a scatter plot and then if you use this overlay operator that's available for any hall of use object including all of these objects that come from a tree plot then you get this combined block where we've overlaid the line in the scatter plots to get overlaid the scatter on top of the line to get this hopefully that's clear we also have this also illustrates and this bit right here is not following the pandas supply API everything before it is normally exactly the same as the pendants top plot these are extra options there's probably a way to put that information in here too but in this case I happen to know how to do it with all of you so there's a whenever you end up with a whole of use object you can press holidays options some things are able to be customized in a way that isn't allowed in the pandas API all right now's a good time to ask questions and start trying something out you should see an empty slot for you to try to do things if you these objects generally support tab completions so if I have HP plot there's an area plot a bar plot a horizontal bar plot bivariate plot of ox block I don't think it said is a plot density not even sure I think that's a KDE plot and then air bars plots eat maps hexagonal bin plot histograms KDE I believe is synonym for density plot things of text labels lines paths points there are a lot of options go ahead there is a general question that came up in the chat earlier and that was from someone saying I'm new to data science how would I know how to distinguish between valuable analysis and crank science just because it's a post using panel doesn't make it quality info correct that's correct it's easy to create garbage panel makes it very easy to create garbage so you can create garbage more easily than ever before so you do have to think about what it is that you're you're doing here and the only way to avoid creating garbage to first think about what you're doing so try to understand what exactly what you're plotting and then also get feedback from people who know more than you do they might not know anything about panel but they can tell you about your plot they can say what is on that axis what are you plotting there how's it been normalized are you comparing it against a reference value or is this raw numbers if it's raw numbers is that appropriate are you comparing raw numbers from a small country to wrong numbers from big country if so that's probably going to be legit incorrect conclusions and so on are you plotting things on a map but your division of the map or zip codes or states that vary a lot in their size well they need to think about do you want to plot normalized by area normalized by population oh there's so many options and you really have to think about what story are you trying to tell and is that an is that an accurate story is that a story that that you have come to by understanding the data and are ready to share it with someone or is it you just type some stuff and got a pretty picture and charities it's a ladder you should be scared you should think you this should be a process where you're engaging with the data and we're trying to make it as easy to pass as possible for you to see the data from this side and from this side and what if I do that if I do that and that's not so you can generate a ton of flocks it's so you can generate a ton of understanding so you can see how everything fits together in fact that's one of the major driving principles behind the hallelujahs project it's for visualizing multi dimensional data the screen has two dimensions X and y and then you can add color and a few other dimensions but real data has very large number of dimensions each one of those rows already has say ten different columns in it and you can easily derive other columns which I'll show you that summarize it in different ways that's many more dimensions than you can ever show at any one time so you need to show what if you look from this direction what if you collapse along this dimension collapse along this dimension select one dimension and ignore the others and did the opposite with other things and you have to really have this process of engaging with the data set and becoming at one with it in a mystical way I suppose we're where you are no longer surprised by the data because you've seen it from every angle you know all its warts you've gotten really outliers as soon as you understood what the outliers were you did not get over the outliers because they were ugly you got rid of them because you know exactly okay I understand that or even some you might talk and say we've got to investigate those why is that one data point fifty thousand times bigger than other s I don't know but I better take it out because it can change all my results and I better investigate it because maybe was right there are a lot of things like the like that that come up this particular data set had they warned you it has not been cleaned they don't they say that it's not horrible but it's also not been cleaned and if you look there are missing data fields there are abstracts that are 50,000 words long that's not an abstract I don't know what's in there I need to find what those are so you'll see those as you go through and so it's basically a process of human understanding and engagement with the data set and that's what data science is and you have to add data science plus science sometimes the numbers alone and your fancy tools are not enough you actually have to pay attention to what the scientists in that field know they know how to interpret these numbers in a way that you can't just look at the numbers and interpret there are very common things that are that will completely mislead you about the data and if you're in that field you learn those when you were young you'd walk up to that field and say I'm a super smart data scientist I'm gonna do it you don't know that you don't know the problems that the people have solved a long time ago and then steer carefully around and so that's where I'm asking people to have humility you can come up with a great plot but try to engage with the people who know more than you do try to understand engage with the data and come and make the world a better place with more understanding rather than throw some noise out there hmm thank you for that amazing explanation and what I took away from is be one with the force of data all right really careful about the question you're answering it's very easy to be misleading or manipulative or not like the data itself it's harmless it's how you combine different factors and tell a story and dependable story trying to tell and you can manipulate almost of how you want that story to story to steer it's very easy to come in there with a preconceived notion put up a plot in there that looks like what I expected and it's completely sailed over the fact that yeah it looks like what you expected there are two bad data points in there that completely changed all the conclusions you were happy with it so you didn't look further and you go on don't be that guy there are a couple of questions that came up in the chat one is from Matthew at 2:27 p.m. timestamp is impact factor of general publication still a thing and then right after that there were some people are having some technical difficulties okay alright so the in order of these the chat that I see on my screen it says what are the tools at the right side of the plot do if you hover over them you'll see what their names are and if you click on them you can try them out basically there's a pan tool that's shaped like a cross there are zoom tools that you selected and you can zoom in and out of the data and various other tools depending on what that particular plot shows mostly in this will will use the three tools zoom pan and hover and you don't have to worry about any of the rest and however comes you don't have to select us to start you there it's available the question impact factor journal publications that is very much still the thing but there's there are people who fight against that and say that publications are not where it's at software is where it's at or running applications on the internet is where it's at and it depends what your goal is you want to have an impact those may be true if you want to get tenure you probably still need it so I don't think for any of the publication's it's possible to look up an impact factor but it's not in the data set itself but you certainly can go look that up and then see the there's a question about in a type is not convertible to date/time that usually means that one of the values in a certain column you're trying to do something about the column and by treating it as a date and it's this is a not a time item I don't know why what you're doing is think haven't figured it out he said you don't have to answer okay good good that's hard to answer and then the value at the end of each bar I think there is I do not know so there probably is an option you can pass to put to print the actual value at the end of each bar but I don't know it off the top my head okay well hopefully you either were able to replicate what I had which is the start or you're able to see that there's lots more and maybe you're able to get something working it's okay the next section doesn't depend on whether you got something working here so we'll go on and then in this the first plot here is pretty simple but it's just something useful to illustrate if you want to look at the journals and you tried to plot all of them there are seven thousand four hundred different journals and if you did that you wouldn't be able to read it that's not too much data you can plot it there's no problem but unless you limit it somehow it'll be very ugly I think we can show what happens when there's a lot of them there you go that's not very useful so I limited it to the top twenty so that so that it be readable and so that and that's something you can do in general which is just one of the big advantages of HV plot is that you stay in the pandas namespace with everything pandas does so here panda supports this notation so was I was able to tell tell it just give me the top twenty well in this case it's kind of the bottom 20 but then they when they're plotted there they're plotted top to bottom so it's a little confusing but anyway this this gives me the top 20 and that's just you can use any syntax any kind of syntax up to this point this particular one is saying count the number in each journal sort it and take the top 20 so you can put anything there anything panda supports and then if it's been selected down down to a small enough set of data that actually is meaningful to plot then when you do HP plot you'll be able to plot so that's a general just a general thing to keep in mind use pandas pandas will help you figure out what you want to see here and it's up to you to to use those tools to explore and find out what you want to see here's another here's an example of the kind of analysis here we're doing a group by year and trying to understand okay we've got this data set if I wanted to tell somebody how many of these are very recent paper is how many are very old well the answer is there are very few old papers but in fact this is all arranged so there are some papers from around 1950 but probably only one and then a few up until the 2000s and then since the 2000s obviously interest in this type of disease must have picked up and then most recently in 2020 exploded so this is not this should not be surprising but it's useful to know what the bias of this data set is the bias is heavily towards recent publication all right so these are things that you can extract directly from pandas and plot now let's look at things where we have to do a little bit more work this is generally if you're in a machine learning context it's called feature engineering and what that means is doing some analysis and then putting it back into the data set in this case it's not very complicated analysis but at this stage you can do anything you want if you search for natural language processing dimensionality reduction all sorts of techniques can be used at this stage to transform what essentially is a string of words like the abstract into something that has semantic content was something that allows you to compare how similar two things are I'll show you examples of that but we'll do the very stupidest thing which is let's get some numbers out of this data set it's almost all strings let's count how long the abstract is and how many unique words there are on the abstract that's just Python code and then we don't have to do this but we're going to clip this data at to ignore outliers and I mentioned this earlier I'm ignoring them here I don't know what their what they're for you can just see that there are are outliers there are lots of things going on in this data set I don't know what all those very long abstracts are but at a first pass I don't want to see him you'll actually see them show up in the plot but you won't understand them because they because of other aggregated I'll show you what that means but so what we're going to end up with is now two more columns used to be we had no numeric columns and we had one date column now we'll have two numeric columns so if we plot that you can actually see some interesting things as a former academic I can immediately spot some patterns here this is the abstract word count right here is where we clipped it this height tells me how MIT there are 398 papers that were clipped they have values above 400 words some of them extremely large so this little bin here represents all the data that we have no data about all we know is that it's larger than 400 so they always notice that what about this guy well this is all the ones who have no abstract what about these guys there are papers that have six words in their abstract I don't know what I'm being said to figure out what those papers are we found it and it was good I don't know but the interesting bits are these where there's these little spikes here well if you've ever written a submitted a journal paper they tell you how long the abstract can be it can be 50 words 100 words hundred fifty words 200 words or barely over and they got away with it or 250 words or 300 you see that it's very clear little patterns in this data this is a nice normally distributed one the number of unique words so this is optimized people cram their abstracts in to exactly fit the value and it just immediately as a parent when you do when you bought this data and so this is not interesting for solving anything to do with disease but it is correctly revealing patterns of human behavior from data so I like to enjoy seeing that pop out of this when I applauded this last night now let's start looking at relationships again these are not going particularly interesting relationships it's up to you to figure out what's interesting I'm just showing you how the plotting works so in this case let's look at the scatter plot of the word count against the unique words now if you just think about it a longer abstract is going to have more unique words probably right and we immediately see that linear relationship here the longer the abstract the more unique words there's not quite linear it kind of Falls up a little and that makes sense too because the longer your abstract the more likely you are to repeat words eventually you would expect it to kind of tail off that maybe 10,000 the number of words that you'd ever ever see there and so it's linear the first few words are all unique and then eventually they're become repeated words and therefore fewer unique words and then it gets a little bit crazy and here notice that I did this used pandas to make this be a tiny plot I could take that out well it's not too it's not too much data it's a little bit slow if I plot all of the data oh I again I didn't run the preceding cell because I had loaded this in from disk so I hope to make sure to define actual codes these are just saved outputs from the last time I ran it but in this case it's running it ran through and counted all of the words you know I need this this box okay and having you updated I can now remove that and so the you can see that's a little bit slow it's now exploding all of the data and if you do that you can still hover and see things but it's it's a bit of a mess so we just to make sure that the save file isn't too big we put a smaller number of data set data points in it but up to a hundred thousand or two hundred thousand maybe you should be able to handle it with no special techniques so what if you have a lot more data then we have this eco system comes with a tool called data shader and data shader is a completely different way to plot data and it's designed to do two things show you the underlying distribution completely faithfully and do that preferably the larger the data set the better it does so here we've given it all the data and if you look really closely I'm actually recenter this slide there we go if you look really closely you can see that same pattern in in the abstract work count against the unique values you can see it's denser right around 200 150 110 and 250 this is structure you can't really see maybe you can see it around 300 here but this is called over plotting there's a each of these data points is drawn on top of the other data points and you can't see this structure that is clear in the data shader one here data shader tries to be very faithful about exactly how dense things are in that local region and in this case you can see that there are more dense around 150 more dense around 200 in ways that are not visible at all in the in the raw block they're subtle but they're there and it also the way data shader works is that when you zoom in it'll rerender it dynamically so that you can see all the data all the way down whereas if you zoom way out eventually all your data will be in one pixel and in this case on the Left all the data points are shipped to your browser and the one on the right there just about 20 by 20 pixels here all of the data is all on the same pixel in the data cave data straighter case because it's basically sending pixels whereas in the other case it's sending individual data points and so well we what this lets you do is work with datasets that are much larger than you could otherwise use just anything you can fit invite down including distributed across multiple distributed across multiple CPUs and so on now that was both well I got a notice of something else that appears to have locked up my system here let me see if I can recover it nah this appiah lousy placed in the talk and see if I can get back to my page okay so my browser has gotten confused but hopefully I can get unconfused meanwhile we were about to get on to the next [Music] block there we go okay so hopefully that you can learn more about native shader at Danny Otero org but the meta message is if you're dealing with a large batch of data and you want to see the subtle patterns then data shader is a good option it's actually pretty easy to invoke with a treat block so let's go back to presentation and then there are other HP plus supports other things in data shader for instance you can use hex bins you could use this is a bivariate kernel density plot you can do these but if you notice them they're not going to show you the these subtle patterns and the data that it's pretty hard to see and that's because they're they're aggregating up quite a coarse level whereas data shaders aggregating at the level of the pixels on your three and so you can see all the data up to the limit of your screen and your human visual system or as if you take that same data and visualize it as a hex again you can see the flipping and then you'll then you can't really make out the hundred and two hundred and three hundred bits of data but there's nothing wrong with this type of plot if you if you want to ignore local details like that the nest of appropriate because it bends on a local scale and this is smooth that's on the local scale on the right can I interrupt you for a couple of questions that I've come up and this question is is there a relationship between Boca and Sullivan's there is there are various relationships between holidays and lots of things you'll notice this symbol here is a bouquet plot everything but the very first shot I plot I showed you is matplotlib everything else has been bouquet HB plot generates bouquet plots it can actually be tricked to generate matplotlib plots and that's because the underlying log rates built on Holub use supports bouquet plotly and matplotlib HP plot has only so far been built the translate options from pandas an X ray into bouquet and so everything I show you is gonna be a bouquet base but that's because that's this convenient interface that we have the underlying tools can handle multiple libraries these are all sort of one step above plotting this is ways that you can construct things that can be plotted conveniently mapping from what's in your data frame into something that can be plotted but it does so primarily without without being tied to a particular underlying plotting library and so bouquet was created at anaconda but holidays was created when we when I was a professor and that was before I came down convex and so originally all of us was Holub use was built entirely around that lip but then we wanted to do interactive work and pretty much everything since I since me and most of my colleagues came to anaconda has been built and built on bouquet but it's not required all right one more question and somebody who's trying to install the anaconda the J Betar anaconda project and it says this person got a message there John breeding some packages upgrading some packages and installing one new package is that gonna interfere with this person's like other other projects that may yeah if you use the anaconda project command that I suggested the advantage of that is that it is a fully independent environment of everything else you have on your system and so if you follow the instructions to the letter that should not interfere that said there is information in there that specifies an environment and you can easily install those packages into your own environment if you do that you will get what you've said there which is that it will overwrite some of the things you installed and install some of your stuff put in this stuff that's different versions it shouldn't it shouldn't install like for packages but it'll definitely change the versions because these are the versions that have been tested other versions will probably work that nothing I'm showing here is new or experimental but in any case you should be able to run it exactly as specified with an account project you can probably run it in your your own environment with just whatever version you have but also if you try to put the project into your own environment you'll end up with different versions well my daughter has interrupted slightly yeah alright okay so following on the theme of getting progressively more complex plots this is a particularly useless plot but here we've taken the first 10,000 data points and then we're going to scatter plot of scatterplot by license and this particular data frame is plotting word count versus unique words just like we had before but now it is it has this extra cause here and what that does is create an overlay where each different license has been given a different color and in this case it's almost importantly impossible to understand if you zoom in you'll be able to see a little bit more but it's still pretty hard to follow anything so this is not a very useful one but in in many cases if there were clear structure what this indicates is that people write papers of every license at every size of abstract so there's there's no obvious relationship between abstract size and license chosen and you wouldn't expect one so this that's why this is kind of a boring plot but if you if there were relationships here you would see them but you can see them better if you if you do this instead of doing by you can do group by this is something that particularly is useful when you're exploring and you haven't found out exactly what you want to look at in this case it'll give you want to run it and I've probably needed to run with something before it let's see this case it'll give you widgets to select a particular license so if you're grouping and by license as opposed if you're doing by license that means to do it as a as an overlay if you do group by it's a basically same operation but instead of squashing in all the one flow different colors it says punt and let the user figure out which one they want to say I want to see so in this case you can look you can see this relationship for any value of a categorical that's why we called this column categorical and there in the beginning of the notebook and that's so that we can do things like this break things down by category and look at their results in and then you can also if you go back to the by instead of by default no pun intended a by will overlay instead you can pass an option for doing small multiples or subplots if you do that in this case it's a lot of data in many cases it'll be useful in this case there's really no difference basically it validates your assumption that you should probably have is is that the relationship of abstract to unique values is not different depending on the license hopefully when you look at it you'll be looking at some interesting pattern this one is kind of a boring pattern but maybe you've already thought it's something you'd really love to to plot if you think of some derived value you could put in the data set or you can just take the previous ones and put them together in different combinations you if you can use these operators to do that if you've got something that displays on its own and something else that displays on its own just put + and they'll display side by side and if you want to if they have the same axes or at least compatible axes you can also do a star and that would make them overlay you can also have any combination of them which is - overlaid objects laid out plus two more overlaid objects and so on so that's what this example is some object next to some other object overlaid with some other object and you can have any combination of these and if you get too many in there off your screen you can put parentheses around the whole thing and then call doc calls three or calls one or calls - - to put it back on your on your page if it's gotten larger than your water so basically that's exercise to go back and find something that you either don't like and want to change or something you do like and that you want to lay out side by side or overlay just like these are I've never tried this I don't know these are compatible let's see if they overlay see what error we get the x and y-axes are the same the color axis is different okay well they overlaid ah you get two color bars in that case for these two independent overlaid color axes and you probably want with and you'll notice that this is slow these particular ones here are actually not just plotting they are first meaning the data but they're going through all of the data and building up this data structure this is the plotting happens at the very end probably need to put that on each one of those overlaid items so they the actual plot is not very expensive because there's not much data involved but creating the plot is there is expensive because that's such all of your data question free ones um star star s opt s tempura sauce ops yes a lot of people are not familiar with that so here I'll just give you an example if you if you have a list so let's say what s ops is it was defined earlier it's just some parameters in particular this is saying that's why we're plotting word count versus unique works because I hit it inside the asaba objects and so star star what it does is take a dictionary and substitute the values into this function call as keyword arguments so it's exactly the same as typing them in directly so it's just saved a lot of typing so if we did this we would need to do x equals abstract word count and then y equals so abstract etc and so all the does is substitute the value of a dictionary into an argument call and so I tend to use that a lot and it tends to confuse people when I use it but hey otherwise you have to type all this stuff all the time ok all right now we should check the time yeah we have time - lets you go through a little bit more there I'm gonna go ahead and in case people are already done with things I'll show a few examples well actually I'll just mention other websites that you might want to look at for inspiration here if you're building the plots you should probably check out HP Locke's website and if you go there it's got a user guide you probably want the introduction or the plotting or the customization these are this basically tells you how to do plots and you'll recognize some of these plot types from the from this page there are area plots bar plots and it's mothers that are not on this beta there's the hex bin plot others that are not on there such as heat maps overlaid bar plots and so on box plots Islands so if you want to explore further that's where you go to to find out what's available and I'll show it later but again everything there's a big tutorial of everything hello org this particular tutorial is based around some geographic data and a lot of them this is the same type of plots that you would see these are earthquakes but those could be covert cases so if you do the same exactly same type of plot to plot the location of covert cases and it'll take you a look through all the steps for that but that's a much longer tutorial in this one okay so is there any other crisis if not I'll move on from this exercise okay well feel free to break in if there's a good question or issue but now we need to get to the kind of the meat of this this challenge which is that you're working with words and we're showing a lot of pictures here and you have to somehow be able to go back and forth between data and words and understanding things at a numerical level presenting them in America level presenting them at a word level and basically go back and forth between text and text as words and text as data so let's start to try to do that and then we'll be leading up to building a dashboard based on this type of work and never mentioned my previous webinar a few weeks ago was all about dashboards so basically we'll get you as far as entering being ready to watch that one happen by the end of this so we'll briefly touched on dashboards but we have this other resource and also at panel Paula visit org and at Halawa stat org there are a lot of materials on going further with developing apps and dashboards meanwhile let's basically start to work with the words in this case we're going to take all the words in the titles of all of these 60 some-odd thousands and papers take all the titles split them by space which means to give each of the individual words and then concatenate them all together eliminate all the ones that are four characters are smaller that like the end of and and it's rid of all the gets rid of most of what are called stop words words that don't convey meaning or just for their function and those are usually ignored when you do natural language processing work with this type in any case I didn't want to load a separate library for it so I just did the the cheapest easiest thing I could think I'd get rid of the short ones if you do it for real you'll want to load a list of top words because there aren't that many and particularly you want to think of soft words as being both for English text and also for papers like the word paper is a stop word in this case it would convey no meaning to say paper or author or abstract or any of the things that are about the function of science writing as it opposed to the underlying disease so you'll want to get a list of those words and get rid of all those as well but in any case we're not doing that we're doing the the simplest thing here let's go back to the full screen so you can see all of it so here let's just get the most common 5 letter or longer words and the titles and lo and behold what is the top word that's common across all these Ovid 19 that probably had some reason for him there's probably some reason for that after that's coronavirus respiratory here notice that I have not distinguished by case and that's basically because I didn't want to lowercase things like this depending on what you're doing you either want to collapse in my case or not I didn't I'm just showing you what to do how to do things not what to do so you can do what you like there even some clear misspellings this is not perfect data but now we do have a list of the most common words in the dataset so maybe they'll come in handy in the title we could do the same thing for the add tracks let us take a little longer and so anticipating building up an app I'm building a couple of useful utilities so that particular useful utility was a list of common words that'll come in handy now let's write a little function that filters the data set and essentially does a lookup you can write very complicated lookups all sorts of fuzzy matches natural language processing to substitute similar words and meaning even though they're spelled differently you can do anything you want here I'm doing it very simply I'm saying that this little function will search in this field for this search term and it searches it as a contains so it just means anywhere in it if that string appears it already considered a match and it will return the top num matches so in this case let's search the abstract for the word bovine and return the top two results and this is what we get and so we've got all of the data available we have all of the columns but only two of the rows so to that I only asked for two you can ask for as many as we want so let me make sure we have actually run that previous one I think we did yeah yes we did okay so now let's write a little function for the for concisely displaying the records that we just got in this case I wrote a bit of HTML I'm an old hacker and I know HTML without ever having to look it up you don't have to use HTML for this you could use pandas own data frame I used HTML because I didn't want to look up how to make a link clickable in pandas dataframe you knew what you like here this is just a what any way of displaying things and now this is actually if you look at the bottom of the screen this is now a link to that specific article so defer the previous function here gives you a way to search and then next gives you a way to display the results okay these are all useful these have nothing to do a dashboarding per se here I happen to know to command from my Python that lets me display an HTML string we won't use that again but that just tells you what the outfit is function is in this case for that cows query okay and now let's build a in a dashboard this is the entire dashboard I'll let's first show you the dashboard then we'll go back to the code because it doesn't make any sense without the code well here's the dashboard it's called lip searcher in this case and if you look at this you can see what it does it searches the given field I'll have to execute it again sorry for leaving that confusing I should have just cleared them all before I started and now it's too late okay so in this case we're looking at a little dashboard in this case it's embedded in the duper notebook but I'll show you that it doesn't have to be and let's see what we have here we have a header we have some widgets the widget you can use to select whether you're searching abstract or title or if you search and their authors that looks like a couple of authors are named coded 19 how about we search for an ID okay no no IDs are named coded 19 but there are alters it turns out named code 19 that's funny and there are certainly abstracts that mentions go in 19 we can we can limit that if we wanted to we have a little filter here for controlling the number of results that are returned and we can this particular widget is pre-populated with those hundred words that I found we're most common so you don't have to think what might be in this data set let's let's just explore what is in this data set so vaccine that's probably not gonna be very interesting and and alongside this table which we already showed you the code for is a plot and this is a plot for the selected data here now we can put any plots here this I just randomly came up grabbed one of my plots from the previous part of the talk stuck it here to the right it's not even a very insightful one because it's not it's not showing anything very useful on hover it's showing the other word counts we should set it up so that it shows something useful on hover which is the author and the title and the year or something like that that's a little configuration that would need to be done so that this would be useful in some way and you could put another plot over here or any combination of plots when you put plots they'll be linked to each other if I share any axes but you can call a special command on it called links selections if you have a recent version this code and that way when you specifically select one that was in a selection actually that's a zoom but if I added a selection tool you'd be it'll select some and then cross filter across all of the other plots that's just something that comes for free if you call link selections and it'll allow you to see how everything is related to everything else anyway they're all arbitrarily many complex things you can do there you can pretty easily say that when you've clicked on one of these data points you can have it update another plot or spawn another plot or just go to that URL just like these will go to the URL you can make the data points go to the URL basic operation of this little app hopefully is clear also I'll just show literally what happens when you run it as a standalone app the dots show what that does is launch a separate server it's we'll get out of full screen so that you can see the URL what that did is launch a separate server that's listening on a port on my local machine here and this is the final outcome of this notebook which is a shareable if I pass that around anyone who else was logged in they'd be able to visit that same URL and see this app and they would get their own copy of it with their own state their own things that they could drag and explore and they would they would see it like a website basically that now let's go back to so let's remove that and then try to explain how this works so that you can see how you'd build it yourself so this is the result and we're looking at this object here this part is optional but I called it that because if I later go and write out the command line if I type into my terminal panel serve exercises IP Y&B magic will happen it will run that file and anything that says dot server will be comes its own app and so that way you can explore the notebook you can also run it if you have a deployment in a container or on a cloud somewhere all you need to do is run that one command and it'll launch that server and you can share it with anybody so you can ignore that for now let's think let's talk about this object here how did you get this object literature well let's see this is the code for this is all of the code for literature and what it does is [Music] does a little bit of imports it's using our library cult panel and using some widgets from panel fact I could have used pmw hero and save some text in this and so what what this does is instantiate three widgets so if we look at our next page there are three widgets one two three and so let's go back to see the definitions of those there's a widget for the number of results that's a slider it starts at one ends at 15 and has an initial value of 10 that starts at 1 ends at 15 and it starts out there so okay there's another one a select widget that is on the field so if I go to the field that's this one it's the Lex on all the values that are available in this dataset this is basically data set columns DF columns which is what it's been it's actually all the string type columns because those are the only ones that make sense to search a string on so we gave it a the field is given a list of options and that options is constructed by querying the data frame and it's basically anything searchable but anything you want here if you only want to support two fields you just list those two fields and then the final option here this could have been a free text box we just put a pin dot which it's dot input I think it's called I can't remember the name of it but anyway you can put any type of widget you want I chose to do lect and to feed it the first hundred most common words you can put anything you want because it's because I chose that it becomes this selection so I don't have to type I enjoy just clicking around and doing that and so typing but you could totally have put a arbitrary text field there and it would search it okay so so far for this all you have to do is go to the panel website and instantiate any widget you want it'll just be there in fact if you if you didn't take one of these widgets let's say like this one and let's get out of here the zoom stuff is always in the way and we go back to the notebook if we just look at that widget this is the number of results widget right here so all by itself you can easily just instantiate anything and debug it and do what you like at this point it's just a matter of looking up on panel putting the right parameters in and you can get a widget so and that's true for any of these widgets so now what about the rest then we need let me skip to the end at the end we the end is easier to understand so at the end we're gonna create a column object that contains this widget this widget in this widget so this widget is just a column of widgets and so again if I went back here and inserted the cell and type widgets that would be all three widgets now and they're all alive whatever so that's a collection of widgets right and then last bit so I'm using a package called rise to do this slideshow presentation you'll see me switched from the standard notebook this is the rise presentation mode that's why everything gets really big and you can actually read it okay so here we created this widget object and then finally we've created this literature object which is a column that some text in it I happened it's a combination in this case I just stucked in some HTML and some markdown and that works and then I it's a column of some text some widgets and the results and this object search results which is here and that's what you can see that there's a column of widgets and a column of these other things and that's what we end up with in our app which I have now missed with outside of this that app is still paying attention to those widgets I was just messing with and alright this is the last bit in fact the last part of the talk which is that if you want to this is where the meat of your work comes in you can easily do this and just put any widgets you want just think of what you want to control anything and then you write a function and for each of those folks you just write a function like this every widget should show up in this dependence list in this function this list in order and all this does is say that this parameter of dysfunction depends on this widget so ty is that widget to this function and then we're going to pass that function in here and all its going to happen in panel panel just looks at that function realizes it's a function knows that it's able to call it because it's been declared what each of these arguments is coming from one of those widgets and so what we've done here is to find some widgets define a mapping between widgets and function call and then pass that function to panel and panel does the rest whenever one of those widgets changes it calls this function again because it knows that this function depends on this widget in this widget and this would do and it knows exactly what to do with those values when any one of them changes it calls this function and when it whenever it does it will call this top function we called earlier to get a few rows it'll create an HTML table which is what I had to code for before it creates another copy of SS ops which is this dictionary of options it'll plot using those options so here it's expanding that list that dictionary of options in the call and then it returns it all as a row where the tables on the left and the claudus on the right and then the final result is this app you saw table on the Left plot on the right all of that is underneath the widgets and all of that is underneath the label and you can see that right here the label the widgets search results the search results is a row of a table and a plot the plot is constructed from the arguments and the table is constructed from the argument so it all fits together your job if you want to make an app is to figure out what you want to bury and make some widgets figure out some function of those widget values just something displayable no matter what it is whether it's an SVG a PNG an interactive plot video I don't know the actual PDF viewer of the paper embedded in the in the app all of these things are possible and then you connect the two you've got some function they can display something and you connect it to arbitrary widgets whatever widgets you want and then you make your app and basically your initial app should be very easy it's easy to put widgets up there it's easy to put a plot easy to put a table what's hard is tying everything together so that what you completely custom control what happens when you click here how does that relate to everything that's hard work but you don't have to do it it's only work you need to do if you want to and we've tried to make it easy in certain cases which is a cross filtering case and there are other cases that we think that we can make easy otherwise you have to make this plot subscribe to another plot or mother plot subscribe to this one these are not terribly hard but it is you have to decide what exactly want to do and then you look up how to express it go ahead I said a couple questions that come up one is a general question someone wants to know if have any recommended techniques for feature engineering using plots that you that you use regularly I hesitate to mention anything because if you look at the Kaggle website you'll see submissions from lots of people who have spent way more time with this dataset than I have I started on this data set yesterday around 4:00 in the afternoon and this is the result there are people who spend a month working on and really thought about how to represent and how to derive data from it all just we have only a few minutes here but I'll do one little bit of just point out something somebody had done which is I don't have it up here but somebody else had had created oh it's actually linked at the top of this file this is a this is somebody else's response to this challenge and this is somebody who is also using bouquet and they created a dashboard with bouquet directly not with all of his tools you could make this a similar dashboard with all of his tools but what this does is use massive feature engineering using cluster to different types of clustering comparing them to each other forming little bits of related documents that are related based on their full text and coloring them all in a position based on one type of clustering and coloring them based on another type of clustering to show how the two types of clustering are related and then overlaying the under the actual data in my colleague Philip rüdiger the author panel did do that with our tools as well and just in a very quick try a different algorithm called you map and if you use that one it's again a clustering technique and what this plot shows is that there's not a whole lot of relationship between the journal and the clusters that are found by you map so this one is not very successful one is clearly showing some structure a very interesting structure there our first pass of just throwing something at it either clustering based on the title of the abstract not even the full-text it doesn't come up without much useful but this is a sort of thing where you are taking a word based data set and trying to come up with a way to get a handle on this and and relate the words that are in it to the meaning of it and finding a similarity between completely different documents and trying to group them and help researchers find particular clusters that oh if you like this paper you'll probably like these papers and so on and so it's easy this particular tool you map comes with is based on bouquet and hollow views and so it's it fits nicely into our system here it's also very fast it's built on the same very fast tools that we are using so that's our tool of choice but we haven't spent any time trying to optimize it and it's also where we don't have the access to the full-text data here because that's an order of magnitude more data than I wanted people to download but this notebook here is included in your packet basically whatever you unpacked there's one called view map type e1v that's a starting point try that that's that's where I would start I'd try to figure out what to pass in as the thing that it's trying to cluster by and you probably need some parameters on that you need to think about that and it'll map things into this abstract space of similarity and there are lots and lots of other things but don't trust me trust the other participants the Kaggle who go before you look at other similar tackles that have to do with doing text similarity there's a lot of information out there I briefly was doing work like this in my Master's but since then I've left that all behind we work on numbers not words nowadays and there's I know we're almost out of time but there were additional questions that popped up and right around the 3:17 timestamp mark I don't know five minutes left okay [Music] how does I'll just in the order I see them if I can get to them as quickly No how does her eyes differ from reveal yes this is rise is built on reveal GS it's just a very convenient Jupiter interface to it you're welcome the similar techniques can be used with reveal I like Rhys do what you like can you have a responsive web design yes in fact if you saw the underlying app it was changing the shape of that plot as I changed this the table it was actually responsive to the shape of the table you can have it be responsive to the shape of your page it's not always straightforward because if you've laid out a bunch of things if any one of the things you laid out happens to come from some toolkit that doesn't support responsiveness you might get stuck but in principle bouquet does root okay and plotly both support it and so you can layout plot they invoke a things and they'll all respond responsibly yes here the the question here is about basically what we did is that make one call back that returns a whole big amount of data in the table and a whole brand-new plot and whenever you change one of the widgets you get a whole big chunk of data it is possible to specifically update individual parts of the plots it takes a little bit more code but if you do that you'll have much more responsive clocks and if you go to the examples pie beside org that there's going to be lots of examples there of cool stuff built using these tools it'll say panel on it if it's illustrating these and particularly if you go to the data shader dashboard you'll see a 30 line program that specifically ties every single one of these widgets to exactly the one change that's necessary in the flock and you'll see how to make that that very precise mapping between doing only the minimal amount of work that's determined by that particular widget and so yes it's just beyond what we can explain in this intro talk but you can totally do that as possible okay and there's yes there's a free service to run this online a lot of people do Heroku if you search for Heroku and panel and maybe Heroku panel hollom is you should see some examples the panel website also has a the user guide has a deployment a server deployment section you can embed panel apps in django if you want a full-featured website with like a shopping cart you can do that so deployment is about deploying it on AWS or free Heroku app or you can use my binder those are both free sorry I was missing Roku super easy to to use and it's free the only thing is it might be like a few seconds lag before it uploads because it's free but there is a paid version the six seven dollars a month and your website will pull up right away as far as it go yeah and the same is true for binder it's it definitely is a tragedy of the Commons depends how many other people are using it there's one last question I have a time to answer the star is an overlay operator so if you have two plots that each individually displayed a spine you use the star to over like Oh somebody already answered that thank you okay uh yeah and the drop down option is by group by instead of by alright anaconda alongside pip you can it used to be that there was tricky because anaconda didn't know about the PIP things in 1500 but I'm Conda nowadays they play much more nicely together in fact that you map at the top of that I didn't want everyone to have to install things so I installed one thing with Conda and then I installed you map with doesn't it should just work alright I think that's everything that I can manage to convey today Oh last bit is that I had some final points at the end of the notebook there is there's some pointers for you to go on from here there it is things about coded biz or how you can learn about holidays such as all of our different other videos longer and about specific topics thank you so so much I will send out a recording of this in the next couple of days I'll also share again the previous dashboard recording I know many of you have requested that and he was phenomenal as always and Jim do you have like a special YouTube channel that you would like to share I know lots of people have been posting that they really want to like see more presentations from you if you want to post on the chat I looked over my everything I had on YouTube and I pulled out these these are the ones that are up to date and are not going to be confusing and really cover everything so there are other ones out there I would ignore all those these are the ones that matter so you have a link to the specific ones that I would recommend and the rest are they'll use older versions we changed our name from five is to olives just to make things work more nicely with the rest of the community and be more inclusive with five is so you'll find some confusing things that call our work by this it's really holidays so these are the ones that keep everything straight and won't confuse anyone thank you again I'll share Jim's information as well if anyone has like follow-up questions we'll do our best to get back to you on that thank you thank you thank you for sharing your Friday afternoon with us I'm sure you'll have a Friday zoom happy hour somewhere and Jim thank you for staying up obably till wee hours last night pulling all of this together for us I really really appreciate it everyone's had a fantastic time learning for me today right thank you for inviting him absolutely have a wonderful weekend everyone I'm going to be logging off now goodbye
Info
Channel: James Bednar
Views: 572
Rating: 5 out of 5
Keywords:
Id: REW-QsG-Y5Y
Channel Id: undefined
Length: 89min 29sec (5369 seconds)
Published: Tue May 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.