Jeffrey Heer - Interactive Data Analysis: Visualization and Beyond

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so good afternoon everyone i'm jeff hare a professor from across the lake here at the university of washington i'm very interested in building tools to help people explore and understand data more effectively so let's just start with a couple examples so here kind of an homage to Hans Rosling we sadly passed away this last year here's a scatter plot of world health and economic statistics so here along the x-axis of the scatter plot we see fertility plotted along the y axis is life expectancy and then we see countries represented as individual dots so this really represents you have a snapshot of the world in around 1980 but of course to understand data more effectively we don't want to see just static imagery we'd like to be able to interact with it to understand trends patterns changes and so on so in this particular example I might want to see something about the individual countries so mousing over this I see the label for Bangladesh but I also see its trajectory so basically how is this country moved in terms of its health statistics over time and I'm going to do that for other countries as well such years know Egypt or Bolivia but I don't know maybe see just the static capture of these time traces I want to see how everything moves together so for example using direct manipulation I might grab this point and drag it through time so in this way I see as Bolivia moves along its time line I see how the other countries move in response and in this case you know backwards or forwards trace out decades of global development in this case just using a simple you know tabular data set there are a variety of visualizations we might consider of course as we move into other more complex data types there's you know a richer array of visualizations we might use this is a not very informative visualization showing you all the different direct flights across airports in the United States and obviously in this case plotting all the data simultaneously doesn't lead to a lot of insight so in addition to plotting tools we need to have a knowledge as to what's going to make different visualizations more effective so for example in this case I might instead use interactivity to show subsets of the data at a time so in this case as I mouse around I can see the direct flights from individual airports to each other so in this case you know direct flights from SeaTac to other airports across the US and as I do this you might notice that points are selected just as I mouse near them so we have to worry about not just the quality of visual encodings but the quality of interaction techniques as well so for example if I double-click I will show you the hidden visualization this case AB or annoyed diagram that's being used to accelerate mouse capture in this case as soon as I move to a point that it's the closest to my mouse cursor I automatically selected that's helping me more easily select different data points within this data set I think the real value from interactivity comes only in understanding multi-dimensional patterns so looking at further data about flights this is over 500,000 slice from the FAA showing their on-time performance so this first histogram shows the arrival delay so how early or late is the flight the second histogram shows the local departure time so what was the clock time the local area when the flight left and then finally a histogram of distances and it's useful to look at these just as univariate summaries initially to get an overview of the data so one thing you might notice is that the mode of this first histogram is actually negative numbers which is saying you know basically a bulk of flights arrive early which may tell you more about the scheduling practices of the airlines and their actual you know a flight time but nonetheless you know having given given this overview we can then explore further to see deeper patterns so for example I might take the selection region using a technique called brushing and linking to cross filter these views so I move this window around I automatically I just have a selection irie a granade all the data and the other plots to see how dependent different variables are on the selected view so for example I might say well why what makes flights late so as I drag out to the right you know starting from here I can see how this distribution shifts and then as I have later in later flights they're much more likely to leave later in the day now if you fly a lot like I do this is probably unsurprising is that delays can change so delays early in the day will then propagate causing flights to be later and later as the day goes on we can see that reflected here in this data we might also ask what allows flights to arrive early and as we shift to the far left and see the early in earlier flights will see a change in the underlying distribution of the distance histogram so for example we see that longer and longer flights unsurprisingly perhaps are more likely to arrive early because they have more distance on which they might be able to make up that extra time and arrive early and this way you know we can explore these things both to get an overview of the data and then to ask questions that might you know vary along different dimensions and interactively gain insights into multi-dimensional behavior even though in this case we're only using simple one-dimensional plots so this is one instance of what I mean by interactive data analysis and so to support these types of interactions I've been working on visualization tools for a number of years now so now I'm really over 15 years of work that started with toolkits written in the Java programming language such as prepuce then on to flash and then now it using web-based system so I the honor of collaborating with my former Stanford student Mike Bostock for example on the protists and d3 frameworks we've since been focusing on the development of higher-level languages declarative languages for visualization such as Vega and so one problem with all of these tools if you've done much you know for example in the JavaScript world is that they might be very powerful but they're also very verbose so it might be great for building you know very you know bespoke visualizations that you might find on the front page of the New York Times but fairly unhelpful if you want to rapidly build a visualization for exploratory analysis so one important question we might ask is you know how do we offer interactive graphics in a way that allows us to build not just static plots but interactive exploration tools in the midst of analysis and so one approach that we've been exploring - this is a JavaScript library in a higher-level language called Vega light this is a grammar of interactive graphics and so by grammar what we mean is we have a closed system that was we have a formal declarative language for specifying things like visual encodings but also interaction techniques that loop back together this is to support very concise and rapid specification everything from you know basic statistical graphics such as histograms line charts etc but also to build multi-view visualizations like we saw in that earlier flight delay example that not just taking individual charts and stringing them together separately but rather having a combined specification that allows us to bid multi-view graphics whether it's things like a scatter plot matrix where we look at all pairwise projections of a data set as a set of scatter plots concatenated and layered views that might allow us to repeat variables to see how they differ or faceted views where we subdivide the data and then create multiple plots for those different subsets these are critical parts of creating effective exploratory graphics and then we want to add to these interactions the first-class citizen of these visualizations you'd be able to do not just selections and panning and zooming those of course very important but also specified transformations of the data interactively whether it's re-indexing data such as in the stock chart panning and zooming you also create you know filtered views and then also cross filtering below as which we saw earlier so this way we want to be able to support not just these static plots again but really rapidly create interactive graphics and so fortunately we've had some support in doing this so we're very honored to partner with jake vanderplatt and brian granger and creating altair which is a Python library for using the vega Lite language and what I would love to give a keynote just on all the intricacies of Vega light and Altair many of others have already done this so for example my students gave a talk at open disk off so if you want to know more about interactive specification using a declarative grammar please see that talk if you'd like to learn more about some of the motivation from altair you can say just rewind one year to Brian Granger's talk here a PI data around Altair and so while these things are great and you're excited I actually want to spend my keynote focusing on a slightly different question rather than getting into the nuts of bolts of how to build a visualization let's pop the stack and ask our question how might our tools help us become better analysts that is getting beyond the nuts in both of these tools on how they work how do they function within the larger process of making sense of data and so one question that unfortunately it all too often goes unasked when we talk about visualization is not how you build that but rather should I even build that visualization in the first place how do I achieve a shin is effective or not so a starting point for our tour today will be then to ask the question well what makes the visualization good let's start thinking about that and use that as a foundation from which to think about how we might start building more powerful data tools so the start to answer this question and give you a sense of some of the research in this space let's just do a quick experiment I'm going to show you two shapes your job is to just make a value judgment how much larger is the big shape compared to the small shape don't cheat by using your thumb or a pencil to measure just make a quick visual estimate and keep it to yourself and we'll take a quick poll so first off here's two circles make a quick estimate how much larger is the big circle relative to that smaller circle in other words how much will you have to magnify the area of the smallest our goal to arrive at the larger circle to raise your hands if you think the big circle is four times larger all right well you just look around the room pay five times larger more six all right seven now eight nine 10 11 or higher you okay so what's wrong with your guy's eyes it's nice that was a very long tail distribution we got every number four to 11 plus with a you know so basically we're talking about a lot of entropy here right so not a lot of consensus now let's look at a different example here are two bars do the same exercise how much larger is the big bar relative to the small bar so go ahead and make your visual estimate and we'll take another poll how many people think the large bar is four times bigger no one okay five times you don't see me no one six okay seven all right eight nine whoa big drop-off 10 11 or higher anyway you voted twice so much yes here's you so obviously already you know a much tighter distribution you might have everyone clustered around 6 7 & 8 as opposed to being spread out between 4 and 11 and a half so in case you're wondering the answer is the same in both cases it's 7 so here's no way obviously you have to melt these circles down to accurately fill the circle that's 7 here 7 there so why is this so much easier in one case versus the other well there's actually nonlinearities in human visual processing so the entire research area of graphical perception dedicated to trying to study these differences in what we actually visually decode from information graphics so for example here is the results of studies we ran on using experiments deployed on Mechanical Turk looking at estimation error for writing different graphics so we have bar charts with basically different bars either either stacked or with distances between them in terms of position and length also looking at angle circular area rectangular area and you see this fall out and these are showing 95% confidence intervals of the air and so by looking at these results we can actually gain guys for more effective visualization design so in this case if I want to support the task of quantitative comparison so easily proportional judgments I will prefer position encodings to area encoding is as we all just experienced so we can turn this into sort of rankings that can be useful for design so for example given that task of comparing proportions I can consult these types of resources or even create algorithms that use this as a resource for making informed decisions about which charts might be better to show in a given situation so for example here's a type of map that you might see all too often this is called a choropleth map that means it uses color as the encoding of a state region in this case using a geographic projection to communicate some quantity so in this case is data about political donations and how many were made in different states so this case you can see Texas and California might have given more which probably isn't surprising because they have a greater population but other things might be quite difficult to see so not only is color a less effective encoding channels we will you see from those rankings we also have compounds with shape you know also sometimes compounds with projectors that make things even more difficult to see so things like you know what happened in yo Delaware or DC or Rhode Island might be much more difficult to see due to relative sizes so for example we might consult these kinds of resources and say well instead of using color hue what other visual channels might I'd like to consider well x and y are already taken if we want to keep you know with a map so maybe we'll instead you know walk up and use something like areas such as the size of a symbol to communicate the quantity instead so in that case we go from this color map to one using size symbols so in this case these dark filled circles basically show the quantities that were shown in the prior display so in this case you can see yes California Texas do have higher values as we saw but DC actually has a huge number of contributions that was invisible in the previous map in addition we actually get some more latitude for interaction so I told you the field circles show that data I showed in the previous map in this case those empty circles show totals so those dark circles actually just showing a subset and we actually now see set subset relationships by having symbols within symbols so only did we gain some more perceptual clarity we actually gain more degrees of freedom with respect to showing interactive selections and how they compare different states so what to visualize again is a more important question overall than like what tool you use to show it just around this point home I want to share with you my favorite example of a perceptual redesign this was done by Michelle borken and colleagues while she was a PhD student at Harvard they worked in collaboration with doctors at Massachusetts General Hospital to try and explore different visualizations of arterial stress so in the bottom left here us thought was actually the state of the art display at the time they started their research this is an anatomically correct 3d model of an arterial tract and then color using a default rainbow color palette to show you you sheer this is something a doctor would look at to make diagnostic decisions for example whether to administer certain blood-thinning drugs or an extreme case as part of the evidence that would go into making an operation decision and you can see in these other quadrants here alternative visualizations so in one case moving from a 3d model to a 2d model or in this case you instead represent an artery by cutting it open and unrolling it and then ditching anatomic exactness to instead show sort of topological connections between arterial tracts the other design move here of course is changing the color scheme so for example going from a rainbow color palette to a more perceptually motivated diverging palate that diverges into red for the areas that are going to be most troublesome almost worrisome to a doctor now the real interesting example too though is what they did when they evaluated it they actually put this in front of real doctors with real diagnostic tasks to see their accuracy with these different visual encodings actually quite shockingly to me here's the result each of which were statistically significant so you have with a state-of-the-art visualization of the time roughly 40 percent diagnostic accuracy simply changing the color scheme moved 30 percent up to UM a 71 percent accuracy in this bottom right and then moving from a 3d display to a 2d display gave you an initial with the twenty percent boost almost as an independent factor in improving people's ability to spot trouble areas and make appropriate diagnosis with these visualizations so these different choices of visual encoding really do change what we see how easy we see it and how we make valid comparisons so knowing something about what makes a visualization effective you know is really important because I think most of you would have you know not have a hard time deciding which of these displays they would prefer that their doctor use right so these they're not just a matter of intellectual interest they have real world consequences what we do and don't see in data so we can use this as a basis to start to figure out how our tools might make a slightly better analyst one might be by providing some smarts in terms of appropriate visual encodings we now let's look at you a slightly broader question given that foundation and say well how might we support more effective data exploration overall so given a new data set that we're trying to get familiar with we might have some hypotheses to begin with but others that might be nascent or discovered you know as we start to interact with the data how do we really make that process more effective overall well it's a hard question to answer but one where I think tools can play a role so just as one example here's a data set taken from juvenile corrections department from the state of Maryland actually from a couple of decades ago the most important thing to know is that this is about you know juvenile criminal offenders who've been processed by the system and the other most important thing note that the x-axis is age all right so as a father of two young children I can tell you I am very concerned about the rise in violent infants so there's clearly you know something you know maybe it's just the state of Maryland maybe the West Coast is different I don't know but as the son is a wonderful a 95 year old grandfather - I'm also very worried about these marauding centenarians about here so as you may have guessed one of these these are actually you know data quality issues in these cases when the age of some juvenile was unknown different administrators had different policies of dealing with the fact that the integrity constraints on the database did not allow them to enter unknown as an option so either opted for zero the minimum or like the maximum value allowed by their interface now and that explains a lot of this data unfortunately I have no good explanation for you why a 35 year old is in the middle of this I know maybe there is particularly baby-faced I can't say for certain nevertheless data quality issues are one thing that obviously undermine exploratory analysis and something our tools can do a lot to help us identify but I think I'm more a deeper and pernicious problems about you know we might refer to as blinder vision so after years of teaching visualization and data science classes a recurring trend I see they give people an exploratory analysis project they pick an interesting data set they have their candidate hypotheses ready to go and then they deep dive right into the data often jumping into multivariate views without ever assessing simple univariate summaries without asking what if questions about what are the ways the data collection in an error they might even overlook latent factors that are actually more explanatory because they were so focused on their pet hypothesis and so thinking about how to make you know exploratory analysis appropriately broader comprehensive is another interesting thing to consider so there are many pitfalls and analysis I'm sure you if we brainstormed in this room we could come up with a very long list so I share just to write over looking data quality issues fixating on specific relationships and of course we know from with the intelligence analysis and the psychology literature that we is sort of limited human beings with you know you know constrained cognitive capacities have many other types of biases that affect what we see and how we think about things so maybe our tools can play a little bit of a role in nudging us you know way towards some of these more unfortunate tendencies and so that's actually one thing that we're starting to explore in projects such as Voyager this idea like a support you know rapid interactive analysis of using visualization but to do so in a way to also support you being a bit more broad in your consideration and we all helps you not overlook certain data quality issues so let me just go ahead and jump into a demo of this tool so here's the Voyager to UI but those of you familiar with tools like tableau you might notice some similar elements we have in this case the data set about cars this is automobiles over a number of years different data fields with different types so we have string data numerical data its that are laid out here with our schema and we have visual encoding channels so if we wanted to we can drag and drop a variable and make it the x-axis the y-axis a color encoding etc all of these which under the hood are actually being transformed into Vega light visualizations so as UI is actually just a way of specifying formal statements in the underlying Vega light language but probably what drew you arrive first is this gallery of visualizations on the right this part all we've done is loaded the visual mode of the data set we actually haven't specified a visualization we've automatically populated a gallery of recommended views so this includes you know we could be some reviews for each variable within the data set so that we can get a quick overview of the shape and structure of the data so it's includes you know histograms of all of our discreet measures and then on to our technical quantitative measures so in this case you know we can see you know some have normal distributions others look positive more lognormal in nature and doing so we hopefully can spot any you know unexpected values or outliers that might underlie the subsequent analysis so looking at this you know I also noticed that mileage or miles per gallon you know is one of the properties in the data set so if I'm interested in that I might start to build up views on my own so for example I might grab miles per gallon drag it to the X field you know here I get a horizontally oriented dot plot I can you know it's change the axis if I wanted to log scale it I could also just transpose the view now I have it along the y axis and you notice that well I've specified a specific focus view I'm still getting these recommendations but in this case they're based on the view I'm currently looking at so it's actually taking the Vega Lite specification for this view and then automatically identifying ways to generalize that and then present related visualizations so I might see summary view such as a histogram or the global average which in this case is about 23.5 miles per gallon and I also see what happens when I start adding field basically in a search frontier what happens if I look one step forward in my analysis but do so in a comprehensive voice I don't overlook potential relationships of interest so for example here I see that displacement horsepower and weight all follow very similar curves they look rather quadratic in nature but the relationship between acceleration and miles per gallon is quite different so if I want a car that's fast but also has good mileage that was built somewhere in this case between the 70s and 80s it's an old data set I can see that at the time the top cars for that those criteria were Volkswagens there's a VW pickup here's a you know Dasher and so on so I can see these outliers that may be of interest if I'm looking for a fast but fuel-efficient car as I scroll down I can see other relationships automatically as well right I can see that as the number of cylinders increases mileage tends to decrease I can see that you know as an oil only different origins and it seems like the USA seems to be creating the least fuel-efficient cars which you know we should say is that true or their latent factors you know that's something we'll come back to shortly and then finally I see you notice plotted by time so I can see that mileage appears to be improving over the years included in this data set first I wanted to learn more about that if I mouse over this icon in the upper right I actually automatically fill in and the specification view or what would be necessary to build this chart myself so actually scaffold learning by you seeing examples you can quickly see how to build it yourself by looking at this preview I'll go ahead and click this I'll make this the new Focus View and then I get recommendations that are then conditioned on this being my focus chart here I see raw data but perhaps I'm interested in looking at aggregates instead so this is actually the average mileage plotted here so maybe I'll switch to this view instead and then I see some breakdown by some different categorical fields so including the number of cylinders and then getting back to something we just saw a few slides ago different origins and indeed it looks like the USA is doing poorly in terms of mileage relative to the Europe up and Japanese markets across time so let's look at that more closely as well and as we go to this other view we see we get alternative visualizations so for example here I have a line plot with layers but if I had a lots of series maybe it makes more sense to break them off you know in a faceted view so you see I have alternative coatings recommended here such as seeing Europe Japan and USA individually I can also choose to filter the types of recommendations I get for example showing the only things where we add new categorical fields to the display and doing so we see a plot here that has the cylinders broken up into facets and then includes origin as color and we see that well indeed the USA does poorly overall it's also the only region producing eight-cylinder cars which is sort of by their nature have worse mileage so more fair comparison if we're interested in comparing these different markets would be to look at the four and six-cylinder cars we might get slightly more fair comparisons and not be subject to ignoring latent factors so in this way you know we want to explore topics of interest but also be exposed to the breadth of the data and this is what the voyager tool is trying to help accomplish so going back to the slide deck you know we ran a study at this and try to see how people actually change their data exploration patterns using a tool like this versus a UI modeled after tableau and we found was that compared to these existing tools voyager led to over four times more variable set scene so people are seeing a much broader swath of different multivariate combinations of their data and they're interacting with over two times more of this so it's not just a thing that these are being rendered on the screen and ignored people actually engaging with at least twice as many different plots with unique data variables driving them I'm looking at some of the qualitative feedback people said something like you know the related view suggestion accelerates exploration a lot to will found this really sped up their ability to get a comprehensive overview of a data set it also aided learning is that people study I like that it shows me what fields to include in order to see a specific graph otherwise I have to do a lot of trial and error and can't express what I wanted to see but we also saw that maybe these things can in certain instances work maybe too well in the sense that you know one person said these related views are so good but it's also spoiling that I start thinking less I'm not sure if that's really a good thing so I think this is really interesting like you know given the scale of data both not just a number of records but the number of variables human cognitive biases some amount of automation can be really helpful but there's clearly a delicate balance that if we go whole hog and sure like automating data analysis we're going to throw out the baby with the bathwater and we also run the risk of people becoming passive receptacles of what some pre-configured set of algorithms say rather than driving the analysis themselves so really having this right balance between the analyst and computation within these interactive sessions I think is critical we're looking forward to exploring this more we're further refining the system also excited to share that we're working with Bryan Granger and others to try and bring this as an integrated component within the nascent Jupiter lab environment and as we bring both Altair and Voyager containing these things that we've been developing largely in the JavaScript ecosystem but hopefully making them valuable in the Python data science world as well and so really excited to be working on that very thankful for the collaborations that we have there but of course all the examples have shown so far really focused on visualization and that's not surprising given how passionate I am but also how highly developed our visual senses are it's really a high bandwidth communication channel nevertheless visualization is of course just one component in a much larger process of data analysis including how we acquire data clean it up in a great diverse datasets build models and then also getting into social functions like how do we share our results get feedback disseminate things more broadly and you know it would be lovely and ideal if we just you know move from one step to another and sort of an uninterrupted flow but of course you know that is a fantasy and the real world looks something much more like this as I'm sure you've all experienced you know for example visualization is often the a line of defense against incomplete or bad data so I visualize something and realize I might have to acquire new data sets engage in additional cleaning there's all these feedback loops that make the heart of this interactive process that even if we have automated machine learning algorithms we're spending an immense amount of times you know in human driven feature engineering model validation and looping etc so understanding the interactive nature of actual data science practice is critical to identifying some of the pitfalls and making our tools better so to understand this a bit more broadly my students collaborators and I also connected the number of interview studies so here's one quote that we really like to share this came from a study we did back in 2012 where data scientists and Industry said I spend more than half of my time integrating cleansing and transforming data without doing any actual analysis most the time I'm lucky if I get to do any analysis at all and so I've shown this quote a lot and every time I do I get a lot of pushback people have this quote is just not right anyone guess why that might be but yeah exactly so a gentleman yelled out 80% so everyone's like half this is the luckiest data scientist I ever met that they only spend half their time cleaning up the data and so you know it's like you know there's maybe at the time when we first learned earthing that's where surprises this wasn't you know a bigger issue in terms of tooling around data wrangling preparation integration etc so at the time we felt is something like the elephant in the room of data science research you know since then I think the topic has gotten a lot more attention but to our efforts and many others is basically as practicing data science became more popular and more well known you know you just couldn't ignore the magnitude of these issues so that nowadays we instead have things like Big Data borat opining in data science 80% of time spent prepared data 20% of times spent complain about need for prepared data right so obviously you know there's a lot of interaction happening with data even prior to building models or creating you know informative visualizations and so you know that leads to other questions that we might try and ask in the realm of interactive analysis like how might we support interactive data wrangling this is something I've been working on now you know for over a half decade I'll show you something like I now classic result of students and I worked on back in the 2011 but before I do I want to show you the only this isn't just nasty data in the form of values or like horrendously designed log files that you then have to map back into a table something I like to call should be structured data it's in some weird text structure but it really had no need to be in that weird idiosyncratic format even sort of well curated data often has these issues so for example here's the US taxpayer dollars at work this is data you could download from the Bureau of Justice showing housing crime statistics and this is designed for human consumption really in a spreadsheet environment and nothing else so it's well curated but you can load it into Excel but if you try and load this into pandas or into our data frames or in a relational database you know it'll break up on import because it's just not formatted appropriately and so to explore some of these issues we built you know a set of interactive tools for doing a data preparation and one of the early research systems we built was one called data Wrangler and so I'll show you a quick demo of that now so here's the data Wrangler UI using the same data set we saw in the previous slide of housing crime data is broken up by different states and each sort of has its own sub matrix we've loaded into a tool we see a somewhat familiar spreadsheet style UI we see a couple operations have already been performed that we recognize some delimiter so we split the data into rows and columns in this case based on tabs and I could go on and you know you know Madge and I had a set of commands like it you know evoke from a menu that might be one way to try and clean up this data we didn't wanted to support you know interactive transformation we initially played with a gesture language but we found our gestures rapidly end up being ambiguous like the same movement that someone demonstrates might mean different things based on context so it actually took us down more of an automated recommendation and machine learning approach so now like I indicate what I'm interested in such as in this case I'm interested in row two what might I do with the fact that I selected that row well one in four instances delete that row but I can also examine the contents I see that row is empty so the second suggestion here is to delete all empty rows and doing that you see I get a visualization which then highlights what the change to the table will be if I choose that particular recommendation so without visualizing the effect of a transform of the key part for having people make sense of this tool so in this case it does what I want I go ahead and hit enter and now those rows have been removed and I can go on with cleaning up this data my column headers are actually buried within the data so I'd like to extract out that metadata so I click that row and I get some different options including the ability to promote it to a header row so I can go ahead and do that but of course that was repeated so I want to get rid of these additional header rows so I can do that here one suggestion is to delete these rows based on exact value matching and that works in this case so I go ahead and now I derive which might be kind of a more complex operation in this chain which is extracting the state names to make them part of the data currently they're kind of buried within this text label so to indicate my interest in that I'll just go ahead and select the text Alabama in the system infers I'm trying to do an extraction procedure and indeed I am but you know the initial inferences are fairly simple in this case matching by position within the string or by exact content matching and I can immediately see in the preview that this doesn't work for example Alaska is not selected Arkansas is cut off so I've been wasting my time with the suggestions I can go ahead and just start giving it more examples so that can better you know generalize inference and now I see that you know this looks good in this column and the top suggestion is extract from the column year after the text in which also looks very appropriate for this data set so I can go ahead and enter and underneath the hood we've actually learned a regular expression pattern for performing this extraction now as I'm doing this you may have noticed some other features of the display for example this little numerical icon indicates that I've inferred that this is largely numbers and here in red I can actually select all the things that fail to parses numbers I'm starting to get some additional like type safety and data quality feedback meanwhile over here in gray is showing me the proportion of cells that have missing or empty values and so if I wanted to get suggestion that what I can do at a column level I can go ahead and just click the header and I get some suggestions which include interpolation or filling so in this case is actually filling down you know the empty cells based on the observed values so in this case I can go ahead and do that now I filled out these cells with the state name and I like to get rid of these rows I no longer need for example it's a reported crime in Alabama you know I could just throw away things that don't parse those numbers so that could be brittle in many spreadsheets actually have you know a value plus an annotation which might you know break type inference but might actually be something I want to keep so to be a bit more specific say I want to get rid of the rows that have the text reported crime in the system initially in furs I'm doing an extraction based on a text selection but I can also give it a hint that I'm interested in deletions and that helps you limit the number of different transformations that might consider and then here I get you know as my top suggestion delete all the rows with the text reporting crime in at the beginning this is indeed what I want I have a nice query so I can go ahead and execute it and now I have a relational data table I could actually load into any kind of standard analysis tool sparing you the minor detail just renaming that column so at this point I might go ahead and click export and because this data set was relatively small it fits in memory and this browser-based application I can just automatically omit all of the data in this case in CSV or TSV or JSON pick your data format but as I've done this what we've actually learned as a script right so this history down here is actually just a rendering of underlying programming language statement that we learned through direct manipulation interaction so I could use this to actually cross compile to different languages so in this initial research prototype we actually for example generated Python code that would then run within a quick and dirty Python runtime we implemented as a proof of concept we had since taking this research so long further and now do things like generate scripts that will run at scale you know doing Hadoop jobs on spark for example so I actually interactively demonstrated transformation script get feedback and then turn that into a program that I could then execute to run you know at scale so that was the idea you know behind things like the data Wrangler project you know and so since then we've commercialized this we started a company called trifecta which releases a free tool that you can download if you like called Wrangler that supports a number of these similar operations this case you're seeing a data table of contributions to political campaigns on the 2016 election cycle in addition to the table you see histogram so we have like preview visualizations to aid with data quality assessment throughout the way as I demonstrated before you know interactions like selecting text will lead to visual previews of possible transformations but in addition we've been exploring ways of doing automated visualization so for example if I come here instead say given these candidate names give me a profile this particular column of my data set and so tell me things you know like you know statistical summaries of different properties of the data of you know type issues of outliers etc and in this case for string data to understand what's an outlier we actually look at the length of the strings which is a simple that strangely effective ristic for actually spotting you know untoward values in your data set so if I want to I can actually interact directly with this visualization so on this box plot below I can actually select the bar corresponding to the outliers that are high and get transformations relative to those including filtering them out or isolating them for further analysis so for example if I elect to keep only the outline string values representing candidate names I then get this data table subset which getting away from sort of the political morass of today we can look at what are some of the candidates you've never heard of who didn't make it these are all people who actually have legitimate financial transactions behind their name that includes candidates that is basically the letter A with caps lock stuck press repeatedly that is you know little what else do we have in here you know Lindsay Lohan Emperor Goku who that is Remo the cutest dog ever the mini schnauzer from the dog party who else Oh Alexander soy sauce and taters master first gourd from the nap party so there's this whole strange world of Fringe candidates that had you just been dealing with this data you know through symbolic tools you probably wouldn't have known exists this could have actually you know maybe you know biased your analysis in certain ways and with you might prefer to remove this data or if you know maybe a journalist maybe there's a whole interesting story to be told and investigating what's going on on these strange fringe candidates who are actually registered with the Federal Elections Commission so in this way you know interaction can also you Megan make us aware of some of the things we might have overlooked and so there's balance between you know human driven exploration and automation again proves quite interesting so more generally I want to say the contrast sort of a typical approach to data analysis so might involve things like you start with a data set you start writing scripts for transformation and visualization if it's very large you might also have to then extract a sample of that data to work with you then run your scripts on that data and then you might visualize the output so you might you know like map plot live and create plots based on the output of having already run this transformation script and you might you know iterate etc until you finally arrive at a suitable set of analysis procedures and what we're exploring across these projects both the Voyager and in the Wrangler projects is is the way to actually meaningfully invert this process actually start with visual representation of the data whether that's text tables summary visualizations etc and enable interactions with those and use that as evidence to see the search process so underneath the hood in both cases we actually have a language model whether it's languages for representing visualization or languages for representing data transformations and then we actually do search and rankings so we use a numeration and machine learning procedures to try and identify what we think good responses are to what the users indicated they're interested in and then closing the loop through visual feedback whether that's new visualizations or visual previews of what transformations would do then trying to create a technique that's hopefully faster for experts but also much more accessible for people who know their data but aren't necessarily programming wizards and so that's what types of things we're trying to explore and supporting interactive analysis and so to wrap up I might share just some some parting thoughts and we need to explore the sort of space of tools I know many of you are avid data scientists many of you are probably also data tool builders some of these things might be things that you consider as you figure out building tools that people beyond yourself will be using to conduct analysis so what are some of the considerations that we've come across in this short talk today well one is you know a careful balance of automation and control so by using you know you know automatic recommendations or machine learning one be able to expedite and also make broader the process of analysis but to do so we want to maintain that control of the analysis the analyst you know at the center of this process the resemble as many pitfalls to automation as I'm sure you're aware that's like loss of agency intuition and domain expertise we want people's understanding of the domain which vastly typically outstrips just the data resources you have at hand at helping guiding and where the analysis should go what things are genuinely surprising versus what are just noise all these things often require human judgement and if we just black box too much of this in automated we also run the risk of putting poor models loose in the wild which could have devastating consequences for various organizations from government industry and certainly in health as well but of course you know we as humans have limitations too we have our cognitive biases we have blind your vision and we make mistakes so what are the ways in which these tools can actually help offset those by having a better understanding of human capability is whether it's what makes the visualization effective or whether common errors that people encounter and their reasoning our exploration of data and trying to think about the design of tools in ways that help counter them as I think a critical concern and so to do that you know what we'd explored in these projects was enhancing interfaces with underlying models of capabilities so what do I mean by models of capabilities well in this case it was twofold like to really enable these applications way to look at the language level again visualization or data transformation languages that gave us a base to formally reason about the steps the user might take then we think about within those types of steps you know how can we rank them or like recommend them effectively whether it's using you know visual perception guidelines that help us pick more effective visualizations or knowing based on the you know the data types and relationships what transformations seem most needed on at the time now as we build these things these models that we build to help power these applications themselves are going to require curation and so I think an interesting problem that we're all going to face going forward is how our interfaces themselves are going to learn from us over time and I think data science is a particularly rich P tradition to explore these issues I think we'll have much larger resonance across many areas of software so how are these interfaces going to learn and how are we can help shape them in the most appropriate way that the models they learn really do help a larger swath of people get their job done effectively and so along the way there's gonna be a lot of interesting meta challenges as well so this can require the means to inspect monitor and audit the models that are underlying an increasing number of the interactive sessions that we experience today and so I think the type of tools and expertise on there being explored you know by a community are going to be critical to these issues going forward and so be very happy to engage you all in conversation afterwards on any of the topics that may have resonated with you and so with that I'd like to wrap up I also want to thank all of my collaborators and students over the years there's just a small sampling of them along with different funding agencies I'd also like to thank you all in this room my group myself my students have made immense progress building on a number of the open-source tool that the Python community has developed so thank you for for all that you produce and all the examples you provide Thanks you [Applause]
Info
Channel: PyData
Views: 22,171
Rating: undefined out of 5
Keywords:
Id: hsfWtPH2kDg
Channel Id: undefined
Length: 41min 34sec (2494 seconds)
Published: Mon Jul 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.