Stanford Seminar - Expressing yourself in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so what I want to talk about today is is it a broadly expressing yourself and don't mean like how you express your feelings in our but how do you express the data analysis which to me is the process we're kind of raw data comes in one side understanding knowledge and insight comes out the other so what I want to talk about is a little bit of acid justification so I really strongly believe that if you're doing data analysis regularly it's really important that you learn how to program so talk a little bit about that so how many of you here are programmers that's pretty much pretty much everyone how many of you builds tools that other programmers use and yeah few yeah okay it's interesting so I think it's really important that you program if you're a data analysis I think it's a little it's a little strange to me that when I talk to statistics departments I don't need to tell them that everyone should be programming but when I come to a computer science department I have to make that pitch a little bit more and I think it's the same way like statisticians strongly believe that if you're doing statistics and you're not a statistician you're doing it wrong similarly computer scientists believe if you're programming and not a computer scientist you're doing it wrong too it's a little bit of a caricature so if you're in a program why should you use R as a programming language talk a little bit about that and then talk about two of the projects that I've been working on iterations of two of the tools that Michael mentioned deep liar which makes it easier to express your data manipulations and Gigi givers which allows you to express a new class of visualizations so Y program to me there are three main benefits first of all reproducibility it allows you to redo the past and the present this is really important for good science you have to be able to recreate what you did and and and the I really like the idea of provenance that you can track your data from creation to the final artifact that causes someone influences someone to make a decision now reproducibility is about recreating the past and the present Automation is about preparing for the future in the present so programming is really important because as new data comes in as it does all the time you need to be able to rerun your existing analysis and you need to be able to redo new analyses that are closely based on existing analyses now the final benefit of communication of code and the one I think is most underrated is that code is a vehicle of communication now it's obviously a vehicle of communication between you and the computer but it's also a really powerful tool of communication for between you and other people and that's because code is just text it's very very easy to put code in an email to ask someone else how to fix your problem or to Google for the answer or to post your problem on Stack Overflow this is a really really important part of coding I think it's fundamentally a community process and this is particularly so for our since many our programmers many our users a very very task-oriented they want to solve a specific problem they don't want to learn the beauty and the peril of programming they just want to get their jobs done so being able to quickly find an answer whether it's in a book or on the web or on Stack Overflow is really really important so if you're going to program why use our and I think to answer that question you need to think about like what are the what are the bottlenecks and the data analysis process and I think there are three main categories or two main categories first of all you have to think about what you want to do you're going to figure out what the next step is then you need to precisely describe it in such a way that the computer can understand what you want in other words you have to program it and then finally the computer has to go away and crunch some numbers and in my experience when you're doing a data analysis the biggest bottleneck is cognitive you spend way more time thinking about the problem then you do actually computing on the problem and if that's the bottleneck then you don't want to choose a programming language that's optimized for performance it doesn't matter if you can do this Tinh if you can do the computation 10 times faster if the computation takes a second and you have to spend a minute thinking about it right you want a language you want to pick a language it helps you think about the probe and that helps you express that in code and I think IRAs is really well-suited for that the other thing is when you're doing a data analysis you need the tools of data analysis of which I think there are four main parts first of all what I call data tidying or data wrangling this is a little bit like data cleaning this is but it's a bit simpler this is just getting the data in a form that you can work with so the arrows sort of goes here but in reality it's actually you know maybe all the way over here right that's often that the hardest part of any data analysis project is just getting it into a form that you can work with now once you've got it into your statistical analysis your data analysis environment you're going to iterate between three main sets of tools so tools of transformation which include things like creating new variables that are functions of existing variables like creating a variable density from weight and volume or it could involve simple a book aggregations like groups and sums grouping and sums you're also going to use visualization with strengths of visualizations I think a two-fold first of all they're really useful because they uncover the unexpected visualizations can surprise you in visualizations are also really helpful to help you refine your questions this is often another big part of the data analysis process is just making your question sufficiently precise that you can answer it now visualization surprised you but they fundamentally don't scale because a human has to look at every single visualization so to me the compliment of visualization the models by which I take very very broadly to include all of statistical models and machine learning and data mining but basically whenever you can make a question sufficiently precise that you can answer with a handful of summary statistics or a simple algorithm I think of that as a model so models are great because they scale very well it's always it's almost always cheaper to buy more computers than it is to buy more brains but models fundamentally do not surprise you a model is never going to tell some tell you something that you did not expect a linear model will never say your data is nonlinear or you've missed this in two so visualization surprised you but they don't scale models scale but they don't surprise you so in any real analysis you're going to iterate between these tools many many many times and I think are as a as a language as a community for data analysis provides many of the tools or all of the tools to take all of these pieces of the problem now there are lots of reasons not to use our of course and I certainly don't claim that our is the best programming language in the world it is a very unique and quirky language but I think it is very very well-suited to its domain so the two main drawbacks to are that people talk about others that it's slow which is true but when you evaluate it's slow as a programming language but that that shouldn't be what you care about you shouldn't be trying to optimize the speed of your programming language you should be trying to optimize the speed of your data analysis the other thing is that that R has always really been designed as a language to glue together other high performance languages it was originally designed to glue together command-line Fortran and C scripts so many of the computational bottlenecks and our have already been rewritten in high-performance languages and our provides tools to make it very very easy to connect today as well the other downside of our is that all of the data must fit in memory is fundamentally an interactive exploration environment and to do that really all data must fit in memory and people sort of talk about big data pretty loosely now without I think realizing a how big memory is these days for example on Amazon ec2 you can get a machine with 250 350 gigabytes of RAM pretty easily and you can fit a hundred million to a billion observations in memory for $3 50 an hour so I think by and large kind of companies have big data problems but to answer a specific question it's usually very easy to get the data in memory through a combination of simple subsetting or sampling or aggregating there's certainly lots of problems for which this is not true but I think probably 95 percent of problems have people have far less than 100 million observations so what I want to talk about first today is deep liar which is a tool to make data transformation easier and I want to talk about it and the context of this little dataset I put together which is from our package downloads in 2013 which comes from our cran mirror so here I mean using the the deep layer package I'm going to load this data set into are calling it logs and then I'm going to print it out now if you've ever used are before and you've printed out a data frame it has some rather interesting defaults like it will print the first 10,000 rows for you typically looking at 10,000 rows of data on your screen is not very informative so deep load is just some simple thoughtful things like only showing you ten rows by default it allows you to see the data get some idea of what's going on but it's not going to overwhelm you with detail now as you start to use bigger and bigger data it starts to get really really helpful to put comment commas in places so here we can see there's around 23 million rows of data here so this represents the 23 million packages that were downloaded from our mirror in 2013 now this isn't big data this is data that fits easily in memory but it's still 1.6 gigabytes so that's a reasonable that's hopefully by anyone's definition that's a reasonable amount of data and the goal of deep liar is to tackle both sides of this bottleneck to make it easier to think about the data analysis think about the data manipulation and what you can do what you should do and then to make it faster to actually do it so the key insight there was to making the cognitive bottleneck smaller is that by and large is only really a few key data manipulation verbs that you need for almost all problems and they're the same regardless of whether data moves so to me there's five key verbs for manipulating a single table of data so select where you're picking a subset of variables of interest filter we are picking a subset of rows of interest mutate we add new columns that are functions of existing columns summarize we reduce multiple numbers down to a single number and a range where you change the order of the rows so one thing I think is interesting about this is that it's easy to think of a data frame or a table as a symmetric matrix but that's really not the case data frames are fundamentally not symmetric you have variables in the rows observations on the columns and so the operations are going to the most important operations are going to work on rows and columns differently now along with these verbs you need some kind of a adjective which is the group by so many of these things you want to do by group you want to do summaries by group you want to do transformations by group you want to do subsets by group and I'll show you some of these examples shortly so what I'm going to do is a little and what I want to do is find out how many packages or what packages are most frequently downloaded so first of all I'm going to group the logs by the package variable so I'm creating something that represents so I'm basically saying I want the unit of analysis of this data set to be the package next I'm going to summarize that so this is a grouped object so this means I want to summarize it by package how do I want to summarize that well I want to do that by counting the number of observations in each group and then finally I'm going to range it and descending order and then I'm going to take the first 20 so I'm going to run this code live hopefully and make that little bigger so you can read it so I've loaded the data in already because that takes about a minute unfortunately one of the the current bottlenecks and I was just getting the data in can take a while and now I'm going to run through those lines of code and I'm using this function system dot time just to show you about how long each of these things take and so all up doing this kind of group summary on 20 million observations takes about two seconds so this is this is very much an interactive speed you can iterate typically the first summary that you do is not going to be the right one you need to be able to rapidly change your mind try out other things and if we can run this code just to see what it returns and hopefully it doesn't seem too self-serving okay so one of the things about the design of plier is that it's very functional the goal is that we have lots of little functions each function does one thing it does it well it takes inputs in the standard way it never modifies anything in place it's always going to return a new output so the downside of that is when you're writing sequences of operations you either have to sort of write a lot of intermediate variables or you have to write things stat you have to miss things read deeply in parentheses so one thing I sort of discovered or invented is this operation in are the in fix function % dot % and what this basically does is sort of a language manipulation it takes an operation like this and it turns into an operation like this so what it allows us to do instead of expressing nested operations it allows us to write them from left to right so here is the same code I showed you before but written using this percent of percent which is rather like the pipe operator in F sharp so we take the logs we group it by packages we summarize it creating a new variable called in which is a count of all of the the the number of observations in each group we arrange it in descending order of N and then we take the top 20 observations so the goal is to create something that we can read very naturally that should make it easier for us to write the code and when we come back to our code months later you can still look at it and understand what the heck you're thinking and so just to illustrate that that actually works I can run that and we have to repeat all those operations again takes about two seconds and we see the top 20 downloaded packages yes harder because you know we're having an immediate variable seconds yes yes definitely when I kind of discovered I could do this I was a little uncertain as to whether people would like this idea or not because that the disadvantage of this code is that there's quite a lot of magic going on behind the scenes it's actually kind of creating a call that has everything nested pretty deep and it does make debugging a pain a little more painful but you certainly can if you if you prefer to have easier to bugging and you can certainly write it like this and Mike yeah this one what I the way I normally work is they do something like this like you know so I write this and I check that it works and then once I've checked that that line works I write the next line and once I verify that that line works I write the next line so I don't imagine by and large people writing big blocks of code like this and then running it and then well actually I do I do very much imagine people doing that because my experience teaching people are is that that that's the biggest problem if you say go and write a function what people do is they write a function it does everything and then they run it and it doesn't work and there's 15 possible places where there's an error and it's very hard to debug so any analysis is an R is going to be very very iterative you're going to play around with things on the command line you're going to try things out and iterate your way to a solution rather than writing one big expression now of course there's other you you're typically not doing data analysis with just one table of data you've got multiple tables and the verbs here basically come pretty direct from SQL relational algebra the most you for most useful are the left for join the inner join which most people know about the other two that are really useful of the semi join and the anti join the semi join but these joins are kind of like filters they don't add any new columns they just restrict the rows based on whether they match or not don't match and the in the other table as well as the cognitive side deep layer aims to make the computational side more efficient as well for local data frames it doesn't do anything particularly magical it just uses some efficient C++ code to avoid some of the many copies that are makes generally when you're working with data one cool thing that we do however is we have implemented a little miniature evaluator for our expressions of the nature that you commonly see in subsetting and mutating operations which is usually using logical comparison or using summary functions like mean or men and that allows us when we're summarizing over thousands or tens of thousands or hundreds of thousands of groups to or to avoid the overhead of our D our function call this is our work joint done with Romain Francois who is a vastly superior C++ programmer to me and all of this stuff is just so obviously trivially paralyzed able you can always split up the data and have different threads working on different pieces at the same time and we're currently working on exploiting that so my sort of vision is with one of these big ec2 instances you've got you know 300 gigabytes of RAM you've got 32 cause you should be able to do an interactive exploration with data on a hundred million to a billion rows so compared to if you've used player before for some operations apply it so deep layer is generally a hundred to 10,000 times faster than player so I've had some gratifying tweets of people who say they switch reply to D player and their card win from taking 3 hours to run to taking 2 seconds which is more of a test mentor how bad player was then the testament to how good d player was but the goal of player was always to be more about the cognitive side than the computational side now the other thing that's really important when you start working with bigger data sets is that you don't want to move the data around the data is big and it's expensive to move around so instead of moving the data to where the computation is you want to send the computation to where the data is and so D player has this idea of a back end trying to abstract over the idea of a table of data we've got rows and columns which basically looks the same regardless of what system you're using to store it so as well as some local sources like a data frame and a data table and a sort of experimental data cube you can also work with tables that are stored in a relational database or things that are kind of like relational databases but aren't quite like Google's bigquery so one thing I think that that's really interesting with many of the platforms for storing big data they by and large standardizing on SQL as a language for getting that data out and if you wanted to make a bit on what languages people are still using in 50 years time to do data analysis I would bet you a lot of money that will still be using SQL just only slightly horrifying so what deep layer does is it allows you to keep expressing what you want and are and it's going to translate to SQL code for you so the goal here is not so that you don't have to understand how databases work or how you write or how SQL works but allow it the goal is to avoid some of the cognitive overhead of switching between R and SQL and I think the cognitive over here is by and large not because the languages are so different but because they're so similar but with some very subtle distinction which creates bugs that are very painful to to figure out so I'm going to show you a live example with a slightly different data set here we're looking at the flight delay data set or a subset of that called H flights not short for Headley flights but Houston flights so if we look at this we print this out and tells us this data set is coming from Postgres this is a data set that lives in a database and we're going to just interact with it like we can interact with any other data source and r-and-d player is going to automatically generate the SQL to talk to it for us so you can see again it's about two hundred thousand rows which is not particularly big but this illustrates talking to databases so I'm going to do an operation which i think is very natural but it's actually quite hard to express in SQL I'm going to take this data base I'm going to group it by the tail number or so I'm going to group which is the unique identifier for each plane something the unit of analysis is going to be the plane I am going to mutate it I'm going to add a new column called rank where I'm just ranking the the flights made by each plane based on how much they were delayed and then in my final result I am just going to show the three variables the tail number the arrivals of delay in the rank so this is something that I think is fairly natural to Express so I'm going to run that and it executes instantly because it doesn't do anything the key here is to be as lazy as possible it's not going to fetch any data and until I explicitly ask for it and if you can look at this object and you can see that there's a SQL query there it uses window functions which are extension in SQL 2003 which allow you to express some sort of sum some of the things that's very natural to express in a vectorized programming language like our button SQL we can also do lots of although our standard SQL things we can ask we might be worried this is going to take a really long time to execute so we could ask Postgres to explain its query plan and if you understood post query plans you could look at this and tell me how long it's going to take but I don't know anything about that so I'm just going to run it I know a little bit about it but not that much so you can see here I've just typed in it's going to print it out it's going to print out the first ten rows which unfortunately are not very interesting because they seem to be possibly a bug in the code or possibly we have some planes without tail numbers and no arrival delays and we've got some a problem with the rank but the goal here is that deplane never does any work until you force it to ask for the data always tries to do the minimum amount possible working in concert with the database here's another operation this is a group that was a grouped transformation a grouped mutate this one is a grouped filter so again we take the data set we group it by the tail number a unit of analysis as the plane we're going to filter it so that the arrival direct delay equals the maximum arrival delay so for each flight we've reached plane we're going to find the flight that was most delayed and then I'm just going to show you the tail number and the arrival delay or maybe we should actually we could add in that and look at the I'll just look at the data set again just to make sure I remember the variable names origin mm dist yeah okay so let's run that again it doesn't do anything that's being lazy the SQL it generates now is even more complicated because of the way that window functions work you have to use a wrap everything in a sub query but so this is this is an operation that I think that's very very natural right I can explain it to you for every plane give me the flight that was most delayed and if you knew is skew L you could probably eventually iterate your way to this maybe with a little bit of googling and using of Stack Overflow but you would get there in the end but here deep layer kind of wraps it around so you express something that's natural in a data analysis and it's my problem to convert that into useful SQL not yours again we can explain at or we could run it and see the first ten rows which again takes a few seconds to run so generally if you for anything you can fit in memory that's going to be orders of magnitude faster than using a database since it has to go to disk you can see all of these flights left George Bush International and you can see the worst delays or are these are all in minutes right so 300 minutes that's almost a five-hour delay by and large you don't see many delays higher than that because airlines just cancel highly delayed flights otherwise their flight delays statistics look very bad so as with any kind of real data where someone cares about some of these numbers there's gaming that goes on there's supports or some interesting stuff like if you take these tail numbers and match it up with the FAA database of tail numbers there's like a surprisingly high number of hot-air balloons that travel more than 300 miles an hour and serve commercial air traffic in the United States so I mean one of the things when you're working with data is you can basically never trust the raw data and you always want to be looking for these these strange behaviors so if you want to learn more about deep layer you can Google for it there is a github page as a package on crane you can use it and there's a number of vignettes that go into more detail about the underlying principles the other project I wanted to talk about today is called Gigi vis this is joint work with Winston Cheng this is tackling the visualization problem and is aimed kind of squarely in the cognitive space so the goal of Gigi vers the three goals so first is you want to be able to describe visualizations declaratively like you do in ggplot2 the graphics should be on the web so of the web not just on the web but of the web it's going to produce things that are fundamentally web graphics as HTML CSS JavaScript the the sort of the goal is that you know you can do a fantastic visualization and you can show it to your boss or your advisor on their iPad and they're like amazed at how awesome you are and it's also built out of reactive components reactive in the sense of functional reactive programming which allows a declarative specification of interactive and dynamic behavior which is easier to show you as I will shortly than it is to describe so as I said it's aimed thoroughly and this cognitive domain my feeling is by and large the visual Big Data is not a visualization problem big data is a modeling and summarization problem which you then visualize there's only so many pixels on your screen you're going to have far more data points than you do have pixels so when you're thinking about visualizing large data sets you have to think about how can i summarize this down to something useful that I can gain insight from with just a few million pixels I have on my computer so I'm going to show you this with a little demo and I'm going to be using again some data from the cran downloads this time I'm going to group it by date and I'm going to summarize it to count the number of packages on each day and also the number of distinct IP addresses so in some in some way this is like the number of packages and how many people downloaded those packages so we run that takes a couple of seconds to compute and then I'm going to start doing some plots so the main visualization function is called cuba's just call it visualization and you give it a data set the data sets the first thing because that's the most important component of any visualization then you tell it what should what variable should go on the x-axis and what variable should go on the y-axis so we have date on the x-axis and the number of downloads on the y-axis and because we've specified two variables we're going to get a scatter plot which isn't very useful here so what I'm going to do next is I'm going to say I don't want to scatter plot I'm going to override the default choice I'm going to use a line plot so you can drag this out and if we pop this out into another window you'll see it just opens up in Chrome this is just some HTML some CSS some JavaScript and if you look at this if you're familiar at all with Vega which is a declarative Splott specification by jiff here we're using Vega as the rendering system so we're just sending this declarative JSON to the browser and the browser is going to render there for us so this is a plot of the number of packages downloaded each day what's going to the biggest what's the pattern that jumps out most clearly here weekends right so fewer packages download on the weekend which is telling you something not terribly surprising and that's most people use our for work not for fun well though there still are a lot of downloads on the weekends one interesting thing we've seen with some other data is that on the weekend it tends to be slightly higher proportion of max during the download then during the week which is I think a little interesting so instead of showing the number of downloads per day we could show our approximation for the number of people downloading in it so this is interesting like when we look at the number of people downloading the pattern is much much much more regular I'm not sure why that's the case but what we could do is look at the average number of packages downloaded per person and this just sort of illustrates that when you're doing a visualization you want to be able to do inline transformations I want to look at the number of packages divided by the number of IP addresses and so there's some interesting variability some of these spikes I can explain I think this one here and this one represent new releases of are so when you download a new version of are you have to download all of your packages again I have no explanation for these other spikes these ones are particularly interesting like the first of September it may be that these represent like big online our course online courses start because typically if you download even if you download like ggplot2 you have to download about 30 other packages that it depends on so visualizations are a very iterative process if we go back to this original plot looking at the so if we go back to the yeah this plot of looking at the number of IP addresses per day unique IP addresses today once you've seen this week in pattern we're probably not that interested in it we might be more interested in what's the long-term trend and package downloads so we might want to add a smooth line overlaying it so here I'm just providing a victor I'm saying I want both lines and a smooth overall fit and it's going to add a Louis curve to the plot now of course we never you add a smooth to a plot or whenever I add a smooth to apply them always like well that's not the right amount of smoothing so I'm going to show you a slightly more specific way of writing that exact that plot and then show you how you can control the amount of smoothing so here I'm using a slightly fuller specification Q versus a little bit magical it tries to give some things for you g-give is a little bit more explicit we say the data we want is this downloads data set the visual properties put date on the x-axis but the number of IP addresses on the y-axis add a layer of lines and add a smooth layer on top of that now we can control the amount of smoothness using the spanned parameter so I could set the span to point nine span as a number approximately between zero and one I could make a point eight run that again well that's not Wiggly enough I could make that point seven right parameter space exploration by changing a number and rerun in code is obviously really really painful so what I'm going to do is define a new object which represents a slider so this is a slider that goes from point one to one and it's going to start at point five and then in my plot I'm going to say make the span of the smoother map it to the slider and when I do that pop this out so it's a little easier to see when I do that I get a slider and as I drag the slider I can control how Wiggly that smoother is so you'll notice here there's no here we've got some the plot update is a respond a response to the slider but we don't have any we haven't written so many explicit observers to say update the plot when the slider changes yep yeah so I should mention if we pop this out you'll notice this URL is a local webserver so every time I drag this it asks our to recompute the smoother so this is designed fundamentally to have and for every visualization you have an instance of are running in the backend now that I give you the New York Times that would clearly be an awful idea right you need to have hundreds of thousands of our instances running in the background but for the for our goal which is to make exploratory data analysis easier this is absolutely what we want because you're only limited by what I can do and that's not much of a limitation when it comes to statistical modeling so I'm going to close that down and of course you know you can easily map pretty much any other HTML input components so this is a pretty lame one but you can just see I'm changing the color of the line and the system is intelligent enough that when I change the color of the line it's not going to ask our to recompute that line and knows the color is independent of the computation so so Gigi vis is built on this this idea of reactive components so components which don't just have a single value that's fixed fixed in time but have values that change over time that fundamentally reacted and the thing that's neat about building on top of this reactive framework is that it doesn't really matter why those values are changing so rather than binding a parameter to a control we could Behrendt bind it to this thing I called waggle which is just going to animate values between point two and point one so doing exactly the same thing here I'm saying the span is just this waggle component and now I have an animation exploring the space now this is kind of a not this is not a terribly interesting example but the thing that's neat about the framework is this animation is just a stream of values coming on and that's exactly the same as if you've got streaming data with new values coming on that this framework is fundamentally reactive so that as your data changes so too does your plot or another source of reactivity is the keyboard here I've bound the span to the left and right arrow so as I push the keys I can interactively experiment with the smoothness yes is that not currently so it's there that that's something that is achievable within this framework we just haven't implemented it yet so as well as these binding simple controls or simple streams of values to parameters of the plot we can also add slightly more higher level interactions so for example in this case I'm going to add a tooltip a tooltip takes a callback function it's given some data about the point under the mouse and it returns some HTML and which in this case is just going to be a number so when we mouse over the point you can see exactly how many ip's downloaded packages on that day or instead of a tooltip that works with just a single value we can make a tooltip that works with a brush in here just showing the total number of IP addresses under the brush which updates as you move it although you'll see there's some bugs and it currently doesn't under unbrushed the point but so GG vis is sort of has mostly been an expert as so far as sort of a proof of concept convincing me that a declarative specification of a visualization plus reactivity is like a grammar I think of interactive graphics it allows us to specify a very large variety of interactions and when and when I mean it allows us that kind of I mean it allows you that I don't have to pre specify every possible interaction you can come up with new interactions that make sense for your data so Gigi vis similarly if you want to learn more you can Google for it there is a github page there are some vignettes that describe more about the underlying philosophy more about how the reactivity works at a fairly deep level and it's it is uh it's it's usable but the demo I just showed you was carefully constructed to make it look as impressive as possible and so there are currently a lot of little little things that make a little bit frustrating to use but we're working very hard to make those not such a problem so in conclusion I think it's really important when you're thinking about how can you make better tools for data analysis better tools to understand data you think about all of the bottlenecks both computational and cognitive you have to make tools that make it easier to think about a problem make it easier grab that problem programmatically and then make it easier to compute with it and certainly in my experience the biggest bottleneck is cognitive you spend way more time thinking about the problem than computing on it which means by and large that research if it should be focused on making it easier think about data analysis not making easy to do it computationally so you want tools that help you define the problem make it make it easier to express your ideas about how to solve the problem and I think R is really well-suited for this instead of a host language for developing domain-specific languages so data analysis in many ways is sort of a combination of a sequence of domain-specific languages you want a domain-specific language to express how to get your data out of whatever format it's in and are you want a domain-specific language to express data manipulation you want one for visualization and then you want one to describe your models and then the rest of the programming language just sort of helps you do things like for loops where you know well I've got six variables I'm going to explore them all the same way I'm going to write a for loop to do that rather than copying and pasting my code multiple times so if you'd like to get a copy of the slides which I presume I'll also give to Michael so you can put them on the website but you can download them from this URL here thank so you started out with this idea that programming is the right way to interact with these systems I and I just want to push it on a little bit because I think there's a little bit of a tension between that claim and this conclusion about domain-specific languages in the sense that what you're doing is creating a small scale ways of interacting with visualization for example in juju does that don't require programming work in deep wire that are much closer to natural language semantics in some sense these seem like perfect candidates for taking these and saying okay now let's put a natural language semantics on top of that and we'll just face up or let's put a click-and-drag semantics on this you can just add your tooltip by clicking the tooltip like yeah why not do that so I think this - I think there's two reasons not to do that so the first one is these tools are fantastic like the most common 80% of tasks the other 20% of tasks don't require five tools to solve them they require like 500 different tools and if you're like locked into this box so you can only express the most common 80% of operations that means like you're always uses lots of things you can't do because you need to do some special case but also I think it the advantage of embedding everything in a programming language is you never like limited by what I think universe have locked into an environment where you can only do what Hedley thinks is the right thing to do you can always break out of that and do what you think is the right thing what someone else does you can integrate this with other people's code I think also if you are even if you were to create a like a GUI or some kind of nice interface around this it can make sense to develop the language first and it is I think it's like developing to me like developing the language is the sort of the easy part then developing a toolkit to express that that that's that's another hard problem and if you then later on discover oh I've actually missed a really important part and you've got to totally rethink your toolkit and then again like having this I guess so you are arguing a natural language you still have sort of you you always want text on the end right and whether that's like an embedded DSL or an external DSL I think that that sort of it doesn't really matter but having embedded in a programming language is so important to break out of that box when you need to I was just going here with us for a long term where it is shiny and ggplot2 in G G vis where how are they going to sort of fit into each other eventually sorry I should have mentioned that so G G versus using shiny under the hood all of the reactivity comes from but whenever you create a plot whenever you create an interactive plot that's actually a shiny web application running behind the scenes in terms of GG versus GG plot and in some ways the complementary like GG versus aimed squarely to pop to is aimed squarely at print graphics GG versus aimed a web graphics if you want to make a PDF or a printout or a book right having an interactive something doesn't matter you want just something you can easily get into a PDF and and do what you want with that that's it like there's only one of me and I can't like work simultaneously on ggplot2 and GG versa make progress on both so lately GG vers as sort of a ggplot has been rather fallow and I think what we're going to do in the near future is basically say ggplot2 is now in maintenance mode we'll fix major bugs will accept poor requests but we're not going to do any active development one thing that's really challenging with ggplot2 now is that there's such a large community of people using it that there are well-known bugs that people have come up with workarounds for that if I was to fix the original bug would break all of the workarounds and so the net effect of fixing a bug is actually to break more code than not fixing it and the sort of ties in a little bit genuine for the tooling around making reproducible code and our that keeps running that we're working on and in other ways a little bit of that on this notion of like what you want to do with a PDF one web graphics you could imagine the future of science and some should not be PDFs yes right we should be creating papers that have these graphics that I can interact with yeah you can imagine creating some I know you shuddered at this notion but think about like the managed services for Roku problem you know our Roku or whatever yeah where I can specify these things and now part of my my research result or whatever I'm doing for my company gets hosted by the sites yeah and then anyone else who looks at it can actually play with it go down to the patio duck calls I feel like bats end up being working good allow allow me to fork that is that they are good myself yeah why are you why are needs stopping it yeah I would say I mean PDFs generally like the past right an interactive web stuff is the future I don't think you'd argue with me that interactive hTML is what that's going to be right it's not going to be Wolfram's computational data format for example the other thing that that the other thing that I've been working on which is sort of similar to this as I've been writing a book which like and if I can connect the internet quickly try and those look very visitor II but I've been writing a book in the open and I'm writing it an hour markdown which you can render into markdown which gets rendered into HTML and the book is just a github repo it's it's it's all published on the website and has an edit this page button much like a wiki but instead of editing the page directly it edit it and github and then when you click Save it submits a pull request to me so then I can review the pull request and integrate that in that that's firstly like a really fantastic like collaborative writing environment and I think it used to be a wiki and such as since I switched to the system where I review if we change people actually submit more changes because they know I'm going to look at it the other piece of technology that's really cool is in that is using Travis which is a continuous integration system so any time a change gets pushed to github everything gets all of the code gets rerun all of the HTML gets regenerated and pushed to the book website and we are certainly imagining a future where you're not just pushing static graphics but you're somehow kind of pushing the live interaction there as well and then that sort of like thinking about being able to write books which have interactive components which people can play with and all that kind of stuff is really really exciting I'm gonna give I played around a lot with my phone you know wondering if you can contrast like say our studio and I play around my understandings ipython the fungal black back-end you can imagine yeah so the iPod is this a notebook interface where you have cells input an output and you can intermingle rich text and you're colored this or the equivalent in the our community so so y-you can put an art back into ipython by and larger thought that our community uses is this tool call our markdown which let me just quickly pull up so our markdown is a similar idea so you can enter mingle bridge text written and markdown with our code when you execute the document the our code is run and the environment the the code and texture that the resultant takes two interleaved so the main difference is with the serve that the ipython philosophy is you have one file which contains the code and the results of running that code in the our markdown philosophy you have one code one file that's the input which contains the code and then you have one file which is the output so the advantage of the ipython approach is that you've got one file to email but for someone to read that they need to use a special viewer because the documents just a JSON data structure the advantage of the are markdown approach is you can just some email someone an HTML file or a PDF but then they can't going to run the code and such an easy way as as ipython so I think like our markdown plus an IDE is effectively equivalent and expressive powered like a notebook where it gives you very similar tools but I think that's a really interesting space to explore and generally like the different ways of intermingling documentation or text and code to explain things as that we're just still starting to see what the different approaches are and what a what's going to went out a slayer that you could set their level of excluding with and what it was this scope for this kind of interact you sort of playing into grab and you have a drop box at like a global variable or multiple of these that we updated many excelling these drop box though there was some good updated look so that sort of like sequence of drop boxes we each felt progressively filters is not something you could do that easily I've showed you what what I showed you was kind of the the very concise format we you're using kind of components that we've built like the you know I'm using components like the the drop down or the slider you could also write your own components using any HTML right if you wanted a date picker you could write an HTML date picker you gonna have to know some HTML and JavaScript if you want to do that and cook it up to an hour object that represents that so if you if you know a little bit about HTML you can add some of these structures some of them you can do if you write a shiny app and just use our existing things otherwise you are a little bit constrained with what kind of fundamental input elements we've written already kind of discussed before that like he bugging is very hot in this kind of like discarding material it was your vision that I mean is it possible to also have a variable I mean if you're doing this multiple piping of data together yeah that kind of like you could tell the user immediately that if any of these like oh is fading or something yeah I think there's a lot of scarf ability bugging tools another problem you can imagine now you've got a graphic that's dynamically changing over time and at some point it breaks like that is going to be an incredibly frustrating debugging experience and I don't think there's still a lot of thinking to do with what with what bit of tools you could come up with one thing that I can imagine very easy Lee is the sort of some of the the Brett Victor idea so in maybe when you're typing this and I push into you could imagine getting a little preview that shows the first few rows and so I could check immediately have I done the right thing or not I think that that that feedback loop we find out as quickly as possible have I done the right thing or not is really really important and since you know I'm working with the people writing the IDE there's a lot of scope to add that kind of interaction programming it I didn't he just because his provenance hell they're a result came from certainly and it seems like this model to kind of you know you create a graphic and then publish it and publish its a web around someone goes can plays with it was like that one in the graphic you know is now disconnected from the code that move it just usually doesn't have anything where you know grab it sort of knows about how it got created you can tell is August this is not something we've thought about I think generally our model is if you did one people and to understand how it was created you would also publish the source code along with it I mean one one thing we've thought about is also the ability to kind of publish an interactive graphic by pre-rendering a number of the computation so it doesn't have to hit our every time it just stores everything in the Jason we're the envy at least you'd have the data stored and the document even if you wouldn't have all of the calculations I think there's a similar like with reproducibility there's only ever so much you can reproduce right you can reproduce the actions that someone take talk but it's very very you can't like reproduce their thought process which led to those actions which you could then reproduce data analysis with a new data set right so there's this sort of always limitations and like will kind keep pushing it as far as we can but yeah again I think that that's just not like that that sort of more the notebook style which is not the way that we've been thinking currently but that's a certainly a valid way to do it and I'd think and I would imagine it would wouldn't be too hard to hook up ipython so you could have like jeevers graphics they would work interactively in the notebook as well so you serve this claim I agree with that you want to take advantage of the things that people are good at computation fact that difference the proximity of the languages description to you how I think about your problem and if this is the case no and you look at what people are learning first what are there are poems people start scripting you're not looking at all be honest you're looking at with HTML and JavaScript which is where like many many people start yeah and I wonder what this would look like if you looked it around and said let's take these same verse here and make JavaScript as you know the best language ever make it massively yeah data parallel in these same ways stand by by you know embedding the script and then anything that someone wants to do on the visualization side there's already very rich yeah visual come out of the JavaScript yes I mean I think your your sort of your your question is like why use our as a home for data analysis as opposed to JavaScript in some sense yeah yeah I mean I I agree and I think yeah I think it's a two main contributions of this work and that first is that intellectual framework like these are the five verbs and if you wanted to go implement a JavaScript phone you can use those ideas and if you do that that's like success for me as a researcher right that I found something that's useful but for me to go and do that they don't have to go and learn JavaScript right so it's all the sort of like like relative costs I think Java starters I think that would make they could make a more approachable data analysis platform for many people have learned JavaScript already on the other hand it's going to make but the thing that javascript is weakest in is the the modeling side so if you're going to if you're going to be in JavaScript in well you can do the visualization and transformation but now you're also going to implement a whole lot of statistical models and then that that's basically how it is in every language right every language provides some of the pieces and you have to do a whole lot of work to provide the final one you know I've done a lot of work to date building these pieces an hour and seems to lead to throw it away just because lots of people know JavaScript a JavaScript is interesting though because I think but it's very very similar in many ways both as a programming language that's lexically scoped and if you there's a lot of interest in functional programming now and also as sort of a community where like most JavaScript code is crap and most our code is really bad and you know there's there's like like Douglas yeah that Douglas Crockford is JavaScript the good part is sort of like are the good parts as well you
Info
Channel: stanfordonline
Views: 74,028
Rating: 4.8899674 out of 5
Keywords: stanford, stanford university, stanfordonline, hadley wickham, cs547, HCI, Human--computer Interaction (Field Of Study), seminar
Id: wki0BqlztCo
Channel Id: undefined
Length: 61min 9sec (3669 seconds)
Published: Tue Feb 18 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.