Hadley Wickham – “You can't do data science in a GUI”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you for coming to a meeting today in regards to data science GUI with happy with chief data scientist in our studio as well as the member of the our Foundation and an attempt professor at Stanford and at the University of Auckland he builds both computational and cognitive tools to make data science easier faster and more times his work includes various packages as well as principles of care development also he enjoys writing educating and speaking to promote the use of our for data science and you can learn more from his website so please help me to give a very warm welcome [Applause] good evening so tonight I'm kind of giving you a deliberately provocatively titled talk to the hopefully entice you in here but hopefully there won't be too much that you disagree with so I'm going to talk about like then my beliefs about data science some of the tool that I've worked on but before we can do that I think it's really useful to talk about like what is data science there's now like you know approximately a billion different feed diagrams showing you what they each one say the sciences but to me I think the definition is pretty simple we need to go like struggling with data whenever you're trying to understand what's going on with data when you're trying to tune their war data and insight and understanding or discoveries I think that's data science I'm not going to interested in like the philosophies of data science what I am more interested in like one of the tools of data side what do you need in order to actually do data side so before I go I have a few questions you all just get a little of a sense of what sort of people you are so how many of you would say you're a data sign are you what you would identify the data scientists right now and how many of you would see the pie is like a program there's not like how many of you use Python before so I'm going to talk about the tools today of science we're going to talk a little bit about why I think like you've been an old used to our title before so some since I think maybe already convinced at least a little bit I'll talk a little bit about some of the reasons that I think using a programming language to do data science is so important so much more important than using and then I'll talk a little bit about my favorite programming language are indeed some of the tools that I have worked on these domain-specific languages which help you express the ideas of data science more so to me there are a few main tools of data side the fifth quite obviously where you can do anything you have to get the data out of whatever crazy format of communism into your dangerous science apart sometimes this is reading CSV sometimes is learning from a database but this might be scraping websites calling api's whatever you've always going to get the data in and I'm not going to talk too much about influence and I guess with my experience like data import is either like 80% of the time it's just like incredibly boring and the other 20% of time it's like English screaming neither must make for like fantastic so it's something a little bit about that but I'm not gonna focus too much of that one thing once you have imported your data that I think is really important is to do what I call tidying you day this is not cleaning your data when you care about like all the babies accurate tidying your data is just about getting it into a structure that is gonna naval the risk of your analysis once you've done that did your job of it as a data scientist is to understand what that gig is going on I think it is and I think they're going to three main tools that help you do that you're gonna need to do some feeling kind of mechanical transformations you might summarize multiple values down to summary like a meat or out you might create a new variable as a function of existing variables I think of these as transformations are really really useful but they'll like no 360 so to me if there really are two main engines that help you understand what's going on today's visualization involved in visualization is like a fundamentally human activity this is figuring out how can you take the best advantage of you of the innate skills that psychology that they're cooking in things and understanding what's going on so visualization is a fungible a human activity you look at a plot and it gives you some insight into what's going on in visualizations are great particularly great for two reasons first of all the like you can go to the visualization with a pretty vague question and you can use the visualization to help you refine you and often this is a big part of the challenge of data science is taking that bag ill-formed question and your kids and trying to make it sufficiently precise that you can answer it Quantic the other great thing about visualizations is that you can see something that you didn't like spit think it's of the price but the downside of visualization is because it is that human in the loop then it's scale particularly problem as you get more observations as you get more variables you simply cannot look at every possible thing so to me the complementary tool the visualization is popular and I think a bubbling very broadly but basically we need you can make your question sufficiently precise you can also have an algorithm board some summary statistic I think of this as a farmer and models are great because they have fundamentally computation and even if the slowly feels way easy to throw more computers of the problem that is the Breville brain but the program with modal's is that every model makes assumptions and a model by its very nature cannot question those assumptions so it's some funding level and into the model and also price so this is why I think visualization is bugling such great company do tools visualizations can surprise you but they can't scale models scale much better but they can't find a reduced price so any data analysis need data side you're going to look through these things that you begin to give you to use visualizations look at maybe a small subset of your data see use that to generate hypotheses make those concrete quantitative and precise team used models to scale them up to much larger budgets up to now you're gonna walk through this loop but you get them again and again begin until you decide that you've died the economy Thanks hopefully you haven't taught you today that tell us through this hopefully you balance of real signal on the data and then you're gonna do one of the treatments you guys have to communicate those results to another human being with a vet to boss or your supervisor saw the decision-maker inside your organization all you're going to want to communicate their results thank you to another computer you want to think about how do I deployed this in something something sisters I think a little bit of the division in my mind between kind of classical statistics and like data science classically statistician to be more concerned about how do we get humans to update a police and a big part of data size days how do we find bubbles into these computational pipelines we now the model the visualizations of no longer be at the point but there are beginning of some other so a lot of my work will be developing tools to make each of the parts of these processes easier in hours that's the so-called tiny base come back so if you're going to do data science why should you program well to praying that think it's a book I think it's useful to think a little bit about well what do you have to do when you're solving a data science problem like obviously you have to use a computer in some way right maybe thirty years ago you can do on paper that just no longer possible today there has to be a computer called something so what you have to do is you have to first think about what you're going to do the computer isn't even gonna tell me they need to describe what you want precisely in such a way you can understand and then they compute has to go would actually do to come into this of two extremes to this so one extreme is this of gooey like it's easiest type approach where you just pointed click everything is laid out in front of you all of the options are laid out in front of you which is great because you can see everything you can do but it's also terrible because you constraint you can only do what the inventors of is is wanted one SAS or Excel one of you where's ah let's come people we need programming language is kind of the opposite all you get is this blinking it's just telling me you can do literally anything but it's not gonna give you it's not gonna give you much so I think of this to me that think the important thing about programming languages like iron - is the language pot they give you a language to express your ideas they give you very few constraints which makes like tough from your learning or your DVD data signs occasionally but the payoff for investing in a programming language if you get this whole this new language is what you can express them now the other thing that I think is really great about programming languages is that you interact for the programming languages code your code is just tips and there are two amazingly powerful workflows that text gives you the first word [Applause] now you'll love right like copy and paste is not like a strategy that you want to like carry you throughout your entire career but it is an incredibly powerful strategy to soul to like repeat yourself when you need to do something multiple times you copy and paste like it's not the best solution in the world but it's gonna get you did the great scaffold and the other great web flow gets you stack up but because colored is just text you can dump your error message you can stick your image to Google and Google will need you to stack overflow which will solve your problem right again this is a little humorous but I knew this was important right because code is just text this means you can put it as an email like you can tweet it you can google it they give this a huge amount of power associated with that but expressing your ideas and tips because it's so so many tools for working and there's also a bunch of great tools around the provenance of text the tips is kind of reproducible like what you've expressed your ideas and code you can rerun that code with new data later on and get updated results you can use lots of great techniques for seeing the differences between two pieces of tips you can read Tim it's not some like crazy or fake binary format you can look at it and even if you don't know the programming language even if you don't know exactly what that card is doing you can pass a lot you can understand you be fine and then that also allows us become the post card in all sorts of ways to make a photo finish share it so I wanted to give you just a couple of little examples of that from guitar so the first one is this is a project for British Columbia that could be population estimates they have done all of this work in the open get up github repo and one of the neat things about putting you do data analysis into a system like it using a tool like give up is not only do you get to see where the analysis is now you get to see how it as a ball over time you can see these series of commits showing what is going on over the first of the analysis and you can drill down at the one let's commence this is probably hard to read but you can see like what exactly has changed well this case we're just changing but colors of a BA child it's not like super exciting with disability to see what has changed is incredibly incredibly powerful you could not only see like we're data analysis is now but you can see it how's it falters related to tools these tools none is I'll mock down which allow you to mingle I codes and tips to produce a single output plenty of all those similar tools which exist on other languages I want to show you a little example using a lockdown this is from my colleague Judy Brian who's doing a data analysis of the Calaveras jumping frog Jubilee so this is a state county care I think where they have a strong jump because the purposes of analysis is mostly to kind of it is an example to teach with but one of the things I think is neat about this is that in the readme which github exposes very nicely Ginny has shown you some of the summary statistics or they just show this data front so this is the readme when you go to this get outside we get this nicely formatted HTML page web page you can see what's going on how was that generated well this comes from an hour lockdown document the odd knock down document has ticks at this point all we know that each frog is one cocktail in we've got some odd code in there as well and so we can run that archive into mingle it with the text and begin a beautifully polished HTML document like this and we can go back even once it clear the how do we do this nice cool winning how do we get these Giddings well opt out use this plain text format for lockdown you just write text and a really pretty simple to the conventions that can maybe nicely motivation so this whole a lockdown Wicklow I think is really important because you no longer like copying is pasting like graphics from one thing and sticking them into your word document and in your data changes and then you've got to like carefully rerun your analysis and then carefully copy your coffee and patient exactly the right images be exactly the right spice every time you have to do something by hand every time you would have coffee enticed by and you've got that possibility of air up and ideally you really want and then kind of one of the things like in the flipside like one of the things like Atticus me 90 is there's like this dialogue dig so so what's happened that's legs but although I'm going to sort by that column and the excel gives me this option do I want to expand the selection what do I want to continue with the current selection what these options really should do you want what do you want to do the right thing oh I see be randomized you today and so it's kind of not like vulnerably horrifying that you excelled it to randomize your data basically if you cook the wrong thing yeah what is truly horrifying is that there's no provenance there's no way if you do this X deeply there's no way a to see that you've done it and then there's no way to rollback and I think that that's incredibly incredibly good and there are plenty of examples of like real life Dave analogy scientific papers with people have made these eras documents like the whole financial crisis the routine mission of it was like magic go to the Excel spreadsheets it's just this beer is very very difficult to like to see what's happened to figure out how would you get and that I think is crucial so hopefully I've convinced you that you should be programming once you can use up well so our is like I'm not gonna like sugar card at hours like a quickie language but most of the quirks they think oh well full power some of it just literally like utterly bizarre but some of the quirks I think are really well suited to working with danger doing data so we'll talk about a few of my kinda favorite features of are so one of the first one is that are is a big de language so everything you work with it an hour is a victim I does not have scalars you can already equipped with a single number this has kind of three serious performance implications but the neat thing is you can express ideas very simply here I'm creating a dicta by randomly something teeth numbers between 1 and 100 I could need to say how many of those numbers are greater than 50 and I get a logical victor that right now there's no need to write like write a for loop yeah I'm just gonna apply this operation this completely and then I can say well I'm going to sum up this Victor when you saw a bunch of a big falters become those truths become one so the sum of illogical big do is just pick out a number of truths to this very simply and elegantly tells me how many numbers of the stick that I produce 56 over another thing that's really really important when you doing data science of Statistics is the idea that some of the values in your day they're going to be missing you're not going to know what they out that is the reality of working with real world data than some of the values is simply absent and rather than just like omitting them it's often actually important to track those Masika and so an R is missing values built in with this DNA chocolate novel and what happens with missing values they work very similarly to the novels and SQL you get this kind of material anytime you have a missing value that missing value is going to propagate to be free expiration so here I've got a big dipper of just randomly rearrange the numbers for each one and five and if missing value and I want to say for each of those values physik greater than two well why does a lot brighter than two is a missing value grow them to you you don't know it's brilliantly so now the thing that's a little bit weird about missing values you might say well tell me what values of my dick that are equal to a missing belt and if you do that the missing value is propagated you just get this and this is a little bit confusing at first like why does a missing value not equals I'm missing guy and I think you can better understand that by thanking you both Humphrey so here we've put a variable that we used to store the age of John Howard a city we don't dog and good another variable yeah we're going to use to school by age of Mary how it was Mary we don't know Jonathan beri the same age we don't know right it's not like they're just white the missing value if you knew that value everything would be fine it's like an infinite number of missing values and there's no reason to believe a priori that any true missing right and so instead you have to use is w a is missing so another thing that's quite unusual R is that Finn's knee one of the main data structures in the other language itself is this relational table this one's data frame was tipple basically this table structure with variables that have made and we did this rectangular structure data frame the other thing that I think is great about but I could not really articulate why is that our it is basically a functional programming language so R is not like fight level of G grinded look programming language you do not generally solve problems of up and by creating new classes and the instantiating objects on those classes instead UT to solve problems by composing functions in various layers and one of the things that makes our functional programming language is that functions an architect functions as arguments and they can also return functions as well and it seems in some way I think because generally when you're doing data science there are a bunch of like different data structures is just one data structure this data frame or relational table that you usually give them to give them again you are you are doing different things that damages the day are struck from the same line you were doing different things to it in functional programming seems to be a good the other thing that does quite different again is that you could do a little beat up programming so here's the bowl this is a little snippet of code that like I remember when I faced and counted as an undergraduate I had a I was majoring in computer science I like learn Java PHP programming and then I came to our to see like who gives the give up because he does plot plotting X and sine X and when you look at the plot somehow magically knows like what Pete does excel the inputs to that question and now like this seems like a something natural to be now and if you've used a lot before you may have never thought about this but it's still like big Wiggly or beeper encountering this and it just like blew my mind and worked in a way that was very different to any programming language and the thing that's neat about this is that how it gives you incredible amount of power to kind of look at the structure of code that you were working with and so now like every programming language you can think about code as a tree like structure often called the extract syntax tree in are like many languages books of a list heritage you can actually look inside that tree and manipulate so here this is a bunch of I've written actually for iest it takes an hour expression and it's rules this come nicely formatted like console visualization about that tree books so we can take you've been aqua diamond to what you can look at code that we can introspect and we can modify in various plans and the thing that I think is so neat about this is it gives us this incredible power to create little domain specific languages that are tailored to certain parties and science courses so this is one of my work with being my main claim to fame is equal to a visualization packet just provides like a domain-specific language for visualization that allows you to express the relationship between variables in your data and aesthetic properties in use so we're going to talk a little bit about that and talk a little bit about the so-called party hood is a collection of packages that think about data or the symbol of mine that provide a similar state event in the kinds of the big that the theme that underlies all of these packages is this incredibly powerful way of solving complex problems which is breaking them down into small pieces and solving the problem is different and I think this is a really powerful strategy for two reasons first of all it allows you to take little snips and off to each little stamp you can chip like have I ended up in a good place if you whip into much bigger pieces you know you might be wandering around like you're putting like some crazy deep waiting model and it takes like a week of computing and like a week later you're finished you're like huh make sure you kind of a bad place there just wasted the loss so we need but you can like break a problem down to the small pieces you can get rapid feedback that's incredibly powerful and the other thing that's great is because you've got these little pieces that you can recombine in different ways often you can take pieces from one project that you're already familiar with you understand all those pieces where you can recombine them in a new way maybe with a few new pieces to solve a new problem so you twitch easy to generalize a solution from one problem to the next problem so I'm going to show you a little a little example of a day now and now using some tiny those packages this is not my damn analysis become some cloud woke up like increasing proportion of my sliced a these come from us with a direction and so this is using Kyle's package called chitin synthesis so this is one of the neat things about loud because lots of people who care about data use ah here a bunch of packages to getting data from various data sources and as you might guess from the name this gets data from so I'm going to run through the code I'm going to get this just at us and this is this is the type of data and both it's just like ugly boring because it just works and then I'm going to look at so here we have a road there's about five hundred and twenty rows all together each row represents statistically area so could be micro micro statistic arrow of Metropolitan Statistical Area we have the name of this variable which you've expected from the Census I'm going to tell you what that is so I'm going to give you a challenge please figure out one of this to do but the estimate and if you've got my Vera so what we did a little awkward with this is like these claims are kind of a little ugly right with God who comes away that the pitcher collagen here is work they could have aggregated big cities together in a region we have like alpha destroy you a metropolitan area like be nice that there's some kind of a separate and so on and so even when you get pretty nice data you almost always have to do some data transformation so I'm going to do a little bit of that data I'm going to say well that only would have looked at the big metropolitan areas the areas that I've greater than three million people in them that variable is constant so I'm going to get rid of it I'm gonna remove this minute trip the area because it's just every single color to teach me anything they're gonna split it off into city and state and then extract those just so we get to fit city in the fifth state well the other thing that's a little odd but you can't quite see here but this is margin of error will be recorded as negative five five five five five five which is a little suspicious and in fact there's actually how this day investing values so close you're going to treat those weird values to the axle just so what I do that you know I don't expect you to read that code is that we through but hopefully you can see none of these major boobs with filtering with selecting we have mutating we're creating new variables that we're expressing ourselves here with car and if we do that we get this nice a format now we've got the city of stage yeah no combine these together to give me a simple city state combination rather than full potential messes accurate metropolitan standing jury then I'm going to plug that really really good idea don't always look at your data I'm gonna make visualization rule of thumb is that the first visualization you look at will always reveal a data quality Europe and if it does not reveal the data quality era that just means you haven't found similar will plot this is what the club looks like so here's a variable let the estimated value from the census on the x-axis from these cities of the y-axis and what I'm going to do see if you can guess what this variable is might be a little bit hard to read for the x-axis scale goes 0 to 50 and the top of garlic New York San Francisco Washington DC Boston Chicago and at the bottom of an Indianapolis City Dallas Tampa Detroit any guesses focuses it's not population I think the blood bank that's mplt basketball ' DeAnza the affection you can take pretty pretty close this is the seepage of people who take public transportation to wear so you y'all basically out there on its axis and so I bring this up to kind of illustrate these two cloths I think it's useful to think about like explorative graphics or ethics ain't the beauty analysts and the goal of the Explorer agrestic is just to get get the graphic as quickly as possible get the inside they get under the NYX visualization and you don't need a lot of you need a lot of scaffolding because you want to commit become the new of this data already but when you go to share this day with other people it really helps to stop thinking about good axis labels having titles and subtitle sources that explain what's going on and often going from this to this this is nuts were busy keeping ass now you're gonna like get out of your own head you're gonna think like one of people who have never seen this data before what do they need to know every call that comes and their communication pots are so so important in data science so difficult to do and I have no real advice about but I mostly where I wouldn't think talk a little bit about this to the the thing that makes this I think easier like when you look at this card like one of the things I think about a lot it's not just the individual components of the problem of the individual components of the the programming huntys but how do you join them all together and I really like this part from how Abelson that it's not just the individual components how's the glue that sticks in there and so I spent a lot of time like thinking about the glue like how can you like once you've learned one part of the tiny goose one package how could he make sure that makes moving to the next package a little bit easier and I really like us like the era of pips of 60s so you put me familiar with like a pinnacle of success right but to get to a pinnacle of success or peak of success you have to strive you have to try really hot the goal of pivots exist to kind of like pull backs until they like men you know this is a little grandiose where we are the flavors it more like a pothole so success currently but this is very much my goal like I won that I want to help you as data scientist get to the point your fingers at just typing our code without even like consciously thinking about it and you can use you can speed your capricious cognitive resources like correctly but the big questions of the day does not like figuring out how did how to get the computer to understand but all that's it there are also things that are really hard to express with car and I think this is a really simple example this is a great simple scatterplot we look at the scatter plot you might kind of say well please punch me a local to go tweet these points here a little bit different or the other one and it's really easy to say like these the points I hear about but to express those points over playing ok you gonna figure out what what's like the slope of this line like like it's really painful the thing that's very natural have expressed directly of the data it's very tiresome different program another thing that they do a lot this kind of painful it's just sometimes you get a dataset and the variable names that's hardly wacky and you have to rename them all in doing that in code just feels like true you could do it but you're gonna like hope you tie something all things and get new names just gonna feel that something else is just typing an acuity just you know so much one that I think one thing that's really exciting to me is now in our studio you can create these atoms basically with pages that you can be a to the side studio the I need you can use that to to interact with data in a way that he of will natural but then you can generate the OP code out so I think it's fine it doesn't matter like subsets I think like a lot of like most of the code you will create by typing with your fingers but somehow but there are things are just difficult to express that way and if we can create user interfaces that allow you to express those operations well naturally that's right particularly if you can tune those dead into code because then you get all of the advantages of the card we talked about earlier you get all this promenades to get this reproducibility source and I think that's a whole bunch of kind of other interesting questions like when we start working with these pipelines in code like oh maybe there's a typical like filter like very predicts what comes next can we prompt you like 90% of time people like you like this other function well Kevin bull with autocomplete so I think order to complete is it really it's of a really interesting spice we have a primary mode of interaction is typing but order the complete kind of makes your life easier by giving you a bunch of all like how far can we take how old can we help how can we kind of guide you towards useful while still giving the the freedom to express whatever you want there's also a bunch of critical work about like learning things from examples I think learning rigor expression through examples is particularly particularly neat work they're like regular expressions like it takes a while to learn their language it could be really difficult to create a chick that you've correctly utilized and the right axes can we just give the computer a bunch of positive and negative examples and Heflin the regular expression and this is kind of like easier the crazy to stop with generally no you use this but this is cool a bit pic service you can take a photo it look like someone's met medical equation let's go tattooed over the ROM and bad things will give you the lay kid from man so you can express there so I think this looks of like what I believe it is like code as the primary artifact from a data analysis but I think we've still got a lot to learn about how can we generate that for copper piping so to sum up I think that just so many fun just to using Commodore or GUI it was just chicks so you will op the copy and pasting and googling the Stack Overflow questions already powerful of the picked up workflows but you also get this reproducibility you get protons of tips and they go to track what's going on others I think are is a fantastic programming language for data science yes it's quirky but love the quirks turned out to be pretty good ideas and help you spread science challenges they should meet you like we didn't say our was great it does not imply anything about how the for everything is fine when I say our is bright please do not take the subjects to me the like higher your socks like I think I did is great to you but I particularly though I really like this idea of domain-specific languages figuring out how I'm going to cop out little pieces of the whole data science problem for five little tokens little miniature languages they give you some degree of flexibility and hopefully can like guide you into this pit of success then hopefully sexists it's a little bit easier to pop and Lissa resistance and then finally like while I gave you this very provocatively titled told me you cannot do data signs of the GUI well I am most what I believe most passionately is that code should be the primary out of that I think most of the time that you should be creating codes by typing but there's a lot discard while other times it using it [Applause] [Music] you know and it's challenging for them to decide yeah I think they're two challenges so festival-like quickly like this this is this problem right like you haven't used up in two weeks and you open it up and this Chris is just like and I think that like if you are only doing their analysis occasionally using a programming language is really really hard because you're gonna lose that link you lose with like context whereas with the GUI like the context is encoded like you can see what are my options it's very obvious what you do so I think some of the do it's something that we have to accept like there's a very wide posit people that were programming language is not going to be part but it is also like a lot of people who are doing data science every day who I think I married to be very resistant to learning programs and I think a lot of that is because like programming just in people's kids is this like really big complicated thing and I think if you could like trick them into using I like without talking about it being a programming language that can help I think that the key is to like find a few things that they can do much easily enough they can express well elegantly or like things that they tedious things that they hate doing and they're gonna like pointedly begin begin begin like they're doing the monthly TPS report which involves copying and pasting like 15 different screeches together which is like an excellent that typical workflow called a thousands of people like if you can show them how to automate that order like they will love you and there will be like an honors of awesome and I want to say more anything when they encounter when they finalize they were look at everyone fails from the hellenic programming is like some light of the photography when I get a mistake in error that I have no idea about and I google it and it says it's a stack overflow and overflow error have nothing to do are not recognizable to know the situation without seems to be similarly the same there's no there's nothing like a log file that I can go to where it spells out well this is you know this is what's happening and you know warnings you know don't always help and trace back then in all this hello how do you how can you get past that because for me I'll I'm a magnet for look like that and that will establish we there are states yeah I think that is a big challenge one thing I kind of seems so obvious now but I only realized recently it's like one of the things that makes you like like one of the things in the experience we're gonna can do that a newcomer cannot is look at these like error messages that look completely different like you see someone else an error message and the information is the same that the situation was different than figuring out like how that relate to you like that is like that is like a turkey skill I don't like concretely like I I think the way like I think he needs like in some sense you need to find like a friend Oleg you need to find like you need to find that community you don't you need to find like a community of people that can like struggle along with you like they won't always be able to help you but sometimes all you need is like a fresh set of eyes sometimes you're one of the I think one of the challenges with the error messages in ah is it hours like a very permissive language like you do something we and I was like okay you do something weird to the result of that cause like okay and then three days tips later I was like Al I'm not doing and it's not actually with it error messages the urine the air the salty air is actually like recent so I mean one of the things that you know that it'd be the other thing I just I still have to do a lot of is just like do very simple steps that ship they just if like the thing I think it's it's easy to do is just do a whole bunch of stop of the evening and like obviously I still do this all the time I'm like you know I've been throwing up hangout for 15 years I'm gonna like write this function just go hanging out and then it doesn't work and I've got no idea why and I just up just stop like do one stick and they look at the you know one of the things that I've tried to do and the tiny voice in particular like it used to instead of data frames like one of the things Tibbles do is tell you the type of theory Colin because if you have like a fact if you think your way with strings and you've actually got effect you get this weird error messages and it doesn't so like finding tools that let you like figure out what the heck do I actually have is that what I expect is really useful take shifty to the table so I think Timo's will help y'all saying his studio that's in the latest version of a studio while he visited do you like the view pane in our studio if you're working with but now I can get you but they asked you a beautiful you're gonna drill down the environment planning and just you know like just I wish I had like a good opposite for you but some of its just like something is challenging and something that we've been thinking about [Applause] one thing that you can often do it was like when you scroll up packages open that needs of these like no binary fission about particularly using all the mood you can often say like type equals sauce that were trying this ball just the way to cran words of all the Buddha's softball for maybe so so I've written this this book the data science this is kind of Maya team two gonna get you to the good pots of data science as quickly as possible Kermit the goal of it is to be that you could come and read this book with no are gaining experience I have not achieved that goal yes but this has comes to this like this is my attendance I configure there's a few other people that I would like look at the media tomorrow is she also has experience teaching to high school students I kind of think the key thing is like it's like how do you how do you shorten the life of it like how do you figure out the shortest Wow do something that they think is awesome if you could do that leave you build everything around you figure out what is the shortest possible path to get from knowing nothing to doing something awesome like don't worry about the theory don't worry about anything just worry about how do you get to do something cool and then once they can do that seeing about the motivation to like struggle through somebody never will the office the keys like finding that motivation to one person is done some cool Thomas Peterson it's done some cool I can show you tonight for like the most horrifying picture of me so he's done some cool like finding for like cold like things that like hook people that dress stuff like exploring how like deep learning like how can you visualize but this is kind of like like bottling could give it how you like make these and like animations I think it did it mean like you show them how do you do stuff with your University work what kind of trends are you seeing that your students coming through now just like my university work is like structured in such a way that I do so I'm not gonna speak too loudly to that I think we're sort of performing to treated like to me like the Train that is the craziest people are coming out of high school and they are excited about statistics and data science like you have teen years ago we'd have like for statistics majors seem like came out of high school like really excited about statistics and now you've got like a hundred fifty people I think that the biggest difference I I didn't see like a two because the skills like coming in that's changing a family just because I think like you know people have full familiar with like computational lot of things like iPads so they're kind of defeating this idea about Hollywood behind still a lot of like how do you like beautiful manufacturing and then stops and I think one thing is like you know teen years ago you could stop programming you can create something you can do your first attempt does not look that far away from like standard yeah but now like your first app it's not going to look like you're really apologize something's the distance is far away and kind of making you like how do you kind of get over that my work relationship which is sort of it sub seems like the challenge of the into it like no matter what you do you can find some of the intimate doesn't seem that great right and you've got to like figure out something motivation [Music] to develop this to like dig the pit fire beautiful Anoka fall on your side I think like the thing like that throughout micro Korea has been like most successful of math is like deliberately popping out some spine like every day or every week when you are focused on not like solving the problem like your shoulder problems but thinking about like what are things I need to learn about like that I'm not going to pay off the day so my payoff in a couple of explicitly like seeking out new things into Moscow I think one one scrappin is easy to fall into if you do that is that like a be polite Twitter or you reflog so hacking years or whatever that's just like like there's always like amazing things you like well this is cold this is cold this is cool and then you ain't out just like screening so this is something like you also put like this deeply in soccer and you have to have like it's something they will just like ignore all these other amazing things are they going on just kind of focus and keep on like no ticking so I think part of that like that problem is never going to go away because I think something's like the best and worst thing about our is that most high uses and not programmers and that is awesome because just the diversity of people using an hour is critical and it gives you know very very wide range of people to come find the sandy data could also means that like most people writing packages of never had any formal training and software and you know that could lead to some patents like kind of my goal I think it's just to figure out like I like I'd open takes me a long time to like so externalize the things that I believe about partner the things all the ways that I think you can structure card and so like some of the one of the things that some thinking about so slowly more and more is like you now we've got the tiny bears now we've got a collection of packages that fit together pretty well like like why do they actually fit together one of the properties and then like how can you like you know use those ideas in your own car so currently that's kind of it the phase or I'm like giving the giving workshops if we now need runs of talking about that it's slow you're finding those a yes but I my seeds is like that is the nukes book that I want to write off the data science is about like how do you do data science it fast I was about like I was all whoop packages about how to crank it just quick but they need I need like I want to write a book that's like how do you like take the big missing problem and decompose it into small functions that work well in isolation any man Eve to existing principles that mean they could have fitted more naturally to Oakland or that ecosystem but like I feel like I I can do I could really do that myself but it takes me a long time and like figuring how to explain to all the people that's gonna Jeremy which would say simple yes I think it's great Oh kind of beds like the ground floor and say you know like I think a lot of the coolest kind of work with like no we're kind of up to her like a mid 1980s one one thing I think's really need is so that the disco commuting stick to the American Statistical Association as this video library which is a bunch of videos Oh like interactive tools that people develop this is like one was like to be like free like developing needs a director of tools in like the sophistication we are they'd like we're obtaining 3d displays with like use in like 1973 now admittedly this is like at the Stanford Linear Accelerator which is like the biggest and most powerful the shooter time when somebody is like I think we still like the Golden Age of kind of directive statistic graphics and it was the 1980 and a lot of those tools we don't have today particularly being like treating what you would describe with direct manipulation of the party this is some interesting work and the the CIA I think this would be a target yeah this is of a high Christian Don so the kiss like what a production environment they are this massively part of the problem is that just R is unfamiliar like most people who are running production environments and by traditional software engineering background and they're very very comfortable height so part of the problem is just educating those people like this R is a real programming language and sure people can write really crappy code or not but they can also write really good so there's a little bit of that there's also just because again like R is a community that does not have any kind of full-time software developers in it is some piece of the puzzle that I'm missing because no one in the our communities who's that interested them yet do one of the things that I like to make me excited to join our studio was to have the slant company they can high up full-time software engineers to work on a lot of boring infrastructure stuff but don't so I think we're starting to see that you know some of the things these will know like directly production related but like Spike Lee I agree now a pretty nice and device before we took spot from our we've also invested a lot it's just like a bunch of our production problems it's just like oscillate flooring it the structure we've also likened base to the law in like databases and having a performance safe correct connections to databases and just over time we'll spoiling as we find out about the mic knock off more all these I think are always going to be high friction to put in production because again like most I users are not software engineers and that's you know extreme so pretty much everything everything but the vast majority of what I was to do dollars good job so even thought the IDE it's like a few main organization to the AH studio organization that's with the IDE lives at your shiny ones is also like a tiny bonus organization we're on the tiny base packages like you can just like going and thank died amended its evolving I think it's goods to the ogre reads it have like a basic understanding of like much - and awkward but I think you have better off like specializing and why I can feed her up like going deep in wine and mastering like you don't just want to be like you don't styling yourself you want to have like us you know they rule what the shallow understanding of all the things I think generally like you want a Mazda soft thing so you could be like really really good problem that is creating expertise because often I think like when you go deep like if you go deep into ah and like learn about how the language works and like functional functional programming works like when you like go deep the language you lure you do learn these like ideas of the you can like translate take what you've blended all fun for programming and are and individually translate that into Python but if you gave it kind of good deep enough we give up like go up it's strong enough [Music] yeah yeah I think that's like one that is like one of the explicit long like the a lockdown trying to that you know you commute right into the stand come at me you can explore to like what it does it be game so it could generate light if your people yes I can generate that you know and then the thing about like mock Dow and the cocktail with gin role is that it's like it's kind of it's sufficiently simple that you sip don't have to worry about it going away I think like you can sort of really but if you had like first of all like it's not a binary format so you can always just like read it and find text and just about the peace thing but you can still see over theta and then like the other thing I think is really good correction it's like Jenny doesn't frog status it is you don't just like you don't just save the source code of everything you all say she save the read the documents it's saving so like when you have in our lockdown documents saving the ring of knock down is actually really useful [Music] because you can actually see the data yeah and being like it's checking this is to get your dare to change it now you're gonna tip you see exactly what the change not just with your code or the inputs and outputs as well and then that means like having these kind of multiple layers means a more like ended you they were a comma something like and I know that a lot of authors do things of book down where you can render it but the PDF document has been disabled and the defaults to HTML there but my question is a lot of the times of fails when I try to render it in PDF is that it fails panda chapters and I can never find I can't find where to change it because there's always the error message and I want to know what is it is that on purpose you know or and I should stop and just accept the HTML that I'm offered outside work I think part of it is like fun to me delay like creating something that works with HTML and PDF is like more work it's easier to disable block so then like that's what I do with even when I go and actually paint when I see the book the publisher that's what I actually call it like that is like a lot of pain and frustration for me Wow and I don't want to do I could think that frustration go away way too frequently so I don't use any like it's typically Noah kind of like I think he's like it's a deep philosophical objection just feels like you could always print acknowledged it produces a latex command with the pans off where's that stored so that I could I could probably a terrible did you just make that change should just be the directory of that a little succeeds to extract smaller chunks like keep with respecting the bacteria and end up with every week so so and you guys kind of started doing up our studio in the add-ins at what point are you satisfied or you use end up with another pipelining program I know there's something I just I really dislike those kind of like pipelining programs do you like solve a problem I like drawing dragging things would be the same thing I dislike them because they they pretty much what they try and sell you is that the hot pot is typing the color the hot pot is not tight in the car the hot pot is figuring out which inputs should be please to which outcomes and which components you need in my time and like there's a see the variety of data or whatever it's called does something similar like they just don't seem like they don't make the problem that much easier because you've still got all this basic quality but because you don't get cold you lose like all the benefits look like totally like soft way to do this people last 50 G's developing they just feel it was like for quit like they make a problem they make the easy problem easier and make the hot [Applause]
Info
Channel: Association for Computing Machinery (ACM)
Views: 25,870
Rating: 4.8531075 out of 5
Keywords: Hadley Wickham, Data Science, GUI, RStudio, Graphical User Interface
Id: cpbtcsGE0OA
Channel Id: undefined
Length: 75min 18sec (4518 seconds)
Published: Mon Mar 12 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.