Hadley Wickham: Managing many models with R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so what I want to talk about today is really the culmination of a few years of thinking about kind of integration and data manipulation and visualization and modeling in up how many of you have seen Hans Rosling stalks so if you have on your homework for today is to go google Hans Rosling watch one of his presentations of data see house rustling is a is a absolutely fantastic presenter who talks about data that looks like this so here on the x-axis we've got GDP on the y-axis we've got life expectancy and each of these dots is a year so you can kind of see over time what's happening with all the countries in the world now I'm not going to show you any cool animated graphics like this I'm going to show you some really boring line graphs and this is by no means a big data set right this is like 1,700 observations and total but even when you start plotting this much data you immediately have some problems like what can you tell from this plot like what's the main what are the main takeaway messages I mean maybe you can kind of see that generally the pattern is upward I should mention here this is just one of these two variables we've got year on the x-axis life expectancy on the y-axis and then each of these lines represents one of 142 countries so you can kind of see like this some it looks like the trend is generally pretty positive I've got some weird maybe some lines going the opposite direction down here and this is a really common problem right even with just you know a couple of thousands of data points you can't pop them or you can't see them all and so it turns out that even just to do visualization doing some modeling is really really useful so we're going to work our way towards is a plot like this now each of these dots as a country and what I've done is I've fit a very simple linear model to each country I've summarized that model the quality at model with the r-squared on the x-axis and then the main parameter of that model which is kind of the average year increase in life expectancy so when we look at this you can see well a lot of these countries this model fits really really well right the R Squared's up at around 0.99 but then we've got a whole bunch of countries where this model doesn't fit so well and one of those countries have in common well they're all red it's they're all from Africa so see why shortly way to those countries not fit this model so well now I'm going to show you this idea with a very simple data set we're just going to look at basically two variables 1700 observe ation but the neat thing about this approach is it scales to arbitrarily complex models you can use it with big data because it's trivially paralyse herbal and then finally there are kind of three main ideas three simple underlying ideas that you can apply in other situations as well and each of those three ideas is coupled with a package so the first thing I'm going to talk about is this idea of nested data so this is going to appear a little crazy at first what we're going to do is make a data frame where one of the columns of that data frame is a list of other data frames I'm going to do that using the tidy our package let me talk a little bit about functional programming basically as an alternative to for loops and I'll talk about kind of why you know a lot of people say for loops are bad I'll talk about why they're not really that bad but why you should be learning some other techniques as well as for loops and we were looking at that with the per package and then finally a really powerful idea whenever you've got a model and you want to visualize it in some way it's really really helpful to turn that model into tidy data and to do that we're going to use the broom package by David Robinson so these three ideas are going to help us understand this data set they I think they're pretty simple and a pretty general and will hopefully help you solve other problems in our so what is mr. data well we've got these hundred and forty two countries and what I want to do is fit a linear model to each of these hundred and forty two countries and it's going to be much easier to do if I have 142 data frames so currently my data looks like this right I've got a variable for the country a variable for the year and a variable for the life expectancy in each of these rows as an observation but if I want to fit 142 linear models to this it's going to be useful I have a different formula data and that's what I call a nested data frame so we're going to make a nice a data frame that is one column per country and we have another data frame which is a list of data frames so for example the first data frame in this list that is going to be the data for Afghanistan the second will be the data for Albania and so on and so forth I'm going to call this a nested data frame hopefully for an obvious reason because we've got data frames inside data frames and when I kind of first discovered this I thought that maybe this might just be a sign of madness putting data frames and data frames but I think this actually turns out to be quite a powerful and useful idea because as well as these data frames if you have one out per country we can have lots of other things that we have one off per country let's take a look at the code behind this so the data I'm showing you comes from the Gapminder package by Jeanne Brian and I've just done a little transformation I've created a new variable called a year 1950 which is just the number of years since 1950 can everyone read that code okay okay little bigger so what I'm going to do now is create this nested data frame sort of a combination of some deep life a deep liar function and tidy our function so I'm going to take the gap miner data then I'm going to group it by continent and country and then I'm going to nest it so let's see what this looks like so now instead of a 1700 row data frame right before we've got one rope or observations so one row for each country for each year now we just have one row for each country and if we want to get the specific data for that country we have to dive into this list now if you've used our for a while you might find that STR is your kind of go-to method by understanding the structure and object unfortunately STR is not super useful here as it gives us a lot of output so what I'm going to do instead is just drill into that data variable and then look at the first value and so this is just a data frame that has oops if we look by country the first row is Afghanistan so this is all the data for Afghanistan so if each year what's the life expectancy and then we have a few other variables that we're not going to talk about like the population and GDP per capita so does that make sense so what we've done is we've gone from this one big data frame to one data frame per country and instead of just having that as not lying around as a list we've stored it as a column of our data frame which makes it easy to identify because of the other columns of the name of the country and the name of the continent how many of you have seen this guy before the pipe okay if you haven't seen it before this operation is just the same as this so the pipe which is often pronounced then so X then F of Y it just takes the thing on the left hand side and feeds it in as the first argument on the right hand side so the pipe is just sort of syntactic sugar it doesn't change what code does or what it means but it basically makes it easier for us humans to read so when you read pipe code you can sort of read it as imperative sequence of operations take the Gapminder data then group it by these variables they list it and behind the scenes that's exactly the same as this code now functions are right because computer scientists have studied functions for a very long time we understand how functions work and how to program with them but the problem is to read these nested functions you're going to read from inside out and that's a little that's a little challenging to do so all the pipe does is it rearranges the code in a way that is more optimal for humans and I think this is really important because code is a medium of communication it's not just a medium of communication between you and the computer but it's a medium of communication between you and other people and every project that you work on you're always working on it with at least one other person and that's future you and you really don't want to be in a situation where future you is cursing present you because you did a terrible job of writing your code and now you've got no idea what was going on so I've experienced this many times I look at my code and it's not only like a stranger has written it but a potentially insane stranger has written this code I have no idea what it does it's really really valuable when you're writing code even if you just think it's going to be throwaway code that you never and look at it again that's always the code you end up relying on later on very very important to invest time in making sure your code is easy to understand and the pipe is a great way of doing that ok so now we've got a data frame we've got one column that gives us the name of the country we've got another column that has a whole lot of data frames each one gives us the data for the individual countries and now what we want to do is fit a linear model to each of those countries a very very simple linear model I'm just going to predict life expectancy based on the number of years since 1950 so this is not necessarily a model that I believe and I don't think this model is true but I think this model is useful because when I fit this model in and look at what remains I've removed any linear trend the models don't have to be true in order to be useful and when we're going to do this I'm going to end up I've got 142 countries where 142 data frames I'm going to end up with 142 models so while I put that in a column of the data frame well and this all works because the definition of a data frame is it is a list of vectors that all have the same length a list is a vector so you can have a column that is a list and a lists can contain anything in AHA so you can easily have a list of data frames or a list of linear moles this is all perfectly legitimate our code and tidier and deeply just provide a little bit of infrastructure to make your life a bit easier and I think the important thing here is that using these list columns is really useful because it keeps related things together right if this is your data this is the name of the country and this is your model and this is the continent if you decide you just want to focus on one continent say Africa you have to subset this vector subset this vector you have to subject the fictive to subset this vector and if you forget to subset this vector then at some point you're going to try and combine them together and they're not going to line up and if you're unlucky you won't get an error message it'll just silently do the wrong thing so this is very much like if you've ever had the experience of working in Excel you go to sort a table by a column and instead all the Excel does is just sort that individual column kind of effectively randomizing your data right you want to keep related things together and you want to keep them in a data frame because then you act on them as a whole you never have to think oh I changed the order of this vector so I've got to remember it or reorder or the others the data frame provides the infrastructure to do that for you automatically so to fit the models well first of all I'm just going to define a little little function here it is going to take a data frame as input and it's going to return a linear model as output then what I'm going to do is I'm going to say take the data split up by country I'm going to use mutate that is going to add a new variable that new variable is going to be called mods drop a model and this is what it's going to be so we explain the the map function a little bit more detail shortly but the basic idea is map is going to take each element of this each data frame and it's going to apply this function to it and it's going to return a list so this is going to transform or map a list of data frames into a list of linear models using this function as the translator so you might wonder well why use this this fancy map function that no one's ever heard of why don't just use a good old fashioned for loop and you know if you've used up at any point at any amount of time you've probably heard someone tell you that for loops are bad and you're a bad person for using for loops this is absolutely not true but there are some advantages to kind of taking the next step beyond for loops and learning these techniques of functional programming which is basically instead of you writing the for loop every single time you're going to use a function that does the for loop for you so in some sense this is all about being lazy instead of you doing the work rely on the work at someone else's done once I'm going to try and illustrate these ideas using some cupcake recipes as motivation so here is a cupcake recipe this is a real cupcake recipe from the hummingbird bakery cookbook but how many of you know about the hummingbird bakery it's a fairly famous bakery here how many of you have made a cupcake in your lives okay so if you've ever made a cupcake before you read this recipe you'll notice it's quite explicit right put the flour sugar baking powder salt and butter in a free-standing electric mixer with a paddle attachment and beat on slow speed until you get a sanding consisting seeing everything is combined right or my favorite instruction here is continue mixing for a couple more minutes until the batter is smooth but do not over mix right worst advice ever don't over mix you can kind of take it as given right don't under mix it either don't over bake your cookies so this is great the first time you've made cupcakes so it's very very explicit it spells everything out in detail but now imagine why you've mastered vanilla cupcakes and you want to go on to make chocolate cupcakes so what is the difference in a vanilla cupcake and a chocolate cupcake and they're going to switch back and forth you might notice well is that you put cocoa in a chocolate cupcake and so you can imagine this is like a hundred page cookbook it tells you how to make cupcakes and every page repeats this information and changes a few small things so this is important because now that you've mastered vanilla cupcakes and chocolate cupcakes maybe you want to go on to invent your own cupcake recipe it's really useful to kind of be able to easily see one of the parameters are the same and one of the parameters are different but it's hard to follow because there's so much text here so what I'm going to do is I'm going to refactor this recipe I'm going to rewrite it as if I was rewriting a function to be easier to understand so one unfortunate story about this cookbook my friend and New Zealand recommended it she said this is a really great cookbook and then I bet border and made some cake baked cupcakes from it and they turned out terribly and I complain to her and I said this is a terrible cookbook but it turns out unfortunately I bought the American version which uses these ridiculous like a scant 3/4 a cup of sugar instead of reasonable weights and metric measures so the first thing I do this recipe is I'm going to convert it to reasonable units the next thing I'm going to do is I'm going to rely on some domain knowledge so I'm going to assume that you've done some bait made some cupcakes before we've done some baking before and I can reduce the recipe to this so mix these things together until sandy and then beat them in with the wet ingredients now this is it this is this amusing historical anecdote is that when I was a kid growing up I did quite a lot of baking with my mom and her recipes always really annoyed me because they're just a list of ingredients and they would say baked hot oven ten minutes and there were no there were never any instructions it was just assumed that you would know based on the ingredients what to do with it and so at that time that really frustrated me so I rewrote all these recipes with very expert instructions about what to do and now I'm rewriting recipes in the opposite direction so it's one other thing we can do here that's going to make it easier to generalize I'm going to kind of use some variables let's say just beat the dry ingredients together and then mix the wet ingredients and and now that I've done that I can put multiple recipes on a single page and this is useful because because we can now see precisely what is the difference between a vanilla cupcake and a chocolate cupcake and it's not that you add cocoa right is that you substitute flour for cocoa so this is useful because now if you made this cupcake and it wasn't chocolatey enough it's kind of obvious what to do right you put more cocoa in instead of flour or if it was too chocolatey you'd put more flour and less cocoa and now that we've done that well the next kind of step is to start thinking about these recipes as data so it can now put them in a data frame and I can fit on even more recipes on the page so now added a recipe for lemon cupcakes so what's the difference with a lemon cupcake and a vanilla cupcake well instead of adding vanilla you add some lemon zest right so this is important because if you want to go on and generalize to create new types of cupcake recipes now you can go to see what's the same what's normally the same and what's different well there's another recipe red velvet cupcakes you can kind of see immediately that these are a little different right they have more flour they have no baking powder they're more sugar and they have more butter so right away you know this is going to make a bigger mixture but it's not going to rise as much probably now there's one sort of interesting there's still a one egg right because eggs come in integer quantities although interestingly if you really get into baking the integer number of eggs is not really accurate enough and it's even better to weigh out like 40 grams of egg which I assure you is quite frustrating so what's this got to do with programming well here a couple of for loops and I want you to just look at them for a couple of seconds and see if you can figure out what they do what's the same between these four loops and what's different so these four loops so there's three parts to every four loop your four loop should always start by allocating space for the output so here I'm going to make a vector of doubles that's the same as one element less vector for each column and the empty cars data set the first pay you always want to do this or your for loops they're going to be terribly slow if you have to grow the output at every iteration then we're going to iterate we're going to say for one over seek along how many of you've heard of seek along before so seek along just quickly demonstrate that say we've got one two three your c-cups let's not do that that's cooler it just gives me a sequence of vectors a secret of images the same thing so basically like this so what happens if I just have two elements in this vector I get the numbers 1 2 2 what happens if I have one element in this vector so what should happen if I have a vector with no elements in it I should get no integers right and that's what I get with seek along I get integer of leg 0 but if I use 1 to link X R is going to helpfully count back from 1 disease so normally you don't like deliberately use zero link vectors in your code but you want to protect against accidentally using it because that's not going to give you an informative error message you're just going to get this weird error message somewhere later down the line that's going to have no idea what's caused it so seek along is just a little protection okay so we've got the output we've got what we're looping along and then move up the operation extract the eighth column of empty cars compute the mean and then save it and the eighth element of this vector so what's the difference between this fall open that for loop well we're just computing the mean and this one and the media and that one so the problem of for loops is that they kind of tend to emphasize the objects that you're working with where is what actually important is the action right this is the critical difference between these two for loops but it's hard to see it because it's you know it's like 5% of the total characters I typed it's what's actually different so it's hard to see what's different if there's a lot of things that stay the same and so what I think you should do instead or what you should learn to do over time is use a function that wraps up that for loop so I'm going to show you the functions from the per package these have many Anna and analogs and base are but the functions and / are kind of more consistent and have a few shortcuts that tend to help you out so this code is exactly the same thing we're going to take each element of the empty cars we're going to apply the mean to it and we expect that the output is going to go into a double vector and so indeed if we look at the code for that map double function this is exactly what we see we see exactly that same code we use before but instead of using a specific function like mean or median that's now an argument to the function so in our functions can take other functions as arguments and this allows us to write not a specific for loop but a generic for loop now this is actually somewhat of a convenient lie the source code to map devil doesn't actually look like this because for various not really important reasons I decided to write it in C rather than but basically this is what it's doing behind the scenes it's still doing the same thing it creates a space the output iterates over each thing in the input calls the function and it and saves in the output and then returns it so lots of other functions right if we've got a function that returns doubles you might imagine we also a function that returns integers we have a function for characters and for logicals and then the most generic form of all is map which returns a list this is the most general because anything and I can go inside a list now those functions have all varied in their outputs these functions can also vary in the input so for example map two instead of taking a single vector X it takes a vector of X and y and it loops through those in parallel right so calls X if with the i'f value item an X and the I value and Y right and so you kind of imagine like map 3 and map 4 and so on which don't exist because there's a more general form or P Maps or four parallel map and if you could have going back to the the cupcakes analogy we start thinking about recipes as data you know you can even think about functions as data so here I put the list of functions they want to apply in a list and I can use some pretty tests met a pretty terse per code to basically for each of these functions compute it for each element in this data frame so now kind of moving up higher and higher along this Tower of abstraction and you know you maybe you'll never get to this point but still there's this idea that you can write a function that wraps up a common pattern of for loop is a really really powerful idea so let's get back to the Gapminder data so this is where we started right we've got one line per country and this is we want to end up we have one model per country and with some is that model in a couple of different ways so the first thing we did was to miss the data using tidier so we go from this form where we have an observation per row to a group or a country per row and the individual data is now stored in a list of data frames the next thing we will do is fit the linear model to each country so now you can kind of understand what this map function does it's going to take take each element of this list of data frames and it's going to send it into this function it's going to save the results into a list so let's do that just to confirm I'm not lying and this really does all work and so if we put this out you can see we're going to data frame with four columns the continent the country this list of data frames and now this list of linear models so again you know if you look at STR is going to get kind of progressively less and less helpful unfortunately so you'll notice take some time to print all that one of the things we have a v8 to do get to work with the r studio ide team on is to make it so you can kind of easily more easily explore these objects without being overwhelmed with output but I just wanted to point out one of the first things that this one of the reasons we're doing this coming back to this idea of a data frame there's list columns now I can say I want all of the things to do with Africa all the countries in Africa and this just carries everything along for the ride right I get the data and I get the model and I don't need to think about having to subset all these pieces separately but now that we've got this list of linear models but what can you actually do with a list of linear models not a lot so what we're going to do instead is take these linear models and we're going to turn them into tidy data frames and the thing that's great about tidy data frames is you already have lots and lots of tools for dealing with them you can visualize them with ggplot2 you could manipulate them with deep lire you could even think about modeling them if you wanted to go really crazy so to do this I'm going to use the boom package by David Robinson it works for a very wide variety of model functions model objects and are it's going to convert them to different forms of tiny data and what sort of data can we get from a model so here I'll just pull out the data for New Zealand I have fit a linear model to it what can I get out of that linear model well there are basically three things so it can pull out model summaries like the R squared so model summaries tend to kind of fall into two camps right they're either a measure of model quality like the R squared or they're a measure of model complexity like the degrees of freedom or they are some attempt to compromise between complexity and quality like the adjusted R squared or the AIC or the BAC so for these you've got one one row of these per model then we also have data about the parameters in the model the estimates of the parameters and you can also imagine I'm not showing them here B each of those estimates have a sociated standard error and a t statistic and a p value and so on and then we have values we've got a one row for each row in the original data set so these are things like the predicted values the residuals and so on in the boom package gives us tools to extract each of these so glance is going to give us the model summaries we have one row per model Tidy gives us the model estimate so we have one row per estimate per model and then finally augment gives us the the things that observation level so to get one Roper observation per model so let's let's do that let's compute all those and again we're going to use model and we're going to use map size I'm going to take each of these models and I'm going to tidy it but take each of these models and I'm going to glance at it I'm going to take each of these models and I'm going to augment it and so we end up with a data frame that looks like this so now I've got a list of data frames it's my original data a list of linear models endless of the model statistics the parameter statistics and the observation statistics and you might wonder well what can you do with a list of data frames well hopefully we can do the opposite thing that we did before so remember earlier we took this regular data frame and we nested it to get this nested data frame now we want to do the opposite now we want to go from this nested form back into the uh nested the regular form so let's have a look at that so I'm just going to calculate those statistics I'm going to do one other thing which is I'm just going to extract the r-squared from the glance or just add that as a specific column so you can see here we can see the r-squared of each column of each model this is just a regular column of doubles all right nothing special here there's just one R squared per model and we could do things like sort all the models based on their r-squared and we can see Brazil as a point nine nine eight R squared which is pretty high in to eat now looking at those R Squared's and the table is not that useful we can instead plot them which that plots not very useful either so I'm going to zoom in so here I've got r-squared on the x-axis the countries on the y-axis you cannot read those labels but that doesn't really matter because you can kind of see the main story here right but most of these countries the model fits really well we've got a bunch down here but the model is terribly bad another bunch that are kind of progressively getting better but still pretty awful and then maybe another break around point 7 file so I made a little shiny app to explore those just so you can see so these are the countries these are plots of the countries that had an r-squared of that model between zero and point two five so why did these what are these models one of these countries not fit our linear model very well well the obvious reason is that the statistical reason is that these are not straight lines right something else is going on in these countries and they tend unfortunately at this level these are squares all bad things that have happened to countries right we have civil wars but for most of these countries the main thing that's had this massive effect is the hiv/aids epidemic and so we could look at that other group between like point two five and point five you see a lot more African countries if we go between 0.5 and 0.75 no idea what's going on with Bulgaria but a different pattern to every other country can we see civil war in Cambodia HIV and AIDS Iraq war North Korea and I think I don't know anything about Liberia but I don't know as many data points here but something bad happened there as well so how did we make those plots well the key idea is to uh nest the data to get it back to a regular data frame so if you just UNMISS that data column which if you remember has our original data frames that was basically going to give us back our original data except that we have this additional r-squared column so remember we've got one a squared per model so what is going to happen when you unleash the data to have one row per observation per country well there's still only one R squared per model each models for your country so that R squared is going to be constant within a country right just duplicate it out to de Phil on that data or if we look at the glance data this is going to give us remember glance gives us the model summary so this for each country gives us the r-squared adjusted R squared the Sigma P values degrees of freedom AIC BAC etc or if we UNMISS the Tidy data we get the coefficients from the model right so each model has two coefficients so that means each row each mod each country is going to get two rows here we see the term the intercept and the slope the estimates and the other variables that didn't get printed out include the standard error T statistic and p-value now I want to do a plot with intercept on the x-axis and slope on the y-axis so this data is not quite the right form for that but it's a data frame so I can use for example the the tools in the tidier package to get it around the right way and then I can plot it into this plot again we've got one model one point for each country represent in the model that we fit to that country we've got the expected life expectancy in 1950 that's the intercept and the average yearly improvement now one thing I've done here is I have a size the points according to their r-squared and that's because you kind of want to be careful about interpreting the coefficients of models that fit very poorly by the fact that this had a very poor r-squared suggests that a straight line is not a good model for this data set there's no point trying to interpret the coefficients because that they are cold they don't do a good job of explaining what's going on there are a few interesting things here so there's no data up here in this triangle for the point in that triangle represent that would be a country so if you imagine a point up here there would be a country that a very good life expectancy in 1950 and has improved rapidly since so this is kind of telling us this is actually quite positive right this is saying the countries that were worst off by and large in 1950 have improved the most the worst off countries are catching up to the best off countries although we have this basically Africa which is that the hiv/aids epidemic which is not we could also all UNMISS the Augmented data so remember that gives us one row per observation per model so we can see like the fitted values and standard errors and their residuals and various other residual Diagnostics we can use this to kind of to plot a sort of overall summary plot and do this one first actually so this is on the y-axis the residuals from the model the x-axis is the year each line as a country and then I fit an overall smooth curve through all these residuals so if you solve this sort of residual plot when you're fitting a model normally you'd think well something looks not so good here right I've got some clear non-linearity I probably need to go back and improve my model now I'm not actually going to do that here because they don't do real data analyses I can just stop whenever it gets too complicated but hopefully you can see how the same ideas would apply you could refit a new model and then proceed through these same steps and just in this case I could also facet that residual plot by continent and you can see that most of that unusual pattern seems to be coming from Africa but maybe we're also seeing a little bit in Asia as well so even though that linear model did do a really good job of getting our Squared's of 0.99 maybe there's still something that we're missing and we could go back and create a better model so there are three main ideas that I showed you today so the first idea is that you can put anything in a list and you can put anything or you can put a list in a data frame so that means you can store anything in a data frame and this idea of a list column is really powerful because it allows you to store related things together it allows you to keep the data frame connected to the the model that was used and you can imagine there's lots of situations where this occurs in this case I was splitting up by country but you get similar types of you get similar combinations of data and models when you're doing bootstrapping or cross-validation this is a really useful technique for keep keeping related things together my second piece of advice is I think it's really really useful to learn a little bit about functional programming you know you don't have to for loops are great and if they solve the job you solve the problem that's fine but I think there are some big advantages to learning these techniques you you can't it's not that you can't do that you can now do something you couldn't do before but you should find that you can do things with much more with much greater ease it's easier to write a for loop it's easier to iterate over these things so you have to write all that code for the for loop you can just call one of these functions and you know that those functions are well tested used by lots of people so it's unlikely that there's going to be an error the era of you accidentally create that the chances of you accidentally making a mistake in your own for loop is much higher then finally if you want to understand what's going on with your models a really powerful technique is to turn them into tidy datasets because you have lots of tools for working with tidy datasets and the best way to do that in my opinion is to use the broom package by David Robinson so I wanted to finish with just one other little thing that I've been working on recently this is a collaborative project between me and where's McKinney who's the author of the pandas project for python and this is a new project called feather the goal of feather is to kind of solve this problem so here I've taken this CSV file a kind of a large CSV file that's just lying around on my hard drive about 600 megabytes and if I read it in using say read underscore CSV that's going to take about 14 seconds the goal of feather is to provide a binary file format that can be read and written much more efficiently than CSV files so if I write that out the disk is going to take about three seconds currently feather files are actually bigger than CSV files often but they can be read in much much much more efficiently in the other advantage of theater files is that you can share them with people using Python and in the near future Julia and go and lots of other languages and they also retain all the type information so when you say about a CSV file you lose the information like this is a date or this is a factor rather the string feather preserves all that information so you can read it back in without losing an important metadata now you may have used RDS for this in the past with read/write RDS or save out a yes and read ideas they are similar similar idea the two problems they're quite a lot slower than feather files and of course they're limited to I you can't load them in from another programming language so I wanted to finish off just to give you kind of the names of the most important things I thought about today I talked about today really I think in the age of Google the most important thing you need to know about something is its name as soon as you know the name of something you can google it and find out the information you need to know if you don't know the name of things it's much much more challenging so I used apply today for working with data frames and / primarily for working with lists and then I used tidier to convert between the regular data frame and the nested data frame which contained at a list of smaller data frames if you have models and you want to turn them into tidy data frames broom as your tool of choice if you have used a plier before you might notice some similarities between what you have done with LD ply and DL ply or if you've used D play and you've done some stuff with Dew and row-wise the what I've talked about today is basically an alternative workflow for those solving those same problems and I think it is by and large a much much better workflow if you'd like to learn more about any of these techniques I'm currently writing a book with Gehrig Roman called R for data science the goal of that book is to kind of take you from knowing nothing about our or programming to being a reasonably well tools data scientist it is a work in progress but everything is available online and if you have any feedback I'd love to hear about it finally if you'd like to learn more about further you can go - where's McKinney's repo where he and I have been working thank you I just want you at the end use my senses say and most are human so we've got a couple microphones we'll get back to you and just investigate the film as well as a sensitive case you're frantically scribbling notes and we will make the size available acts and video you know this is you can see very quickly whether to do a shin spot here ourselves own people because of this lab we will a meetup group is for us mostly everyone's is it is a large program and it's data science and technology Scotland's I'm the bodies of the group's primary event together students academics commercial practitioners anyone interested in the science an ability to take Ranjith in Scotland's regularly hosts and so pretty interesting people who are talking at home primary applications of data set course the last job you had are always looking at how the decison Parker or just links means with an organization that means possible price air or interesting just as an algorithm to help either scientists or Netflix accessible to any we would like to come along asked us there every month also there is a lot of offices you are more than welcome as only man should ever think of people Indian solutions there we are in classical beast few solutions and business specializing either science and myself and to these handsome gentlemen over here and around to there we are over the editor over for the next hour or so so if you're looking for anything we are mark knowledge or face or obtains classic we're curing bullion's idea or whether you're suggesting off a shop please come forward seen illicit liaison are looking for these charged either than what would say to them each and they are very good work well thank you so addictive fame usually each column it has an enforcement class and the double when you are using that is each one is on this yes assistance list are you just ignore that or you love you have something so so win so let's just make this concrete so when you have these list columns you'll notice here that each of these things is a linear model and there's nothing stopping you from putting something else in there you could put a table in there and this one you give a data frame in this way and you could put a function on this one in this absolute there's nothing to stop you from doing that except your own common sense so basically I'm going to relying on the fact that by and large you're going to work with these things using like a map function and I think it's going to be fairly difficult to go to accidentally end up with different things in here so that's kind of my belief Carly I think it doesn't need I don't believe I need to worry about it so for now I'm not going to worry about it and if it turns out that people really do like to do bad things then I'll think about it but this is in line with the philosophy of like you know islets you take their loaded gun and point it at your foot and pull the trigger if you want and it's up to you to avoid doing that to yourself back we want to take to someone house when I ask your question we'll get the microphone to you in preparation up the back on the left yeah I thanks very cool so a concrete distressing at work while you're describing here is really different coach to a lot of my few friends and Allison's so precise is you where they have a dataset first in your mind can fit the one big model rule all yep which is really challenging to break them of that notion that's why should be done so I guess a two questions first with room what kinds of models does it work with well is it extensible at all were any other kinds of model to might fit in in terms of my other screen is doing so this functional stuff creative great little columns that are lists I occasionally have issues where there's some subset of the data that breaks the function and it crashed the whole thing rather stored some kind of null or error so I'm gonna answer both of your questions with Google searches so the first one if you if you go to the Broom homepage you can't unfortunately just Google for broom you have to Google for broom and are you'll see this is the list of packages that are currently supports which is quite comprehensive and David the maintainer is fairly active at adding new packages and if you do want to do it yourself it's a fairly simple process to do so the other challenge when you're when you're doing these map functions they either the whole thing either succeeds or the whole thing either fails which which can be frustrating soper provides another function calls safely and safely works a little bit like try and base R but safely is a little bit differently first of all it's an adverb so safely takes a function as an input and returns a modified function and what it does is it makes a function that always succeeds and it changes the output of the function so that when the function always now returns a list of two things the first is a list and the other is an error so either if every time you run it either the error is empty or the the result is empty so you can kind of easily relatively easily work with this and you can see this for some examples on this page this is again the R for data science examples further on typically you want to do two things when you have an error you want to find out you want to extract the results that succeeded and work with those and then you want to find the inputs that failed and that's the other way these are the inputs that failed these are the results that succeeded and then and then work with those and so perp two also provide you of functions to make that as easy as possible another question did you get the gave the French you're talking about our loose ball scanner basic game prints or those the data cables so these are all slightly the answer to that is slightly neither these are what I call Tibbles so Tibbles basically they're like 90 percent regular data frames but the kind of data frames like data frames were invented like you know 20 years ago and since they're invented we think determined that some parts of what data frames do are good in some parts of what they do are annoying and stupid and so basically what Tibble does is just it takes the good parts and in my opinion makes the bad parts go away so for example there's a function called data underscore frame which is a replacement for data dot frame and it does things like it I mean it basically the main differences it does less it tries to be less helpful so for example it does not convert your character vectors into factors for you and then Tibbals have some other behaviors around subsetting that just make them a little bit more consistent so if you try and access a variable that does not exist you get an error rather than a null or if you subset a regular data frame with a single column you'll get a vector back no matter how you subset a table you'll always get another table back so that basically is sort of a thin wrapper around data frames that make life generally easier and Eira's more obvious there is not currently within the map functions but there is a package called this is still mostly a proof of concept but this is a package called a multi deep liar which allows you to basically take a data frame and spread it across multiple cores in your computer and then all of the deep liar functions work in exactly the same way that you're used to but transparently the the requests are spread across multiple cause so you could use multi D player and because the workplace showed you use mutate you kind of get parallelization for free at some point I think pur will support some parallelization built in as well I'm just waiting for basically this distributed our project to kind of get a little further and provide a common API across not just multiple core but cluster api's and spark and all that kind of stuff as well so definitely like it's in the works you can kind of play around with it a little bit with multi D player now but they'll be more in the future yeah so I think you're looking for this page it's got hipster educating people they learned are before it was cool so this is written by our Karl bro man who also learned our a long time ago and this is kind of his notes on what you should learn to update yourself so I don't know I don't think if you google for hipster you'll find this but if you if you remember by just google bro man hipster and it seems to reliably get me to this page so obviously I'd like this page because it tells people to use my stuff also tipple so Tibbles we're kind of originally inside of d playa and this Tibble package is now like a separate thing so if you use d plier you kind of get it for free if you want to use this out for outside without deep lot you can use the Tibble package which is available on cran like my colleague next door I like our x2 we've done a bit porn in between as well looking at your data that we presented right absolutely so I think so I would argue that doing that is the first thing would would be a mistake because I think one thing that is challenging with this data set is you know basically a lot of the countries our linear model fits very well the other other countries you need some kind of quadratic - yeah so I I think like I think working your way up to a single model is a really good idea but I think I mean my approach is always like start with like the simplest possible thing and then work your way up because it's kind of easier to diagnose problems along the way to a complicated model because I you know if you yeah yeah I agree I just think you are better off starting like you make sure that they might not work and you might you might be absolutely necessary to have a single model to get to pull enough information to get reliable estimates I just worry that people fit very complicated minimum mixed effects models that they don't actually understand and in people I think the other problem is your fits like you pick them all you want to fit that's really complicated and includes all the scientific things all the interactions you believe in and then it doesn't converge so you're like well I'll just drop this variable off and it still doesn't converging like well I'll just drop that variable off and now it converges and you're happy right so you know I'm not saying you can't make mistakes with this but I think thinking about like going in both directions is really valuable Thanks in one question go one banjee hi I'm worried the farms becoming Pisa huh I rotate a regulation I don't know so he's offered by deep an example I know you're offering us better things this one bloomin store Universal actually understand this again they get back as a whole not just your batteries the onions let me basically know there's no possible so one thing that I ate a really interesting conversation ones with with John Chambers who designed the hist language which are is based on and I kind of am interested basically the same thing to him like I want to have like one right way of solving it like I want to best practice like if you going to do this there should be one right way of doing it and to John that idea was really unappealing like he wants you to be able to solve problems using multiple different approaches and in you know regardless of whether we're on the spectrum of like having many different ways to tackle the same problem always have in one right way you fall where you fall on that spectrum personally the reality is that our is very much on there will always be many different ways to solve a problem and and that's there's just you I think you just have to accept it you if you if you rail against that you'll just become unhappy so you just have to accept that then I was like a language of freedom like lots of people can approach problems in lots of different ways I think you know there's a small danger of our kind of regressing into like mutually incomprehensible dialects but I think there's just this or enough code and that increasingly like people are doing more and more in public they don't think that I don't think I think there's going to be like dialects but it's going to be like Scottish and English right you're still gonna be able to understand each other most of the time but it's not going to be commits not going to go too far apart so yeah I mean personally I would like to have like a very like tidy language but that's just it's not going to happen you say every generation of programs the regional programming ages why do you think this is so hard and we'll so I think one I mean in for the many many domains functional programming is not a terribly good fit and I think you know often you see I see people instead of arguing online like as object and I object oriented programming better was functional programming better and the functional programming people said yes it's obviously better because every problem I've worked on it's been better and the object owner people say well if every problem I worked on has been better and that's completely true because you can work on different problems so I think like the data analysis in particular I think functional programming is like a good fit because you're often kind of doing the same thing to many different examples I think the other thing that functional programming is not really the the most accessible discipline in computer science and it's taken me a long time like you know does this sort of this cupcake analogy has taken me a long time to get to but because it's so deeply internalized into me as the why functional programming is useful it's hard to explain to other people and when you know it's so when the benefit is so clear to you it's hard to not sort of come across as like evangelical like you learn functional programming and be saved would which is not generally a terribly successful conversion technique so so I don't know I'm hopeful like I think functional programming is a big part of our the goal of per is to kind of take the parts that don't work so well and provide wrappers around them and give a consistent interface and you know what we'll see how it goes is I see snow tea between your frame of beta frames every third year with multi dimensional modeling in general cute etc we have some performance issues how do you feel yeah absolutely so I think there's there's lots of different ways of thinking about these types of structures and one thing that I have found helpful is talking to kind of people in different domains and different disciplines you think about this very differently and you know if you talk to you know physicists these are like Tinsel's or they multi-dimensional arrays and so some of what some of the art of kind of making this useful as making is putting into a framework that is already familiar if you've used our and that by and large means data frames like certainly are does have matrices it does have arrays or tensors but by and large they're not terribly well supported in the language and there's no reason that they couldn't be i mean someone could sit down and like build out all of those tools and make them possible but it just hasn't happened and whereas there are lots and lots of tools with data frames and they're very useful I mean to one thing along that line is you know there's so and Alou who is a now retired UCLA stat professor has written this you don't know if you've heard of APL before but it's kind of the original programming language for manipulating high dimensional arrays APL basically like people make fun of Pearl for being unreadable but like APL originally had like a special keyboard that you needed to use with them so APL is incredibly incredibly incredibly to us but the ideas are really well thought out and so yarn has been kind of pulling some of these ideas and writing our functions around them you know I don't know it's just thinking in a razor so it doesn't come naturally to me because I so I just use data frames all the time but I think there's absolutely like an equivalent set of tools you could build around arrays but they're just a little too far and I think the most most people use are and if you do work with arrays and I you're almost always better off comparing it to data frames and things become much simpler I think basically you should ignore data table until you have like more than 10 gigabytes of data so data frame is incredibly incredibly incredibly fast and I kind of personally believe that is sort of sacrifice some stuff on the altar of that speed now like that that's my personal opinion and some people don't agree but I think you can kind of so what was a one big difference between data table and deep liar is that data table does everything basically using subsetting so in data table there's one function that does everything that like 15 functions do and deep liar and I think you can kind of tell like are you going to prefer a data table or deep layout by thinking about that like some people think like one function that does everything that why would you want to use like these 15 like verbose wordy functions like that's so much typing data table is way better and other people think well you've got one function that does everything like how do you possibly remember how it works and if you're that type of person deeply it's going to be more natural so I think there's some kind of like you know breakdown whether it's like psychological or experiential but some people really are very very passionate lovers of data table and other people are very passionate lovers a deep liar and I did that and I died but personally like the type of problems I work on do not need the speed of data table and I just find deep lie like obviously it's easier for me to understand because I wrote it but I do genuinely believe for most people it's easier to understand because it is more verbose like you can read it as a sequence of operations you've got all of these hints there that when you come back to code that you wrote three months ago you've got much more to kind of remind you of what's going on with data table very very concise you can probably write code faster if you're using it everyday you can read it very efficiently but I personally believe that it's going to be harder to understand and the long run or fight and harder to communicate with other people okay you
Info
Channel: Psychology at the University of Edinburgh
Views: 60,086
Rating: 4.9694991 out of 5
Keywords:
Id: rz3_FDVt9eg
Channel Id: undefined
Length: 69min 38sec (4178 seconds)
Published: Wed May 11 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.