Hadley Wickham | State of the Tidyverse 2020 | RStudio (2020)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] um this is hadley wickham and take it away thanks mara so i wanted to talk not about visualization unfortunately uh but about a little bit about the tidy verse generally so i wanted to give you kind of a little little bit of an update like kind of you know where things are going in the tidal verse where are we today talk a little bit about tidy eval which i know has caused some unhappiness in various places in our community and then talk a little bit about what we're doing to address some of those problems in the future so i made a few plots um to kind of show how things are going so the first one is just the cumulative number of uh package downloads so 2018 on the left 2019 on the right uh i you know this this data is not super trustworthy i don't i would love to believe that you know three quarters of a million people decided to use the tidy verse uh i guess basically on my birthday um but i think the main takeaway here is that you know the numbers are continuing to grow really just um continues to amaze me how many people find these tools useful the other thing that i think is really fascinating is that 2019 was the first year that more people downloaded d-player than ggplot2 i don't really know what that means but i think that that's kind of interesting so who who works on these packages so my team at our studio encompasses a bunch of people uh so jim hester thomason peterson gabochadi raman francois jenny bryan max lionel henry me mara evereck and davis form and very very soon we'll be joined by uh julia silgi on monday but as well as us there's also a huge number of people in the community who contribute to the tidy verse and last year in fact we had 236 unique contributors which is just so amazing and feel so wonderful to me and so i also made this chart of kind of the number of unique contributors comparing 2018 to the last year we we ended up with quite a few more unique contributors last year because of two main events these are the tidy verse developer days which we did for the first time last year but we're going to be continuing to do them a lot in the future so these uh happen after our studio conf after user that basically a chance to get a bunch of people interested in the tidal verse or contributing to the tidy verse in a room with helpers and you know we do a bunch of pr's to fix a bunch of issues you might also notice that there was a big jump in kind of may of 2018 uh i did a little investigation to figure out what this was this was actually me going into a pr closing uh merging frenzy in ggplot2 so there are a lot of pr's that have just been sitting there so we've got about like 20 new contributors in one day because i merged all those pr's a few highlights of this year i think uh really excited that we now have a paper that you can cite if you want to cite the entire tidy verse which you can get by using the citation function uh the kind of idea of that paper is that rather than having to cite every package you might use individually there's just like one place you can cite it's a published article in the journal of open source software uh it's kind of i have to admit that i'm like slightly addicted to checking the citation count like you know normally when you check the citation count of a paper it like changes like maybe once every six months or something but we've already had like 17 citations in uh the two months that's been published which is really really amazing also we introduced embracing the double parentheses i'll talk about that a little bit later but part of our efforts to make kind of tidy evaluation easier and more approachable so you don't have to learn about all the theory in order to use it the vectors package which is a bit of a mystifying package in some ways because if vectors is successful in its goals you will never know that it exists and the goal of vectors really is just to make things work more consistently behind the scenes so that when you use a function and one package you know your predictions about functions and other packages are more likely to be correct that's the main impact for like data scientists if you're a developer as well it makes it much much easier to create new types of s3 vectors so if you have a new type of thing something that makes sense as a column in the data frame vectors provides a really nice set of tools for this and if you want to learn more about this you can check out when when the when it comes out jesse sadler gave a really nice talk about how you can create new types of vectors using this package another really cool package which i think reflects some things some sort of development processes some coding practices that we'll see more of in the future is the room package vroom is a really really fast way of getting data off disk from csv files or tsv files or whatever into r i think there are two things that are particularly interesting about it the first is that it makes uh heavy use of c plus plus 11 behind the scenes this is kind of our first major use of c plus plus 11 and the tidy verse because we've been a little bit worried about whether enough people can use c plus plus 11 even though it is eight years after nine years after 2011. but the main advantage here is it gives us really easy access to multi-threaded computation so vroom will use many or if not all of the cores available in your computer to speed up data ingest the other thing that's really interesting about vroom is it makes extensive use of this new technology in base r called alt rep which allows us basically to laserly load the data in the file so that if you're working with a very large data set room is only going to read in the data that you actually touch so if you read in like a 10 gigabyte file but you only look at like one column or a small set of rows it's only going to read that data up into memory which of course you know makes things much faster and saves memory and then last but certainly not least i think another big thing that's been happening this year is uh max and davis has been working on tidy models a collection of packages to bring modeling into the tidy verse i think that's really coming up to speed uh this year with julia's joining that team and i think next year that's really people are already using tidy models effectively but i think in the next year that's really going to become a killer set of tools a few things to look for that i'm kind of excited about for 2020 i don't want to make any predictions that uh turn out to be famously wrong uh i will not be declaring that 2020 is the year of anything unless i want to kill that thing but a few things that i'm excited about we've been putting a bunch of work under d-player 1.0 this is kind of one of the projects this is one of the first times we've done sort of like a full court press so like there's been four or five people from my team working on this in various ways completely new implementation which makes it much much better easier to add new features and it's going to provide a really really solid base for future extensions also working on adding more problem oriented documentation i think we've always had pretty good uh decent tutorials and reference implementation we're also working now more on more documentation that helps you solve specific problems we're also going to see kind of less per i think or the the you can if you're teaching like an intro data science course you were to teach per much much later as we provide tools to kind of eliminate the use of per for some of the most important uses so vroom for example allows you to slurp up a entire directory of csv files in a single call tidyr has new functions for uh rectangling complicated or deeply nested json data uh and d player is going to be bringing back the row wise verb so this should be mini lists many list times that you need to use per when you're doing data science still like really really strongly believe in per and functional programming as a programming toolkit but you won't be forced to learn it just to do data science and then also really excited about google sheets four we've got some uh really uh great ideas i think for making that even more flexible jenny is working hard on that and uh i am personally really hopeful that we're gonna be able to use that so google sheets will be our primary kind of data source for our studio conf next year which will avoid a bunch of problems that we had this year with keeping data synchronized in various places so what i did really want to talk about though is a little bit about uh tidy evaluation because i think we we did make some mistakes uh just uh here's kind of a provocative question that someone asked a little while ago will tidy eval kill the tidy verse i'm pretty confident the answer is no [Music] but just to get just to kind of if you haven't seen tidy evaluation before this is kind of the one of the challenges of programming with functions from d player and ggplot2 for example is introducing indirection and these in d player and ggplot2 normally you provide the name of the variable directly but what do you do if you want the user to supply the name of the variable in a function well previously we had this technique where you did bang bang in quo bang bang went and quite required that you kind of learned about the theory of quasi quotation and inquo made you think about these complicated things called closures now we have a system called embracing which should allow you to solve like the vast majority of problems along with a few other techniques which means that you don't have to learn the theory if you don't want to so we've also been working on lots of articles to explain this i think i have these up again at the end if you want to grab them but i wanted to talk about kind of the mistakes we made and i think the first mistake we made is that this problem as a whole was much much harder than we thought and i know there have been at least like five points where i'm like yes we finally understand how all of this should work and uh on every point i was wrong except except for today and i'm reasonably confident more confident than i've ever been before basically because we've been doing a much better job i think of creating this problem-oriented documentation and we can see in that that we can solve the problems that most people are having i i think the other thing the other mistake the next mistake we made is the theory really is like beautiful and elegant it is i think pretty much unique amongst programming languages because very few programming languages have like a combination of like first class environments and computing on the language and this just adds like it's sort of really interesting and i think it's really cool and leonel has worked on this a lot with me thinks it's really cool um but most other people do not think it's really cool so totally i think it's totally still worth it if you want to learn about the theory because you want to learn more about how these things worth work but generally kind of the cost benefit ratio if you're like a data sign is just trying to solve some problem the cost benefit for like spending the time to learn the theory just just wasn't there and then i think we also ended up like introducing too much we wanted to be like precise and so we ended up creating a lot of vocabulary that um just ended up overwhelming people so i think two of our kind of takeaway messages with this like i think one one just being aware of the problem i think is important so that we are now kind of more aware that there are things that like we get excited about that are like really exciting to us but we know like the risky r community are not gonna be very excited about them um so we're gonna try and uh make it more clear like what is the status of various things that we're working on is this something that we think is really cool but it's pretty experimental maybe it's going to change radically or is this something that's like we now have kind of grave doubts about and maybe it's on its way out in the future and sort of and think about how can we get feedback on ideas that need more feedback without forcing like everyone to have to think about those issues and one of the ways we're doing that is trying to be more clear about where functions live in the kind of the life cycle of a function so what is the life cycle of function uh well the the place where most functions live is this kind of stable life cycle like we we're pretty confident like this is a function like it does what it says on the can it's useful we don't think it's going to change majorly some functions like when we introduce them we're like seems like a good idea um i you know i think like when i first introduced the when we first started introducing the pipe like i was like well this seems like a really cool idea but like no one's gonna understand how it works and it turns out like that doesn't actually matter the pipe is useful because it does something it allows you to write code in a really clear way it doesn't actually matter that you don't understand how it works you you can still use it there are other things that we have kind of so sometimes you know we we think we have a really good idea uh and then later on we're like [Music] maybe that wasn't such a good idea and so we're starting to label those functions now with questioning when we're not sure if those functions are a good idea or not now sometimes we will like think about it more and they'll end up back in stable other times it'll be like oh okay we've finally figured out a better way to implement this function or a better way to solve this problem and those functions become superseded so just to give you kind of an example of this in deploy 1.0 there were two functions row wise and do that a lot of people really liked that we've been in questioning for a while uh we kind of figured out like row wise actually can go back to being stable i figured out why i didn't like it and fix that problem so it's stable again uh do we've figured out a better way of solving and so that becomes superseded so superseded functions i'll talk about a little bit more shortly these are functions like we don't think they're the best solution anymore but they're not going away we've got like a better approach but you don't need to worry that if you're using these functions that we're going to yank the rug out from under you uh we're not going to the only time a function is when we want to be clear that a function is going away we'll tell you it's deprecated and then when eventually it gets removed all together then it becomes defunct so just to kind of contrast deprecated and superseded a little bit more clearly a deprecated function is like clearly on its way out in the near future which kind of probably means in the next year or two you'll be warned when you use it so whenever you use a deprecated function you will get a warning currently you get that warning like once per position i think maybe that's not warning you like quite enough we don't want to warn you every time you use it because if we warn you every time you use it it's basically just as annoying as like taking the function away in the first place so we're still working on trying to get that balance right like how how do we like gently nudge you that maybe you should be moving away from this function without like getting in your face all the time uh so two examples of that like the typical df function and dply which is one way you used to create tibbles before we call them tibbles is on its way out uh another function is deeply um do generally these these functions we don't wanna we won't go from like it will try unless it's like a very niche function it will not go from stable to deprecated immediately normally it will go through a questioning phase first just to make sure you've got plenty of notice if we if we're uncertain about something now to contrast this um we're talking about the superseded life cycle we've we've also called this retired in the past my sort of thinking was like you know when you retire you're not like actively working anymore but um you're still a productive member of society but when people heard that a function was retired they're like oh they're going to take it out back and shoot in the head [Music] so we're changing the name to superseded to make it clear that these are these we think there's a better alternative that an alternative that's like maybe easier to use or faster or easy to learn or more powerful or something and we we think you should learn that new approach when you've got some free time so these superseded functions they're not going anywhere they're going to hang around for a long time but they're not going to get any new features and they'll only receive like really critical bug fixes so really good example of this spread and gather and reshape and tidy up you know like you know they've been around for a long time hundreds of thousands of people rely on them we think we've got a better approach with pivot longer and pivot wider an approach that hopefully you can actually remember how to use but spread and gather are not going away they will probably eventually go away but that's and this case is spreading together probably at least five years away so we do encourage you to switch we will change them in our documentation in our books so that you don't that we don't teach them we don't advise them but those functions will live on for a long time and then on the other end of the spectrum we've got these experimental functions these are functions that like we have played around with internally and we think they're kind of interesting or kind of cool but we're not like 100 sure about them so maybe they'll go away maybe they'll stay we just don't know and so we kind of want you if you are adventurous we want you to try these functions out and like tell us like are these functions do you love this function do you hate this function like what's going on so we can use that to kind of inform our decisions but don't use them in critical code like they may get removed in the future don't rely on them 100 um but if they are useful to a lot of people we will you know we'll keep them around and hopefully remove that experimental label and really all of this is kind of in service between this tension like we want to give you a stable foundation that you can build upon and rely upon but at the same time like just because design is fundamentally iterative we know we can't get it right the first time and sometimes we want to be able to have a go at it a few times before we are confident so we really want to just make it clear like where functions are on this so we can continue to build the stable foundation while trying to get it right in the long term so just to finish up i think the big messages here tidy this seems to be doing pretty well uh lots more contributors really excited about how the community is contributing to the tidy verse uh you know we messed up with tidy eval but i think we've learned from it and one of those things that we've learned is that making it more clear what the life cycle is will hopefully make it easier for other people to know like what's going on thank you thanks headley um we have time for a couple of questions but if yours don't get asked i'm sure hadley's really easy to find and has free time to ask him afterwards um is vroom eventually replacing radar yes so vroom will not replace reader so much as rita will use this room under the hood so in the long run you would just use reader and uh if it'll use the fast room code that should hopefully happen in the in the next year why less per it's awesome yeah yeah that's this is the statement um per yes per absolutely is awesome but i think one thing we see when like teaching newcomers to data science that there are a few things that just that you just want to be able to deal with like that you've got a directory full of csv files like that is such a common problem it's really nice to be able to deal with that before teaching about like functions or functional programming or iteration and so for the the like the most important tools i think having some high level way to express what you want is a little bit easier like it's not about it's not about like not teaching per it's a lot it's about like teaching per kind of later in the curriculum so it's still really really powerful i still love perl like it's not going away but we're not like forcing quite as many people to use it will there be a tidy verse solution to interactive web graphics gigi plot 3 question mark um thomas and i are arguing about that already so uh maybe i i i like interactive graphics is still something that's like very near and dear to my heart uh i think we now understand like some of the data structures understand how some of the how we could fix some of the mistakes in ggplot2 but it just doesn't still feel to me like quite like the the bottleneck like the critical problem that we need to solve next but hopefully in the long term and then um last one i think we have time for is will room be faster than f read in data table of room is already faster than data table for some circumstances basically because it is like it's fast because it's lazy it does it doesn't do as much work as data data tables are freed if you don't have to work with all of the data and the data set so if you go to the room website uh jim histor who's the author of room put together a bunch of benchmarks you can see where it does better than you're freed where it does worse than freed well thank you so much hadley thank you
Info
Channel: RStudio
Views: 7,559
Rating: 4.9414635 out of 5
Keywords: rstudio::conf(2020), Hadley Wickham
Id: OwwYfxB8CA0
Channel Id: undefined
Length: 23min 22sec (1402 seconds)
Published: Sun Dec 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.