Programming with dplyr and the tidyverse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good evening so um before i start so tonight's topic is going to be about programming with dplyr so how you can write functions that use dplyer functions inside of them uh but before that i would just like to um ask you to fill out the survey that i started today it's about typing speed and keyboard layouts so i will link it in the description below so um it's it will take you two minutes so there's a one minute typing test you just input your words per minute you input your score and you just answer some basic questions okay what's the keyboard layout you use what is the language you type most in so english french or whatever um and what's your job if you want so the job is optional but it'd be great because it would allow me to kind of look at if there is any difference depending on the job so i will link that in the description and so it would be great if you could fill out that survey another thing is that we are approaching 500 subscribers i will make a very very special video once we hit 500 maybe i'll even release it a little bit before but if you're not subscribed yet maybe if you find these infos this video is useful and the infos i share useful well maybe consider subscribing so something else before we start um the two three videos i did just before this one and the blog post i wrote on um closures attracted a lot of attention a lot of feedback i got a lot of feedback which is great and i will address that so i've already answered to the to the people in writing but i think it would be interesting to remake a new video um where i um discuss the feedback that i got so because there is a way of actually doing what i did without closures but i still think that there is an advantage in using closures and i just kind of want to explore that and maybe get then you know a new discussion going so it'd be great i think it was the first time that one of my videos and blog posts and attracted so many so much feedback so it's great it's really a great thing i'm really enjoying that so tonight's topic uh programming with the player so the idea is so if you start if you've watched this channel um since the beginning you've seen me write a lot of code where i use dplyer function i mean tidy versus functions um functions from dplyr functions from per etc um i've always i think used them in data exploration interactive ways so i write code i run it i look at the results i rewrite some code i run it etc however of course you could instead of running your code interactively you could write some functions that you call at very specific points in time so maybe once a week or every day or whatever these are functions that will run something that you need to run um and it's always the same and you need to run that very often so it's if you if you have code that needs to run a lot of times it's best to put that into into a function and then just call that function and maybe you you want to you know let that function run and some kind of server maybe that will extract some data from from a database run your r code maybe it's running a model or it's plotting some stuff whatever there's many use cases for that i will show you how you can write a function using dplyer functions or more generally i i don't think every tidy verse function would work yet but um i think deep liar for sure tidier for sure per as well i think and gg plot so you have i think the most important functions but probably the others as well now and let's take a look at what i want to do so i just want to show you um so the problem i have let's say let's imagine the problem i have is the following i have a population and i know the characteristics of this population so i know how many men and women there are i know in this case here what their religion is and maybe some other variables okay so i know that i know that because we can imagine because it's census data okay so i know that from the national statistical institute they run these censuses every ten years or so or five years or whatever and i have these general characteristics of my population now imagine that i run a program and people can voluntarily apply to this program and every week i get a new batch of people that applied and that went through this program okay and every week i want to look at my sample and i want to see how different it is from my population and i want to do that according to several variables race religiosity or religion or whatever but it could be anything else could be age it could be um citizenship um could be i don't know a job title whatever you want and whatever you need okay so let's imagine that so i i for that i use the gss cat data maybe let's take a look at it first um so this just gives [Music] for several types maybe let me put my face up here so this is one person so um that person is white 26 years old never married the year and that this was recorded was the year 2000 has an income between 8 000 to something doesn't really matter is an independent protestant southern baitist i guess i don't know i'm not very familiar with american denominations and then how many hours that person watches i guess i hope per week because if that's per day that's a lot it doesn't really matter and we can imagine that these 21 000 persons are our whole sample okay or rather our population we can imagine let's assume this is our population okay um and this is in this case micro data so i have the micro data of all the population okay it could happen in some instances and and imagine that these are two samples okay this is the sample of the year 2000 this is the sample of the year 2002 but it doesn't matter you let's assume it's a weekly sample week one week two doesn't matter maybe i'll rename that uh it's just to illustrate my my code doesn't really matter uh if it's week one week two or whatever now um this here the frequency of of the um of the race and the frequency are actually this is not very the frequency of the religion so these are the characteristics that interest me in the whole population okay so i know in the whole population i have you know as many uh white black and other types of races of people it's very you know just as a parenthesis this race type of variable is always very uh confusing to us europeans because we don't have such variables in our in our data i mean if you're working with you know if you're a social scientist or an economist you you never have a variable like that maybe in the uk i guess they have something like that um but yeah in continental europe you never see that anyway um and religiosity as well so maybe let's add a little bit more of um maybe let's add the total which will be simply the sum of n and maybe add also the frequency and you know just to illustrate and maybe let's call that frequency in pop yeah i guess that's not a bad idea to add let's do the same down here so let me just correct this yes okay so this should now look better uh oh yeah that's because i have i wrongly named this thing yeah so now i have my uh my religion and uh most people are protestants so i guess yeah it's probably u.s data and catholics and so on then none um and so on okay very very interesting uh same for the race now i also have my percentages here okay that's great now i want to every week so now i have two weeks of data i have my week one in my week two but maybe you know this is a program that maybe lasts for a whole year so i'll have 52 data sets and i have every week to compute this thing but in my samples so i wrote a little function but maybe before looking at the function let's just you know look at what i need to do so if i could you know simply basically copy and paste this thing and this should give me what i want so this is the frequency okay in that week okay in that week and i want to compare that to um let's say the religiosity in my population so it should be very similar i guess uh well there is there are some differences but um shouldn't change too much but you know maybe if this was a real example maybe every week you'll have very different types of of people joining your program maybe because of holidays you'll have younger people coming students maybe during the holidays and then during you know the work uh i don't know maybe in winter you'll have less people just because it's cold outside something like that so maybe you'll have something very different every week and it doesn't matter again so if i want to do that every week i would need to copy and paste this code every time and i would do i would have also to copy and paste it for every variable that i want so here i only have two but maybe i have dozens of variables that i need to look at okay and or maybe i just have a much more complex example than this so you could be tempted to write a function that would look something like this okay so what this function does is it takes as an input my sample data frame and a variable let's see what happens if i try to run this let's see what happens so i take my gss cat week one and i take so the variable i want is relic for example and it's not working um so r is complaining that variable is not found so this is weird because uh you know i gave my variable here it's well defined there's there isn't a typo here so this this should work okay the problem why this is this isn't working is the following count here okay and it could be any any d player function okay could be group by could be summarized could be select filter whatever this function here is not looking for relique or relig it's looking for a variable called variable literally called variable in my data frame sample df so this variable does not exist so this is looking count is looking in sample df for the variable called literally variable so i need count to understand that this is not the variable's name or this is not the variable itself rather the variable is what i give here okay so um i think if i look at in i'm not sure if this is going to work but let's try i think if i look at environment uh it's not it's not exactly or is it and i don't know um there is a way i think to look basically at um the environment so i think if you do ls this is just going to list yeah so this is going to list the um the variables i think there's a way to look at the data frame as an environment because it's kind of what's happening here so this is count is not looking at let's say your global environment or or what you defined here but it's looking at as i said a variable called variable inside sample df and this variable does not exist so you'll you have to do a little change here which is to say that variable must be quoted so this goes back a little bit to the quo function that i showed in my previous videos okay but this is not quote this is endquo and i'll show you the difference or i'll explain the difference in a bit but let's first try to run this now this is working so i did two things first i used enquo on my variable second i used this this is also a function called bang bang that unquotes this variable so i don't want to be quite honest i don't know the exact mechanics under the hood what's really going on i think this is probably very complex but in a sense what's happening here is that this is creating okay this variable now is a closure and it gets here evaluated okay or i guess instantiated if i want to use the same words as in my previous video but here count knows oh i have to replace you know this thing okay this variable name i have to replace it by what the user gave me being religiosity or religion okay and i know that this is something that is here because i'm looking for relig now i'm not looking for a variable i'm looking for a link let's try outside of a function so if you're doing that outside of a function you have to use quo so that's why you use quo quo is outside of a function enquiry is inside of a function so if i just type something like quo variable okay this is a closure and by the way variable is not defined in my global environment variable does not exist but this doesn't matter for quote it just becomes a closure now if i evaluate that so let's try i don't know i think i can use parentheses and should oops should work um yeah so they can only be okay so this is but well it's it wouldn't work because a variable does not exist but this can only be unquoted in uh they say in a quasi-quotation context basically inside like another dplyer function so yeah if you run this in the environment it's not going to work this is what they explain here so if you run this inside mutate it's going to work anyway what's important to remember is that if you want to grab or work with a um to work with variables from your data frame you need this enco bang bang trick or workflow there's something else i want to show you yeah and then and then you could you know just to to finish you could run then this uh if i go now with what was the other one um with race so this is now you know doing the race thing oh and by the way i could also then because this is not exactly computing what i wanted but i could now just you know run this and maybe instead of pop in sample and yeah i get my frequency in my sample and if i compare that to the frequency in the population i see some slight differences okay so now i could merge these data sets i could actually write a new function that would take both these data sets as an input to merge them and you know maybe show that in a plot etc something else i want to show you is you know maybe i need something a bit more complex so let's paste that down here maybe i need something um with two variables maybe i want to count over two dimensions okay so instead so i will just rename this here and instead i will now use quotes i will now use quotes so this will create a list of closures and i will unquote oops i will unquote that with three bank banks so let's see how this goes so compute fracking sample two gss cat one let's go with race and with really and it's not working and the reason oh yeah it's not working because actually i need to call this dots because and i will explain why i need to put in dots and what are the dots and now it's working so why didn't it work before why is it working now so let's go back to b4 so before i add one variable well i use the plural but it's still one variable called simply variables this [Music] this is a single variable and quotes expects several variables so the dots what the dot
Info
Channel: Bruno Rodrigues
Views: 939
Rating: 5 out of 5
Keywords: tidyverse, rstats, dplyr, quosures
Id: W3e8qMBypSE
Channel Id: undefined
Length: 18min 44sec (1124 seconds)
Published: Fri Mar 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.