Dplyr Essentials (easy data manipulation in R): select, mutate, filter, group_by, summarise, & more

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This is the best introduction to dplyr I've seen, thank you.

πŸ‘οΈŽ︎ 5 πŸ‘€οΈŽ︎ u/Prospects πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

Great video! I find the .$ syntax in a dplyr pipe to a bit more jarring than than pull(), but that's just my taste.

πŸ‘οΈŽ︎ 5 πŸ‘€οΈŽ︎ u/YepYepYepYepYepUhHuh πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

Really nice intro, I recently started using dplyr instead of data tables and haven’t looked back. Dplyr is so much easier to read!

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/nilliewelson106 πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

Very nicely done - especially the pacing.

Are you looking to cover other packages in the future?

πŸ‘οΈŽ︎ 5 πŸ‘€οΈŽ︎ u/zebulo πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

Great little dplyr intro πŸ‘

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/quant_ape πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

This is great, you have mad (ie, really great) teaching skills. Subscribed!

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/davidharper2 πŸ“…οΈŽ︎ May 10 2021 πŸ—«︎ replies

Dplyr goes so hard

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/MrCoachKleinSaidICan πŸ“…οΈŽ︎ May 11 2021 πŸ—«︎ replies

This is excellent, love it!

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/PFC-Qc πŸ“…οΈŽ︎ May 11 2021 πŸ—«︎ replies

Outstanding video. Well done.

I have already referred to others.

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/Ratmanman1 πŸ“…οΈŽ︎ May 12 2021 πŸ—«︎ replies
Captions
so you've gotten familiar with the basar syntax and you're starting to realize it takes a lot of effort to manipulate and work with data frames even basic things like filtering rows and selecting columns may start out simple but the more you want to do the more variables you have to make and the more unreadable your code becomes enter dplyr an art package for data wrangling that allows you to easily chain together operations speeding up your workflow and it makes your code much more readable in this video i'll cover the essentials of dplyr and go over some common data manipulation scenarios you'll likely face the first step is going to be to install the dplyr package once with install.packagesd plier and then load it into your project with library dplyr the data frame we're going to be working with is from kaggle and it contains the billboard hot 100 songs of the week for each week dating back to 1958 i trimmed the data to only include the last 10 years but i'll leave a link in the description if you want to play with the data set looking at the data set there's a column for the date that week the song title and its rank for that week along with some more metrics about the song's rank history and how many weeks it's been on the billboard hot 100 so let's start learning some deep player the first concept we're going to cover is something you'll see being used in deployer a lot the percent greater than percent is used for piping which is when everything that comes before the pipe is evaluated and then passed in as the first argument into the function that comes right after the pipe so if we want to grab the top 10 rows of our data frame in base r we might do something like head billboard 100 and then 10. but with piping it would be billboard 100 piped into head 10 and that gives us the same thing it obviously doesn't make much sense to pipe in this case but it does end up making our life easier when we start chaining functions together another quick thing to note is that when you want to pass what comes before the pipe into your function as something other than the first argument you can explicitly specify that with a period and piping doesn't only apply to data frames so we could write the same expression as 10 piped into head of billboard 100 and then comma period so the 10 would go in that spot and we can run that now let's start working with our data frame there are a few functions in dbplier i want to go over and we can group them into a few different categories there are the column operations select and mutate the row operations filter distinct arrange and then the group operations group by summarize and count the first the player function we're going to look at is select which is pretty self-explanatory it selects the columns of your data frame that you specify and then drops the rest so for this example we want to select the date the rank the song the artist and the week's on board columns and we can do this by passing in our column names as arguments into select so after billboard 100 we will call the select function with our pipe and we'll select the date the rank the song the artist and weeks on board because it's using these special characters these hyphens we actually have to put it in either quotes or back ticks so we'll write weeks on board and we can run this and now we just have those five columns to make it a little easier we can actually use the colon between two column names and it'll grab both specified columns and then the ones in between so we'll copy this code and paste it right under and instead of date rank song and artist we can just do date colon artist and this is great if you have a lot of columns and i also want to rename this week's on board column to weeks popular and i can do that in the same select function just by writing weeks popular and setting it equal to weeks on board and if i run that i get the same data frame as before but now this column is renamed another way of selecting would be by explicitly writing the names of the columns we don't want to select and then adding a minus in front of it so in this case if we look at our original data frame we're getting all the columns except for last week in peak rank so going back to our code for select we can put in minus last week minus peak rank and that'll give us the same thing except we want to use that other function to rename the week's on board column also one quick thing to note is that each time i'm running this code i'm not actually rewriting or overwriting the billboard 100 data frame i'm just printing out what the result is if you wanted to save it you'd actually have to reassign that variable but as a general rule i'd actually recommend not using the same variable name because you might run into a situation where you can run your code once and it works fine but if you run it again you'll get an error because we renamed this column weeks on board to weeks popular and when we run it the second time that weeks on board column no longer exists so i'm actually going to go back to the top and reset my data frame by reading in the original csv and i'll get rid of this but going back to our select function there are actually a few more ways of selecting columns because you can use helper functions i'm not going to cover them in this video but i will leave a link to this cheat sheet down in the description and it includes more info about the select function but also all the other dplyr functions so next we're going to move on to mutate if you want to add new columns to our data frame we can use the mute command similar to select it can take any number of arguments and it allows you to specify what you want the new name of the column to be so if we go back to our billboard 100 data frame we see that there are a lot of rows where there's a featuring artist and i think it'd be interesting to have a new column that tells us whether or not the song was a collaboration or in other words whether or not the word featuring appears in the artist column so we're going to pipe again and call the mutate function and we'll create a new column called is collab which is just whether or not the song was a collaboration and we're going to use the grp function here which is built into base r and it takes in a pattern to search for and then the text to search within and it'll return either true or false for each of these rows so we're searching for the word featuring and we're searching in the artist column and if we run that we get this column at the end called iscolab and a quick little trick if we want to have the artist and is collab columns to be the first two columns in our data frame we can actually create another select and then select the artist column is collab column and then everything which will just grab the rest of the columns and now you can see that it's reordered and let's just go through a couple pages just to make sure that his collab looks right so it looks good there yep these look good to me so those are the two main column functions now we're gonna go on to the row functions and we're gonna start with filter so filter can extract rows that meet one or more criteria so if we wanted to select all the songs where the number of weeks popular is greater than 20 we can go down here we'll paste this code in just to get our data frame and we're going to filter for weeks popular is greater than or equal to 20 and run that and now we only get songs in our data frame where it meets that condition and as i mentioned we can pass multiple arguments into filter if we want multiple conditions so we want the songs where weeks popular is greater than or equal to 20 and also the artist is equal to drake and now we get only his popular songs if we want to use the or conditions so we want where the songs were popular for more than 20 weeks and the artist is either drake or taylor swift we could use the vertical bar and say that the artist is equal to drake or artist is equal to taylor swift and now we get both of their songs but it still meets that first condition that week's popular is greater than or equal to 20. now another quick note is that these are only going to grab the songs where the artist name is exactly equal to taylor swift or drake but not the ones where there might be a featuring artist just because the strings won't be equal so now let's say we want only the song titles of these drake songs if we were to filter him and select the song row we'd actually have a lot of duplicates as you can see here because a lot of these songs were on the hot 100 for multiple weeks so that's where the distinct function comes in and you can pass whichever columns you want and it'll give you distinct rows so we'll paste the code that we wrote before and we just want the drake songs we'll add another function onto the end of this called distinct song and i want all the drake songs so i'm going to get rid of this other filter and we can run that and now we get 95 of his songs but it's still in a data frame if we wanted this in a vector format we can use that period that i mentioned earlier for piping and essentially we can use it as a placeholder for the data frame so now it's signifying that this is the data frame and then we can grab that column called song so if we print this out it'll give us a character vector with the song names now on to the grouping operations these are a little trickier to work with but they become really powerful for doing data analysis if we wanted to get all the drag songs and order them by the week's popular column it'd be a little difficult because the week's popular column can be different for the same song as you see here this is because billboard releases its data each week so if a song is on their hot 100 again a sweet popular column will be higher this week than all the previous weeks now one solution would be to group the songs by song name and then determine the max number of weeks popular for each of those songs so our end goal is to essentially get a data frame of distinct song names along with the number of total weeks they showed up in the hot 100 so we'll first start by calling the group by function and we want to group by the song name now this group by function if we run the code it actually won't do anything to begin with it's implicitly grouping our data but you need to call one of the group functions on it afterwards so we're going to call the summarize function which essentially summarizes data into a single row of values for each of the groups and we need to pass in a summary function so we're trying to find out the highest number that weeks popular is for each of these song titles so we're going to use the max function so we'll call max week's popular and just like before with select and mutate we can rename this column so we'll call it total week's popular and set it equal to this max week's popular and we can run that and now we get each of the song names as you can see they're 95 songs just like how there were 95 distinct songs before and we get the max number that they were weeks popular now there are other summary functions too like mean min and again these are on that cheat sheet but i'll put a link to that in the description it looks like these songs are in alphabetical order but let's say we want to sort them by this total week's popular column so we're going to use the arrange function and i've grabbed the same code from before except now i'm just going to pipe at the end arrange and this will take in one or more columns so we can pass in total weeks popular by default it'll do this in ascending order but we want this in descending order so we can just wrap this in d-e-s-c for descending and we also want it to default to being in alphabetical order if the total week's popular numbers are the same so we can pass in a second column into a range for the song title and now we run this and we can see that where the song is thai it's still in alphabetical order but now grabbing the top 10 or top 15 would be really easy we could just call head 10 right after and it would just give us the top 10 rows of the songs that were the most popular so the last function we're going to go over which is similar to group by is count which will count the number of rows for each unique value in the column or columns that you specify so if we wanted to find the total number of times that each artist shows up in our data frame we can simply write count artist which will create a default column called n with each of the artist names and you can see here the data is not very clean because obviously two chains is being listed here multiple times but we're going to ignore that for now but if we want to get this into descending order by n we can just arrange descending n and now we can see the artist that showed up in our data set the most number of times so even though there are a lot more d plier functions these are the ones you'll likely find yourself using the most in the next video i'm going to cover a few more scenarios where dplyr can really come in handy including cleaning up a data set piping your data directly into a graph and merging data sets together to gain further insights if you found this video helpful i'd definitely appreciate if you liked the video and subscribe to the channel thanks for watching and i'll catch you in the next one
Info
Channel: dataslice
Views: 3,132
Rating: 5 out of 5
Keywords: dplyr cheatsheet, dplyr summarise, dplyr data, dplyr group_by, group_by, dplyr library, dplyr commands, dplyr filter, data manipulation with r, dplyr guide, dplyr, dplyr in r, data manipulation, Data manipulation, filter, select, r tutorial, dplyr tutorial, tidyverse, tidyverse tutorial, tidyverse r, r programming, r packages, dplyr package, data science, data wrangling, mutate, arrange, piping, rstudio, data analysis, dplyr functions, data tutorial, rstats, stringr, grepl, lubridate
Id: Gvhkp-Yw65U
Channel Id: undefined
Length: 11min 21sec (681 seconds)
Published: Mon May 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.