Tibble Joins with dplyr (STAT 545 Episode 11-A)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi i'm vincenzo coya and welcome to stat545 the video series designed to help you write a clean and modern data analysis so far in the course we've only ever been working with a single tibble for doing data analysis but whether deliberate or not you'll probably end up with more than one table at some point and how do you put those tables together into a single data frame for example take this imaginary guest list for a wedding containing the rsvp status of everyone invited here's another table down below of the party email addresses how can you go about adding the email addresses from this table up to this table without manually copying them over that's one type of task that you might want to do for joining two tables today we'll see four of the more common types of tasks for joining two tables together we'll use the dplyr package for doing this and the data you can find available on the stat 545 github repository and i'll add a link to the description where you can find the data the first type of task is to add columns using a reference table now this is the first example that we saw where we're wanting to add an email column to our guest list by referring to a different table in this case we're making a new email column in the guest list by looking up the party email in the reference table so the first four rows of the guest list will look up the party1 email and for the next two we'll look up party2 email etc use the left join function to accomplish this using the buy argument to indicate what variable will be matched between the two typicals let's move the email column next to the party column for easier comparison notice that some email addresses are missing specifically for party 5. that's because party 5 can't be found in the reference table party underscore emails on the flip side the reference typical is allowed to have parties that aren't present on the guest list for example down here we have 16 and party 17 which aren't found on the guest list can also achieve the same task using the right underscore join function and swapping the two tables these functions are so named because for left join you're adding data to the left data set and for right join you're adding data to the right data set the left and right joins both add columns to your table and because of this they're called mutating joins similarly to how the dplyr mutate function adds columns to a tibble let's take a look at another type of mutating joint the second type of task is to pool two tibbles together now with the left and right join we were adding to a main table by referencing a reference tibble but what if we wanted to just take two tables and pull them together to start let's suppose that the guest list was created based on your partner's friends and family now you're tasked with adding all of your friends and family to help you download a list of your contacts how can we combine these two tables together so that your friends and family are in the guest list without making any duplicates just use full join to keep all information from both tibbles this time we're matching a rows according to the name column because we're comparing people not parties at first the full join tibble looks the same as the left and right join tibbles but so far there's one thing hinting that this is different there are 33 rows here whereas with left and right joins we had 30 rows scrolling through the tibble shows the reason you've added people from your contact list that weren't previously on the guest list and people from the guest list that weren't on the contact list resulting in one big happy table and what about these nas well that's because these three people have been imported from the contact list and so we don't have attendance nor male information about them and another thing about the full join function is that your mutual friends won't be duplicated but what if you wanted to do the opposite task and only add mutual friends to the guest list in this case we only want guests that are in common from the original guest list and the contacts list use inner join to achieve this also matching rows in the two tables by the name columns now there are only 25 rows the five people on the guest list but were not on the contact list were removed and the three people that were on the contact list but not the guest list were also removed this also means that we have complete information for everybody unlike the left and right join functions like we saw earlier which have a main table that we add to by referring to a reference table the inner join and the full join functions treat both tables equally and pool them together these are all still mutating joins because they add new columns to your tibbles but what about a different type of join altogether the third type of task is to filter your rows according to a reference table so we've already seen that the two tables that we work with sometimes have entries in common and sometimes don't what if we wanted to find out who is in common and and what entries are not in common remember that there are five people on the guest list that weren't on the list of contacts there's a way to access these rows directly [Music] to find out who those five people are we can use the anti-join function as before we'll match rows using the name column to do the opposite that is filter our guest list down to all guests where we do have an email address in the contact list just use semi-join notice that we're back to the situation where each table is treated differently the guest list goes into the function and a modified guest list comes out of the function where the table in the second position is only used as a reference because new columns aren't being created here with this type of join these are called filtering joins because we only reduce the amount of rows based on a reference table and sure while it's true that you could do the same thing by doing a left or right join or maybe a full join and then filtering that resulting tipple this approach is just more direct and simpler the fourth type of task is to bind two tables together and this is literally just taking two tables and smashing them together either by columns or by rows and there's probably no need to do this here's the guest list again what if we wanted to add another person to the guest list you could bind the rows of these two tables together using bind underscore roads but you might as well be more deliberate and use full join in addition to the bind rows function there's also a bind underscore calls function for binding columns and this is especially error prone because you'd have to make sure that both tables that you're binding the columns together by you have to make sure that the rows all match up instead if your task is to actually just make new columns why not just use the dplyr mutate function or one of the mutating joins that we saw earlier well that pretty much covers joining two typicals with dplyar to summarize starting with mutating joins first add a column to a tibble by referring to a reference table and second join two tables together moving on to filtering joins we have task three which is filter a main tibble by referring to a reference table and lastly task 4 is to bind rows and columns with care i hope you enjoyed today's episode as always join us next class where we discuss file input and output [Music] don't forget to subscribe to this channel so that you can stay up to date with more data analysis goodies
Info
Channel: STAT 545
Views: 134
Rating: 5 out of 5
Keywords: left_join, right_join, full_join, inner_join, anti_join, semi_join, dplyr, joins, tibble joins, bind_rows, rbind, bind_cols, cbind, vincenzo coia, stat 545, rstats, tidyverse, mutating join, filtering join
Id: YAdX9MVRY1c
Channel Id: undefined
Length: 9min 30sec (570 seconds)
Published: Tue Oct 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.