Data Analysis using Pandas - Joining a Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to talk about joins using pandas if you've used relational databases before you've probably worked with joins the join is common in sequel for joining two different tables together but we'll use it for joining data frames to one another if you haven't used joins before let's quickly review what they do and what types of joins are available Christopher Moffat has a great article on code project comm that breaks down different joins visually I recommend to read through it if you're interested but we'll just focus on his summary graphic there are some major group distinctions when it comes to joints the two major questions to think about is is the joint inner or outer and is it left or right in the middle of the graphic we can see what represents an inner join as you can see inner joins retrieve only matching rows so only data that is shared between the two tables or data frames on the top left we have what is commonly called a left join or a left outer join as you can see a left join takes all of the data from the left table or data frame and any matching data from the right table or a data frame on the top right we have our right joint or right outer joint it takes all data from the right table and any matching data from the left table so he's also outlined some more complex joins that leave out overlapping data or include all data you can see them by scrolling down these are also possible using pandas but less common it's good to know they exist in case you need to use these types of joins but you'll probably be using the main three that we covered so now we know that ordering of joins matters left versus right and that inner and outer joins give us much different data sets first let's open up a new Jupiter notebook to do some join work I'm going to import pandas in pandas as we discussed instead of using tables we used data frames and instead of writing sequel we use a join method in pandas left versus right is determined by which data frame you use first in the syntax the first data frame is the left one and the second data frame is the right one I've created some very simple data sets for us to use so we can easily see what's going on normally your data will be much Messier and we'll continue to learn lots of ways to clean it so when you're doing joins in the wild make sure that you do any pre-processing you might need to clean it up before your joins to begin let's import the main data set will be using employees CSV since I know that this data set is small I'm just going to print the entire data frame I have the four rows with the employees I see that I have an ID column so I'm going to set that as my index I can do so using the set index method great now I have my ID as my index so now I can see that I have a few columns that end in underscore ID this is a common convention in databases to point to other tables depending on your data it might be more opaque on how to join them but regardless if you can find shared columns and you know how they're named you can join them together I also have a database for titles let's create a new data frame of the titles so here I knew that the ID was the first column so I can just set it as the ID column and so I have a few different titles here so let's see if I can join this titles data frame with my employees data frame so here I'm going to use our suffix because I know that my titles database and my employees database both have a column called name if I didn't supply a suffix then this would fail our suffix means to put the suffix on the right data frame so here I'm using underscore title to remind myself that that's the name that came from the titles data frame you can also use L suffix which appends it to the left data frame let's take a look so now that I've run my join I see that it's only matching one row if I look back up in my data I see that I have one two three IDs for my titles and if I take a look at my employees you see that the title ID column should all be matching so what's happening here by default pandas attempts to join on index so here it matched according to those index IDs because I have a two ID in my titles table and a two ID in my employees table it matched Chris to the lead engineer this is a great option if your indices match however this is not the case for our data set joining on index is however one of the fastest and least painful joins but for our data set this means we'd have to reset the index and then set it back again since we likely don't want the title ID to be our index so it would just add a bunch of unnecessary steps there's a much easier way that we can use to join this data I'm gonna copy this cell here we're going to use the keyword argument on which allows us to pass a list or a lookup it will attempt to look for the title in the left data frame and match that to the index on the right let's run it now and see if we can get proper matches perfect this is looking a lot better so now let's see if we can pull in our department data okay so we properly have our data so I'm gonna save that to a data frame I'm gonna scroll up and instead of outputting here I'm gonna save this and I'm just going to call it employ with title so now I'm gonna see if I can join my departments with my employees with title data frame here again I'm going to use our suffix again depending on your data frames that you're joining you may want to rename your columns first because it can get confusing if you have these long suffixes but it's gonna depend on your data set so really you're the one to judge what's readable and what's not you always have the rename method if you want to rename the columns first and then join them so let's run this and see if it works okay so it looks like we have some name departments but it looks like sandy has a null value in panda's n a n is our null values if we look here at Sandy's row we can see that her department ID is 7 when we scroll up to our departments we see we only have 1 through 4 represented so what pandas is doing here is that left outer join that we talked about it's gonna take all of the rows from the employees with title and look for matches in his department's data frame when it doesn't find matches it's just gonna simply put null placeholder values so if we wanted to avoid null values we could use an inner join so I can add an extra argument called how when I run this inner join I see that the row including sandy is now gone what it's giving me is again that intersection so it's saying show me only rows from employees with title that have a match in departments let's take a look at a right outer join here I can see I again lost Sandy's row because here I'm taking all of the rows of the department data frame and I'm just putting in null values when the left data frame doesn't match so we can see here that the bottom row has a bunch of null values because it represents the sales department which doesn't match any employees we can also see that it has some unintended consequences on our index which we might want to correct for so now we have two number twelve IDs I hope this video has helped you feel more confident handling joins and determining how to join your data frames using pandas I recommend taking time to use some messy data or importing some sequel data to work on joins and see which ones are easiest to manipulate keep in mind that joining on a shared index is the least painful and the fastest way so try and resort to that when possible thanks for watching this riley training video if you like more information on this topic click on learn more don't forget to subscribe to the o'reilly video training youtube channel for more tutorials and be sure to like us on facebook
Info
Channel: O'Reilly - Video Training
Views: 39,711
Rating: 4.9122581 out of 5
Keywords: tutorials, video courses, training video, lessons, video, how to, guides, tutorial, video classes, how-to, software, how to use, video tutorial, data, python, Pandas, Python Pandas, data analysis, RegEx, data visualization, Matplotlib, Bokeh, Katherine Jarmul, data library, dataset, data wrangling, data tutorial, python tutorial, pandas tutorial, relational database, joins, relational databases, Jupyter, data frame
Id: 8K8Bs7z6d2M
Channel Id: undefined
Length: 8min 42sec (522 seconds)
Published: Wed Aug 24 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.