Python Pandas Tutorial (Part 4): Filtering - Using Conditionals to Filter Rows and Columns

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there how's it going everybody in this video we're going to go over the basics of filtering data from data frames and series objects so for example if we wanted to look at our survey data and only look at people who know Python then we can filter that data out or maybe we only want to see results from a specific country or people that have a specific salary range anything like that we can do all of that by filtering out data from our series and data frame objects and we'll learn how to do that in this video so filtering is one of the main things to learn with pandas because it's basically how we begin every project by filtering the data that we want from the data that we don't now I'd also like to mention that we do have a sponsor for this video and that is brilliant org so I'd really like to thank brilliant for sponsoring this series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I'll talk more about their services in just a bit so with that said let's go ahead and get started ok so first of all let me show you something that's going to be the basis for our filters if we perform some basic comparisons using our data frames then we'll get back some interesting results so let's see what this looks like so right now I have my snippets notebook open that we've seen a couple of times throughout the series and at the bottom here I'm going to make a basic comparison and see what the result looks like so let's say that I wanted everyone that has the last name of DOE from our data frame there should be two since we have a Jane Doe and a John Doe here in our data frame so in our bottom cell here I'm going to put a comparison that says DF and then we want to access that last name column and now we can just say if that last name equals equals doe so if I run this then what we get back is a series object and this might not be what you expected so maybe you thought we would just get a data frame back with all of the values that met our criteria but what we got back is a series with a bunch of true/false values now these true/false values actually correspond to our original data frame and the true values are the rows that met our filter criteria and the false values are the rows that didn't meet our filter criteria so you can basically think of like a mask so we can see here up here in our simple data frame the last name of Schaefer is false and these two last names here with Doe are true so this is a filter mask when and when you apply it to a data frame it will give you all of the rows that meet that filter criteria so now let's apply this filter to our data frame so first I'm going to assign this this return a series here to a variable and I'm just going to call this variable filt so I'll say filt is equal to and then this comparison here now filter is a built-in Python keyword so be sure to use something else anytime I assign these two variables I usually just use this filt keyword here we're not keyword by variable name now I also usually like to wrap my entire filter in parentheses because I find it easier to read so this isn't going to change anything but I'm just going to put this entire thing here within parentheses so that I can see that this assignment here is separate from this equal sign okay so now we have this filter here and remember that this this filt variable here is equal to this series here of true and false values and now let's apply this filter to our data frame and we can do this in a couple of ways so you might see some people do it like this we can just pass that directly in like we are searching for a column we can pass in a filter there and if I run this whoops and I got an error there because I did not run this cell to set that variable so I'll rerun that and now if I run this then now we can see that we get a data frame back where it returned all of the rows that have the last name of doh now we only assigned the filter on a different line because I think that's easier to read but you might see some people put these comparisons directly in the brackets for the data frame so you might see something like this so I'm just going to comment that out right now and just grab this entire filter here you might see some people do it like this and just paste it or put it directly in there and we can see that that works now I think that that's a little more difficult to read than just assigning this to a variable so I'm just going to use that as a variable but you should know that it is a possibility since you will see some people put the filter directly in there okay so that's one way of narrowing down those rows now another way that you can do this and the way that I prefer to do it is to use the dot Lok indexer that we've seen a few times now and I'll talk a bit about why I prefer this in a second but if we pass in that series of boolean x' and - dot lok then it should give us the same results that we currently have so if I say D F dot Lok and pass in that same filter then we can see that we get the exact same thing this here gave us the same results as our dot Lok down here so this is one reason why pandas can be a bit confusing to people because there are multiple things that you can pass into these different brackets to get different results so like I said before dot lok is used to look up rows and columns by label but if you pass in a series of boolean x' like we did here then you can also filter data out now the reason that I like using dot lok for this is because we can still grab these specific columns that we want as well so for example if I wanted the email column then I could simply say pass in a second value here and 2 dot loke and just say that we want that email column if I run that then we can see that we get the emails that match those last names so remember the first value to dot lok are the rows that we want and the second value here are the columns that we want so since I wanted the email column it gives me a series of those emails where the last name is equal to dou okay so real quick let's go over some other ways to filter data out here so first let's go over the and and or operators now we can't use the Python built-in + 1 + or keywords for our filters so we're going to be using some other symbols and the syllables that we're going to use here are the ampersand for and let me write these out here the ampersand for an and and this vertical bar for an or so these symbols carry over from other programming conventions so you may have seen them before so for example let's say that we wanted our filter to get all of the rows where the last name is Doe and the first name is John which should only return this one result so in order to do this we can go back up to our filter here and I'm just going to get rid of this cell here because I'm just going to be using dot lokrum now on so now with our filter here let's say that we want a lastname equal to dough and which we're going to use the ampersand and we want the first name equal to John so I will put that in here and I also let me actually separate these out as well I like to have each one wrapped in parentheses I think it makes it a bit easier to read so now if I rerun that filter and then rerun our dot loke then we could see that now we're just getting that single email address and that email again is where all of the last names were equal to dough and the first things were equal to John so in this case it's just one result so now let's look at an example using the or operator now for this we can use the vertical bar character so let's say that I wanted a filter where the last name is equal to Schaefer or the first name is equal to John okay so if I wanted the last name to be equal to Schaefer or the first name is equal to John for the or we're gonna do a vertical bar character there so now if I rerun this then we can see that we get back all of the rows or all the email addresses where the last name is equal to Schaefer or the first name is equal to John so in this case it was these two rows here now we can also get the opposite of a filter so let's say that I wanted to get the complete opposite of this filter where we want all of the rows where the last name isn't Schaefer and the first name isn't John now I could go in here and fiddle around with this query that I currently have trying to get everything right or I could simply add in a tilde at the beginning of this filter and it will give me everything that didn't match that filter so if I just come in here and put a tilde there then that is going to negate that filter and give me the opposite of those results so we can see here that we get Jane Doe because that our that is all the results where the last name was not Schaefer or the first name wasn't John so this goes back to truth tables that some of us may have studied in school now if anyone is unfamiliar with truth tables then let me know and maybe I can make a future video on that subject but it's more mathematical related than programming related although the two do overlap very frequently okay so that kind of covers the basics of filtering on a small data frame but now let's go back to our larger data set of survey data and look at some real world examples of some filters that we might want to take a look at so I'm gonna bring up my other notebook here and here we have the stackoverflow survey data that we've been using throughout the series and again if you'd like to follow along then you can find a link to this code and the data in the description section below okay so now that we've learned about some basic filtering let's see if we can run some filters on our survey data here so for example let's say that we wanted to look at the data for people who are making a salary over a certain amount now maybe we want to want to take a look at what languages are earning the higher salaries or something like that so in order to do that I'm going to first create a filter now if you don't know which column and the data frame it gives the salary then you can always find that using the schema data frame that we've seen throughout the series that tells us with what each of these columns here means but for the sake of time here I'll just tell you that the column for salary if I go over here I think I can find it here pretty quick it is this converted comp right here so if we scroll down through some of these survey results we can see that this person says that they make about 9000 this person makes sixty one thousand and these are converted to USD and real quick let me reload this page here just to make sure that I have run all of these cells and that I'm that all of these variables here have been initialized okay so if we wanted to get a filter with all of the high salaries then let me come over here and again I will copy this column name here and now let's make a filter and I'll call this filter high salary and I'll set this equal to a conditional here where we're saying that we want the DF of this converted comp column of our data frame we want that to be over let's say a high salary is over 70,000 you know this is kind of subjective but we'll do that as a filter here and now let's apply that filter to our data frame so just like we saw before I can say DF loke and I can pass in that high salary filter there and now we can see that we get some results here and this isn't all of our respondents we can see that now it's respondent 6 and 9 and 13 so if I scroll over to our to our converted comp then all of these salaries here should be over 70,000 and it looks like they are now in order to narrow these columns down a bit let's just grab a certain number of columns so I'll grab the country the programming languages and the salary so to do this remember we can just pass these into dot Lok so up here where we're doing dot Lok I'll put in a comma here and now I'll put in a list for the columns that we want and let's say that we want to get the country we also want to get the programming languages that they that the these people have worked with and this here and like I said you can look all of these up in the schema but this is under languages worked with we can see that this person knows Java our SQL and so on so I'll paste that in as one of the columns that we want and also I will get the converted comp so I will paste that in as well and now let's run that now we can see that we get not only the people who meet this high salary standard filter that we set but also we're getting the we're only getting the country the languages that they know and their salary so our filter does seem like it is being applied correctly and that this is working nice so now that I'm actually seeing the countries here that reminds me that we might want to do some filtering with multiple values so for example you know my YouTube audience comes mainly from the United States India the United Kingdom Germany and Canada well that's where the largest percentages of the audience of people who are watching the videos come from so let's say that I wanted to filter out the survey results here so that I only see the results from those five countries that I mentioned now I could create a super-long filter up here where I say you know if the country is equal to the United States or if the country is equal to India or if the country is equal to the United Kingdom but that would take up a lot of space instead let's just create a list of those countries and then I'll show you a neat trick that we can do to filter those out so now up here instead of using this high salary filter here I'm just going to call make a list of countries here and I'll set this equal to I'm kind of a slow typer here so let me type these out really quick okay so now if we want all of these survey results where the respondents said that they were from one of these five countries then I can simply say I'll set a new filter here actually let me do that up here I'll set a new filter and I'll set this equal to DF of country so where the country dot is in and we want the country to be in this list of countries here so I'm going to run that cell and now let's apply that filter to our data frame so if I run this and just to narrow down these results to make sure to where we can see the country right off the bat I'm just going to only grab the country there so if I run this then we can see okay we get United Kingdom United States Canada India so it looks like that applied correctly as well now let me show you one more common filter operation that you'll probably use a lot so we can actually use string methods within pandas as well to do some alterations to our data frame or in this case to help with a conditional so let me show you what I mean so let's say that we only want to look at people who answered that they knew Python as a programming language so let's see how we do this so first of all the column that lists the programming languages that each person said that they know is that language worked with column so let's see what this looks like so in this here I'm just going to grab that language worked with column oops and I spelled language wrong so language worked with is that right yep okay so we can see that we get some programming languages here and each different language is separated by a semicolon so we can't just do a query where we say you know if the language worked with equals Python because the actual value of that column will be any combination of the languages listed on the survey so one way that we could do this is to instead use a string method and see if python is within this string of these languages that they know so to do this I'm going to rewrite my filter here and I'm just going to say that my filter is going to be equal to and we want to grab the languages worked with for each person and then we're going to use some string methods so I'm going to say dot STR dot contains and then I will pass in Python now also we can see that we have some na n values here but not a number now we need to also set a fill value for those or else we're going to probably get an error so that is part of the contains method here I can just say na is equal to false we're just not going to do anything with those so let me explain this one more time here so this filter that we're putting in place here it's saying OK for this column here the languages worked with which are all of these results here we are saying the string in that column does it contain Python so that's my filter I'm going to run that cell and now let's apply that filter to our data frame by saying dot Lok and I will pass in that filter and now let's also just grab the languages worked with column to make sure that that worked correctly so if I run that then all of these results here are all of these rows that it gave us back should I have Python in their language is known so we can see that we have Python there they're there so yes it does seem to be in all of these results now this one here number eight we can't actually see it here but we have these ellipses here so it's probably just being truncated here now in my last video I kept saying that these were being concatenated I meant truncated a few people pointed that out in the comments so yeah these are being truncated here so we just can't see the Python value there but they are there and I'll probably do a complete video on string methods here in the future since there's so much more that we can do with we can use them to replace text to split values and all kinds of different stuff so it's very these string methods are extremely useful in pandas so that's a basic overview of doing some filters with our data frames again the key takeaways are that we can do these conditionals here and have them as a filter and again let me print out just the filter without applying it to data frame we can see that the filters return a series of true/false values and when we apply that to our data frame those true/false values basically apply a mask to our data frames where we get all of the true values back and not the false so if I look down here we have respondent one two four and five as true and three was false so if I look at my data frame here we can see that we get 1 2 4 & 5 & 3 is not there because that value was false so that's how filters work and again it's an essential part of pandas because this is what the one of the first things that we do with our data whenever we are loading it into pandas okay so before we end here I would like to mention that we do have a sponsor for this video and that is brilliant org brilliant is a problem-solving website that helps you understand underlying concepts by actively working through guided lessons and brilliant would be an excellent way to supplement what you learn here with their hands-on courses they have some excellent courses and lessons on data science that do a deep dive on how to think about and analyze data correctly so if you're watching my panda series because you're getting into the data science field then I would highly recommend also checking out brilliant and seeing what other data science skills you can learn they even use Python in their statistics course and will quiz you on how to correctly analyze the data within the language they're guided lessons will challenge you but you'll also have the ability to get hints or even solutions if you need them it's really tailored towards understanding the material so to support my channel and learn more about brilliant you can go to brilliant org forge slash cm/s to sign up for free and also the first 200 people to go to that link will get 20% off the annual premium subscription and you can find that link in the description section below again that's brilliant dot org Forge /c M s okay so I think that's going to do it for this pandas video I hope you feel like you got a good idea for how to filter the data within our data frames to find the information that you're looking for like I said this is a fundamental skill and pandas which is usually one of the first things that we do with our data in the next video we'll be learning how to alter the data in our data frames and make changes so we'll learn how to make changes to specific values and also how to make multiple changes at once across the entire data frame so for example maybe you want to make it so that email addresses are all cast to lowercase so that they're easier to search or maybe you want to take out any spaces of your column names and replace them with underscores so all of that will be covered in the next video so be sure to stick around for that but if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are several ways you can do that the easiest ways to simply like the video and give it a thumbs up and also it's a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching you
Info
Channel: Corey Schafer
Views: 185,696
Rating: 4.983758 out of 5
Keywords: pandas, python, pandas filter rows by condition, pandas filtering data, pandas filter columns, pandas filter, pandas filtering, python pandas tutorial, python pandas, pandas condition, pandas conditional, data science, data analysis, loc, pandas loc, corey schafer, python programming, python (programming language)
Id: Lw2rlcxScZY
Channel Id: undefined
Length: 23min 4sec (1384 seconds)
Published: Fri Jan 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.