Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so those of you guys who are coming in this is the github repository for everything that we'll be working on so even after the tutorial the notebooks I'll publish them all here what I do want you guys to do is if you can go to that link that's highlighted up here and if you don't know your way around github you can click this green button here and click download zip and that will at least download everything in the most important part about that is because the datasets that we'll be working with are all stored in this repository so if you want to follow along at least you can follow along that way so I'll just keep talking a little bit about setting up so when you download this if you download the I'll assume that you downloaded the zip version of it and I'm not going to assume that everyone in the room knows get unzip it somewhere that's convenient so the desktop is probably a useful place at least for the tutorial for those of you on Windows and Macs I sort of assumed that you're running the anaconda distribution of Python and so if you open up the anaconda navigator if this is the first time you're opening it and might take a while take a little bit of time but there is a button in there that says Jupiter notebook or Jupiter lab and once you click that a browser should open up what I am going to show is a little bit different because this is a Linux machine but I'm just hide all this random other stuff before something I regret shows up on the screen [Music] all right so if you use the Anaconda the Navigator and you click on Jupiter lab or Jupiter hub eventually you might get a screen like this if you're if you click Jupiter lab I mean Jupiter hub sorry Jupiter notebook if you want there is a new ish foreign that you can play with called Jupiter lab I'll probably end up using Jupiter lab for the for this tutorial and Jupiter lab is you know you can think of it as like the next iteration of the notebook system it's pretty handy because you'll have a lot more convenient access to everything so you can browse your files right away what's also really cool is CSV files can also be rendered as spreadsheets in Jupiter lab which is really useful if you're just playing around with data what's also really useful is you can have multiple notebooks open at the same time and you can have like two views of the same thing which is like really useful that was like one of the things people were complaining about all the time and so SuperLab if you haven't played with that yet I would highly suggest looking into it so hopefully everyone has that up and running and totally just realized and this whole chaos I forgot to pull up my notes for the whole course for those of you guys who have questions feel free to raise your hand someone will run to you with the microphone you can totally interrupt in the middle I don't really matter too much about that and sorry give me one second I am pulling out my notes all right oh that's not helpful render works alright so this is a very introductory tutorial so if you've never worked through pandas before or worked with tabular data in Python this is generally the target audience that I'm aiming for the main difference about this one is I'll skip over a lot of stuff and get right to the actual Tidy data tidying data portion of it which I found to be the most liked that was the reason why like I switched over to using like Python or R to do analysis just because doing these data manipulations was just sort of magical for me so that's sort of the general pace that will go in if you've seen other tutorials that I've given you'll notice that like I pretty much jump from like chapter one - like chapter six but we'll cover holes along the way so don't worry too much about it so hopefully you guys have the data sets downloaded and you have this notebook system open and we'll sort of just live code along and just talk through the commands as I go through them so python comes with a bunch of libraries and by default just to make things faster to load you know they don't load up everything at one go so we have to import our oops let me make this bigger so we have to import libraries so the library we're going to lose it use is called pandas and this is our library for reading in tabular data and so if we want we can now use pandas they have a special function in there called read CSV to read in CSV files so we can assign what we load in to a variable so we'll call it DF or data frame and in Python when you load in a librarian you want to call a function within a library you use this dot notation so we were saying in this pandas library we want to use a really specific function out of it and the function we want to use is called read CSV and if you're wondering how I got that thing to pop up you hit the tab key and it will do that auto-complete for you so in read CSV if you hit shift tab in the notebook it would pull up the function signature for you and the first thing that it will take is the file path or buffer so this is pointing to the file path or file that we want to use the ooh I forgot I need to do one more important thing when you're if you're in the Jupiter notebook system the right now I am in the data folder so you saw before that we were originally here I click data and then I created a new notebook in here right so that's what I ended up doing so when I hit like dot slash meaning like looking my current directory in hit tab all the data sets that I have are in there and in the next notebook I'll show you what happens when you move your notebook into another folder so the first data set we're going to look at is called Gapminder TSV and you can look by the extension that this is a tab separated value file not comma separated so we actually need to give read CSV a different parameter called sepals slash key so backslash key and this is how we tell pandas hey like by default you read CSV files or comment delineate files but in this example we actually have a tab delimited file so this data frame thing in Python is an object so DF and if we want we can print out all of DF and we'll just give us the entire data set dumped out to the screen but one really convenient way is this method called head so we say DF head with round brackets it'll just give us the first five rows and when you're reloading in datasets for the first time this is really useful because you just really want to make sure that like your column really like your colony got loaded properly sometimes datasets like the first row is like an empty column in which case like and what might not load properly or maybe you don't have columns in your dataset at all you know and pandas will by default treat the first row as like column name so you might end up with something wonky so you always want to look at like head just to make sure that everything is loaded properly so if we look at our data set a data frame in pandas really has three different components there's the column names on the top there is this index on the left so in this case the index is really the road name the row number and then there's like the body of the data frame and if you ever kick so you can access those three components however you want so we could say DF columns and that will give us just the columns of our data set you'll notice that columns does not have round brackets it's just an attribute of the data frame it's not really like a function that's being called if we want to access the quote-unquote row names its DF dot index and that's how you get the index portion so if you if you end up going down pandas and you work with a lot of time series data you'll end up setting like the day time as the index so you can access your data set access different rows by the time of day that way and then in the middle right here like the actual body of our data frame that's called the values and so this this is one way if you need like hey I'm trying to take this data frame object and pump it into this other library and this other library like does it understand what a data frame is and in and it works with like the numpy arrays this is how you essentially get like the numpy array version of your data set alright so it used to be you have to do this when you were working with scikit-learn all the time but they sort of handled the ability to work with pandas dataframe objects but occasionally especially if you work with number that assumes that everything is a numpy array and so sometimes you will end up having to use dot values just to get the the numpy representation of your data set one other thing when you're working with python very often is this type function this is really useful when you're working in python you get some weird error and you just want to know why and what the type of your objects are so you can see here this is a panda's data frame object you see panda's data frame right here all right so what else do you do when you like first load up your data set for a first time well there's this attribute called shape and it gives you the number of rows and columns in your data set so this data set this Gapminder data set has 1704 rows and six columns all right and so like if you can imagine you have like million rows or hundreds of columns instead of inspecting each and one of them you know this is a quick way to make sure like did I get my data properly if we're reading in data from like instruments this stuff usually ends up being really specific like every time this instrument spits out a data set is 300 rows and five columns and this is one way you can check that light the instrument didn't like dump out a bad data set for you what ends up happening sometimes you forget I forget as well these attributes like columns sometimes I'll put columns with round brackets and you just might forget and so if you end up doing something like DF dot shape with round brackets meaning like hey I'm calling the shape function on the data frame and not just the shape attribute you'll get an error message something like this in this case it says it's a tuple object it's not callable because what got reasons we turn back is that tuple so think of that as like python list that you can't change and so it doesn't know what that this as a function is mean does that's why you get an error like that so other stuff that you do if you don't really have too many columns one really useful thing is just say like hey give me the info of this data frame and this gives you the type on the first row so this tells us it's the data frame we have 1704 rows and they're labeled zero to 1703 so this label is that index that we saw earlier and then down here in the middle it gives us all of the columns so on the left-hand side it gives us the tells us there's six columns here on the left-hand side the column name in the center it gives us the number of non null objects so in this case you can see that like hey we don't have any missing data in our data set and on the right hand side it tells us the type of information stored in that column so if you see object in pandas that means it's a regular string object but you can see in this data set we have integers we have floating-point numbers as well and there's other types in pandas as well so if you work with like other statistical languages like are like a factor you'll see that encoded as a category in Python or in pandas all right so let's actually do something with our data set let's start filtering and slicing out our data just to get over the very very intro of this stuff so if we just wanted a single column of our data set we can take our data set use square brackets so square brackets is in general how Python will some subset objects and if we wanted for example the country column we can just put in square brackets in the string country and it'll give us just the country column right so what we could do is we can take this and save this to a variable right so we can say country you know frame is just that command that we just saw before all right so if we wanted just to look at the first couple of rows instead of all of that we can also say dot head on that one important thing is if we look at the type of this if you have a single column in a data frame what comes back is a series so you can think of a data frame as a bunch of columns each column is a series alright so that's just how those two words end up working together and you can think of a series as almost like a numpy like a wonderful one-dimensional numpy array right so what happens if we want more columns in our dataset sorry let's see this is subset so sometimes what you want to do is like hey I have this giant data set but I don't care about like all of these other columns I only care about subset over these columns so what you can do in Python is if you're trying to specify by name multiple things you can put in a Python list so we have this outer set of square brackets which is really telling python hey I want to subset something and then a list in Python is another set of square brackets so that's why you'll see double square brackets if you're trying to pull out multiple columns at the same time and so here if we wanted country continent and year oops that's because I spelled it wrong so if we wanted multiple columns and we're just trying to specify it by the actual column names that we want we can put that in a Python list and then put that list object into how we subset data frames one other thing to keep in mind if we do pandas dot double underscore version double underscore that's how you get the current version of pandas that you're working with this becomes really important when you start googling and trying to get help from Stack Overflow that sometimes api's will break and hopefully the Stack Overflow answer will give you the version of pandas that they're using or hopefully there's another solution that is using something more modern so that's how we subset columns so what about rows there's three different ways in the actual Python API on how you can subset rows two of them are recommended one of them is deprecated or being deprecated you'll get a big warning and not advised so I'll cover all three of them and then you know just tell you you only need two you really only should work with one of them so how do we get specific roads that are idea set you can see here that there is this index thing that we talked about right so this index place of important rules are how we subset rows because this is going to be how we call a specific row in our data set so if for example we wanted to call this you know to which is a third row in our dataset but we're trying to reference it by the actual like string to that you see on the screen there is a way we can subset data using LOC for essentially location and if we give it a value for example to it will look in that index and see like hey what matches this thing call - right it's not really looking at like the index that the second index row it's looking for the thing that matches the symbol - right and this has some consequences we won't really cover it in this course or in this tutorial but if you like concatenate data like start row binding things together you can have this situation where this row index is duplicated and if you say LOC - you actually get two rows back instead of one because it's really looking at like what is print being printed here which is to write like that character - so if we wanted to get multiple rows back just like how we did with columns we put that set in a Python list so if we wanted to and 0 for example we'll get a data frame back with 2 and 0 in the order that we want right and again this is LOC and it's really just doing almost like character matching like I'm looking for the thing called - so the other way that we can subset data frames is using something called I look so I'll oak and the I stands for index so this is where we're we're actually starting counting from 0 and then saying like that is the road that we're trying to get so if I say I'll oak - this is really going me to show you what this dataset looks like i lo - is really going saying I'm getting 0 1 2 that is the index position that we want and that is the road that's getting pulled back out all right so there's like subtle difference between the two in this example because the index is really the same thing as the row number it's a little hard to figure out but you can imagine if you start working with data sets and just combining them sometimes you want that thing labeled too and sometimes you want like the actual position in the third or index two in your data set the last way if you look in the documentation there is this way to access rows and rows called I X or X and here if you try to do anything in X or IX it will give you a where is this a big giant deprecation warning and it also says like please use LOC or I loc instead I don't remember the exact order but I think it sort of works in it works as if you're using LOC first and if they can't do that then it'll drop down into I I look right and so you can see it's kind of ambiguous and so that's sort of why they're saying like hey don't do this at all and you might run into old code bases that actually use IX and you know when you see it try to fix it or talk to the guy like did you mean like the label of the row or the actual index position of the row right and just sort of enforcing your code to be a little bit more readable by saying like hey we're this thing that's ambiguous let's just get rid of it so how do we get multiple rows and columns that let's say we want to subset our columns and also start filtering by rows so they're really so you end up using LOC to subset both rows and columns and under the hood if you really want to be like really verbose on how you type this stuff LOC if you've worked in are that the notation would be really similar you can have a comma and there's a portion to the left of the comma and a portion to the right up the comma how you specify your rows is going to be to the left of the comma and then how you want to specify or subset your columns is to the right of the comma so in our if sorry in Python if we whoops if we for example just wanted to get all of the columns and keep all of the rows we can put this slicing notation this call that just says like hey give me everything here and this just says hey give me all of the rows and then on the right hand side if we want in multiple columns remember multiple columns we can put in a Python list we can get all of the rows in our data set and then just filter by columns that way right but this notation you know it really is pretty silly in this case because we could have just used the double square bracket notation to begin with but this has advantages because we can now also do you know boolean subsetting of our rows at the same time right so what does that look like let's say we wanted our data frame our we're basing it off of the index label and in here let's say we wanted you know year in 1967 right so how do we specify we want all the rows where the year is 1967 so we can really say hey DF we're taking the Year column and we want to make sure it is 1967 6 1967 and then if we wanted the year and population we can do a filter for all of the rows where our a year column is 1967 and then just subset specific columns all right any questions so far that is one more example we'll do one more example if you're trying to do this trying to do this boolean sub setting and you want to do for example multiple multiple conditions or multiple boolean cases the trick is you have to wrap each individual statement in a set of round brackets so let's say we wanted year that's 1967 and then also DF Pop greater than I don't know what is this 1/2 a million that seems kind of low maybe this will work other cool thing about Python numbers like 10 or a thousand let's use a thousand a thousand you can actually use as a visual marker this underscore for integers and it represents the same thing right so if you need like the equivalent of a comma to help separate out your numbers Python sees one underscore zero zero zero the same thing as a thousand so again hopefully make your code a little easier to read so I have this these two boolean cases and if we wanted both of those to be true we can use the ampersand and if you want at the or you use this vertical pipe for or and the reason why the notation ends up looking like this where you have to use round brackets and an ampersand instead of the regular Python a and D it's really doing bitwise comparison and so that's why I did code looks like that alright cool so now to the actual portions because we are now done with the the the intro so if you downloaded the data sets or that that folder there is a folder called notebooks there is a folder called notebook so let's go into the notebooks folder and then create a new Jupiter notebook in there so their first notebook I can rename this and I'll call this intro and then the second one I will call this tidy right so the key here is I am now in a different folder from where all the data sets are stored so that really just has some implications of how we like load or find our data sets but before we go there and I'll put this link on the github repo so you guys can find it but so let's just talk about tidy data and what what does it mean when your data set is quote-unquote clean right and so there is this really nice paper by Hadley Wickham he comes from the our world but you know stuff like this is totally relevant it's language agnostic and so what does it mean to have like clean data sets right and that that's sort of the reason what this paper talks about it's like hey we have these two days that's defined by table 1 and table 2 and you know which one is better like if we talk about rows and columns it's really ambiguous and that's what this paper talks about like we actually need like better words for these things all right and then he also says like ok this these two things contain the same bit of information but then here's another version of this data set in table 3 and this is like the actual like tidy or clean version of this data set and why so you can imagine let's say you want to fit a model a statistical model that is like you know some result based off of treatment right you can now actually like fit a model that's like result is your Y and treatment is your X right like you have a comment for that versus you know a data set like this you can't you don't really have like what is the column that is like my response variable and then what is the column for my inputs right so that's that's sort of one visual cue to like well hey what does it mean when data is clean the other thing is like once you get your data really into like a tidy format lower so if you're just processing data this is a pretty good target for like what your final data set should look like if you get your data set and format like this it's pretty easy to convert it to whatever you need right so like this version of our data set it might be good for statistical models but like if you're trying to like show a table of results like one of these two forms is probably way better right like for a given person you can really easily check like treatment a and treatment B but it's a little harder down here like hey I'm looking for the Jane Doe it's a result and you have to scroll down to find like chained OB right so just different versions of datasets and they're all none one no one version is better than the other they all serve different purposes but the key is like once you get your data in the clean or tidy format it's pretty easy to convert it to one another and that's what we'll talk about the formal definition of Tidy data he talked hadley Wickham talks about in his paper each variable forms a column each observation forms a row each type of observational unit forms a table will probably really only talk about the first two so what does it mean for each variable forms a column each observation forms a row let's look at this version up here you can see that we have a column called name a column called treatment column called result right so those are three separate variables that means those are three separate potential inputs into a model that we want to define and in each observation forms a row right so for a given person we have like his or her treatment value and their result a little bit more down on this paper data is messy it's messy in many many different ways but there's if you also go through the process of like cleaning or tidying your data sets you'll probably end up going through the first three problems right and so that's what we'll go through in this tutorial is just walk through these first three things as defined in this paper and if you do end up reading this paper will actually like work with the exact same data sets except for one of them you really don't need to read the entire thing by the time you get to chapter 4 or end of yeah by let me get the section 4 that's probably like all you need to know the rest is really specific to our but if you just want to understanding of like hey what does it mean for data to be like quote-unquote good the first three sections is really all you need to read all right so we have this other notebook let us you know start processing tidying our data this is a new notebook so we do have to import our libraries again one other way you can import libraries is using this as command so we can we can say import pandas as PD and you can see what this allows us to do is we can now instead of writing PA and das dot something we can now say P dot so this is just a way this is called a alias and it's just to help you simplify your typing so you don't have to type so much so the first data set will come from the Pew Research Center and will now use PD dot reason CSV instead of pandas not read CSV to load our first data set so in this example if we look at my folder structure I have my stuff in these in the notebooks folder but the data set is really from this notebook it's one folder up and down into a data folder and then that's where my data sets are right so instead of just typing Gapminder like we did before we now have to say two dots that says hey go one folder up and then go down into the data folder and then look up the PUE csv data set so if we look at the head of this data set this is the first this is exactly the same data set shown in that paper but you can see here is an example of a data set that's not tidy but it's really great in terms of you know something that you show in a report so for a given religion and again give an income bracket I think this is like the count count of people and so if we go back up into this paper what is the problem what are the three problems that we will cover column headers are values and not variable names what does that mean well if you look at our columns you can see that we have a bunch of columns that are actually income brackets and that really should be in one column that's called like income and then on the right-hand side we should probably have like a count column that's just for a given religion and income bracket there's a count right alright so how do we fix this type of problem there is this notion of like long data and wide data if you read the paper that's also kind of ambiguous but I still need to use it because that's how database people also talk about things so you can think of long data as like a lot more rows since it's long versus Y data which is just a bunch of columns right so in this case this is an example of wide data because we have a bunch of columns that we really want into long format right so hopefully if I show you the code and run it make a little bit more sense so if we are going from like Y data to long data and this is the reason why the function is named this way you can think about like if you have a sword and you're trying to melt the sword it goes from long wipe along so the function that we're using is called melt so pandas has a function called melt because we're going from wide to long and melt if we look at the function signature so if you hit shift tab the first object takes is a data frame and so the data frame we want to melt is called pew or PW and then there's a bunch of other parameters in there so one of those parameters is called ID VARs and ID VARs essentially what you are trying to do is you're trying to take a bunch of columns and turn them into one column right so you're going from white to long and ID var is this essentially saying hey what are what are the column or columns that you don't want to touch right and in in this example we don't we want to leave the religion column alone we actually want to melt down all of the other columns so we can say something like ID VARs as religion meaning that is the column that we don't want changed so let me save that into a variable called pew long and if we look at it we now - all of those columns that wasn't specified by ID VAR so religion and we converted them or made them long into a column by default it's called variable end value right so if I don't use head hopefully this will show you can see for a given religion we have a income bracket and then like account right and that that is the tidy version of that data set right we have each column is a variable so religion in this case value variable and value each row is now an observation Segan oh the index our D&C is unique I believe the indices are unique I have to never actually ran into a situation where like I was playing with indices when I'm doing this so I don't know the actual answer to your question but hold on let me see maybe there is all right so there isn't like a reset index yeah I don't know I might I might play around with it like afterwards all right so if we look at this anything that you didn't specify an ID VARs get thrown into this other parameter called value bars right so the function takes really two things ID bars and value bars and whatever you specify in value bars if you lead ID bars alone that will be in there and vice versa all right so you don't have whatever is easier or shorter to type that's the one that you use you can also use both of them if you're also trying to subset columns at the same time but the other two parameters in here var name and value name that's how we change the default value variable and value columns right so if we take the same exact piece of code and we say var name is now income we can also say value name is not count now if we look at our data set we actually get like some sensible variable names for us right so yeah so that so if you ever end up in this situation when you have a bunch of columns in there like hey those are clearly like variables I want to put into a model that's all you have to do is turn this melt your data set specify the columns we want fixed or the columns you want melted and then you have your tidy data set right so now you can do something that's sort of like for a given religion like predict the count or something it's like probably a terrible model that you would want to fit but that's something that you could do alright so what happens if we want to work with more more columns so there's another data set in in the data sets so there is a data set called billboard dot CSV and this we really have the same problem we just have a lot more columns right so for let's look at any individual row we have a year artist track time date enters so these are all songs in in 2000 and then we have its billboard like rating from like week one I think it goes all laid out - like week 72 and so this is a same problem we have before right so we what we really want is like a column called like week number that's like week one all the way to 72 and then it's actual rating right so then you can do something like you know for a given week predict like the actual value or something so this is the same problem that we have before we have a bunch of columns that are values instead of variables right so if you go back to the paper we're still in this column headers or values not variable names problem this is just a bigger version of that problem so how do we melt this so we are using the same melt function right lips melt I will just be very generous with my line break so it doesn't flow off the screen so the code is that exactly the same as before right so if we want to melt the data set the first thing you say is like hey what is the dish that you want to melt so the first parameter is going to be the data set and then before when we specified ID'd ours we only put in religion but in this case we really want to hold we really want to hold year artists track time date entered like the same the other two columns we don't want to change and everything else we do and so here we could just put in here artists track time date entered all right so these are the columns that we don't want and then all of the others which is week 1 to 72 those are the columns that are slated to be melted down so we don't need to put in the value VARs parameter in this case and then just to round off the example let's just have when it does this lets create like sensible names for these things so we're gonna call the columns those column names are now gonna be in one column called week and then value name we'll call that rating right so now if you look at our data set we get your artists track time date entered for a given week what is that song's rating right why might you want to do something like this again you can do this for modeling purposes this is actually the definition of liked ID data if you work with databases databases prefer data and long format so before you dump something into a database you probably will end up doing something like this I had this example where we had like this 1,000 by 1,000 row data set and we try to dump it into Postgres and it really didn't like the fact that we were trying to put in a data set or a tabled I had a thousand rows a thousand columns and so what I had to do was actually turn this into like this many many thousand row object with like two columns and then and then it was fine right so different things need different forms but you're also thinking about databases this is also to format databases prefer to do things in all right so remember shape so billboard shape this billboard dataset really had originally had 317 rows and 81 columns when I call this note see yeah yeah so I'll just talk what I said so if you look at the two of the shape of the two data sets you'll notice that one has way more rows than the other right sometimes that's just what happens when you're trying to tidy data you just end up with like oh I only had like a hundred rows before now I have a million well it's really the same number cells but one is definitely more favorable in terms of a format for data storage and just modeling in general all right all right so let's go so that's usually the PD that melt that's what you do when you have that that first problem like I have a bunch of column than I just want to melt it down what happens if you have multiple variables stored in one column the paper will be using a different data set from the paper so I just want to show you what the paper has this comes from a I think these are accounts TB counts and you can look at one we actually have the same problem as before right like these column names are actually values and they actually store two bits of information right so the way we read this m0 14 this dancer of male 0 to 14 and then we have the same thing and it goes all the way up to like male 65 and above and then we have the same thing for female 0 to 14 mixer etc right so we know the first problem is like we need to do something with melt because what we would really want is a column that's like male and an age group right but we but the column name itself is storing two bits of information so that's a problem so how do we work with that so in our data set we have a dude sorry I will load this before alright so here we have a data set called country time series and so this is a data set for when I when I was just about to join grad school I met with another graduate student in the same program I was going into her name was Caitlin River she currently works at Bloomberg School of Public Health I believe at Hopkins and she curated this data set from the 2014 Ebola outbreak so at the time Ebola was like first big outbreak in West Africa countries were doing like giving reports and she was kind of reading those reading those reports and just keep putting down these cases and death counts that the countries were reporting into a data set there's a repository for all of this and I'll link that as well but you can see this data set has the same exact problem as before we have cases Guinea and in a number of cases and we scroll to the right we also have something called death skinny and then number that's right so what we would really want is like a column called date a column called day a column that somehow distinguishes whether this is a case count or a death count we also want a column that is like the country that this is account for and then another column that is the actual count right so that's the end goal when we that we want so let us take a five minute break so we can like go to the bathroom get water etc etc so it is now 2:16 let's come back at 221 yes ah yes you have to write like your code for it but yeah he could you could totally slice columns and without but you have to write like other Python code to do that yeah so let's come back at 2:20 and then we'll just pick up from there all right 220 it's actually muy fast I guess all right but we have a lot to cover so let us keep going so we have the data set and what we want is you know we wanted those specific homes date day country whether or not it's a case count or a deaf count and then the actual value this goes to four o'clock right all right all right so we know the first thing that we want to do right we want if we if we melted those columns down that gets us part of the way there all right so let's say let's do that first ebola long is going to be PT melt so that's exactly the same things before and then ID bars the columns that we want to hold the same in this case is date whoops date and day all right cool so that gets us part of the way there we have dated a variable and value those are the default names and then the only problem here is this variable column now contains two bits of information right so that is the we did the first part and now let's work on the second part so if we look at this if I were to ask you like hey so there's like case counts test counts and in the country how would you split this if I like without programming in general you might just say like hey there's like it really convenient underscore there if we just broke the column down by the underscore the part to the left will be whether or not this is a case count death count and apart to the right that'll be our country so we will do exactly that so just to step out of the data pandas world a little bit and back in just back into pure regular Python if we have this regular if you have the string cases underscore Guinea Python has a lot of string manipulation functions that one of them is called split and by default split will split on a space but we can say like hey take the string and break it up by some other delimiter right this could be an underscore this could be a word it's gonna be like whatever you want and what comes back is a Python list and then each element of the list that's what the components after its split was right so we need something like this in our to work on our column we want to be able to break our variable in this case column by cases and getting on the underscore so the way we do that we can assign a value so the column that we're looking for is the variable column right so I didn't really give these any special names so the variable column and so the key part is there is this thing called an accessor called STR and this allows you to take a column and like do string things with it so STR stands for string so now we can treat this column as a string and now we have like we've open up a bunch of methods from Python that like work on strings one of them like I just show you showed you what's called split and what we want to do is split on that underscore so now if we look at variable split so we took this column right so this column was just coming over here we accessed this column as if it was a string so STR and then we ran the split method and what we get back is this vector or this column of our string split into two components so let's look at this a little bit so what we get back is a series so again a series if you see series think of this as a column of data if we for example this is a series we don't need to lis use lokor Iloka doesn't it's just one vector so we can just say like if we want the first element we can just say Brad : bracket zero and just to get the first element out right so this gets the first element out of a series every type fist notice that this is a regular Python list so if we from this list if we wanted because we know this is a Python list forget about working in the data frame world just think how do I get things out of that list we use square brackets so if I wanted the first element without using type for now we can get cases by just pulling out the first element and that is that type is just a regular Python string right so this is why like if you ever get confused just run type off of everything right it'll help you contextualize like what your problem is alright so before I was just like what the heck is this thing I've never seen this before it's a series so forget about what's in there and just treat it as a series and then you can slowly break down your problem that way all right so how do we get this all back into our working with our data so the first column or the first entity in this list is what we want in terms of like status will call cases or death and then the second one is going to be country so how do we like in one go just get all of the first elements so if we take variable split so that's that and if we say I think if we just use 0 yeah so you can say like hey treat this thing as a string and then just get the first thing out of it you can use dot STR you can use that accessor again and then put a bracket 0 I like to be a little bit more explicit and say get use get just so I can say like hey treat this thing as a string get things out and then get the first element out so like people have told me like how is this any different from your using square bracket zero there isn't really I just like to use the word get it just helps for me every read a little nicer so that's how we get that first column right so that's what we want and then if we wanted to get the second element we can just say get one and then we get the country's so we can use those two components to break up that column so the data set we've been working with this entire time was called Ebola long and if you wanted to create a new column just pretend that column always existed so we can subset this column just like before and we can say status is now equal to that thing right so remember when we ran get it gave us a vector back of whether this was a case count or a death count and then we can copy and paste that column again that code line of code again and we can say we want another column called country and that is the second element in that data set and now if we look at our data set we got we took with the code I just selected broke up that original variable column and we reassigned it to two new columns and so if you ever end up in a situation where like you have columns both stored like for a given column it's storing two bits of information what you end up having to do is a melt and then some kind of string manipulation to break this thing up there's another so that's like the very manual way of doing all of this there's a there's another way you can do this like essentially in one go which I'll show you so if we go back and we look at the original command called split split actually has a parameter called expand and if we set it to true instead of giving back a list of a vector of lists it would just give us an actual data frame back right and you can see here it gave us a data frame it didn't know what to call these columns it just gave us 0 1 oh this is something that we can work with so if you really wanted to take that thing that I just showed you and do it in one go you can do something like Ebola long and also remember when I said if you wanted to assign stuff to a column make a new column just pretend that thing always existed so if you again remember selecting multiple columns in a data frame used two sets of square brackets so if we just pretended that status expanded or country expanded if we just pretend those columns always existed we can assign columns that way all in one go one other thing that you'll see really often is if you have really long bits of code the Python community tends to wrap that whole thing in a set of round brackets so I put one here in Ebola long at the very end and then in between like where you have these dot notations you can break up and create a new line you'll see that very often in the Python documentation on like if code just gets starts to run off the screen all right cool all right so we worked with the first two sets of data so column headers are values not names multiple variables stored in one column what happens when variables are stored in both rows and columns what does that look like so we have a data set for that all right let's look at this data set called weather so this this data set the first thing you'll notice that okay you have d1 d2 d3 both are a we're probably going to melt something that's something that we've seen already but how do we know that a data set has essentially data like everywhere one of the symptoms of something like this if we look at these two columns right here if you end up in a situation where pretty much the vast majority of this row is duplicated except for like one value you probably have this problem or like data is being stored in what was the symptom called variables are stored in both rows and columns so you so if you have a row of data where a lot of information is repeated except for a few numbers that's how you know that variables are actually being stored in a row that's one of the symptoms and so let's look at this data set we have I think a temperature I want to say this is in Fahrenheit but I also don't know what country this is in so this doesn't help but so we would really want like some ID the year the month the day so the day information is stored in columns and then for this element we want like the column that's like temperature max and in another column that's like temperature min right that's what we want at the end but we have this problem we're like hey this team in and T max stuff is actually in this column called element right so in reality what we're trying to do is like hey I just spent all this time melting data how do you do the opposite of melting right so what we really want is take this element to column and actually pivot those values back up into columns themselves individual column so we want a column called T Max and a column called T min what we don't want is a column called element where there are values for T min and P max all right so how do we fix this problem well the first thing we've seen like over and over again we have these days stored in columns so let's fix that part first so we'll be using this melt function again so this is the third time that we're seeing this so we pass in our weather data and you know the columns that we want to hold consistent or not touch is the ID the year the month and elements the let's actually give these things sensible names so when we pivot this down those d1 d2 d3 both just call this day and then this value name we will call temp for temperature and I clearly spelled something wrong it was great oh okay that's not too bad cool live coding is great all right so just like before that that got us part of the way there right so you can see here this symptom is like even more pronounced all right so we fixed the day being columns problem but now we have like ID year-month-day temp and like the only thing that's different is like whether or not this is a min temperature or max temperature right so like pretty much everything suffer like one value is duplicated in this data set and that's that's like the sign that leahey something something's like a little weird another way you can think about like hey what is the problem now I was like hey if you wanted to fit a model based off like year month and day and you wanted temperature as your response how would you like you want like the one of the predictors to be like the min or max temperature right like you don't it's actually like a terrible example but you don't have anyway the specified min temperature max temperature you just have this column called element right and that's not really being helpful so if you think about it in like in the end a lot of people want to end the finian model so if you think about like how would you fit a model that's another way to like diagnose like what your current problem is alright so what is the next part the next part uses this function called pivot I use pivot table but the example will show you pivot table but I want to just show you that light there's two things going on so you could in this example you could also use the method or function pivot the difference between pivot and pivot table is you can for example let's say let's just look at the first two rows let's say I am like a researcher and I'm like out in the Arctic and I'm just taking temperature readings in this example I have one temperature for men and one temperature for max but let's say I like have another collaborator that's also taking temperature readings that's going to have the same year month and day and and he's gonna have another set of temperature min and Max values alright so now you can think like forgiven the Year month and day I have duplicate values for for a specific temperature if you use pivot the function it doesn't know how to handle duplicate values right so for for 2010 January 1st the maximum temperature well it's missing but let's say that there are two values for that when I do a pivot like just like in Excel when you do a pivot table it doesn't know what to do with like I can't handle multiple values alright so in Excel when you do a pivot table I think like by default it takes like the mean or something and so that's what pivot table will give you is you get to specify here an aggregation function so you can say like hey I want the sum I want the mean I want the min I want the Max etcetera etc so those are the subtle differences between pivot and pivot table in this case I'll just show you pivot table but just note that like if you're using pivot you're making an assumption that like there are no duplicates after our duplicates your codes gonna crash it's gonna throw an error so how do we do how do we work with this function oh so the newer versions the newer version of pandas actually had like a small API change so before I've been using PD melt this entire time you could have very easily have used weather dot melt and called the data frame directly and it would have done the same thing that was one minor change so you can also you can use can use PD melt or write they both call the same exact code and the reason why I bring that up is because the PivotTable function you still only work on weather dot pivot weather long melt PivotTable used to only exist as a data frame method and so like people are like why do I have to use PD melt in one case and then data frame dot PivotTable in the other case so they sort of like unified the interface that way so you can use data frame melt or PD melt in both cases alright so how does this pivot table function work we pass in actually it's just way easier for me to the explain it so just like before we had like index bars or ID VARs here we specify hey what are the columns that we don't want to move so we don't want to move so we have to use a Python list cuz we're using multiple columns you don't want to move ID we don't want to move here we want we want to keep month the same and then day the same what in the columns parameter what is the column that we actually want to do this pivot so here we are saying pivot this element column so what it's going to do is it's going to look at this element column for each unique value in this column it's going to create a totally separate column so that's the element column and when we do this pivot we now have to say like hey we just created in this case two new columns what are the values that we put in when we do this manipulation and then when you pass in this values parameter and we just say hey use this temp column here for temperature and use that to populate after we do this manipulation so now if we look at this we get what we need it just prints out like a little wonky so we have now a column call ID column called year comic-con month and day and then we also have a temperature max and temperature min depending on what version of pandas are using I think there's also a parameter in here called drop na so this will drop missing values so that's why you don't have a bunch of na s for for example January it's only showing D 30 depending on what value of pandas you might just have all of them as missing all right so you have this data set and this is almost exactly what we want but it's just printing out differently one of the sort of tricks that I do is like this is a hierarchical index it's like there's a hierarchy in the columns and rows going on here part of so I also in my day to day I jump for like any given hour or day of the week I'm either using our Python so I try to just bring everything down to like the lowest common denominator and so I just want to work with like a regular flat data frame right so the way we do that is there is this method called reset index and if you run reset index and you have this weird hierarchy thing going on and you know it's a bit hard you think just be just by the way it's printed red late if you ever have that and you're just like okay if I pretend to treat this as a data frame it'll work for most time but like there's some weird edge case there are some weird edge cases going on but if you just want like a regular flat data frame just run reset index and I'll just flatten everything down for you and sometimes you have to run reset index like twice so I wouldn't worry too much about that alright and now we have like a nice tidy data frame for our weather data set all right last part what do we do when multiple tables multiple types of observational units are stored in the same table so an example of that was our billboard dataset if you think about like how databases work or if you like work with databases or where or if you are a database administrator and if we looked at the tidy version of our data set of our billboard data set you can see like your artists track time date entered like there are 72 weeks like for any given song there's like 72 rows of information right this stuff that I have highlighted is repeated 72 times and so if you think of it in terms of like just data storage like this is now leaving the realm of like working with data for like modeling purposes because this is the format that you need from modeling but if you're thinking about just storing data you have this stuff here repeated 72 times right like that's that could get really bad very quickly and so let's look at for example you know let's look at this data set such that the track is the song loser loser oh right like this song by three doors down is repeated in our data set a bunch of time 76 times in fact right so that's not good so how do we kind of fix this stuff up so like our data sets are a little bit more like storage and data base person friendly I showed you guys how to filter your data set so it's really just a combination of like filtering your data set with like drop drop duplicates right so let us take our so we want to create like a song's data set which just holds information about our songs so how do we do that well that's our billboard melt and what's information stored in songs it's just your artists track time I'll leave the date entered like separate but it's fine right so we have our Billboard songs right so this is the stuff that was like repeated like 76 times all right so if we look at our shape we have like two 24 thousand rows in this but if we run this command called drop duplicates it will return back another data frame such that all of the duplicate rows are no longer there so let's reassign this back to our data set and now if we look at shape it's really like 317 right so you can clearly see like we went from 20,000 to 317 and this just for data storage like this could save you a bunch of time all right so if you think about I didn't talk about merging data sets together or anything but if you think about like hey I have a song's data set that's great and I'm gonna eventually create like a ratings data set that's great how do I combine those two data sets together like how do I join those two tables if you know anything about joining tables you need like an actual key so let's generate a key so how do we generate a key well one way is really to say we're going to make a new column called ID so every song will have an ID in our data set and we can use this Python function called range and we can say hey generated an ID and it's based off of how many rows are in our song so what does this look like it creates a column called ID so that's what's going on here range is a function in Python that just you give it a number it'll generate a value it's really a generator but think of it as a list from zero to that number right so it created zero to 316 for our 317 songs right so that just created our ID for us and if we want now if we if you really want you can say songs to CSV and you can save this so billboard songs dot CSV what you want to do when you save this stuff there is a parameter called index set at the faults otherwise you get this nameless column in your data set which you probably don't need or we're not using an index in this case so we don't need it and that's how you save out your song's data set all right you're using this to underscore CSV save out your song's data set so now we kind of need to backtrack a little bit and get another totally separate data set that is like for a given a song ID what is their weak and rank for that week so we need to go back a little bit and we need to join these two tables together so we eventually want to create a data set called billboard ratings and we're going to take our original full data set before we filtered our songs out and what we're going to do is merge these two tables together the songs table and the original one why why do we want to do this it's because we actually need this ID column that we've just created right otherwise we just have like this ID column where this song's data said that we can't really do anything with so we're going to do a merge and the data set okay I'm pretty sure I can just do this on the fly so the data said that we want to merge is this songs data set in merge so you can see we have a literally the left data set isn't if you read your code from left to right what is the data frame that shows up on first so left will be billboard melt right wil be billboard songs that whoo that's what we just provided how by default is inner meaning that only the tables that found matches will be returned back you can do left left right inner outer etc etc if you want there is a column called on so fall if both tables have the same exact column names you can use on equals if the column names or on the left data frame two are not the same as the column name on the right you can say left on right on in this example our columns names are the same so we can say on and we can say we are merging on your emerging on here artist track and time now if we look at Billboard ratings all we did was like a really fancy way of like getting this one column this ID column but that's good because now we have like ho for ID 0 which is this Tupac song if we go back you can see that ID 0 is just this song by Tupac so now that we have our ratings or that ID part there we do the same thing that we did with songs which is filtered the columns that we want drop duplicates and then we can save that out so ratings is billboard ratings and what are the columns that we care about the columns we cared about in this case is ID date entered and then week and then reading right so just like before when I showed you or at the very beginning we're just all we're doing is filtering our columns and that's all we did Segan the IDS aren't all zero is it just so happens to be zero so let me just say like sample I think I think this works yeah if I kind of just pick a random sample of this you see like the IDS are actually different and like these ID numbers now if we want we can go back to our songs dataset look up an ID and we'll actually know what song it is and then that's it if you want you could drop duplicates I very much doubt that there are duplicates in this example but if we want we can then say to CSV save this out to billboard ratings dot CSV set index it goes equals to false and then we'll have our two data sets if we look at info ratings songs and then I think it was just sort of the order is messed up but if we look at info one of the things down here is like the actual memory usage of this data frame and so if you look at our original data frame we have like 1.3 plus megabytes if we look at the separate ones they are actually much smaller in terms of data set size right so again the last point of Tidy data is more for data storage if you're actually gonna combine data or fit a model or something what you're doing is like combining and doing the merge together and getting the full 1.3 megabyte data set loaded in memory but in terms of storage this is way easier to store right like you've cleaned your data set up break it up into separate pieces just so if you have to zip this up and email it to someone or if you're at a conference and the Wi-Fi is really bad this stuff just downloads faster and then you know part of your data process is assembly and combining the data sets that you need for analysis all right so that is we go back to this paper that is all five points of this paper right column headers are values not variable names that's essentially just melt the thing variables being stored in multiple places as really some kind of melt and then some string manipulation or some kind of string processing that's usually the general workflow of going on there variables being stored in both rows and columns well you treat that into really two separate pieces treat it as variables being stored in column column so that's a melt and then whatever you need you pivot back up so the opposite of melt and then the last two the technical term is called normalization so you're normalizing your data break break information up into as many like logical pieces make your database person happy databases themselves will be happier and then what you do is like this combination of joins when you're ready to light work with your data sets this way someone doesn't have to load up like this massive thing if they actually care about like those two other components versus you know the two that you already have combined together and the last one was like a single observational unit stored in multiple tables so that is essentially when we our billboard dataset that we had originally that is this last problem right so when you're ready to do your analysis you're kind of like not making your data go here like adhere to the last point because you're actually combining your data sets together because this data set now contains song data and rating data but that's what you need for analysis so that is it I am on time so if you guys have any questions there are microphones people running around but hopefully that gives you the big intro into pandas and this was sort of just like show you how to work with pandas load stuff uploaded data set subset stuff and then we get to like I really like talking about this paper because my job is like processing and cleaning data so this is a pretty good paper just about what does it mean for data to be clean and if you end up like learning or having to learn another language this way this paper or just thinking about how to clean and tidy your data that is a pretty good way to like learn another language because you're not you don't care about like what is the function called you just need to know I need to do this step how do i do a melt in our right so that's that makes stuff a lot easier so the question is about like how do you chain stuff and make it look like are in general I think that's like what you're trying to reach so the I'm trying to find like the example of code that was alright here it is alright so there's like two different like R in Python they're really like fundamentally like very different programs and the way that they're handled like R is functional very functional and pythons very object-oriented and so how do you make your code like look like it's chaining together is first you wrap like your commands around round brackets and what ends up being the result of this stuff is like what is the type of this well that's just a regular data frame right so you can treat all this code here as a regular data frame and so if I wanted to so in this example what I ended up having to do was also like run the reset index to reset this thing together so if how would you actually chain this while you literally say like rethought reset index because the data frame itself knows how to reset an index so that is the closest to like chaining there is in pandas so if you also want I think like this is like I think it's called selectable cheese it could be yeah so if you wanted to you can also use this method called select to like filter rows and columns right so then you'll end up doing something like this dot select dot something dot something so that's that's the pandas Python way of chaining things together just in general so teacher-man how to fish in the Python pandas documentation you know there is a bunch of stuff in here but there is this other section called API reference on the bottom and this is like everything there is to know like all of the functions available to you in pandas so for example I showed you guys the s dot STR accessor and so in here if you were to look up STR dot was a split there is that's how you find like the definition for split right so if you're curious you can also look up like what can a data frame do so if you end up on the page of like what is a pandas dataframe you scroll down a little bit you can see all the attributes a data frame has so like T if you just need to transpose so it's dot capital T I showed you I X dot LOC then you have all of the methods down here right so you like melt reset index etc etc all of that stuff is here and this is all the stuff a data frame can do and if you can find what a data frame frame can do then you can just dot chain them together and that's the equivalent of training in order questions that sort of follow up on that question curious if you in your shop if you've seen or you use a preference or best practice around working in pandas if you know that you want to index particular columns but you also do a lot of training for readability if there's a preference of best practice around accessing brackets versus the dot notation if you know you're gonna be chaining extensively I don't know if there's like an actual best practice I use a square bracket notation a lot very just that's just how like my are scripts look like that I wouldn't say there is like a best practice really what what ends up really happening at least this is like me speaking for myself it's like you'll write a bunch of code just to make stuff work and then like you hopefully you go back and like clean up your code later if this is one giant like data processing pipeline you might want to just end up writing everything in a dot notation just so it looks a little clearer but in terms of like how do you like run your script like write it originally from the beginning I just write it like if I'm assigning a new column I'll just write you know square bracket a new column equals this other thing I don't I haven't really written my stuff into using this chaining notation very often other than like you know if I'm running like a group by that's that's usually where I'll see you'll see a lot of chaining or like data frame grouped by this select these columns and then plot this thing like so when I personally when I write my code like the training stuff ends up being like pretty like this is actually trying to do one thing I'm not trying to like run my entire data analysis pipeline in training mainly because if something goes wrong you have to unchain the whole thing and so sometimes I just write things like if it goes wrong at least like I don't have to like undo like 50 lines of code so yeah and then you might end up in a situation where like some some things like a data frame just doesn't have a method for already so like you just can't chain it so oh yeah okay so that's that's time I believe I'll be around today and also around the conference the next two days so if you want say hello hopefully tomorrow there's I have a big stack of books that I will be giving away so we will hopefully you guys win one find one by me [Applause]
Info
Channel: PyData
Views: 150,606
Rating: undefined out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3, data scientist, data science, data analytics, analytics, coding, PyCon, Notebooks, pandas
Id: iYie42M1ZyU
Channel Id: undefined
Length: 87min 22sec (5242 seconds)
Published: Thu Jan 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.