Pandas for Data Analysis | SciPy 2017 Tutorial | Daniel Chen

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right so as people are trickling in I guess I'll just start introductions so first if you haven't done so please read download the repository pretty much what got added along with a bunch of datasets is all of the lesson material and to go over today I opted to end up doing that just in case we start running out of time I don't have to just start typing everything I can just slowly start scrolling through the page depending on how we are with pacing because it is there's a lot of things to cover I realized and so a little bit about myself I am a graduate student at Virginia Tech I just finished my courses so I'm a group a PhD student I'm not yet a candidate either I am a also a instructor for software carpentry so a lot of our data sets actually come from them and I think that's that's it so if you downloaded the repository or either through git or as a zip file and you open up the Jupiter notebook so hopefully there's someone to help you if you have issues with that all of our data there's going to be is in the data folder the notes I'm working I'm going to be working from are in this zero one notes folder and just for me as this lesson so I don't start over writing my notes I'm going to create a second folder just to dump everything in that I'll be presenting you guys can do that if you want if you want to follow along and then just as a backup if things start this is really start falling behind I'll just go through the contents of this notes folder directly instead of live coding everything all right so yes oh the address yeah yeah so on this page they'll be a little green button that says clone or download if you click it if you know your way around get there's a git URL that comes with it if you don't there's a download zip button that you can click and I will just download everything as a regular zip file that you will unzip my good to start just raise your hand if you still need this link up and I'll just leave it up alright so I guess I'll keep doing intro stuff ahh so this is the introduction to pandas tutorial essentially what pandas is if you come from numpy the easiest way to describe it is it's numpy but you're allowed to have heterogeneous data if you come from the our world it's pretty much the our data frame and if you don't come from any of those it is a programmable way that you can run your excel spreadsheets is there anyone here I know there's at least one people who programmed in our or who have touched our in there cool so I will be making a lot of our references so that should help you kind of bridge the gap I learned how to do all this stuff in our and then a lot of the data manipulation tasks kind of just carried over to Python so I think after looking at this for for at least a year they look kind of similar so we'll see my am I good does anyone still need this link up just raise your hand and I'll just think of something else to say okay what else to say oh so I am using the latest version of pandas which is 0.21 as of I guess last night if you're running on something older than that there's going to be certain things that don't work anymore and I found that out yesterday when I was putting together the lesson plan so I will mention that to you just because that's also in my notes like this used to work and why doesn't it work but there is there's a few things that are that no longer work or are getting deprecated before they make the 1.0 release so and I have no idea when that would be they keep saying soon between Python 2 & 3 there at least for this tutorial there is only one case where there is an actual difference and that's when I write a function that uses the division sign so I'll mention that if you are on the Python 2 version otherwise everything should just work the last time I gave this tutorial at PI data last year I literally had to ask someone any audience for like a Windows computer to add anaconda so the installation should be ok shouldn't have been too bad there's very very few dependencies if you just installed the entire anaconda distribution all right ok can I take this link down now anyone else me a link because I'm running out of things to say all right oh I can write it on a whiteboard - now I'll have a whiteboard alright so here's the Jupiter notebook so pandas is a special library and python and so for those of you who are absolutely new to Python Python comes with a lot of tools already in there some you have to download some it just comes with if you just install regular Python but a lot of these packages you actually have to import them so the package that we are importing is called pandas and for those who are new to the Jupiter notebook each one of these cells are code cells so you can run regular Python code in them you can also do some other stuff that I won't go over and to run it you can hit shift enter and I'll run that block of code and give you a new one if you hit control enter it'll if you hit control enter it'll just keep running that block of code and it won't go to a next cell so if you need to check the version of pandas that you are using like I have to yesterday its pandas dot double underscore version double underscore and so I'm running 0.21 right now alright so in Python or if you're trying to use a function that's part of a package unlike our you have to actually tell Python where exactly to find all of these functions so pandas has a nice function called read that's read underscore CSV that lets you read CSV files but if we wanted to actually use this function we have to say pandas which is the func the library or package that we're using and the read CSV option so the first data set that we're using so the way I'm the way my folder structure is set up right now is I am in this folder called zero to lesson I have a separate folder for my data and then all my notes are in the zero one notes folder so for me to get to my data because I'm in a separate folder I actually have to say two dots to say hallo one directory up go to the data folder and then our first dataset is Gapminder TSV file just by looking at the file extension its TSV so it's tab separated so the function is defaulting to a comma separated file so the way we can read tab separated files is we add another parameter to the function in this case there's a delimiter parameter and we say backslash T to say hey every new piece of data isn't delineated by a comma daleney but delineated by a tab and backslash T lets you do that you can see here all pandas did was dump out our data set to the screen right so that's not particularly useful other than we just loaded and saw the contents of our data so what we can do is save this to an actual variable so I'm going to pick a variable called DF and do the same function again and now if I say D F in the cell I've actually saved everything for the right the right-hand side of the equal sign got saved or assigned to the thing to the left side of the equal sign a useful thing that you're going to be saying very often is this method of a data frame called head and instead of dumping out the entire contents or most of the contents of this data frame all it does is gives you the first five rows so it just makes it a little bit more manageable just to see what's going on a lot of times are going to see people write their import statement as import pandas as PD and what this allows you to do is you get to say the same exact thing as before but instead of typing the word panda's you say PD it's called an alias and some of these packages end up being a little bit too long or you're going to be using a lot of functions within it and so this just saves you a little bit of typing just to show you guys this does exactly the same thing there's this generic function in Python called type and what type does if you ever get confused as to what's going on or if there's an error that saying is expecting one thing but you thought that's what you gave it run the regular type function on on your object and you can see here it's saying that this DF variable is a panda's data frame object so after you load up your data set what do you do probably we'll look at the head of your data set that usually helps you see if your column names got loaded up properly or if your data just got loaded at all another thing that you might be that you might want to know is looking at the number of rows or columns of your data set so you can use the attribute shape to get the number of rows that's what shows up first let me make this bigger and the second number is the number of columns so this is shape is an attribute so if you had for example thought it was a function or a method you would get an error so that's sort of something that you learn over time what things required or set of round brackets what things don't so just something to keep in mind I usually forget things like this all the time so and the next other thing that you will be doing a lot after you load up a data set is running this method called info and what info is going to give you the first line it's essentially the type of object that you gave it or type of the object so this is a data frame it's going to say that this data frame has 1704 rows and it's got six columns and then each column is going to tell you how many in this first part is going to tell you the column name the second column is going to tell you how many non-missing values you are that you have so this data set we don't have any missing values and then the last column is going to tell you the actual data type of that column right so a numpy everything is some kind of number but in pandas each column can be a different data type and that's sort of the benefit or that's sort of what pandas gives you so the first thing that we're going to go over are just the absolute basics of how do you pull rows or columns out of your data frame so I can say if we just wanted the country column we can say DF a set of square brackets and then in quotes the column that we want so country and that will give us just the country data frame out we can save this to a variable and then now if we want to use country DF again we can use it like that another common syntax that you'll see is floor the internet is BS if you're just trying to pull one column out you can use dot and then simply call the single column that you want directly if it does you have to do it in so if the column mean has spaces you have to specify the first way otherwise this won't work so this is really just a convenience a convenient way for you if you just need one column I've sort of seen around the internet that you know you know it's really just for convenience don't rely on this behaving exactly the same as this every single time and one example of that is the leading column so so if you need multiple columns it's the same set it's the same syntax you have your data frame you have a set of round bracket of square brackets but in order for you to provide multiple values you have to give it a list of columns that you want and so a Python list is you create a Python list using square brackets so that's why if you need multiple columns you'll see two sets of square brackets so if I wanted the country column the continent column and the Year column I can just say just give me the first five rows that's how I would subset the columns that way yes so if I hit so I've been created because how how I've been creating it is if I hit shift enter it will run it and then create a new one underneath if you want to manually create a new cell you can say insert cell above or below yeah shift return will run it and then create a new one underneath so yeah oh yeah so if you wanted to just get the column names or headers there is an attribute called columns and we'll play around with columns in a little bit but if you just need a list of just all the columns that's the that's how you would run it yeah so typically I would run head on my data frame just to see the column names like this but if you have like thousands yeah you'll probably end up just getting it out like that yeah so there is so it used to be you can write something like that and say give me column 0 you can't use it this way anymore I'll show you there's other ways that you can get the index out that used to work so if you need to delete a column there is a general command in Python called del for delete and if you wanted to delete the country column from our data frame you would say del and then that this would actually not work so that's why people want Stack Overflow actually say you should probably get used to writing your code like this this is really just for convenience and there's actually like Python handles Python and handles or reinterprets the syntax actually a little bit differently which is why this bottom command won't work with the top one does the other way you can delete a column is using the drop method and here you say you drop the the column that you want and then you have to say access equals one to say you're dropping the column and we'll go over a little bit about what accesses are in a down line but that's how you would drop another that's another way you can drop a column the only way with this is if you run this it'll drop the column but if you look at your data frame again it doesn't get saved so you actually have to save that value if you want to drop a column in this way or there's another parameter called in place that you can set the true and then they'll just drop the column in place that way you don't have to do this reassigning or having this theorem getting copied in memory I believe you do because it yeah so by default it's the axis is set to drop rows so you actually have to say hey I'm actually going to drop this is how I'm specifying dropping columns because these row names can also have labels so it's going to be looking for the continent row label and that's probably not going to work so all right so we're going to reload that data set against as I just mess it up PD so that's pretty much how you would subset different columns so if you have a very large data set probably the first thing you're going to do is to pull out the column that you actually care and not just use everything but another thing that you wanted that you'll end up doing is start slicing or looking at different rows of your data maybe you only care about a few like a certain range or you just want to spot check something on some kind of row just to make sure your data set got loaded properly so there are three ways you can subset rows two of them are going to be moving on and the last one is sort of deprecated so the first method is using this LOC attribute I guess so LOC allows you to pull out different rows of your data but it pulls it out by what the row was labeled so in this example when I say pull out row 0 it's going to be the same thing as index 0 just because the our data frame is just labeled in order from 0 to 1700 something and this also shows you Python as 0 indexed so if I wanted the first row it will be out past 0 if I wanted the hundredth row I would pass it 99 those of you who've program in Python before will know that if you want if you want to count backwards from a container of some sort you can use negative numbers but in this case LOC won't work because it's going to be looking for deep label the row index negative one and that doesn't exist in our data set so how if we really wanted to use negative 1 however you do this so there's another difference there's another way you can slice rows and that's using I look for index location I'm guessing that's wouldn't so if we give it 0 that's going to give us the same exact thing as before but this 0 is actually the position the first position in our data frame so free say I look 99 that would also give us the same exact information the difference here is negative one will not work because it's actually saying hey pull from index negative one not the row labeled negative yeah yeah so you can label your rose however you want they can be you can have gaps that's sort of I'm not going to go over like time series things but yes like sometimes you're going to have some kind of period of time and you need to just fill in the dates in the middle so you can insert like random dates as well and it'll just create rows for you so yeah you can real able any part of this data frame yeah yeah yeah so just reiterate if you start sorting these things the label follows the row so you then have to be very careful if you're using I look if you're always referring to the first element or the label that just got shifted after you sort there's another way that's getting deprecated that I found out yesterday when I ran this is IX so I'm really just showing you guys this just for people who are using an older version order you'll definitely see this on the Internet so this does exactly the same thing if you're running on newer version of Python it's going to say IX is deprecated please use LOC or I LLC and there's a bit of documentation there essentially what's going on with IX is it's going to behave like LOC first so it's going to look for the row index label or the name and if they can't find the name then it's going to treat it as if you're giving it an index position so be the actual position in the row position so IX 0 IX 99 or IX negative 1 why doesn't that work okay I don't know why that's not working all right I'm going to get that so in general if you so I've so far I've just shown you how to pull out one row at a time if you want you can give it another set of square brackets just like we did with pulling out columns and you can pull out columns that way as well the way I X behaves it's going to be looking for the label first so we get the same label out it just so happens in this data set the label and the index are going to be the same so we can also use this notation so just to stay with IX are a little bit this notation can also be used to subset rows and columns and this is sort of when this looks very much like how you would subset data frames and are you have the set of square brackets so you have a comma things that are left of the comma is how you define how you're going to subset the rows being to the right of the comma is how you're going to how you define how you're going to subset the columns so if I wanted to say get the 0 99 and 999 rows and I can also say give me the country life expectancy and GDP GDP per cap that's how I would be able to subset both rows and columns and you can do the same exact notation using LOC because this is pulling basing basing this off of the the name but had I done this using ilac this would not work because I have specified my columns using labels not the actual position so if I wanted I could specify 0 3 & 5 if I wanted to get the same information so using this type of notation you can also do what's called boolean subsetting so let's say life-expectancy mean is going to be DF life expectancy and mean so a lot of these columns so each column ends up becoming what's known as a pandas series so just end of a series as a column these series all have various built-in methods the mean is one of them so you can easily pull out a column and run a basic summary statistic on it so in this example I am calculating the mean on the life expectancy so this is the I guess 59 years old that is the life expectancy across all countries across all years in our data set probably not a very useful number but I guess that's kind of that's actually pretty high given how how far back the days that goes but you can use this loc method to subset your data just based off of the life based off of some other condition right so I can say what is give me the exact a little bit so LOC remember things to the left oh comments how you're going to subset rows things to right is going to be how you subset columns so if I wanted to say I wanted to all the columns I can put a colon there so that's on slicing syntax for to give me everything so I'm going to say hey give me all the columns but I want to filter out the rows so I can say life expectancy greater than the mean value I calculated and I'll give me back all of the rows amount in my data where life expectancy for that row was greater than the average row the average life expectancy calculated across the entire dataset all right so just as the end of yes part one one really other cool thing that you can do with pandas is called group or aggregate statistics and this is actually a concept that's really powerful and carries on to all of the other tutorials that are going on like feed desk one like down the hall so what ends up going so let's say we wanted to calculate the average life expectancy for each year in our data set that's a question that we're going to have that's a very that's the question we're going to have about this data set we don't want the average life expectancy in our entire data set we want it a year-by-year breakdown so what a group by statement will do is you can say hey I want to pass in this variable called year essentially where you're going to do is subset your data set your data frame by each unique value in there so each unique value of year essentially in your mind you can think of gets its own separate data frame and then you pull out the life expectancy value calculate the mean that way and then you can get a life expectancy count for each an average count for each year so the way we do that is we can say there is this function called group by in the our world this is sort of like a t apply for and base are but if you're in like the tidy verse were things it's actually called group I as well so we're going to group by each year and you can see all this does is gives you back a group by object so nothing actually got calculated yet it just tells pandas that if you do make a calculation make those calculation on each subset of year and this in our example so here I'm going to say I'm going to pull out break out my data set by each unique value of year I can get the life expectancy column out of it and then calculate the mean and so this did roughly the same thing as before so if you've just ignored that part that I highlighted what we did before was be calculated the life expectancy the average life expectancy in our dataset we say group by year what it's going to do is parse or partition our data by each unique value of year and then calculate the mean there and then stitch it back together so group by statements another common term that you'll hear is called split apply and combine or you split your data set into various parts each part is essentially I intended from another you apply some function meaning in our case the function that we're applying is the mean so it's going to for each partition calculate the mean and then combine it back together so you get one final data set at the end right so this paradigm of you know splitting your data into isolated partitions to run a calculation putting it back together that's sort of the cornerstone of how distributed computing works so this kind of method actually carries on or is very scalable and carries on as you progress in your data analytics careers yeah okay so so first this group by year essentially under the hood pandas just knows if you make a calculation I'm going to have to break it up by ear first because this ends up being a yes so this is you can think of it as a regular pandas dataframe it just has some special marker attached to it saying that if you do anything split split me up by this variable first so since I have essentially a regular pandas dataframe I can pull columns just like I pulled columns before using regular square brackets so this part of the syntax or code that I've highlighted if you ignore this part the group by part that's really the same as hey I have this data frame I'm pulling out the life expectancy column and calculating the mean there the part that I inserted this group by all it says is everything - right before you end up making that calculation break it up by the variables and I specified in this case it's year yes so that this notation here is just like what I did right up here so pulling out columns is square bracket and then in quotes the column that you want yeah that's actually my next example but before I go there I actually do it this way first but I go through both examples so you can actually pull out multiple columns and remember if you're pulling out multiple columns you have a second set of square brackets so in this case let's say I'm doing life expectancy and GD GDP per cat now I've broken up my data set by each year and I'm pulling out the life expectancy in GDP per capita lating the average for each of those then if you wanted to put multiple grouped by statements in here you can the syntax is just like before if you ever need more than one of something you surround it in a set of square brackets so we can say let's group by year and continent and then this will say for each year and each continent the average life expectancy or the average GDP per capsule way too quickly if you have like a relatively clean clean data set this is a really good way that you would be able to explore your data so another thing that you'll see people do because you already see this line of code running off the screen a little bit is in Python if you put a back tick behind something it actually tells Python hey the next line that you'll see is actually part it the line is continuing off to the next line so what you'll see people do is do something like this I'll use fewer spaces I don't know mice turning red and get this looking like two spaces but you'll see people write their code like this as well so you'll say and it makes a little bit easier to read as well so you can say I have my data frame the first thing I'm going to do is a group by then I'm taking the mean of whatever I groups by so you can just read down the line and get a rough idea of what's going on another way you'll see people write that instead of using this backtick will see people just surround everything with round brackets and that's another way Python knows that treat everything between a set of round brackets as one command so you'll see people write write it some write their code like this as well yep yeah yeah it should it should be back take return and I'm going to ask a helper to help you with that he's having a weird backticks and text yeah all right so the next thing that you'll okay I think I after the dot all right yeah yes yeah I'll do that like way later yeah yeah this is supposed to just get you interested and convince you that this is a useful tool so so one other thing that we can do that's not what I'm looking at right yeah so we can look at group by each variable continent and then pull out the country and then there is a there's a method called end unique that lets you count the number of unique values so in here you can say for each continent we're counting how many countries there are in each continent so Oceania is I guess just Australia and New Zealand but these are ways that like when you first load up a data set you're checking this is the process of check and make sure that one is your data set complete I'm pretty sure this isn't this doesn't have every single country in the world but it looks complete enough from what I from what little knowledge I know about to say is it but there aren't like glit blaring like very obvious errors that are like coming out to me at the moment so yeah um I'll get there yeah we'll get there for now we're just going to assume that we don't have any missing values and then when you when I first show you a step that includes the missing value we'll go over that and so the last thing I want to go or just show you or if you guys aren't convinced enough so let's say this is the group by year and get the life expectancy so all I've done was set a variable to the average life expectancy for each year pan this is pretty cool in that there are plotting methods built right in and that's not working because that slop lid PLT I think I'm going to have to say so this is alright so this line of code I'm pretty sure I have to import matplotlib to use the second line but essentially what this code is doing is whatever plot gets rendered in the jupiter notebook please display it underneath my cell in the jupiter notebook if you don't you're going to end up saying something like PLT dot show and then they'll show up as a pop-up window so again so pandas is pretty cool there are plotting methods built straight in so since we gave it a series of values here it knows how to plot them without doing too many more code it's a kook yeah how far you oh yeah oh yeah so one way that i do it so i come from a statistics background even though I'm not a statistician I work with a lot of statistical datasets so a lot of our datasets are actually flat they don't have this hierarchy so in this and I'll go over this again later on as well but if you look at the output of this it looks a little bit different there's like this weird indentation thing going on so one thing that I will do if I want a completely flat data set again is there is a method called reset index and then that will just completely flatten this data set and then I can use that in my processes so that's that's usually what I do it's just because in the data sets or type of data that I work with I prefer it to look like this and not some weird hierarchy so usually when I'm done doing some kind of group or aggregate I end up running reset index so then this actually becomes a regular data frame like we've been working with this entire time and you don't have to think about oh this this is some weird hierarchy the syntax is a little bit different so that's how I avoid it alright so the next thing or the last thing for part one is how do you save these things out so I have this vector of average life expectancies per year um you can see that this is actually a pandas series object not a data frame anymore and that's because of because of the way it was calculated so the labels are now this year and the values are now this the average life expectancy so if you wanted this to be a regular data frame again I run reset index and then I get my regular data frame where I have a column for year in a column for life expectancy this is a regular data frame that I expect it to be so I can say that to a variable that variable looks great so now I want to save this out so each data frame has a function to underscore and then it saves to something so you can save it to a CSV file you can save it to an excel file there was an HDFS HDS class yesterday so you can do that a bunch of different ways that you can save out things if you care about if the most universal file type is a comma separated by value so what you can do is what you pass in this to CSV is the location you want to save it to so I'm currently working in the zero two lesson folder so if I say two dots it brings me to a folder above and then I have a folder called output and then I can say life expectancy by year dot CSV all right and so what that will do is save out our CSV file if I show you guys this oh I just spelled it wrong oh it's up here spell this correctly let's also delete everything all right so I have this life expectancy by year data set I look at it you see that that kind of looked right I have some empty column but then I have my year column and my life expectancy column and in this case each row has three elements so the row index label actually got saved with our data set right you can that's just the default behavior of to CSV but if you don't want that which chances are you don't you run the index equals false command and then that will give you whatever I'll get rid of the index so if your index isn't anything special like it's all it's doing is keeping count of the row number you can pass in index equals false they'll make your file sizes a little smaller and it won't get loaded when you reload this data set for the people that work between our work back and forth between our and pythons you actually have to install this separately but there is a file type called feather and it is a binary type that will save out from from Python and then it gives you a feather data type that you can load straight into our and it's a binary file so it'll be smaller and load a lot faster than a plain CSV file the rule of thumb is you can totally use this if you're just bouncing back between two languages just don't rely on feather for long-term storage because it's not that stable yet so I've been told yeah I actually don't know oh this looks like you give it a database connection and I'll just dump it to the table is that what it does that is what does yeah yeah the people at continuum I think it's canoeing well it used to be part of the blaze project but there is a library called odo just google this yeah odo for the Star Trek fans go to shapeshifters so there is a library called odo that essentially you give it a incoming data type and an out coming data type and it figures out all the ways to traverse a graph to make that happen so your mileage will vary about how this works but you can go from like CSV is like such an easy example but you can go from like a sequel light to like a JSON file and I'll figure out all the weird ways to convert them the only issue with that is like some of those steps are you're going to lose information in them so there's nothing you can do about that but if you just need something at the end you can use odo as a library to help you do that without writing or figuring out the intermediate steps yourself all right all right that's part one we're doing good on joy all right next thing is data assembly so you have a data set and typically or at least in data sets I work with sometimes they're large enough that there's during separate files and so your next job will be to put them together so you have a single file so since this is a this should be a new notebook so in a new notebook so we have to be important des I have three data sets for you guys and these are toy data sets so there is concatenate two and three so this is what it looks like so they're all pretty much the same they have the same columns they have the same row and labels but they're just three data frames that we're going to catch together sorry so all of our data frames look like the f1 just the insides are a little bit different so you can tell them apart and so what we're going to do so this is an example of let's say you want to concatenate your data row wise so that's actually pretty common if you have five million rows a data set that's five million rows you might have it broken up into five sets of a million rows each just because that might be the limit that you can send as an attachment and email or something so however you you get your data data might come in pieces time series data if you're just collecting like logs or something and when you finally do your analysis you're going to have to concatenate your log files together so you have a single data set to work with so so the first thing that we're going give me up just raise your hand if I can move on like scroll down alright so the first thing we're going to do is concatenate our wrote our data or wise so we've been using the read underscore CSV function but pandas has another function called concat and lets you concatenate datasets so can cat takes a container of or a list of data frames that you want to concatenate them together so we have our three data frames the 1 DF 2 DF 3 and if we run that you'll see that all it did was concatenate our data together I'm trying very hard not to say stacks because that's an actual pandas function but if if that word helps you in your mind of what's going on didn't say stack just just realize that's a that's an actual function in pandas and so all this did was take our data frame and just append it to the end of the previous one and so on so on they don't have to match and I'll show you that and they don't even have to be in the same order it'll know and will realign them which index like this so these just come with pandas like that's just a rogue number by default so if you never have them can you get rid of them I'm thinking you're rid of them yeah you can yeah you can everyone when you save out but they're always going to be there whether whether or not they're useful to you that's up to you it's up to you yeah those are row labels so yeah that's my next point so you'll remember that LOC subsets by row label so if I do LOC now it's pulling out all the 0 labelled rows so now I get 3 just because that's what happened when I am concatenated our data so if you really wanted the first row you would have to say I LOC yes the reset index there we go yeah yeah Oh forgot how we there's a way to do this there's a way to do this without making this look funky and I'll look it up during a break but yeah you can totally so what I did was I yeah I kind of made it worse but I fixed the problem so that's sort of that's also up to you on how you want to handle that there's a there's a much cleaner way to do this I know that first but yeah I'll just get rid of that even though it's recorded forever all right so another thing that you'll have especially if you're creating data and if you're building like a data collecting pipeline you might end up creating a vector or a row of data that you're going to want to stack or a pen or concatenate to existing data so we can create a series by saying PD dot series and we can say n1 for new one new two three four all right so this is our series and if we wanted to or if we expected to be able to concatenate them together one you'll see that it didn't exactly do that but this behavior is actually what concatenating is doing so our new row if we look at it if we if you end up turning this into like a data frame it's going to end up with a column called zero just because it never got a label so it's going to just give it zero or one or two X etc and then all we're doing is concatenated by default we'll concatenate things just bind it to the end of the row so that's exactly what it did was we had this new data frame that had a column called 0 and it just tacked it on and because there was a zero column named zero already it created a new one and there's stack or just tacked on our data at the end great so that's not exactly what we want so if we wanted to actually create a row of data and then appended or concatenate it the way we expect it to we can create a data frame so the way you can create a data frame is capital D capital S in the pandas library you give it the values that you want so n 1 & 2 & 3 & 4 and then because concatenating will automatically align the row the column names or the columns you have to specify columns columns so we can say this is a B C actually I'll flip these last two just just to show you guys that it will automatically align this I forget to frac it somewhere and one two or four attends of Baracus oh I need double bracket sorry yes I even have in my notes right okay yeah so we have a new data frame that's a BDC and if we run t be done cat and this on DF one and you grow too you'll see that it will concatenate our data and then also realign or it will know which columns you'll realign based off of the column names so you don't have to worry about oh is is the data that I'm appending in the right order it'll know that and do it accordingly yeah yes Oh yeah so when you end up with like a single row it'll come back as a series so yeah they are different objects so that's the behavior that you kind of see in our as well is like when you get a single vector of something it comes back as just a vector which is just the series object not a data frame object yeah yeah so this is actually a data frame object and essentially a series object is always like one one row or one column so if you ever have more than one then it's probably a data frame object all right so last example what if you end up this happens a lot if you're like web scraping or something you're collecting data and not every single web page in the site has the same amount of information so you might have you know you might collect data that's not exactly the same or like have more or fewer rows than your actual data frames or final data frame and so this works as well all that's going to happen is the columns that don't match will get a missing value yeah yes so if you end up like and four and five something like this that or two it'll just fill in missing values oh yeah that's that's the next part yeah at least I know that like I'm I'm pacing the the lesson well if your next question is like exactly what I'm doing next yeah uh you can I don't think you can do it in this step like in one go you would have to like re release man with 999 which is very common in like health datasets so when I go over in missing values I'll show you how to do all right cool cap nating columns so if you want to concatenate columns so the our way is C by n so row bind is what we were we were just doing it's exactly the same function the syntax is pretty much exactly the same the only difference is there's another parameter called axis and we set that to 1 if we want things to operate robots and again you'll see that all it's doing is just mashing these things together and so if you have repeated column names it's going to do exactly that no yes not these none of these are like unique identifiers the index position is a unique identifier so if you need something that's like absolute like give me the first one that's one way you would be able to get that you oh you know if you turn your head I'll do it um probably fixed it yeah probably fix that before you go on um yeah or job securing your cut-rate huh what that does a segue into the next part which is how do you rename like your column so I showed you before if you say dot column that gives you the columns that you have in your dataset so if you you can easily reassign that just by saying dot columns is equal to whatever you want to call them names to be so I will really assign all of them actually actually I kind of went through this example so skip this one so if you want actually that didn't help at all that's exactly the same thing so I can pull this be so if I wanted I can reassign my column names just like that and if you try to concatenate them together it'll align the columns just like before if you want to rename the rows for example so the row index the way you do that is instead of columns it's got index and you can give it whatever you want 0 2 so you can rename these as well and that's when you will end up using the difference between LLC and I ll see if you want to start differentiating the first row or just the actual 0 and just to finish off the examples the rows will Zero's will also realign themselves as well so this is one of the cool things if you do work with time series or time data and your row index is the date or some date time object and you start concatenated things together it will align itself so you don't end up with like a mess of a data set so it gets very useful for a time series things but I'm not going to show you time series things yeah like sort them or yeah so the way I've been doing it is I essentially do the column subsetting but I specifying it in the order that I want and that's what I use as a column sub setter so all right so the next thing that you will do so this is if you just have a large data set that is broken up into pieces but another common task is merging data when you have different data sets that you want to merge together so for the SQL people these are like doing table joins and you can do that as well in pandas so we have four data sets that we're going to load so they're a survey person survey site so these datasets actually come from the software carpentry SQL tutorial or lesson it's in a sequel Lite database but I sort of just wrote it all out to a CSV file survey survey all right so we have these four data sets there's this person data set about people on this Arctic exploration there's a site of where they've taken readings there's the survey where they actually perform a set of readings at a particular site and then there's a visited data set that tells you who visited websites on what they write so a very common task that is if you're going to do some kind of analysis is put all these tables together into one job data set so the way we do that is we can say actually first step is I'm going to subset this visited data set refers to 0 to 6 and this is just so I don't have sequel kits so I can show you the next example so if any of you are familiar with joining tables there's this there's a few different ways you can join tables they they all end up using the same exact syntax it's just what happens so there's something called a one-to-one merge where if you have a series of values and as another series of values and there's no duplicates and either of them and you just want to combine those two data sets that's essentially what's a one-to-one merge is and you'll just get a one-to-one alignment of your data another type of merge that can happen is if you have one data set and another day set but the second or the first one one of them has duplicate values in them so you'll have you'll end up doing a many to one or one-to-many type of join and what ends up happening there is the data set that doesn't have duplicates will get duplicated in the one that does and then another thing that can happen is if you have duplicates in both data sets that you're trying to join and what ends up is what's known as a Cartesian product where every pairwise thing gets created and you end up with way more rows than what you expected but sometimes that's what you want so the syntax for all of that is exactly the same you just have to be mindful of what your data set contains and then whether if you're doing one to one or many to one etc etc that really just depends on what's in your data set so how do we actually do these merges so there's this one-to-one merge pandas has a function call merge so there's two ways that you can do this so I'll show you the first way if you're also in the Jupiter notebook if you hit shift tab it'll give you the function signature so that's how I've been seeing this another yes trick is if you say PD dot merge that question mark it'll give you the actual help Docs so if you just need something to you know jog your memory without Google or if you don't have access to internet that's what was there for you all right so when you're doing these merges this this is pretty much universal that language is universal there's this thing called a left data set and a right data set and then or on camera its left and right so you specify a left data set or right this that how you want to join them so this is an inner join so it'll only keep the values that have matches you can do a left join that will keep all the values on the left side and then merge values on the right so you'll end up with missing values for example if there there isn't something found if there isn't a match you can do a right join you can do an outer join there's different ways of doing joins as well the next parameter is on so you can use the on per M parameter if the columns you want to do the match on or have the same name then you can use the on parameter if they don't have the same name which is I think most data sets you can say left all right arm and you can your datasets don't have to have the same names for you to perform emerge so so our left data frame is going to be the site our right one is going to be visited subset and then we can say left on we are joining on name and right on site just showing you today data frame at the end so yes so inner is the the fault way of how it's of how its behaving yeah so for all my examples I'm just doing inner joins but depending on what you need you can do left right outer Sarah Sarah the left on right on is simply specifying on the site data frame I'm using the name column as the key and then on the visited subset data frame I'm using the site column as the key so that's what the left on right on is doing yes you you can join pretty sure I show you that so I show you that if I hope I feel good oh I think I took that out but yeah you can if you have multiple keys like everything else in Python if you have to give it multiple of something surrounded in a set of square brackets and then just come with delineate them and it will just do a multiple column match for you I think I was so panicked on like trying to finish the material herbs like I think they'll get the point after them so just to show you what a many to one merge or one too many again this depends on what your day is that looks like you can do for example I think this is actually one but I'll just call this M so another way you can specify merge is the data frame dot merge so you don't have to say PD dot merge I think under the hood this just called that function so it's really the same exact function that's that's being called so during this so the left on the left data frame is going to be the one that we called more done and then the right one is the one that's actually closer to the right hand side of the screen and this performed that merge one of our data sets had a missing value and so you can see that missing value getting propagated when you perform these merges just to show you what the Cartesian product looks like I'm going to make two data frames so you can make a data frame using a Python dictionary so the set of curly brackets and the way that works is the key becomes the column so the column name and then the values become that entire row of data so before I had specify the information row wise if you do it using dictionary us specify the information column 1 so this becomes a column called a 1 and that column has one one two two in it I just called this a [Music] so this is our data frame one so you can see that if I actually stayed front to appeal you can see that I specified this data frame using a dictionary the key becomes the column name and then the value becomes the contents of that home and if you look at data frame 1 and data frame two you can see in column a or this first column there are duplicate values and both sets so this is what could happen so essentially performing a many-to-many there's duplicates on both sides and so if I do DF one merge and I merge it with DF 2 and the left is on a and the right is on d o8 1 1 you'll notice that I got a lot more rows back than what I had and if you look you'll see that data frame 1 had 110 120 130 data frame 2 at 1 101 200 so I end up with 1 110 101 110 201 120 101 1 2200 etcetera so it's creating every pairwise combination that's going on so any questions about going on there just keep in mind that's something that could happen when you're performing emerge so if you end up with like all of a sudden a million data a million row data set and you're like where'd this come from the two things I joined together weren't even close to that this is probably what's going on all right last topic before first break to prevent the partying problem if you wanted one to ones that said many many I think you're gonna have to get rid of the duplicates right yeah that right yeah that's a little I think you're going to have the derivative duplicates because if you say less or like less joined it's still going to keep all all it does when you say keep every day on the left side or right side is make sure that no no observations on the left a different gets dropped yeah so the easiest way to tell if you do have duplicates is if your number of rows is greater than either one of those data sets that you start off with sort of it yeah when you finish so after you do the drawing if you end up with like a data set with a hundred rows and another one with fifty and you end up with five thousand or even 101 you know that yeah that's a very oh so if you end up with oh yeah yeah that's a that's that's something for you to do more data exploration on I don't think there's anything that you can do in the actual merge call that wool I don't think so wait waking civilian mm-hmm oh yeah like that yeah so you can check to see if there are duplicates in your data yeah yeah another way would would also like get a column and do a frequency count that I'll show you and then if account is like greater than one then you know it got duplicated so or you have duplicates not they've got people gated yeah all right so like I've promised we've seen missing values so now we're going to just work with done for a little bit this is a new notebook so I'm going to have to import this stuff again so the missing values actually come from numpy and there's three different ways that numpy as missing defines they're all exactly the same it's n a n with capital ends or all capitals or all lowercase so that doesn't matter if you want to insert missing values into your own data you can import the missing value from numpy and then just use that the thing with missing values this is probably a little faster type no I was not they don't really equal anything so they're not true they're not false and they're not equal to each other so and the reason behind that is a missing value could be anything so it could be three equipments five you actually have no idea so that's why they aren't equal to each other either because there's no way to actually be certain that they are actually equal they're just missing you have no idea what it is so if you actually need to test for missing nests there's a function called is null and if you give a something that is missing if you pass that into is null that's when it will become true so if I give it something else that's not a null value or a missing value will return back false pretty sure that's also a light the string - yeah so if it's if it's as long as it's not na em it'll give you false yeah it's not but that's because I also imported all three versions of them yes you'll yeah I just imported all of them just to show you guys that there's three ways but you can personally like type them in the data frames you'll see it printed out as n am with tablet no yeah yeah so yeah so pandas will represent missing like this but if you want to use it within your own work you can import any one of these and I think if you use any one of these to sign into a data frame it will always print out like that so depends on how far the shift key is away from your pinkie they would be the same no no they would be there the same missingness they will be the same like there's no distinction between these three at all it's just how much you like your Shift key you'll probably see this a lot like just in the around the internet just because the data frames themselves always represent missing values as n/a ends with capital ends so you end up doing what you see all the time so I pretty much type it like this most of the time I just I just showed you guys this just you know for completeness sake all right so where does missing value come from so in our read CSV file our function I'm not supposed to be there should I do okay alright I'll move it later yes you can run regular bash commands in these cells that's sort of what ipython which is what the stuff is running on gives you okay so visited so when we loaded up our visited data sets there was a missing value in there so one way missing value can come about is just in the actual raw data set that you're working with if you look at read CSV there are a few parameters like n/a values people read CSV oh there's a lot of these plans okay um all right wanna be any values yeah so by default it'll represent an empty string it'll read in as a missing value but you can change that if you change the NA values parameter to like 999 or 888 you can set that so that if you read your data set in any value that says 999 or 888 or something will automatically get read in as a missing value or na n so there's different ways you can change that as well there is also a parameter for if you want it to read the empty string as an empty string and not a missing value you can change that in the read CSV poll as well so depending on what you need it would either be the n/a values parameter or some combination between that and the keep default na parameter that you can use to change how your data set gets loaded in the other way we saw missing values when we perform the merge so when we did our merge because one of our data sets had a missing value that got propagated when we did our merge so this is exactly what we went through before just with a different example alright so in here you'll see that one of our data sets add a missing value and because this was a many-to-one merge that missing value got duplicated in our data set and so one thing that you would want to do is I count your missing values so I'm changing data sets a lot but here's another data set that's kind of fun so there here is a Ebola data set so this is a data set about the 2014 West Africa Ebola outbreak this was curated by a at the time a graduate student named Caitlin rivers women used to work in my you used to go to Tech Virginia Tech with me or not with me before I entered and she also gave a talk here at psy PI four years ago as well on epi Pi so this was a data set that she ended up curating directly from various ministries of health that were reporting cases and so just like any new data set that we load we can look at the info of this like I showed you during lesson one so this is a data frame we have 122 rows and 18 columns these are D column names on the left-hand side and then here in the center because there are missing values in our data not every single one of those numbers is going to say 122 so this is one way you can get a sense of how much missing value or how much missing data you have just keep in mind you actually have to subtract 122 from all these values to get the actual missing value count this just tells you how many non null values there are and then on the right-hand side of the actual data type stored in there and we'll go over data types in a little bit but so floats are a floating-point number and our integer numbers object is pandas generic way of saying this is a string so if you have something with a lot of missing value so let's say I'm looking at this case of Guinea there is a function called value counts this is not completely so you can chain head to a series or pan the data frame if you want so value counts will give you the counts of whatever is in there and so if you're looking just for duplicates if what ends up happening when you do value counts is it sorts by the number of counts the most frequent gets sorted to the top all right so if you're looking for duplicates this is probably another way you can just say hey I have people gets in this data set just because the first number is not a one because this data set has because this data set has like a fair amount of missing values you can say drop na equals false and actually count how many missing values there are and in this example it shows up at the top but that's only because there's twenty-nine missing values in this column and that's more than any other frequency count that we have in there so just keep in mind if you do value counts it will just give you the count yes and we'll return it in descending order but the number of missing values will not always be the first one up there it just so happens that this is I just have the most common observation is missing yeah so unique so one there's a thing called unique that will give you back what are the unique values in there there is n unique that will give you the count of how many unique values there are and then value counts I think I copied it value counts will for each unique value give you a calorie so those are the three differences like what are the unique values how many do you have and then what are the tallies for you - this is what the difference between those three are oh yeah so this is the tally I think that's still great there's ends unique and then so this gives you to Kelly this gives you how many then this is the actual the turn tap to click off all right you wait King Salem oh yeah oh I meant I I need to turn tap to click off because my mouse is like bouncing off bouncing around the screen and scrolling it yeah yeah it won't know when I'm trying to tap complete if you want it's a topical you're gonna have to save each intermediate that you're doing which probably not a good idea all right alright so the next thing I want to go over is how do you recode missing values so that was the question of let's say I have 88 and 99 s and I just want to either turn them missing or actually that isn't the question that she had yeah it was the opposite uh yeah I'll show you how you can recode missing and then the other way around alright so we have our Ebola dataset so we have our missing values in here so there is a function called still MA and if you give it something like zero it will take every value in your data set this works for a column as well so I'm just applying this throughout the entire data frame if you give it some value in this case 0 it will replace every n/a value with 0 if I I'm pretty sure if I had said something like that it would do that so whatever you give it it will turn all of those missing values into that value that you gave it that's not always the best thing to do depending on your use case so there's other ways that you can do this so there is a parameter called method and so method if you look up the actual documentation there are there's like two ways I think it's two ways there's two ways that you can fill your missing values but there's four different ways you can specify it kind of weird I don't remember all of them um anyway so there is you can forward fill your missing values meaning that for current missing value it will take the last known value before before it so that's a forward fill so let's look at our Ebola ed so let's say look at this cases Guinea we have this na value so if we do a forward fill what it's going to do is keep looking back to the last known value and use that value to fill in itself so if we want to do a forward fill its F fill if we look at the head of this you'll see that this two 76.90 got filled in with the last known value you can also see that when you do a forward fill if the row now if the column already begins with the missing value it's going to stay missing because it has no idea what came before it to fill so when you do a forward fill you're pretty much guaranteed that the end of your data set will be will not have missing values but the beginning you're not guaranteed to have you're not guaranteed to have non missing values and then the pole opposite way of doing that is back fill so be fill so you can see that this 273 point 69 that guy sold before got back filled so disused that starts looking from the bottom of the data set up and you can see that the missing values at the beginning of our data frame now get filled but that means that the other problem happens where the data data at the end of your data frame ends up being missing so if you actually need it to fill both ways we're probably going to do a forward and back sole and then you have to determine which one you want to do first and then the last thing before break and then I'll look up a few things while brake is running calculations with missing values so you can for example this cases Guinea does have missing values in it so let me show you that this is actually so because I never saved the forward backfield stuff back to the back to a variable so that never really got saved so you can take the sum of a column even if it has missing values right so that's something that you can totally do that is not the default behavior in R but if you want to actually throw some kind of quote-unquote error when there is missing there is a parameter called skip na and if there is a missing value in that column sum will just give you back a missing value so that is the default behavior in R but in pandas it just skips over that so just be mindful when you are taking some you're not totally guaranteed that you do not have missing values in there so yes yeah I don't really use some pie so yeah I can't make that reference but yeah it's exactly how numpy works and I pretty sure it just called num play all right cool Oh so be back at ya Oh question oh yeah so we'll go over that in about a few more chapters but if you have text in a column so in ten does each column has to be the exact same type so if you do have text in a column even if it's a column of numbers the entire column will be a object column which is a string column and so you wouldn't really be able to take a sum of all strings just because so you actually have to do that you actually have to convert the column into a numeric column which I'll show you in a bit but all right yes so be back at 10:55 so that's like ten minutes all right nine 9:55 sorry just to backtrack a little bit if you google pandas dataframe or panda series the first link is going to be the official documentation for a data frame or a series um and this is because someone had mentioned yes you probably probably I definitely misspoke you shouldn't really think of each column as a each panel series is not really a numpy array they're not exactly the same mainly because the methods and all of that stuff are different so a pandas series is a one dimensional array typically it's a column in your data frame or a row and if you scroll down you can see everything that this series object can do so the parse si attributes those are the things that don't need sets of round brackets so like IX or I LOC you've notice when I've used them I just type I dot I LOC and then I started stop setting it right away I didn't have to use a set of round brackets below that are all the various methods so why I was able to call mean on these functions was because there is a method called me on a series object so this entire list if you ever have a question like hey I have this column can this column do this take a look on the official documentation and you can say like mean and it does calculate the mean so you don't always have to write something from scratch take a look at the documentation that's pretty useful and the same thing applies for data frames so in the data frame stuff you have actually series has this stuff too I believe yeah so all these to underscore methods they're in a data frame or series the minut details you will have to check but everything that a data frame or series can do is listed here so you definitely don't need to remember what can do what especially if you are like alright so Dan got something got a count like what were the different ways to get counts so you can look on this page for count and see all the state label things with you're doing your control s correctly you can see like there's value accounts or things like that alright so the next part of our pandas tutorial actually borrows very heavily from the are world and it's this notion of Tidy data and Hadley Wickham has written a paper about this which is also linked on the actual page so you don't have to remember how I'm googling this but yeah so he has this paper about tidy data it's the examples are in our but the concepts are I'm going to show you what the concepts are in Python and it's actually a very good read in terms of if you get a data set especially if you're starting off with processing or data analytics what your first steps are in the data cleaning process right so so far all I've done is work with relatively clean data we counted we did frequency counts which is important we looked at unique values to make sure there isn't anything too obscure but when you actually need to start doing like quote unquote data munging this paper is a pretty good read about what you expect so actually I had the data set loaded up but I'll just use this paper as an example see it was typing for everyone all right so he goes through some of the language about what makes data quote/unquote clean and so he has these two data sets um there are four data set one you have people's names you have two different types of treatments and then a value for their treatment right and then you can do a transpose on this data set and you get the same information so you have treatments as your row identifier and each interval each individual person as its own column and then their treatment values in there right so if you're trying to describe a data set it's pretty ambiguous in terms of what is a column and what is a row and so he tries to formalize those definitions so when I say row I mean a for have leave describes it as a row is a single observation of data and a column contains a variable in your data set so how do we want to quote-unquote make this data tidy or clean we want one each column should be a variable so we should have a variable called person a variable called treatment and a variable called result each of those variables are separate columns and then each row becomes a single observational unit so we have a person with the treatment and a result right and so this is what Hadley will describe this is a tidy or clean data set but this isn't very like this isn't a table that you would put in the paper right just because it's a little hard to figure out compare different treatments but the pro but when you are cleaning data especially when it's data for analysis this is a pretty good target to clean your data to because then you can say group by each treatment give me an average result right so it becomes very easy to work with when you're trying to do data tasks also when you have a data that is quote-unquote tidy or clean in this format you can very easily switch to any of those other types so this process of getting your data tidy also will fix a lot of other problems in your data set along the way and those are very common data sets data set manipulation tasks that he describes in this paper so what is tidy data he says that in tidy data each variable forms a column each observation forms a row and then this last bit I'll describe but we sort of already dealt with this but each type of observational unit sends the table most of the time when you are doing data analytics you violate this last principles that's because you're merging and combining data sets so in our survey data each individual table was a separate observational unit we had a table about a person we had a table of the readings but when you actually perform data analysis you you merge it all together so so this last bit is really more on the how do you store data and how do you give data to other people not really a use for data analysis so he goes on to describe in this paper here are some very common problems that you have with your data and what I ended up doing is you know the data for this paper and the tidy our package are all on github so I essentially just downloaded those and uploaded that in our tutorial and so I'm going to be going through this paper with you guys but using Python syntax instead of our and so we'll go over these very common but these aren't this isn't of the final list but you can have columns that are header column headers are values not variable name so in the example before if you had you know John Doe as one column Jane Doe is another column you have this column headers our values they're not really variable name so that's what that's referring to and then I think the rest I'll just describe as you see the various data sets because it's much easier to describe them that way but I would highly recommend reading this paper just to get you thinking about what you have to do when you start processing and cleaning your data all right this puts me and somewhere I don't want to be this form oh sorry last thing someone over there had mentioned how do you replace all the 88 or 99 with the missing value so there is a method called replace replace takes a parameter so to replace so in here you give it all the values you want to replace it with something and then the value you want to replace it with so in this case I'm replacing it with the missing value but you can replace it with an empty string if you want or a zero if you want as well so that's one way to do it just be a little bit mindful because I've done this in my code where I've replaced an entire data set with 88 and 99 says man that's very common in health data sets you might have a column that's like ID numbers that will be from zero to something and if you do this you will replace the 88 and for 99 ID number with a missing value so just keep that in mind that's sort of the nice thing when you start doing analytics in a programming language of some sort it's because I was able to find that mistake had I just done a Find and Replace in Excel before processing my data that would have been a much much harder error to to fix so so you can replace a value with whatever value you want just be mindful of what's in each column so we're going to load our library again because this at least I'm working in a new notebook there is a data set called pew for the Pew Research Center and this this all comes from that tidy data slash tidy our package so here is our data set this is a if you're very used to looking at academic papers this is probably Table one of some paper right you have just some kind of demographic and count of some sort just to just to give a high-level overview of what's in your data set again tidying your data is different from presenting your data but if your data is tidy you can transform it very quickly into a format that is easily presentable so this from a data analytics point of view is not a clean data set because yes we have a column called religion and that's listed but these columns right here are they're actually values they're not they're not individual variables right so what we would want is a column variable that says income range and then the count their sum rule of thumb is you'll know that your data isn't tidy if you can't do like some kind of group by aggregate right so I can't say hey find me for each group or for each income bracket give me the average number of whatever religion right so that's some kind of mental things you can say to yourself to see if something is tidy or not so what we want to do in this case is our religion column that's that's fine we can leave that alone but what we would want is for all of these written for all of the other columns we want to essentially the Excel term is do it to the table pivot them into a single column and then have the values here be a some other column for values so how do we do that there is a function and panda is called melt if you are using Hadley apply our package and our this is exactly the same verb that's being used if you are using the tightened our package it's called gather so how does melt work so the first parameter in melt is in the documentation says frame so this is our data set so Q is our data set the next thing is the ID they're bars so this is I like to think of ID bars as this is the column that I want to keep fixed I'm not trying to pivot this column at all so this ID bars in our case is going to be religion okay and then just for because I'm probably someone's going to ask what if you want more put it in square brackets so if we run this this will pivot our entire data set into something quote-unquote tidy it might not be as easy to get a high-level overview of it but this is much this is in a shape that's much better for general data analytics so we have our religion column which is what we want we have our variable that got all of those columns got pivoted down and they got a default name of variable so we have our under 10k we have our 10 to 20 K greater than 50 K and then don't know refused right and then the body of the table became another column by default that column gets a name called value so I'm going to copy this row again but if you're already going to do the data processing step you might as well make these column names kind of useful you are cleaning data so you should probably have a self explanatory data set if you can so in here there's a few other parameters that we can use so there is the so there's another parameter called value bars so what ends up happening is whatever you don't specify as ID bars will be passed into value bars and that's those are the columns that will get pivoted down so you could in this process do come some kind of sub setting of your data in one stuff as well but typically you depending on what's easier to specify the columns you want to keep fixed and then pivot everything else or you can use a some combination of the two but there is another parameter called bar name and so this is when those columns get pivoted down by default you see that it gave it a variable called variable or a name called variable so that parameter lets you change that so we can call the this income and then the same for value name by default it got about a column name called value so we can say count and if we do this you can see that those two lines or those two parameters that we added to the function all it did was renamed our pivoted or our melted data frame all right so what if we want to keep multiple columns fixed so there's another data set in here called billboard and this has for a particular year an artist attract name there is the length of the song the date this song got entered into a billboard ranking and then for that particular year so this goes from week one to week 52 that particular songs ranking right so this is sort of the same issue we have before this week is really a variable but it's been it's being shown as separate columns right but again this is some balance between you know a data set that is for you very easy to visualize versus a data set that's made for doing kind of data analytics tasks right so in this set and the way the data is shaped right now is for any given song you can very easily just scroll across why does it go to 76 okay I don't know why goes something you can easily glance across and see how a particular song changes and ranks over time right so that's why this data set is in its current shape but if you think about it from an analytics point of view you won't be able to say like does a certain week number relate to the actual ranking right that's not you can't really sit a model on way the data is in its current shape so if we wanted to melt this data set it is PD that melt so this is exactly the same thing as before except if we wanted multiple ID bars so here's an example where it's much easier to specify one two three five columns versus 72 columns right so here we want to keep year artists track time date entered the same and inhibit the rest of the columns down so here we specify year artist time date and third this is also how you know this isn't our data set because there is a period in our value name is equal to rank so what are we going to name these columns so I'm going to say far name equal two weeks so I suppose this one all right there we go you like give me a second this looks kind of weird I forget here here fine oh crack alright sorry what was your question so that wouldn't make sense on how this function works because what it's doing is it's specifying all the columns you didn't specify and that's all that's going to be its own column so you really only have one column that's getting created so you wouldn't be able to specify multiple variable names that's a separate problem we're going to address later on that's that yeah that's another data cleaning issue I think and if I I don't cover that then let me know all right so all right cool any questions so if you have multiple columns that you want to keep static and pivot everything else this is one way you can do that all right so here's another data set that has a different problem so here's the TV data set if you ever look at health data this is actually very common how health data is reported and the problem here is multiple variables are stored in a single column we have a given country year and then over here we have a column that represents male zero to four males five to fourteen males zero to fourteen males 15 to 20 for males 25 to 34 and then we have the same thing for females 35 to 44 etc etc this is for females unknown this is female 65 and above etc etc so that's a problem we have a column that is now storing two bits of information and we need to separate that somehow because although this is okay for maybe a table in a paper it's actually a huge table so this probably wouldn't be in a paper to begin with but you know we have to do something about that because we can't say how like what is the average you know TB count across males right because it's all put together in in in a column how would data end up being like this you know if you this is probably very easy if you're entering data manually in a spreadsheet like during the data collection process this is probably a format that made the most sense for whoever was collecting it right so there's nothing actually wrong with this data set it actually has some kind of purpose to it it just you as an analyst needs need to deal with making this a little more useful for your analytics and then again once you clean this up you can always come back to this data set so in another data set that has the same exact problem was our Ebola data set right so our Ebola dataset had exactly the same problem we had whether this was a case or a death count and then what country that was associated with for a given date and the day is I think it starts from like when the first outbreak happened so there's like a global count versus the actual day because different countries got had an outbreak at different times so there's a I guess a number that's counting from the very first outbreak that happened in West Africa versus the actual date and that's also why there's a lot of missing values in this data set all right so how do we clean something I'll just go through the Ebola data cause it's a little more complicated so hopefully some knowledge will transfer to the other one all right so what's the first thing that we have to do well we have all of these columns so the first step that we're going to have to do is pivot them down they'll get us one step closer into having a column that is the calling them the status whether it's a case or death and then a country right so step one let's melt our data and then hopefully it gets us somewhere closer so Ebola know is PD melt our data frame is going to be our Ebola data sets our our ID bars are going to be date and day so it is case-sensitive so I do have capital letters there the value name is going to be count and then my bar is going to be um I called it case that call this status country I call the CD count country I don't remember why I called CD okay case death that's why all right so if we look at this that's one step closer to what we have in a clean data set we have our cases and our country and then account so it's not fully clean yet but this looks like something that we can probably deal with and it's a little it's one step closer so how do we deal with this problem now so the column that we want is this Ebola milk so what is the column that we're looking for it's CD country so that's the column that we have if you want to take a column and treat it as a string like a vector of strings there is a attribute called SPR that takes whatever column and converts it into a string so you can do regular Python string methods on them right and the thing that we're going to be using is you can actually use a built in Python method called string dot split to split a string by a certain character right so this is a data set that's kind of it's at least in some useful or easy to clean format because we have a status on the left a country on the right and a thing separating the two is a single underscore so we can use that pattern to break up is column and so this ends up going into the realm of doing stream manipulation which is going to be pretty much most of your data cleaning tasks is going to be working with strings just because it's probably the easiest way to collect data it's just have unstructured text so alright so we have the string if you think about what string what types of string methods are available in Python so you can look up Python string methods I'm in the string operations you can go to whatever version you're on so there's this string dot split and so this is essentially the function that we'll be using for example we have 1 comma 2 comma 3 if we say dot split we gate it we get a list back 1 2 3 that's split by that comma right so that's exactly what we're going to be doing it but in our case instead of a comma we're going to be using a underscore so we're going to say string dot split and then we're going to split by the underscore right and so that gives us something closer to what we want this looks like a list so we get the type of our first element for example it's a list so that's good we got a list back we can work with lists and then the first element of the list is this cases or def status and then the second element of the list is the country so a list is a regular Python built-in data type and it's a container that stores an arbitrary number or set of things or not not a set it just stores an arbitrary set like things I wouldn't call it an array only because you can't say all right Fraser yes yeah it can be heterogeneous it's just like generic container to store stuff it's not an array because it's heterogeneous and it's not an array because you can't without doing other things perform like vectorized operations on it you can't say like take the sum of a list and then or you can't you can't say take a listen plus one and each element will element by element add one to itself so this list is a regular Python list that's a pure Python list I have a bunch of list together and that's in a panda's series so this is a panda series where each element in the series is a Python list so that's that's what we ended up with so I'm going to save this to a variable because this is something that we want to keep to work to continue working with so if we want to get a particular index out of the list or if we have a series or some kind of array MB and each element in that series is a list and we want to for example get all of the index zeros out of this at once we can say hey use this STR attribute use this get method and we can use this to say get the first element out of in this array just get the first element of each list and we can say get one and get the second element out of this list and so we can use that to advantage by saying let's look at our status values is equal to this and then our country values is the same thing but it's the second our argument right so if we look at status values this is all of the justice statuses and if we look at countries it's just a country so that's much better we totally separated out that column into the various parts that we need the last part that we want to do is assign it back to our data frame so how do we assign a you column to our data frame we can say square brackets and directly in our square brackets create a new column and assign that signed a new columns directly there and we want to do this for status and country so we can say give it the status values give country the country values now if we look at that we've completely cleaned out that column we have a separate column for cases and a separate column for country yeah does that give you the same thing so what ends up happening I think you need this you need this here to make sure it acts in a vectorized manner so if you get yet it's just going to end up pulling the first one like it needs some way to say hey I need to do this element for each element not just the first element that's my explanation can't yeah so that's generally what's happening and that's just to go into the next step of you can actually do this all in like a fewer steps so we can say you can say variables split is our same thing that we did before but now if we there is a parameter called expand if we set that to true what we end up getting back is instead of a series where each element of a series is a list we get back a data frame that's already broken up for us and what's going on behind the hood is if you give PD data frame a bunch of lists it will create a row for each list that you give it so that's kind of what's going on behind the scenes so now that we have this data frame let's kind of clean this up a little bit so we can say columns we can rename the columns here for status one three one all right so all we did was rename this data frame and just like the example from the data assembly chapter now we can concatenate these two things together so we can say our final evil queen is going to be PD concat we give it our people a milk Ebola milk and then this variable split okay so that's another way you could have done that better these names oh whoops axis is one yeah I meant to do that concatenate the columns not rows right okay so another type of data cleaning problem I think it's addresses your question earlier so there is a data set called weather and so this is the example of variables are in both rows and columns of our data so how do you know that this is a problem first of all if you ever end up with a data set where like pretty much two two or more rows are exactly the same - like a few details a few changes that's that's one litmus test that you can that's something that we'll give away that this is the problem that you have more specifically we have this element column there's a key Max and T min and then a bunch of days so the first thing we're going to have to address is these days should probably be in some columns called day the values here should probably be in some column called value but the problem is this P min T max what we should have is a separate column called temperature min and a temperature max right in our example here those values are in the same column and that's why we end up with a bunch of duplicates in our data set so for example look here we have ID year months duplicated and the only difference is we have a team in a t max so 27.3 for day two and then a team in 14.4 for day two right so there's a lot of duplicates going on in this data set and so this is a problem that we're going to have to address so how do we fix this well the first step is going to be melt so we'll do that first is equal to so this is ID year year month and then first we're going to keep elements again so our bar wire name is going to be day our value name is going to be 10 so we look at that alright cool one step closer to what we want now you'll definitely see like the amount of duplicates that we have for each day pretty much everything except element and temp is going to be duplicated in our data set so the last step which is pretty much the opposite of melt is how do you pivot up a column so what we want is we want to take each unique value of element and turn that into a separate column so our final data set will have an ID year-month-day temp a team ax and a team in column so the way we do that is we take our data frame there is a so there is a function called pivot table so just to show you the different ways you can do these syntaxes whether all of our data frames also you can call melt directly on the data frame so you don't have to say PD melt whether you can say whether dot milk and it's doing the same thing you can say PD dot to the table in my case I will show you I will use this syntax so instead of PD PivotTable I'm running data frame dot PivotTable all it does is it lets you specify this thing and without it in the actual function call alright so how does this work you end up doing the very similar syntax or concept you have an index so that index those refer to the columns that you want to keep the same so when we pivot up our elements column we want the ID year-month-day to be to stay exactly where they are we will pivot up the element column so that's what the columns parameter is for which column you want to pivot and then as you're pivoting those are going to be creating new columns so you have to specify the column you want to populate those new columns with and so you use the there is a oh I guess it's not showing you here there is a values parameter and that's how you specify that parameter there so so our you have an index which is the columns that we want to keep the same which is ID year-month-day the columns is going to be element and then the values is going to be tempted just print the head of this all right so that's pretty much what we wanted we have our ID for a given ID month year day we have a team ax and a team in and then that value will contain whatever temp was before yeah so for a given set of days there's a t max of twenty seven point eight and a team in of ten point five and just like before what I like to do is flatten our data set so I will say set index so I get something that's a little more like a regular data frame that I'm used to looking at any question so pretty much between melting and pivoting you're going to end up doing a lot of your data cleaning processes just by pivoting cleaning and then revisiting back so it's a very important skill and if you can see your data cleaning pipeline using this it'll probably make your life a little bit easier alright so any questions that answer your question about what I the data problem mm-hmm you I'm going to have you after like draw this out on the next break yeah oh yeah so this you're not totally done cleaning this data set yet because you need to get rid of this D and then turn that into an integer that's some combination of what we did with the Ebola data set where you did a split instead of instead of doing a string split you're doing string slicing you're just saying start from the second value and go to the end if that makes sense so you have a thing like d-10 so that's like your string so you'll end up doing something along the lines of start from the first and go to the end and so you do a data cleaning step something like that and then turn that into an integer so yeah it's not fully fully cleaned yet because you really can't use this date this day column yet because it has this string it's really a string not a number but I'll go over a go over like data types next and then how you can write your own functions to clean clean clean data oh yeah so what's cool about the pivot table function so there is a function called pivot and then pivot table they do almost the same thing what pivot table allows you to do is if there are duplicates that happen when you do this pivot it allows you to specify an aggregation function so by default it will take the mean but you can change it to min Max something something so if you try to do something that does have duplicates after you do a pivot and you use the regular pivot function it'll give you an error saying that I have no idea how you want to reconcile these multiple values so I typically end up just using Civet table function just because it it can handle multiple values and and calculate an average by default I don't think so because what ends up happening after a pivot is a single cell and you can't put three cells in a single cell but if you need to do if you need to do that there's other methods which I'll go over essentially you'll have to write your own function to do these calculations cool all right time pressures all right next step hopefully this one will be quick so the next thing I want to talk about is data types we will be using data from the Seabourn library on Seabourn is a plotting library in Python I personally like it because it looks more like how I can plot in R and that's that's really only it but it also comes with a bunch of data sets that are pretty cool so well they're not cool if you've been doing data analytics for a while because they're pretty much used to death so we can say SN s low data set and there is a tipps data set and so essentially this is a data set for a given total bill what was the tip given some kind of characteristics of the table yes it doesn't make 100% amount of sense because you can have a size of like six people but still have one gender associated with it so I don't really know what's going on there but how do you look at data types so there's the info thing that we've been looking for and the last column here are the data types or each group you'll see a new one here called category this is I think from like Panda 0.16 or something they added the category data type so this is like an r factor and essentially under the hood these categories they look like strings but under the hood they're coded as some kind of like integer or something and it's really it makes doing analytics later on a little easier because you have this notion of a dummy variable Oh you have this notion of a dummy variable and and because it's under the hood stored as like an integer or something it can make larger datasets smaller in memory because you have the string di and n er but it's really getting coded as one so if you have a million of these it will save a great deal amount of space in memory yeah yes yeah yeah so I think the function is PD that categorical and if you give it a column it will turn it into a categorical column yeah so that's that's that check so that's so we're going to go over how you convert data types so that's how you would convert a column into a categorical data type but most of the time especially work I prefer if you can work with strings and not categories because dealing with a category essentially there's like some other layer you have to deal with because there's like this mapping that's going on so when you're you can't really just you know do like string split and string replace and just assume that the category got recoded so there's a few hoops that I'm not going to go over but categories when you actually are working with them in your data cleaning process I would almost say convert it into a string in your life will be easier so that's what I'm going to show you is how you discomforts into a string and then you know typically when you're like preparing for data analytics a lot of like spaz models or scikit-learn date they can handle these category categorical data types so when you are done cleaning with everything that's when you would use PD categorical so how do you change data types so let's take this tips data set and we'll create a total bill string and that's going to be tips that's going to take this total bill column and the way you essentially recode a variable or a column to a different type there is a function called as type and you give it the type that you want so STR is pythons Texas designation on converting into a string so if you wanted to convert it into a floating-point number you can say float right so you can't really see when it's printed but this column if we look at the info is now an object which is how Python or pandas calls string objects are strings so you can use AZ dot type to convert anything to from one type to another when you're working with numeric types there's I think this answers someone else's question here so let's say tips so I'm just going to take the first ten rows of our tips data set and I am going to ran them not randomly set the total bill to the string missing so this warning is essentially saying that I'm making a copy I'm actually I know I'm making a copy but this is just a warning telling you that you might not be doing what you're expecting to do but I'm totally expecting I know I'm doing in this example so I have this data set or the subset and I converted a few of these into the character strength missing so what happens if we look at info or another way we can look at our types is just running D types is our total bill is now an object all right that makes sense because it used to be all floating-point numbers but because we had inserted strings into it it has to convert everything into the lowest common denominator which is a string right there's no way to convert missing into a floating-point number because each column has to be the same type that's when we had created my head ran this line right here it had coerced everything back to a string so if we wanted to we could have said something like get the total bill and do as type it was a float that's going to create an error because it has no idea how to convert mi SS ing into a floating-point number so if that's the case so in our light that's kind of nice like if you say hey has got numeric into something it will just automatically coerce everything into an na if it doesn't find it so pandas has something that's like that there is a PD dot two numeric function so if you run that we get the same mistake or same error and it's going to say unable to parse the missing I have no idea what to do with mis si ng and that's because we look at two numeric there is a parameter called errors and by default it's set to raise so if it tries to convert something to in numeric and if it encounters and mistake or something it can't convert to a numeric number it's going to give you this value error that's what is being raised but there's a few other ways you can you can change this behavior so you can say errors is equal to ignore in which case it will just ignore it but that kind of doesn't help because your D type is still an object but at least it didn't give an error and then the last thing that you can give it is coerce which will force it force everything into a number and if it can't it will give you an na and na n value so this is the default behavior in R so if you have a column that's that you know is only simple to have numeric values and for some weird reason there's M is si ng or something you can say to numeric and just convert the whole thing into a floating-point number and everything else that couldn't convert will get blanked out too man missing values like incorrectly ah that's not like a British spelling right ah maybe it's such a common spelling mistake anyway yeah it doesn't um all it will just go through but it's not giving you an error I don't know if it's useful it's just in the documentation that that's that's provided I can see it being useful because you kind of don't want your code to just break if you're giving it to someone else so yeah so ignore we'll just silently let it go but it never got the type never got changed it's felt it's felt two ways right okay I know I Ted I tend to in general mix my ad sees and SS like on a whim to like things I type oh okay yeah I don't know good for them catching my spelling mistakes okay so where can I do much is left group by all right so I pretty sure I have enough time all right all right so the last thing I want 12 second lasting is this notion of apply so this is what happens if you know if you're pivoting tables that's not enough to clean your data or for each row of your data you want to perform a more complex calculation other than adding two columns or calculating the mean of the row or something you know something like calculating a z-score is something that is a little more complicated to calculate so you might have to end up writing your own function for so just to show you how apply works I'm going to be using a very simple examples but then we should have enough time to actually do a full-blown example at the end so so let's write so the first thing is here's a quick few seconds on how to write a function you say def the name of your function you indent four spaces and then you have a function code and if you have a empty function you just have path so this is a function that does absolutely nothing but lets me write that function that's a little bit more useful so that was it that's how you write a function that's the entire tutorial function so if you want to write a function that's a little bit more useful like square number you can say squares given value so you can write a doc string because that's good to have all this is doing is returning X and raising it to the second power and if we want to use this we can say my Square to my square four and I'll just square that number back for us if you want a function that takes two parameters that says isolate calculates the average between two numbers that's X plus y so this is the difference between Python 2 and 3 if you're in Python 2 this should be 2.0 that's it I'll just write that there and you can say what's the average between 10 and 20 and it should give you 15 right so how does this play into the world of apply typically let's say for example you want to perform a calculation on each row of your data or each column of your data if you are coming from depending on where your programming background is you might say hey I'm going to iterate through rows or columns that's write a for loop don't do that it's typically when you're working with data sets that's probably the slowest thing that you can do is write a for loop to iteratively run a calculation across columns or rows especially if it's each column is completely independent each calculation is independent from another okay so that's where you run and apply this comes this is more of like a functional paradigm kind of thing where you write a function and then that function gets quote-unquote applied across every single column or every single row essentially at the same time so I'm going to make up a data set so create a data frame going to be a simple data frame it's going to say a is 10 20 30 and then column B is going to be 20 30 40 right so not a fancy data set at all if for example we wanted to really calculate the square of column a I could have said like hey take column a and square it this is what I meant by a vectorized something that's like a vector versus a list you can't do that to a list but because the series is a vector it knows how to handle hey you told me the square roots a whole square each element element by element so how do we do apply so we have this function that we just wrote that says my square it takes a single value and squares it so how do we do this well we have this vector of single value so this is just 10 20 30 if we want to apply a function that we wrote we say dot apply so we're using the applied method and we give this method the function that we want right so these two are doing exactly the same thing but this function is a function that we created on our own so this could be any kind of crazy functions that some kind of calculations for a given number alright cool so what about this other example we have this average - that takes two values x and y so can we use this to say for each row calculate the average so give it 10 20 this would be 15 20 30 should be 25 30 40 should be 35 I am skipping stuff so that presents a problem or something else you have to think about because we're working on a data frame there's two different ways we can specify apply write we want something to work row by row or column by column so just to see what happens when we run and apply on a data frame I'm going to say write a function that all it does is it prints whatever you give it right so whatever you give it it just prints is simply a print statement so let's say I run DF play and I give it this print me function so you can see it printed some things for us so it printed 10 20 30 and then it printed 20 30 40 so you can see when we ran apply and we look at our original data frame it printed 10 20 30 and then 20 30 40 so apply on a data frame by default is operating on a column by column basis the results are none because our function doesn't actually return anything so that's why we have none none so the other way at least in this example we know is we're going column by column is because the return has to non values and we only have two columns so by default apply will work a column by column so this is pretty much exactly the same thing as the F a and b FB right so once it's something on column a is something on column B so let's write another function average three six x y and z so just calculate the average between three numbers so this is X plus y C and divides it by three and so let's say now we're on D F dot apply we have this average by three thank you ah we have running this so we want to calculate for each column we want to calculate the average right so we have ten twenty thirty the average should be twenty twenty thirty forty average should be thirty so we run this that's going to be a problem and that's a problem because what actually happens when you run in an apply is the entire row or in this example the entire column it's passed in to the first argument of the function so what really got passed into this function was ten twenty thirty the series all got passed into the variable X nothing got passed into y + Z which is why the error is going to say I'm missing two things right so if you're going to write in an apply function what typically is going to happen or the easiest way to fix it is you take your function you make it so it takes one parameter because the entire row or the entire column is going to be passed in so I'm going to say the entire column is going to be passed in and then inside this because it's the entire series you can subset it just like before so X is column element 0 Y is column element 1 Z is column element 2 and then if we say DF that apply we can say average 3 and then that will work so that's actually one of the most confusing things when I first started working with applies is like I wrote this function why doesn't it work it's because yes when you write your function like I did before it's very easy to test you give it three inputs it gives you your results when you want to use it for an apply what actually ends up happening is the entire axis or in this case the entire column all gets passed into the first argument so you have to actually change how your function is written if you want to use it as an apply that is a good question x plus y plus me that yeah that's why you unit test all right so if we wanted to do something on a row by row basis we can say B F dot apply right now our function won't actually work so if I say average three call if you wanted to work on a row by row basis we can say access by default is to zero so that means work on columns so if we say accesses one this will work on rows but this doesn't make any sense on a row by row basis because we're doing X plus y plus C there's only two values in here so our error is going to be something like index out of bounds right we're trying to get a third element that doesn't exist so how do we fix this well we can rewrite our function that says average to this now takes some kind of row and now we're doing row 0 + Row one and reducing X Y then we can say D F dot apply average to X is 1 and then that will work on e yes yeah so that's how you can write an apply function that operates row wise or column lines so how do we like so that's generally what's going on so let's use them like an actual data set forth Seabourn MS there's a Titanic data set Oh all right so here's a Titanic the data set of the famous Titanic of whether some individual survived yes or no what their gender was what their age was some other stuff about the type of ticket that they bought and whether or not they traveled alone or not so let's look at the info of this so you can again I'm not going to show you how you can do string parsing you can do really complicated string manipulations write it in an apply function that's you know something that you can do but I am going to show you how you can for example get frequency counts of missing values or what proportion of a row or column has missing values or what proportion of the row is complete so when you end up working with pandas because so many operations are going to be vectorized that work on arrays you're probably going to end up loading numpy as well as it's because numpy functions work on arrays so you don't have to write these vectorized functions yourself so I'm going to write three functions first is going to count the number of missing values next I'm going to count the proportion of missing values and the last one is give you the proportion of complete values this is probably something that you will end up doing when you're working within your data sets so so count and a thing we're going to get a vector of some sort I'm going to forego the doc string so we're going to get how many about what are B we're going to get a vector of whether or not an element is missing like n a n the same so we have is no we're going to give it the vector that we got and then one thing that you can do is true evaluate two like a1 and false evaluates to a zero so if you take the sum of a boolean vector that will give you how many true values there are so that's what we are going to be doing so this is a vector that gives me a vector of true false values of whether it's man missing and then I'm summing it to get the total number of missing values I'm going to write another function is proportion of missing that takes a vector and a proportion has a numerator so the numerator is exactly this count missing function that we got before and the denominator is the how many elements there are in that vector and then the proportion is the numerator divide I divided by the denominator and then the last thing is this is also a lesson in functions and how you can reuse functions as two builds so you don't have to keep calculating the same thing over and over again the last thing is proportion of complete which is really the same thing we got the proportion of missing so is 1 minus whatever the proportion of missing was alright so now let's actually use these functions on our data set so Titanic so we can say count missing Oh so what did this do so this went column by column and ran our count missing function so we can see that age has 177 missing values embark'd has two missing values and deck has 688 and where the person embarked from has two right and then just as sort of like an aside you should always if you are working with missing values you should really make sure that those values are like actually missing at random there shouldn't be a pattern in your missing values so one way you can check is use this PDB know let's say embark yeah so remember lope has a comma that Aleph is how you are subsetting the rose and the right is how you're getting columns so I'm getting all of the columns and such : and then the Rose is I'm taking this embark town column and seeing if it is null if it is return me those columns back and so that's how this is that's how this is working and so you know when you're working with missing variables it's very important that you end up looking at what's actually missing because you could end up with a data set that just so happens that like 98% of your male population is just all missing and your whatever model you're going to fit is not going to make any sense right so this is all of the rows in our Titanic data set such that the embark town value is missing so that's all it's saying I'm not trying to make any inference close to this oh this is the row row number or a row index yeah so we only have two missing values in embark town yeah so this isn't the entire data set this is just in this one column yeah yeah so this one column has two missing values and that's what they are I could show you this one with 688 rows but that just becomes it's not going to show up on the screen so that's that's essentially why I picked out beautiful all right so we can do the same thing because our function is vectorized and it's vectorize because we're an MP sum so it can take any vector so we can take our Titanic apply and we can say we can change the access to one and this will give us for each row how many missing values there are right so the first row has one missing value the second one has the third one has one missing value index number five has two missing values etc etc what you can do is then say value and you can say here 549 of our rows have one missing value 182 of our rows don't have any missing values and 160 of our rows have two missing values and so you can use this substitute any of the other functions that we wrote so we can say proportion completes and here are the proportion of complete value so when we don't have any that's a hundred percent and have one it's ninety three point three percent or eighty six point six kind of okay all right any questions about apply so this function that you write can be as complicated as you need so like that question before how if you have a column that is like D and a number you can write a function that parses it out just in here you have to remember that when you apply the function realize the entire row is coming with it so the first thing you're going to have to do is subset the row for the particular column that you need or the other way of doing it is this will just be more pseudocode clean one column takes a single so remember you can also apply on a series on its own so you don't have to run if you're all you're carrying if all you care about is cleaning one column you don't have to write and apply that applies on the entire data frame itself you can write a function that takes in a single value and then you'll end up doing something like return a single column one Cohen so this is how we ended up saying the 42 would give us 42 right so I can then use the column and just run apply on an individual column you don't always have to run and apply on entire data frame so if you're really just writing a function to clean one column you know extract the column and run the apply on the column there's no need to say hey now I'm going apply to the whole data frame - then you have to worry about all right so now this entire row came in I have to go figure out or count like which position that road that I want my information my function to work on etc so just think about how you're writing these functions and how you're going to be using them and if we have time we'll circle back but the last thing is going to be circling back to so we're going to be going back to the Gapminder is it so we have our data set I showed you before we can say group by our ear pull out the life expectancy column from that and then take the mean right so that's what we did all the way that was the first thing that we did what's actually going on under the hood is what you can think about is it's breaking up each unique value in this variable that we specified here so you can say that it really this isn't actually what's going on under your hood but in your mind you can say it created a totally separate data frame 19:52 is Gapminder lot lok such that gap under here is equal to 1962 all right so we just have a data set of 1952 data from there we're taking the life expectancy column so we just have the life expectancy column and then running the mean and then we're doing that for every single unique value of year so if you you know if you're used to thinking in for loops you can think of get me all of the unique values of year and then iterate through a for loop and run this calculation and then at the end stitch stitch it together that's essentially what's going on under the hood of when you run a group I so mean isn't the only thing that you can do so you can get up for each group to account get a size get a mean standard deviation etc etc a cool thing I'll show you is describe so you can say gap minder group by year and give me the life expectancy and then this scribe it and that will give you for each year uh the count mean standard deviation minimum bubble over for that for that column right it's a really quick way to do to get summary statistics that actually like break down your data set to spare pieces yeah this colon is a generic Python slicing notation so I can have a regular list and if I wanted to just like I can get the first element of the list saying 0 I can say give me all of the elements up until the third element by saying 3 or I can say give me all the elements by saying : so yeah so that's your hand for all elements and because it's the right of the comments all columns so that's not Python or that's not pan the specific that's in this realm of slicing location wait see Duggan oh yeah I I don't cover this in the tutorial but yeah so there's a group by you can do so the way we've been doing it now is we've grouping by we're making a got the actual term I know in our it's called summarize but you're you're doing group by you're calculating calculating in aggregate statistic and so that's a bunch of values getting calculated into a single value right so group buys also have something called transform which is doing a grouping by some variable and then the return is the same number of elements so calculating a z-score is something that you would do a transform on and then there's something called filter which you can for each group do a filter and filter your data so there are other ways that you can do that and then time permitting I can show you offline or or in person alright yes so if you had done that work I will have to think about that all right so we can also write our own functions in them using these group by stuff so it doesn't have to be any of these summary statistics although I'm going to use summary statistic as an example but you can use any vectorize function for example group by continent get the life expectancy so if you need to run any other function so it's not that mean or not something that is already built into pandas you can use AG or aggregate they both are exactly the same so AG is just shorter to type and then in here you can say hey I want to use the enum PI mean function or this is exactly the same a GTR eg a te I don't know the number TC yeah so I want the numpy standard deviation function right so you can put any vectorize function in there it doesn't have to be one of the ones that I just listed before that come directly from pandas you can write your own so define my mean takes a set of values so how do you calculate a mean well you get the length of the values and then the sum is using numpy some functions just because it's vectorized so you don't have to write like a for loop to iterate through each to get the sum and then s divided by n right so this is a function that I wrote and it works on a vectorized manner because I'm using the numpy sum function and we can use that right here and say my so you can create your own weird aggregate statistic or something data transformation of like parsing some string or I don't know some count your limit is your imagination and then with these aggregates what you can also do is you can provide multiple functions at once and the only thing you have to do is give it a dictionary so set of curly brackets and then since you're doing multiple calculations so let's just do it this way actually no this is the old way of doing so if you give it a set list of functions so for example numpy dot count non zero inside that mean I applied that standard deviation and if you give it a list it'll for each group give you the non zero the mean or the standard deviation whatever you want to calculate it okay so you can do that all in one go and then again you can put a set of square brackets here and just break down your data set as far as you you need if you want to use if you want these column names to actually like be meaningful the old way of doing it which is no longer will work is you can say number so instead of a list you give it a dictionary not count so you can rename these columns of the only problem right now it's going to say hey this won't work in the future so I had to last night go figure out how this will work in the future so the way this will work in the future is you give it the old notation which is using these square brackets and then you actually say rename to rename your column so columns how do you want to rename the column so count nom I don't know why they made this not work because it's actually way more typing mean will be called average of P will be called wait Oh all right there and then you can clean this up because this is pretty messy all right so you can also reset the index at the end just so you get a nice flat table that I'm very used to looking at so the last thing group by hand is just for time constraints which I guess I knew that I wouldn't have to do this there is a page called group by split applied combined and so I just showed you aggregation so that is calculating taking a bunch of values and computing a single value as output there's transform which if the number of inputs is equal to the number of outputs and then there's filter where it's some somewhere in between I would say you can you should look through this so calculating a z-score is an example where the way z-score is calculated is you calculate the average of something and then you run in calculus disease it's some function that requires the average but the result is each element gets rescaled such that the mean of zero and a standard deviations wanded right so each value will get a new value so that's what transform means it's doing a one-to-one mapping so you don't have to calculate the z-score for the entire data set you can break it down by year or content etc etc and then filter is for each group you know give me the values where the life expectancy is greater than 50 right so you can do aggregate statistics like that to return something back to you and the reason behind or one of the reasons behind you know having that honor to the group by is if you think longer term into distributed systems the syntax will be almost exactly the same or you can say hey I have this you know five billion row data sets charted across five computers I can say group by this and then the syntax is exactly the same of doing these aggregate statistics so so the more familiar familiar you are with group by it'll it'll vary i wouldn't say easy but grasping how distributed systems work will be less difficult because at least your code will not really change alright so the last bit is stuff i am not going to type but it is essentially a segue of two hopefully you guys have learned something about data cleaning and then so at the end of your data cleaning process you typically want to fit on model of some sort so this was this would be a segue into how would you fit a model so there are two major packages used to fit models so there are spats models and sk learn both of which I'm pretty sure we're a tutorial in this conference at some point and so this is generally how how that works so we have pandas but we're importing pandas because we're working with data frames I'm using Seabourn to load up a data set all of these things also have built in data sets with it I just ended up using the Seabourn data sets because at least I definitely know this is a regular pandas dataframe a lot of the especially in scikit-learn they're not like the data sets that come in scikit-learn are not like regular data frames so you have to there examples are a little different it's not as simple as plug and play so I try to give you an example of what it would be like if you if you worked with pandas data frames from the beginning and now you want to sit a model so there's two different ways you can use Phat's models there's a regular API and for the are like people there is a formula syntax and so this formula syntax will look very familiar if you've fit a model and are before so this is our tips data set that we've seen and so to fit a regular linear regression stats models has a function called OLS over doing regular ordinarily least squares regression they're a naming convention it's a little weird they use in Dodge this for the response variable or sometimes called the dependent variable I personally hate the word dependent and independent variables just because that's sort of like conveys that the two things are related and when you fit linear regression one of the assumptions is they're not dependent so I don't like using that term and then there's another variable where you specify all of the predictors of your model so here we're going to say how does total bill does the total bill affect or predict the tip that you're getting so that goes into the model object all that does is set up the model so you actually have to say model that's it typically you'll see people just say dots it like right there directly just train it together but a lot of the tour toriel's you'll see will just do it in multiple steps so you get a set of results after you fit your model and to look at your results you say results not summary and here if you've ever seen sophistical output this is what you get when you run summary this is sort of the benefit of using stats models because you actually get this set of results if you end up using scikit-learn you don't get this and that's just because there's two schools of thought of how these things are are working stats models are you're actually fitting models as like a statistician whereas this scikit-learn is more of you're trying to do like internal validation through like cross validating and stuff like that so it's up so scikit-learn is really good if you're trying to do cross validation and all of that stuff if you're just trying to fit a model like a regular statistician you know it's easier to get these tables out yeah no this is not an exhaustive list of the model oh I need to run this that's way so results dot tabs and then you can create you can see everything that's in there yeah so I would use tab complete to help you get everything so here's the lowdown on how you interpret this you have a table of coefficients in this case we have total bill and then a coefficient here and because this is linear regression the way you would interpret this is saying for every one unit increase in total bill so for every in this case dollar so for every dollar increase in total bill the tip that the person can expect will increase by essentially fourteen point three cents right so I guess that's good people are falsely tipping fifteen percent although tipping can be a whole other debate on whether we want to have to pay in this country if you just want the parameters you can say results that program and so if you don't want the entire table results and you just need the parameters to go on with your work you can say results offer and to just get the coefficients if you need multiple variables pretty much the same thing so in this case this is multiple continuous variables I haven't shown you a categorical variable yet so if you're doing a multiple continuous variable again put it in a set of square brackets just realize that in this case you need two square brackets don't remember the exact reason for that but the rest of the model will look exactly the same once you create a model you sit it you want summary and you get your summary statistics and so you can see the coefficients will change because we've essentially adjusted for another variable size but the interpretation for each variable is roughly the same so for each one unit increase in size so for each additional person that joins the party the tip will increase by 36 cents assuming that the total bill is the same and then for total bill you can say for each $1 increase in total bill the tip will increase by 10 cents assuming all of assuming the size is constant oh you can totally get the residuals and all of the other stuff like yes yeah it's all in there it's it's it's somewhere in there so if we look at the tips info you'll see that we have a bunch of categorical variables my I guess it's almost easier to just jump to the formula syntax if you're working with categorical variables because the formula syntax will understand that this is a categorical variable and create the dummy variables for you so what is a dummy variable so let's say I have a categorical variable like sexes male and female it would make no sense as a model to say let's say female zero male is one it would make no sense to say like for every one unit increase in sex something happens like what is what does that mean right so it has to be a categorical variable in so that's why it's a categorical variable and not a continuous variable so but how do you actually put that into your model so you end up doing creating what's called a dummy variable so what ends up happening is you create a separate column called male and a separate column called female or you get you create a new column for each unique value in the category so insects will be one carry categorical male and one for female at least in this data set and then if the person is male the person will get a 1 and then female will get the column the value for female will be 0 it's the person that's female the female value will be 1 and a male value will be 0 right so you created a new set of columns for each category and typically with dummy variables you end up dropping one of them just for multicollinearity and so you can essentially instead of having two columns for male female you really just need one column for female because if the female column is zero you can already assume that it's male right so that's why you you'll hear when your stars fitting models and this isn't really a stats class you'll create n minus one dummy variables so that's what this model will actually give you so using the formula syntax so if you've used R before this is exactly how you would specify formula you have a tilde thing to the left of the tilde is your response variable so tip and then the things that are right you literally add the variables together so we're trying to say we're predicting on total bill sex smoker and size right so total bill and size are continuous variables sex and smokers are going to be categorical variables if you're using the formula syntax be mindful it's OLS in single and lowercase if you use OLS in capital case it will be using the previous syntax so you specify the data but then you fit it you run the summary and then you get this nice table and so here you see sex as a female so this is essentially because male got dropped because it's n minus 1 dummy variables this is saying that if the person is female compared to males then the tip will increase by two point seven cents so females are assuming everything else is the same tipping two point seven cents more than males there's issues with interpreting that which you'll see looking at this p-value and the T statistic but this isn't a set class so but then so you can see for smoker so smoker is is the person a smoker yes or no so because this is actually nice that it tells you what the target variable is so the reference variables yes so when the person is not a smoker the tip will increase by eight point three seven so that's what fitting our model will look like using staff models all right you can do the same thing with logistic regression so linear regression is when your response variable is an actual continuous variable logistic regression is one type of method if your outcome is binary so in health data like did the person died yes or no so if you have a binary variable that's when you would use logistic regression so in our Titanic datasets we have a column for survived the person died yes or no the function you will use so I'm just using the formula syntax I tried to use this like the non formula syntax and it came like something I didn't want to figure out at 2:00 in the morning so the formula syntax oh another sign oh yeah this is totally not a stats course that I'm trying to give you because there is something like you have to create an intercept that this isn't doing and there's a whole bunch of other stuff so this isn't totally like a valid statistical thing I'm doing more just showing you how you can continue on with your panda's knowledge so the formula syntax exactly the same as before response variable to the left of the tilde so if this person survived given sex age fair and dec and so now you can see it creates this table results you can see Dec had a lot of different values in them so there was one two three four five six so there are seven values for Dec but we have six in our table but when you're doing logistic regression you actually it's much harder to just take this table result and interpret it when you are working with logistic regression this isn't as fast exam so I'm not going to plain why but you have to expend exponentiate your values so you can run the numpy exponentiate function on your parameters and then you can say something like for example on age so for every one unit increase in age the probability of you dying this case quote-unquote increases by 0.95 first simply that's wrong the odds of you dying become 0.9 five times the chances of you interpret this for every one unit increase in age the odds of you dying are 0.9 five times the odds of you not dying so when it's less than one it's actually predict protective so apparently the older you are the very less likely you are to die but if you look at the coefficient that's probably not something that you can really like interpret very well yeah so just keep in mind when you're looking at logistic regression you have to exponentiate the results and you're actually interpreting them as odds ratios and not like we did with linear regression last example so how would you do all of this in scikit-learn I only really had time to do this for one example for tips using linear regression so so I could learn you import the linear model if you're trying to do a linear regression we if you do so here because I KITT learned essentially expects a array of ones and zeroes essentially you have to create a manually create dummy variables so in pandas there's a function called get dummies you can give it the column that you want to create dummy variables with I am dropping the first one because I only you only need n minus one dummy variables certain statistical methods you actually just keep all of them so that's why the parameter is there for you and then so we get a new column for sex whether this person is a female so if the person is email they get a one if it's mail to get a zero to create the linear regression and scikit-learn we create this we initiate initialize this linear regression object we give it the x and y so the X or our predictors are going to be the total bill the sex dummy the size the Y is going to be the tips column we give each feed in the x and y vector into the fit so we have to fit the model and then we don't really get that nice table of results anymore with scikit-learn so you only get you can get the coefficients by running Co F underscore to get the coefficients and then just because this vector of numbers is kind of useless without some kind of variable name associated with it you have to manually essentially tied the variable name back to it so if you want you can say for every one unit increase in person at the table the tip will increase by nineteen point three cent all right that's all I have for you I guess I can circle back to fly there any other question before I like to go on yeah oh I guess I didn't test yeah I can't explain that why it didn't work in Python two or three the general rule of thumb for anyone who like I don't know it's listening to me if you're starting a new project the rule of thumb is start the new projects in Python 3 but if you already have an existing project and python 2 just yeah there's you don't have to convert it over to Python 3 just know that python 2 does have a very very has an end date there's like an actual webpage that is like when is Python 2 ending and there's literally a website that is counting the sound oh so you know take that you know that's why the general rule of thumb is if there's a website counting down something you're probably here in order to listen or do something about it all right yeah it's yeah yeah what methods have I done before that Python has saved me so I come so I got my master's in public health and epidemiology and so a lot of the datasets that at least we've worked with are these national representative surveys so a lot of data cleaning was really just recoding in SAS essentially what python allowed me to do but that's also just because of the order I learned things so it's not really just Python specific was you know getting a random data set off the internet or scraping a piece of data that's something I don't think I would have been able to do in SAS if I did I don't think there would be documentation on how to do that and fast so you know stuff like we have so that we work with a lot of GIS data in my lab and so there's going to be like oh we got this data set with like state names fully typed out but this other data set of state names and there's two digit code like how do we find a table in between to link them together so that's sort of what Python has allowed me to do is like oh what capilla is great for that type of information you pull you scrape the data off of Wikipedia and how you can do it join I'll use that as an intermediate to join all your data together so that's stuff that Python has allowed me to do oh not to do don't write for loops if you can help it I think like a few years ago like in like a PI Gotham like jeffrey vachs like one of the contributors to pandas he actually wrote a talk that was like how to make your code slower and like it started off with like write and apply and at the end it's like use the it arose on a data frame to actually iterate through two rows to write a for loop so typically if like I would say like this apply function thing it's really important because chances are when you're cleaning data you're not cleaning one cell you're cleaning like the entire row or column at once so the to do or what not to do is don't write a for loop especially if each row completely independent of another that's usually a good sign sometimes you have to okay but like try to understand how apply works and that will make life much easier down the line because it will be like all of this stuff like numba and all of those optimization things that make your codes like faster like they all work on functions so if you already have your code written as a function that is applied to your data set you know you can do those like speed optimizations much faster but yeah so like I was going to show you like one more thing that you can do with apply is we have this function that we wrote that is XY and it takes to their values but we can't actually say like es a bf B again why does that work that's not this way oh I know why that works okay now I can't think of a good example I don't know yeah lo was trying to let me see if I can look this up this should be pretty quick okay anyway I don't know so if you end up with a function that isn't vectorized for whatever reason I like the reason is like this right here actually does the vectorize calculation you can use numpy has a way to turn any function into a vectorized function and you use this notation called a python decorator so what this will do is the decorator is essentially act and then something in front right in front of a function definition and what this will do is it will take this function and vectorize it so if I had manually calculated like an average function using a for loop or something that's not made to be worked on arrays you can say at NP vectorize and then it should work on an array of data I realized the example is probably family and I have to make some pets but that was like the last thing I wanted to quickly go over is if you do have a function that's not vectorized and if you don't want to use the apply way or we write like rewrite it so like the first thing it takes is an X the entire row and then have to rewrite the code complete you can say NP dot vector eyes and then just pass in your parameters that way so yeah that's that's all I want to go over I'm happy I got through everything cuz it with a lot of stuff yeah so keep with it I guess since all of you guys are here because you're very new to pandas so it would be about three and a half years ago I was essentially like sitting there like in my own software carpentry tutorial learning about like how to do analytics so you know practice practice practice and you know you'll you'll get it eventually all of this stuff really is like kind of an art the more I think about it you are taught all of the core fundamental pieces it's you know with practice you'll get used to training them together to get what you need so right you

Info

Channel: Enthought

Views: 84,286

Rating: undefined out of 5

Keywords: Pandas, Python, SciPy, SciPy 2017

Id: oGzU688xCUs

Channel Id: undefined

Length: 225min 16sec (13516 seconds)

Published: Sat Aug 05 2017