How to Reshape Dataframes | Pivot, Stack, Melt and More

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome in this video we will learn how to reshape panda's data frames in eight different ways so here is a data frame i'm going to use to show you some of the examples but for some of them also i might need to use dummy examples just to explain it a little bit better and more clear clearly so this data set is a data set i got from new york city open data it is a data set of all open positions in new york city government we have the job id civil service title so basically the job title the agency that has opened this position whether it's external or internal position job category and the expected range the salary range so from the lower limit to the upper limit the first way is to use pivot and to explain pivot a little bit better i will actually use a image this is an image i got from the official pandas documentation so all credits to them basically as you can see here when we do pivot what happens is let's say we have all of these columns and we have an index when we do pivot we can choose uh one column to serve as the column values one column to serve as the index values and one column to serve as the actual values of the data frame so here for example you get the foo column to be the index so the index will only be one or two or however many however many options there are and then you choose the bar column to be the column name so then you see how many options there are there are a b and c so then you have three columns a b and c and you choose the buzz column to be the value so then you see 1 and a correspond to 1 and then you put fill in the values accordingly but just to show you how to do it also in code here really quickly so if i have this data frame and if i basically do more or less the same thing as in the example above then we will have a whole new data frame uh we will have constructed a whole new data frame as you can see we don't have all of the information from the previous data frame because we choose one index one column and one set of values alternatively i could have used a pivot table as you can see we will basically achieve the same thing with that so i'll just even copy and paste the same values and it gives me the same result the reason that you might want to opt for using pivot tables is two things so first of all with pivot tables you can choose to have more than one index or more than one column so instead of saying index should be foo i can say index should be full and bar and column should be zoo let's say that's the the fourth column that i have here and it is able to handle the situation another reason that you might want to use a pivot table instead of pivot is that it can handle duplicate values too so i will show you on the example data set that we have so this one is called data right so if i say data pivot i want the values to be sally range from and index to be job id and column to be agency so for each job id i want to see the lower limit of salary based on the agency right if i try to do this with pivot it's going to give me an error because it cannot handle duplicate entries and here in my data set i have duplicate job ids some job ids are mentioned more than once maybe because they have different agencies that they refer to or different types of posting type but instead i can use pivot table and i will be able to do this then i get this uh information as you can see i do not have as many rows as there are rows in my data data frame so we can see how many arrows that one has that one has 3773 rows here we only have 2073 because my index is job id right so you would expect there to be as many rows as job ids and some of them are duplicates so pivot table combines these duplicates together to come up with one value and while doing that what it does is to aggregate these values by default it uses the numpy mean function but you can specify it to use something else next i want to show you the stack function so let's say you have a data frame again it has multiple index doesn't really matter it could also have one index and you have a and b columns if you say stack it turns these columns into turns these column names into one column so here instead of saying this row corresponds to bar 1 and the corresponding column where it is a then the value is 1. instead of this it will create a redundant information and put all of it in another index basically so let's do this also i'll just use the example again from the pandas documentation here is an example of a data frame where we have a and b columns and i want to use stack stack on this so it's this is a little bit hard to understand so i'll make turn this into a proper data frame so okay yeah as we seen also in the example above individualization we are able to put that column uh the column names into a whole new column of themselves so right now effectively we have three indices right um what else i can do with this if i call this uh stacked data frame i can unstack it so basically i can do basically we undo what we just did so this is still the stacked id a stacked data frame instead of just undoing what i just did i can also specify which of these indices i want to unstack so i can say zero for example and then it would unstack the first column or the first index here or i can even call them by name so let's say i want to unstuck second one then i can specify that in a second now becomes a column information the name of the columns the values for the second indic index becomes the name of the columns another thing you can use in pandas is called the melt function so to do that i call it with pandas and then i pass my data set to it i specify which column it should use for ids let's say i want to use job id right now and then the value information where should the values come from let's say agency and posting type okay so basically what this does is it chooses one of the or i choose one of the columns as the index again and for the value variables the very value columns it looks at all of them and then gives me that information in a very redundant and exploded way so let me show you the result so basically for each job id this could obviously there are a lot of duplicates here right each piece of information is given to me in a different row basically that's a better way of saying it so i have for this job id i have the agency information i also have the posting type information it gives it all of those informations to me on a separate line so i have this job id the agency of it is this so maybe i can separate one job id and see what it looks like so if i call this the melted data frame and in here i will see the information for this job id and here it tells me for this job id the agency information is this and the posting type information is internal next we can use group by it is a big function and sometimes it could be a little bit confusing to first starters so i will make a whole separate video for a group by for the group by function the best places to use it and the best practices to follow when you're using it but just show you here more or less how you can use it so let's say i want to get the average minimum where is my data set yeah average minimum salary for all positions uh in each agency right so for that i would need to group by agency and i would use the uh mean aggregate function on the salary range from column so let's do that which one i want to do uh agency right i said agency all right so i will say data group by agency and i want the mean it will give me the mean of job id until i range from ansel salary range 2 but i don't really want the job id1 so again what we can do like we did before is to get a subset of the data frame so what i want is only the agency and salary range from and then it gives me for each agency what is the mean salary range from and yeah this is one way to reshape your data set it's also part of analyzing your data set too but we will as i said talk more about group by in a different video this next one is called crosstab as you can understand from the information you basically create a cross tab table of information and how many times they occur together so for that i will call pandas crosstab and then i can choose two of the columns let's say i want to see um [Music] agency and post posting type probably okay let's let's see i need to pass data agency and data posting type let's see what it gives me yeah so basically this is a cross table of how many times in the agency of admin for children's svcs i don't really know what that means but okay did we see external positions open how many times we see internal positions open for board of corrections how many times we've seen external positions open how many times we've seen internal positions open well this is a really nice way of manipulating your data sometimes you just need to add this information to your data also just to kind of create a different perspective to your whole data frame and that brings us to the last function which is explode so i will again create a dummy data frame for that what explode does is literally explode one of your columns and make the information that is in there one column one row at a time so especially it works with lists so let's say i have this data frame keys data frames panda 1 panda 2.3 but the values data frame involves it includes a list and as you might seen before experienced before working with lists inside data frames is not that easy so for that what you can do is to say data frame explode values and then it will return a data frame to you where all of this information the list information has been exploded not in columns but in rows and you have one row for all of the values inside these lists alternatively what you can do is to call the explode function on the column itself so say data frame values explode but then that would just return to you a list i think or maybe a series object let's see what it is yeah it's a series object so you can do with that series object whatever you want but if in turn you just want a whole already structured data frame you can use to call the function on the data frame immediately but that's all today about reshaping pandas data frames if you have any questions don't forget to leave a comment below this video again you can find the code and all the information that you need about this tutorial in the description below if you want to learn more about pandas function also other pandas functions how they work and what are the best times to use them i've made a free pandas cheat sheet and you can find the link to that also in the description below so i hope it was all clear i hope you enjoyed this video have a great day and i will see you in the next video [Music] you
Info
Channel: Mısra Turp
Views: 4,168
Rating: undefined out of 5
Keywords: machine learning, data science, deep learning, artificial intelligence
Id: M3oB2urOHXY
Channel Id: undefined
Length: 12min 49sec (769 seconds)
Published: Mon Jul 18 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.