PYTHON: PIVOT TABLE | GROUP BY | PANDAS PLOT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everybody; you're watching Mr. Fugu Data Science. We're featuring pivot tables, cross tab, and group by using Python today. The next two videos I was considering Excel and R studio doing the same thing for those who may be interested just let me know. Let's get into this. I have some Berkley business data that I generated; and we are interested in what company's, industry, postal code longitude and latitude we have. And let's figure out how we can manipulate these data into slightly nested tables. First, the pivot table uses a data frame and your saying DF dot pivot. You're basically reshaping this data based on columns. Whereas the pivot table is allowing you to create a pivot table as a data frame and it's allowing you to tabulate this using aggregates. Now, let's look at each industry and what companies there are by zip code. So how would I do this? I would take my businesses and I would say let's do a pivot table for this. Where we're taking in the postal code followed by the bracket notation: where we're going to call industry as well as the company. Let's get the head of that so we're not printing everything off. So what's going on with this? These are your your indexes here, and this is based on a hierarchy. So this is the outermost part here for the postal code, alright. Just think of this like if you were doing like a row index and just think of this as your column. You can explicitly call this and I'll show you in a second. So let's get our trusty old businesses and this look at this a little differently. Let's feature this by the actual zip code; you could also call it PD and then pivot if you're not explicitly calling like we did above and you could use that as your first argument. Now, we're going to explicitly call our index like I just said just think of that as working row wise: where we're going to take our postal code and we're going to then take the industry. Let's print that off and see what this looks like. Okay, that's fine. So we're grouping these data by the postal code and inside this postal code you have all of these industries and then it just prints out everything else. Well there's an issue here: this number and this unnamed column. This unnamed column was an artifact when saving the original CSV and I didn't pay attention, and it made an index. So here's the index, right here that you should get rid of. And what's going on with this and this number? This is the street number but ,what's going on with this? It looks like you're actually doing sum. So you have to pay attention to what you're actually doing, all right. So let's look at this a little more clearly, and change it slightly okay. So instead let's call our values but, what do we want for our values? Let's only take our latitude and our longitude, all right. Let's find out what this looks like. All right that's a little better. Something what's really useful here; is you can take this latitude, longitude. And if you work for instance doing a geospatial lookup: like Google map. Where you could say well I'm in this postal code and I see these businesses in this postal code maybe I have an idea of the latitude and longitude we could plug into an interface with an API and do like a Google map and find the proximity you are from one business to another, you know. Maybe you were looking at particular industries within a zip code and you plug in your latitude and longitude and wanted to find out how far you are from this industry to the next industry in the zip code, or within another zip code. So that would be pretty cool, if you wanted to work on that. What else can we do? We can look at doing a expansion onto this once again. And let's add company, and see how that looks. Now, we have by zip code industry and company with the latitude and longitude and you could just pluck this and put that in there. And there you go. It just depends on what your use case is. You could slice this, do anything you want with your heart's content. It just depends on what's your use case is. I'm just letting you see some examples. Here's some CPU purchase data, I created. And what you need to know is this is not stored as a date. It says an object, and this object is a string. We have to do some converting; if we're going to use this in our pivot plots. So what we need to do here, is take these data and you need to do a convert to your date/time okay. Then what I did was I coerce this. So if any errors came up all right. Then I use something called a pandas grouper; and the grouper is really interesting and highly useful. And you're going to see why in a second. Here, I decided let's not use a pivot table or pivot: instead this user group by. A group by as more of a manual technique can achieve the same thing. Now, what I wanted to do was group by the CPU, here. Which is basically like the index that we were looking at before. But, then now you have nested all your dates. Where this particular CPU was purchased and then the fake credit card information that was available, Now, what's going on with this grouper like I said, Well let's do this real quick and notice the dates right here, This is seven ten eight ten of 2017 and eighteen. Now, if you pay attention to what happened with this grouper: I put in the frequency by month and it changed these So then you can go in and look at the grouper. What this grouper is really doing; is its allowing you to take your column as your key parameter or you can work by levels, all right. And you can do your group by with this. So for me I chose to use the frequency, and here's all of the parameters that you have available. For me I chose the month but you could work on the week, you can do the start of the month, whatever you want to do. So those are some of your options. Just depends on how you want to mess with these data. You could also do some string splitting by month and year and further nest these, just depends on what you. We further do a group by. But, instead switch this around. Where we're now looking at by a date what CPUs were purchased. This would be really good for plotting if you were wanting to do sales reports for your company. So think about that; and something I didn't mention. Why did I use this sum? When there's no column that I'm using for sum. I did this as a little trick. Since, I didn't have anything that I was summing. Allowing me to basically force this to print. Otherwise, if you don't do this you get this you get this pandas object thing. And the only other way you can deal with this is iterating through. But, I wanted to preserve a table format. So then what you can do is if you needed to convert this into like a JSON format: you can take your data frame and you reset the index and then you just convert it to a JSON and set it as records. That's something that's really useful if you need to do that. Now, let's look at a last kind of situation what the data set were looking at. Before we get to the cross tab. So I took my sales data for online retail, and I excluded all of the canceled items. From there I subset these data and what I wanted to do was first do a pivot table. And look at sales by country where we're looking at specific items and the quantity of items up are sold. Then I wanted to take those data and convert it into another column or create another column. Where we're just multiplying the quantity and the unit price and I used this specific method called assign. Where it's basically, just allowing me to use a function creating a new column and retaining all of the old columns. That's just the the gist of that. You can do that here where I was using this subset data. But, you end up getting this stupid error and you have to do some special formatting. And I just was lazy and wanted to show something new. Then I did my pivot table. Where my indexes are my country and my item, and then I'm just showing the prices or the amount of money that was spent per item for these countries. And we could utilize plotting through pandas. Which is very useful as well. From there we could do a simple plot: where I'm looking at the items sold versus the quantity. Since, this is nested: you have to pay attention. Because, the outermost part is your order quantity here then you have to decide where are you going inside. For me I decided I wanted the country of Australia. So from Australia I decided I also wanted: only if you pay attention here, the first five items. Since there is thousands. I take those first five items I create a plot size. Just by how I wanted to fit on the screen. I rotated labels on the x-axis and then I created a width , so we can have separation between this. So it's a little more visible and then I just created all of my labels and I sized them. So here we are with our plot. So that's really useful. We basically took our so we basically took our pivot table and created a plot from that. Which is something realistic that you can do at work. While this is a very simple plot you can get more advanced. But, this is proof of concept. You can also do further pivot plots based on the invoice. Just depends on what you're looking for. You can look at the quantity, the unit price the description. And then you see like this coincides with these two. You can do a cross tab which is interesting. And it looks like this doesn't work and just full of zeros. But, remember you have hundreds of industries and you have thousands of companies, assuming right. So how did a house this even relevant? Let's look at this real quick and you'll see all right so we're basing this by industry. Where we have academic librarians and there's two of those people. And then we have some kind of accountant and there's one accountant, all right. So this is going by column wise whereas; this is going by row wise. And we can look at that. Where you're saying: all of the rows you're only seeing a company once. So it looks like these companies here that we're looking at there aren't duplicate company names, okay. Depending on your circumstance you may need to do a cross table. Okay, so as always I would like to say thank you for watching this video. And as always I would like to say thank you for watching please subscribe and see you in the next video. Bye
Info
Channel: Mr Fugu Data Science
Views: 7,202
Rating: undefined out of 5
Keywords: mr fugu, mr fugu data science, Mr Fugu Data Science, Mr Fugu, pivot tables python, python pivot table, python groupby, python group by, python pandas plot
Id: jOrys_zgjpc
Channel Id: undefined
Length: 11min 15sec (675 seconds)
Published: Tue Jul 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.