Welcome everybody;
you're watching Mr. Fugu Data Science. We're featuring pivot tables,
cross tab, and group by using Python today. The next two videos I was considering
Excel and R studio doing the same thing for those who may be interested
just let me know. Let's get into this. I have some Berkley business data that I
generated; and we are interested in what company's, industry, postal code longitude
and latitude we have. And let's figure out how we can manipulate these data
into slightly nested tables. First, the pivot table uses a data frame and your
saying DF dot pivot. You're basically reshaping this data based on columns.
Whereas the pivot table is allowing you to create a pivot table as a data frame
and it's allowing you to tabulate this using aggregates. Now, let's look at each
industry and what companies there are by zip code. So how would I do this? I would
take my businesses and I would say let's do a pivot table for this. Where we're
taking in the postal code followed by the bracket notation: where we're going
to call industry as well as the company. Let's get the head of that so we're not
printing everything off. So what's going on with this? These are your your
indexes here, and this is based on a hierarchy. So this is the outermost part
here for the postal code, alright. Just think of this like if you were doing
like a row index and just think of this as your column. You can explicitly call
this and I'll show you in a second. So let's get our trusty old businesses and
this look at this a little differently. Let's feature this by the actual zip
code; you could also call it PD and then pivot if you're not explicitly
calling like we did above and you could use that as your first argument. Now,
we're going to explicitly call our index like I just said just think of that as
working row wise: where we're going to take our postal code and we're going to
then take the industry. Let's print that off and see what this
looks like. Okay, that's fine. So we're grouping these data by the
postal code and inside this postal code you have all of these industries and
then it just prints out everything else. Well there's an issue here:
this number and this unnamed column. This unnamed column was an artifact when
saving the original CSV and I didn't pay attention, and it made an index. So here's
the index, right here that you should get rid of. And what's going on with this and
this number? This is the street number but ,what's going on with this? It looks
like you're actually doing sum. So you have to pay attention to what you're
actually doing, all right. So let's look at this a little more clearly, and change it
slightly okay. So instead let's call our values but, what do we want for
our values? Let's only take our latitude and our longitude, all right.
Let's find out what this looks like. All right that's a little better.
Something what's really useful here; is you can take this latitude, longitude.
And if you work for instance doing a geospatial lookup: like Google map. Where you
could say well I'm in this postal code and I see these businesses in this
postal code maybe I have an idea of the latitude and longitude we could plug
into an interface with an API and do like a Google map and find the proximity
you are from one business to another, you know. Maybe you were looking at
particular industries within a zip code and you plug in your latitude and
longitude and wanted to find out how far you are from this industry to the next
industry in the zip code, or within another zip code. So that would be pretty
cool, if you wanted to work on that. What else can we do? We can look at doing a
expansion onto this once again. And let's add company, and see how that looks. Now,
we have by zip code industry and company with the latitude and longitude and you
could just pluck this and put that in there. And there you go. It just depends
on what your use case is. You could slice this, do anything
you want with your heart's content. It just depends on what's your use case is. I'm
just letting you see some examples. Here's some CPU purchase data, I created.
And what you need to know is this is not stored as a date. It says an
object, and this object is a string. We have to do some converting; if we're
going to use this in our pivot plots. So what we need to do here, is take these
data and you need to do a convert to your date/time okay. Then what I did was
I coerce this. So if any errors came up all right. Then I use something
called a pandas grouper; and the grouper is really interesting and highly useful.
And you're going to see why in a second. Here, I decided let's not use a pivot
table or pivot: instead this user group by. A group by as more of a manual
technique can achieve the same thing. Now, what I wanted to do was group by the
CPU, here. Which is basically like the index that we were looking at before. But,
then now you have nested all your dates. Where this particular CPU was purchased
and then the fake credit card information that was available, Now,
what's going on with this grouper like I said, Well let's do this real quick and
notice the dates right here, This is seven ten eight ten of 2017 and eighteen.
Now, if you pay attention to what happened with this grouper: I put in the
frequency by month and it changed these So then you can go in and look at the
grouper. What this grouper is really doing; is its allowing you to take your
column as your key parameter or you can work by levels, all right. And you can do
your group by with this. So for me I chose to use the frequency, and here's
all of the parameters that you have available. For me I chose the month but
you could work on the week, you can do the start of the month, whatever you want
to do. So those are some of your options. Just depends on how you want to mess
with these data. You could also do some string splitting by month and year and
further nest these, just depends on what you. We further do a group by. But,
instead switch this around. Where we're now looking at by a date
what CPUs were purchased. This would be really good for plotting
if you were wanting to do sales reports for your company. So think about that; and
something I didn't mention. Why did I use this sum? When there's no column that
I'm using for sum. I did this as a little trick. Since, I didn't have
anything that I was summing. Allowing me to basically force this to print.
Otherwise, if you don't do this you get this you get this pandas object thing. And the only
other way you can deal with this is iterating through. But, I wanted to
preserve a table format. So then what you can do is if you needed to convert this
into like a JSON format: you can take your data frame and you reset the index
and then you just convert it to a JSON and set it as records. That's
something that's really useful if you need to do that. Now, let's look at a last
kind of situation what the data set were looking at. Before we get to the cross tab.
So I took my sales data for online retail, and I excluded all of the
canceled items. From there I subset these data and what I wanted to do was
first do a pivot table. And look at sales by country where we're looking at
specific items and the quantity of items up are sold. Then I wanted to take those
data and convert it into another column or create another column. Where we're
just multiplying the quantity and the unit price and I used this specific
method called assign. Where it's basically, just allowing me to use a
function creating a new column and retaining all of the old columns. That's
just the the gist of that. You can do that here where I was using this subset
data. But, you end up getting this stupid error and you have to do some special
formatting. And I just was lazy and wanted to show something new. Then I did
my pivot table. Where my indexes are my country and my item, and then I'm just
showing the prices or the amount of money that was spent per item for these
countries. And we could utilize plotting through pandas. Which is very useful as
well. From there we could do a simple plot:
where I'm looking at the items sold versus the quantity. Since, this is
nested: you have to pay attention. Because, the outermost part is your order
quantity here then you have to decide where are you going inside. For me I
decided I wanted the country of Australia. So from Australia I decided I
also wanted: only if you pay attention here, the first five items. Since there is
thousands. I take those first five items I create a plot size. Just by how I
wanted to fit on the screen. I rotated labels on the x-axis and then I created
a width , so we can have separation between this. So it's a little more
visible and then I just created all of my labels and I sized them. So here we
are with our plot. So that's really useful. We basically took our so we
basically took our pivot table and created a plot from that. Which is
something realistic that you can do at work. While this is a very simple plot
you can get more advanced. But, this is proof of concept. You can also do further
pivot plots based on the invoice. Just depends on what you're looking for. You
can look at the quantity, the unit price the description. And then you see like
this coincides with these two. You can do a cross tab which is interesting. And it
looks like this doesn't work and just full of zeros. But, remember you have hundreds
of industries and you have thousands of companies, assuming right. So how did a
house this even relevant? Let's look at this real quick and you'll see all right
so we're basing this by industry. Where we have academic librarians and there's
two of those people. And then we have some kind of accountant and there's one
accountant, all right. So this is going by column wise whereas; this is going by row
wise. And we can look at that. Where you're saying: all of the rows you're
only seeing a company once. So it looks like these companies here that we're
looking at there aren't duplicate company names, okay. Depending on your
circumstance you may need to do a cross table. Okay, so as always I would like to
say thank you for watching this video. And as always I would like to say thank
you for watching please subscribe and see you in the next video.
Bye