pandas .head() to .tail() (Beginner) | SciPy 2018 Tutorial | Niederhut, Augspurger, Van den Bossche

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good afternoon everyone thank you it certainly is because we're here to talk about a library that's that's pretty phenomenal in sort of the world and the ecosystem of Python if you deal with any kind of data at all you're going to be using pandas at some point in your workflow so it's good that you're like here we're gonna learn some really really practical stuff some really useful stuff to help me think about the language that I should use in talking to you could I see by a show of hands how many of us would feel comfortable I'm explaining to the other people in the room what slicing is I'm not gonna ask you to do it which is a show of hands okay that looks like about half of us how many people would feel comfortable explaining what method chaining is to the other people in the room okay very good and if I were to ask you to demonstrate two-dimensional indexing and numpy how many of us would feel comfortable explaining that ok maybe like a third excellent so we can cover all these topics as we're going along you don't need to know any of that to get started we're gonna be in here we're gonna have a good time and we're going to start by talking about the fundamental data structures that we're going to be using during this class and there's going to be two of them one of them is called a series and the other of them is called a data frame now a data frame is going to be composed of series so it's sort of one thing is built out of the other but we want to make sure we understand what we're dealing with and when you're learning to program sort of one of the first things you start with is by looking your data structure how you put stuff into it and then of course how you get stuff back out now we call this our access pattern so I'm going to like do some doodling on the board here we're gonna play like a little game of Pictionary well we talked about what these data structures are and then we're going to use the things that we've talked about here in the notebook together and then you're going to practice using them on your own with little exercises and we'll be wandering around the room to help you work through those together alright so in the world of pandas we have two kinds of data structures that we're going to be dealing with one of these is going to look much like in one-dimensional numpy array and what this means is that it is homogeneous in its data types so if I have integers in there it can only be integers if I have one floating-point number in there it can only be floating-point numbers and if I have one Python string in there it can only be Python strings so we are homogeneous in our data type and we have a compact on disk representation these two things together are also going to give us access to num pies fast vectorized mathematical operations now one of the key differences between a Panda series object which is what this is in a numpy array is that when I'm working inside of numpy I'm required to know the position of everything everything is about shape and location by index so are you the zero thing or either one a thing are either two a thing but frequently when we're using real world data what we would really like to have is some sort of label based access into that data so I don't want to have to remember that tom is in position 3 in my series I don't want to have to look it up every time I sort my series where did Tom go I would like to be able to ask just where is Tom I want Tom's data in the data structure that provides us for us in the world of panas is something called an index anytime you create any data structure in pandas you get at least one index attached to that data structure and more specifically you're going to get one index for each dimension so if I have a one dimensional pandas series object that means I get a one dimensional index and this provides me my label based access so when we're working in penas we have two options we can get things to numpy way by knowing it's in position 12 or it's in position zero and we can also get things that panda sway by knowing that I've labeled that piece of data Tom now typically when we're importing data into panis it's not going to come in as a single series of data as a one-dimensional stream of data it's usually going to come in the form of a table and a table is a very particular kind of two-dimensional data structure that is homogeneous within a single column but is heterogeneous across columns so I might have one column in this data frame that has integer data I might have another column that has floating-point data I might have another column that has string data if you look at the data types in pandas you'll see that this is called an object we may or may not explain to you why that's true and just like before when I had my one-dimensional paint a series object in my data frame I also have an index here on the left-hand side now this index is going to give me label based access across a row of my data frame because I have a two-dimensional data structure though I now have two indices I have the normal index that we're used to thinking about which will now call either the index index or the row index but we also get a second index for our second dimension up here now this is also strictly speaking and in its object we will frequently call this the columns and this columns index is going to give us labeled access down into a single column of our data frame now we can combine indexing operations across the rows and down the columns to extract very exacting pieces of information in all the flexible ways that we're used to using Python using numpy [Music] now there are a lot of ways that we could do this Panna supports many many many different ways of indexing into my data we're going to talk about like the four most common ones so I'm feeling really extravagant maybe we'll do five so the first way that we're going to think about how we can extract data from our Pena's data frames um is using something that is called dictionary like access how many of us have used a dictionary in Python excellent this is everybody if I want to extract a value from a dictionary given that I have the key what little symbol on my keyboard do I have to use square brackets exactly in Python we call this the getitem syntax and this is what we mean when we say dictionary like syntax in pandas so if I have some data frame will abbreviate it DF for short I can use my square bracket syntax to extract one column given the name of that column now there's a second way that I can interact with this syntax that's going to be much much much much different than you would with a dictionary in pandas I'm allowed to pass in more than one column name so inside of my square bracket syntax I can pass in a list of things like column 1 column 2 and this is going to return to me a data frame that has those two columns column one and two in the first instance when I'm extracting a single column the type of this data that gets returned to me is they paint a series object in the second case this is a panda's data frame now for those of you who are used to working in numpy and pandas we follow the conventions of get item calls collapsing the dimension they come from so if I apply this get item call to the column index of my data frame that column index disappears and my 2-dimensional data structure goes down to a one dimensional data structure a one dimensional Panda series object the second way I have of extracting data of accessing data inside of my penas data frame we're going to call numpy like now this strategy for putting data into my data frame are getting some data back out this is going to be indicated by the use of a special accessor I LOC I look and this is how we're going to provide integer based access in one dimension or in two dimensions into my data frame just like we would inside of numpy so again we're going to use our square bracket syntax and if I supply only one number or only one slice this slice is going to operate across the rows of my data frame on the other hand if I provide two integers forget item calls or to something that has a colon in it two slicing calls this first one will operate across the rows the second one will operate across the columns there is no way that I can supply one slicing operator and have it apply across the columns now if you're used to working in numpy this follows all the same rules that numpy does if i have a slice here this preserves this dimension if it is a single integer that's a get item call we collapse that dimension and what I get returned back to me is going to be a one-dimensional and a series object and promise I'm almost done talking that's a great question and so there's this like really really famous George box quote that all models are wrong but some models are useful so what we're presenting to you here is not the way that any of this is actually implemented under the hood this is what panas exposes to you as it's user facing interface and it mimics a lot of what you see in numpy and that is very very much on purpose I'm consistency in our API is how we make our libraries more accessible to other people so it is not quite exactly the same thing in its implementation as an umpire slicing but as far as you are concerned its behavior is going to be close enough that you can pretend that it is there will be like a couple hiccups we'll see one of those in just a second where these rules sort of don't quite line up but for the most part we can pretend it's true it's a great question and if I like if I'm throwing out words that people don't recognize feel free to go ahead and ask alright so we have a third way of [Music] accessing data from our panas data frame I mean this is something that is unique to pandas and this is label based access so this is what's going to be using the row index the labels that are inside of there it's going to be using the labels that are inside of the column index now we know that we're using label based index every time we see this dot LOC accessor so this is a way of saying to pandas I would like to have something that looks kind of like numpy slicing so the syntax is going to look the same where I can have like this one square bracket with a colon and this is going to be applied across only the row index or if I have to it's applied across both the rows for the first one and the columns for the second one but the difference is it's going to look in that index and only in that index for where things are so if my row index consists of a bunch of people's names like maybe Tom is in there and yours is in there and I'm looking for position number one I'm going to get something that should look very familiar to you from Python I'm going to get a key error saying there is no one inside of this index now before we get to some code to see what this looks like in practice I'd like to say one more thing and maybe we'll like we'll write this in another color so it stands out especially in pandas indices are not the same thing as keys when we work with dictionaries in Python there's this one one very particular rule about the keys that are supplied to that dictionary those keys have to be unique for a panda's index this is not true so you can have repeated index labels both across the columns which I don't recommend that super annoying but it's very common to see this across the rows so for example if I had four rows all of which were labeled URIs and I used my dot lock access looking for just URIs even though I might expect that dimension to be collapsed I will get all four of those rows with the name urs attached to them so the proper way to think about this is less like keys in a dictionary and more like a database index it's a fast way to find and retrieve data all right now that we've had some fun doodling let's start looking in code and our notebooks we're going to start with some imports that we see up here at the top and in this import you'll notice that we're importing pandas with something called an alias we don't just import pandas we don't from pandas import data frame and we were importing pandas as PD we call this a canonical import this is not required by Python it's not required by the Python syntax but it's such a strong colloquialism is not quite the right word I'm looking for convention thank you it's such a strong convention in the scientific programming community that if you do not do this when other people look at your code they're going to be very very confused about what's happening so it's not sync tactically required but it's very important and there's a couple more setup things here at the top of the notebook one of these is some matplotlib magic we're not going to talk about the internals of matplotlib or threads of control in this particular class all this is going to do for us is going to let us like a look at plots inside the notebook itself and we're going to set some options inside of pandas and then also inside of matplotlib to make things appropriately sized for the notebook that we're working in and for those of you who haven't used Jupiter before the way that we're executing these is I hold down the shift key and I hit enter one two three for our example data for our practice indexing we're going to be reading in some data about airplane flights so they have some flight code and they have an arrival destination now they have a departure now they have a time they leave they have a time they arrive we have like the ID number from the airplane what we have right here in the next cell is some code that's going to read that in for us we can take a look at the table that's output just to give ourselves sort of a sense of what it is we're working with we see that we have this row index here on the left-hand side these are all the numbers that are in bold because we have not put in our row index or asked for one to be created for us we get the default index in pandas which is a sequential integer 0 1 2 3 4 now this is not the same thing as the position of these data so if I were to sort this data frame maybe by the tail number of the aircraft the positions so the I look calls for each of these would change but the labels would stay the same so my labels follow my datum when I move them um for those of you who are coming from a language like like auro or Stata this is going to feel a bit weird at first but I promise you it's going to make your lives easier in the long run for our column index we have these are actually some pretty nice column names and so we have the the data the flight now the characters this is American Airlines an airline ID which is a little redundant and we'll look at that more later tail numbers for the aircraft flight coming from JFK going to LAX the departure time and this is in hundreds of hours so this is 914 so that's in the morning and if we sort of scroll over there are some more columns that were sort of hidden off the screen here we have the actual like date/time timestamp objects for the arrival and departure if we're doing things like time-stamped operations this is going to be the kind of data that we want to work with and then finally we also have the delays both it's doing departure delay in arrival I mean whether or not the flight ended up being cancelled all right so we talked about the structure of a data frame so a data frame is a collection of series objects that share a common row index and we talked about ways we can get some data out of there so for example if I wanted to extract just the column that has the departure delay for all of these aircrafts I can use my square bracket syntax might get item syntax for dictionary like access into this data frame and extract just the column with my departure delays I'd like to ask you what is the data structure that is being returned back to me here this is a serious very very good well notice it has a length so this is the number of elements that are in this series we have 20,000 data points here it also has a D type so this is float64 and for those of you who have not used numpy for quite a while numpy has its own data types separate from the Python native data types pandas uses these data type so this is telling you that I have a 64-bit floating-point number now if I want to extract just those column labels or just those row labels I have access to these as attributes that are defined on my data frame object so for example if I wanted to pull in only the columns of this data frame I can ask for the columns of the flight we'll see this is an index type object it also has a D type and this is object which for our purposes here just means that it's a Python string in here are the names of all of my columns so we can extract the column index we can do the same thing for the row index and this is an attribute called index now we're going to use both of these when we want to have label based access into this data frame and that's what we're going to work on next almost next before we're gonna have like a little teaser I guess is what this is so let's say that I would like to know now the average number of flights that are leaving the New York metropolitan area per hour I'm not going to ask you to raise your hands to show us how you might do this now this is here in the code cell for us but we can execute this and to get a sense of the kind of things that are available to us when we're using pandas so I know that doing all this up front talk about like indexing and how you get data by labels or how you get data by location can be a little boring but at some point we're going to be generating plots like this which is hopefully the motivation to keep listening to me talk and not switch over to Facebook we can do things like plot the number of flights per carrier we can do things like look at the correlation between the delay from the departing airport in the arrival delay as you might expect this is pretty close to two a one to one linear relationship and however flights tend to be a little bit later and on arrival and then they were under Archer all right so the idea behind indexing is that it is very very handy to do things like subset our data so I might want to only perform some set of comparisons or some set of tests on data from a certain time frame from planes from a certain carrier I might only want to be able to look at flights that have been cancelled for example or compare canceled flights and non canceled flights and we're going to look at each one of these things in turn to start with we're going to make a new data frame and this is going to be one that does not have the zero one two three four as its default row index it's going to be a sampling of flights that are indexed by their carrier so this is the first plates the first set of flights that are leaving out of New York in 2014 from Airlines like American Airlines Delta not sure who f9 is United u.s. Virgin Southwest now let's say to start with and we like to grab just two columns now those two columns we had to give us information about how delay the flight was now one for departure is one for arrivals we can do this using that getitem syntax for the square brackets this time instead of just asking for for one particular column we're going to ask for two and this means we have to pass in a list to that gate item call what's going to get returned back to us is a two-dimensional data structure this is a penas data frame that's been sort of subsetted down to just those two columns for those of you who are used to working in a language like sequel this is very much like a Select call if it's a little bit easier for us we can do this in two separate steps we can combine those strings the departure delay in the arrival delay inside of one list and then pass that list by name into my indexing operation and we'll see that I get the same thing yes that's a great question and so the question was in input 15 here I had these double brackets after the name of my data structure and the reason for that is these inner brackets is pythons lists literal syntax so I'm constructing a new list object the outer brackets are getting translated into a get item call and this is with providing me that dictionary like access into that data frame so in Python you can think about when I have just the the square brackets on their own and there's not stuff around them now this is constructing a new a new Python object when I have an opening square bracket that is either immediately after an object literal or immediately after an object name this is providing me some sort of label based access into that object all right so we're going to have our first little code exercise to make sure that everybody is on the same page and what I would like you to do is to extract from the first data frame the columns named origin and destination which is a shorthand just de St now the way this works is you're going to put your solution here and you can hold down the shift key and hit enter to see if it does what you expect afterwards you can compare your answer to Toms answer by going to the cell below and again evaluating it so holding down the shift key and hit enter and it's going to import some Python code you can look at for comparison and just a reminder once you think you've solved it go ahead and put your green or red one up or probably a green one and then if you need more time we'll ask you and put your red one up and then if is there anyone who's still having trouble installing the environments okay we have a cluster for later on I need to get the IP address real quick all right so it looks like most of us have got there congratulations on solving your first exercise and the panas tutorial we're gonna come back together as a group and think about how we might solve this particular problem so I know I had this data frame first so I'm gonna go ahead and type the name here I'm probably going to be performing some actions on it my name what I would like to do is extract two columns specifically now there's a lot of ways I could do that but so far in the notebooks we've only gone through one and that one way is to use the dictionary like access for dictionary like access I'm going to use my open and close square bracket and this is my Python getitem syntax but I don't just want one column I want to and so what this means is I'm going to pass in a list of column names and those names are origin and desk and if I evaluate this cell what we'll see is here I have reduced the number of columns in my data frame down to two do we all feel okay about this yes and that is a good question typically tuples and lists are interchangeable I know that there are some places even in numpy where they perform two different functions and there's an explicit type check on the data structure I don't know why the choice is made for the gate item syntax in pandas specifically so for posterity here the question was I couldn't use a tuple here and why is that and the answer was that in pandas there is something called a multi level index and this is much like like a joint key inside of sequel the idea is I can have a row that has sort of two labels attached to it and when paniced as a type check for the tuple to see if it should be using the lookup on some sort of multi level index with both labels other questions about this all right seeing none we're gonna go on to sort of the fourth one which I did not write up on the board here but we can do this now the fourth way I can access data that we'll be doing actually quite a lot in this class is in an attribute like way so what that means is when I'm working with objects inside of Python I have this dot syntax that allows me to access public data and public methods that are defined on that object now in pandas the columns themselves are not technically speaking public data they're defined as attributes but what pandas does is when it can it will create an attribute on that data frame that gives me direct access to the column now given that we already have our square bracket getitem syntax it might seem a little like why is there this additional option and the reason for that is to do to make it easier for ipython to support attribute lookups on penas data structures so when we're working in pandas it's very common to see syntax that looks like this where I have some data frame and then I'm going to grab some column by name maybe I am selecting some data from that column and I'm chaining on a method and what you'll notice is every time you have these square brackets inside of a Python the tab-completion that you're used to using stops working and the reason for that is I python has no way of knowing in advance the type of the data that comes out of that call without evaluating it on its own that would be very expensive to evaluate in advance every single thing that you're doing as you're typing so we can do instead is use this dot attribute lookup and i python does have the ability to see through these attribute lookups so if I type in value and then hit tab you'll see that that option for value counts does auto expand as I was extracting this column with the square bracket get item syntax this would not work now there are a couple ways where this will not work where you will not be able to have this attribute based lookups in two columns in your data frame one of these is if your column name is not a valid Python identifier the most common thing is if it has a space in it for example you cannot use it as an attribute the second thing is if your column name collides with the name of a public method that's already defined on a penas data frame object that your column access will not overwrite the public method so for example if I create a column on some data frame called mean and I try to access the column named mean via that attribute lookup what we'll see here is its returning to me that public method it's not returning to me my data so as a recommendation when you're working in real life and you have code that you're putting out into production it's always safer to use a square bracket syntax that is guaranteed to always return to you a column given that you have provided the correct name but this dot attribute lookup can be very nice when you're working interactively one more thing to be aware of um is you cannot assign a new column in to a data frame through that attribute lookup so for example in this data frame X if I try to insert a column named wrong using my dot attribute syntax what we'll see when we look at the column is that this failed and the reason is that most objects in Python if they have a class that's to find at the python level and it's not something imported from C allow you to assign arbitrary attributes on that object so we've done here is we've created a new attribute attention to my data frame and it's an attribute named wrong all right so so far we've seen our dictionary like access using that getitem syntax and we've seen the attribute like access into columns that sometimes works now the next thing that we're going to look at is label based access so this is access using that dot loke method so let's take as an example what we would like to do is to sort of filter our data frame so that we are only extracting data from a small number of carriers so only from is that American Delta US and Southwest I can use my dontlook access supplying only the row index in the data frame that gets returned back to me has all the columns of the original data frame but only a subset of those rows now if I supply two different sets of indexers so one for the rows and one for the columns where the rows is always the first one the columns is always the second one I can subset in both dimensions at the same time so in this case we're extracting just the same for Airlines so American Delta US and Southwest but in this case only the origin another destination I mean how long that departure was delayed and as a reminder just like in I'm just like a numpy whenever I pass in a single label so I'm not slicing em supplying some sort of get item call that dimension gets collapsed what gets returned back to me is a one-dimensional data structure so since I'm only asking for the tail number column that columns dimension gets collapsed what's returned to me as a pianist series where the index of the series shares the index of that original data frame now we can do this another way where for example we ask for just one single row label but we're also slicing across all the columns of my data frame and in this case we still get that one dimensional data structure it's still a pain a series object but in this case the index of the series that's returned to us is being pulled from the column index of the data frame so just like a numpy where my one dimensional vectors are neither row vectors and not column vectors they only exist in one dimensional space in pandas a series of jecht is not always a column it can correspond to things like data from a row yes we create a new object first or that's just the view that is an excellent question so the question was when I supply this getitem call am i creating a new copy of the data structure or am I being returned of you into the original data structure and this gentleman was at the numpy lectures yesterday which is possibly motivating the question and the answer here is it depends on what it is you're doing with it so when I am setting data into this object so let's say for example that I don't want this whole column now what I would like is something like the origin and let's say that I know for a fact that this particular flight didn't come from JFK it came from like LaGuardia when I execute this this data will be inserted into the original data structure so this dot local call is referring to some location in memory and I'm allowed to modify that location in memory so if I execute this and look at that American Airlines again we see that I modified the data in place I have not generated a new copy of my data anywhere down here where I'm asking for the specific object penas is going to create some new object in memory and return a handle to that object for me now this is going to be like next notebook or the notebook after what's particularly handy about this is that in Python there is no such thing as a piping operator so there are a lot of statistical software languages that allow you to pipe data structures across a series of transforms this does not exist in the world of Python what we can do in pandas though is chain methods together one after the other by virtue of the fact that these calls are all returning the data structures as new objects to us now one of the things we may or may not talk about later is that most of the methods I think almost all the methods now that are defined on pandas data frames have an optional argument called in-place and these are operations that will occur in memory without generating a new copy of the data so if something like memory efficiency is something that you're concerned about will say with an asterisk to be followed up on during the break alright and to finish up this section if we supply in two of these get item calls I collapse both of my dimensions and what gets returned back to me here is a single scalar value it's okay it's okay and what we'll talk about bullying asking in a bit yeah so yes if I can interrupt the camera might not get you unless you're using the microphone an important distinction here is you need with the boolean indexing that dalamud explained that directly you make a copy if you select a subset but you can still if you directly use it is in one construct and assign a new value you can still update the original data frame so what you get if you would assign this to another object it can often be a copy but that doesn't mean that you can use the syntax to modify the original data frame and the same goes for numpy so the question was is this data frame first some new thing or is it a view into the original data frame that we read in way back when now this is a new data structure with its own data correct so by inserting this LGA value in here we have not modified the original data frame so not in so the question was that I had said earlier that there was a way to not get a copy and there is a way to avoid generating copies when calling methods on data frames so this kind of sub sitting will return a copy terms I'm not sure I understand the question absolutely and if I understand your question correctly yes so that the data frame first when we modified it to include LGA as the departure Airport that is now true forever unless we go back all the way to the top of the notebook and we import the data and do not run the cell sorry first was created with the crew by operation it calculated a full counting we come later on the goodbye but indeed it's just its own object correct so we have permanently modified that position in first the original data frame that we read in has remained unchanged mm-hmm correct so that's a good question so the question was when I look at first I have this new data frame where is the original first and the answer is that there was only one data frame in memory called first so when we have this assignment operator were inserting new data into a memory location that's owned by this data frame so I think the point of confusion might be that when we're referring to like the other data frame is unchanged the other data frame we're referring to is like way way up here if I scroll all the way to the top is this one plates of course all right so mostly we've been doing one of two things we've either been supplying just some single label using that getitem call or we've been supplying a list of those labels we have another option though and that option is sort of the numpy strategy of slicing ranges of data out of this data frame and it works exactly the same way as it doesn't numpy but with like one one asterisk and the asterisk is that when I'm working with labels in pandas so when I'm using my indices that the slices that I use are inclusive at the upper bound so in the rest of Python and enum pie I include the lowest value and I stop before the ending value and what I'm using labels inside of pandas I include both that lower bound and that upper bound the rest of it though including the syntax is going to look the same so for example on that first data frame I can slice from American Airlines down to and then also including Delta and we see that we've excluded here airlines like Southwest that we're below Delta in that data frame now this is also a cue to you that when I'm slicing of course it depends on the order of the labels in my data frame typically you want to make sure that your data frame has been sorted on its index first particularly because if you have repeated index labels that are in different locations the slicing operation will fail all right and now it's time for another exercise we've been talking about indexing using label based access into rows and columns now what I would like you to do is to select the columns that are called tail number origin and destination but only for the carriers US air Virgin America and Southwest alright I'm seeing green flags everywhere so that's good news maybe we'll ask you this time so I have this task here I want to select columns tail number origin and destination specifically from us arrow Virgin America and Southwest what do I need to type into my cell to make that happen yes I need the data frame first dot lope very good mm-hmm yeah so I have rows that are called what us the X and WN and what else do I need mm-hmm and another list for the columns perfect so we have telling them we have origin and we have destination yeah now this is okay so at least two of you asked me about this so we can also have this as a discussion as a class now I'm seeing this exception here that says if I pass in a missing label I'm going to get a cure in the future and it might be not super clear what's happening until I sort of looked down at my data frame and said I have this column here that's filled with Nan's so you haven't talked about missing this yet but Nan's or what panas uses to represent missing values I mean the reason I have this whole column of missing data is I've asked for a column and does not exist anywhere so if I come up here and correct my typos from taina to tell them now we get the expected solution so the question was that air message means that right now I'm getting a bunch of missing data in the future it's going to raise an exception that is exactly correct all right so we have a couple more topics that we want to cover before we move on one of these is bliing indexing so like many statistical software languages and penas we have the ability to filter our data by applying some conditional expression to a column to get a boolean mask back so this is a pane a series object that is filled with trues and falses and we can use that to select just certain rows from our data frame so to get us started let's look at flights that are leaving the top five airports from the state of New York and this is returning to me a series where my row index is the key code for the airport so LGA for LaGuardia JFK for John F Kennedy and what we would like to do now is to extract just those most common labels so that's the first five now we can create a mask from these data so from our original flights data frame we can ask if the airport so the origin Airport is in one of these top five by using the is in public method that's defined on paint a series object so flights is the data frame origin is one specific column and is in is going to return a paint a series object the same length as the origin column and with the same row index as the origin column filled with true and false values or I have true values for all the locations where that data in that cell is in this list of my top five and it looks like this now I can take this mask and this paint a series of true and false values and I can pass it into this dot local call to select out just those values that are coming from one of those top five airports the top five busiest airports in New York so here is my Delta Lok accessor here's that mask we're also selecting out the origin and destination here and what we can remove those and see that I can use this just a subset other rows from this data frame now the safest way to do this is using dot Lok now we haven't talked about alignment yet but one of the coolest things about pandas is that does this mashing based on index labels and column labels for you but you can also use these masks using something like dot i'll oak keeping in mind that your data your mask has to be of the same the same length and as a data frame that you're trying to index into now in panis we have a whole list of methods methods and functions also that are going to return true and false values and these include all of the rich comparison operators you're used to using in python like less than greater than strictly equal greater than or less than there are also a couple in pandas that do some sort of special behavior so in python every numerical line brewery has its own built-in helper method for finding where the nulls are in your data sets where the NPI ends are for example in pandas this is called either is null and not null or isn't a and not n/a and these are going to return a boolean mask with true locations where i have missing data so i can use this for example to extract all the flights from this data set where my departure or my departure time is missing and I think I saw a hand yes that's a great question the question is what is mask so we can scroll up to where we do created it when I asked for masks to be output for my data frame we see what's returned to me here this is a panda series object so it's a one dimensional data structure and it has a data type so this is my boolean data type because it only has true and false values and the crucial thing about the masks that get returned in pandas is they share I can't just highlight the index here but they share the row index from the data frame so when we use that label based access like dot lok it's going to rely on these labels and what that means is if your data frame has gotten like Miss sorted at some point between when you have created this mask and when you apply it those true and false values always go to the correct row you don't have to worry about realigning things yourself that's a great question and the question was why are they all true are there false value sumer that we don't see and yes that's exactly right so there are false values in this mask they're just between where these dots are are those are meant to signify that some of the rows have been alighted for space if this printed all 20,000 data points it would blow up our browser another question so I think there's two two different questions in here so maybe I'll try repeating the question is IVA understood correctly and the question was if I have two kinds of masks that I'm creating and I would like to have them combined together how do I do that and so you are correct that you are going to use or inand but with like a very very strict caveat here that when we combine masks and pandas we cannot use pythons logical operators which are the English words a and D o R into n ot so we cannot use and or a not in pianist what we do is we rely on the same behavior that numpy has and we use the bitwise operators so the ampersand is and the pipe operator is or the tilde isn't exactly not it's the inverse but for boolean arrays it gives you the same thing and these are going to be performed not bitwise but element wise across the elements yeah so the the non-inclusive order is going to be the operator in Python so I think I would need to see the syntax referring to to give you a definitive answer about this can we push this offline okay thank you alright so now that we've seen some boolean indexing where we're using dot lok with a mask and that mask is a paint a series object that shares the row index from our data frame and contains the boolean values true and false we'd like to give us a chance to practice this so we have two exercises here for using boolean indexing and pandas for the first one I would like to see all of the flights that were canceled and in the description here as sort of the hint about the code you would need to make this happen and in the second one we're going to ask you to find all the flights that is departing before the hour of 6:00 a.m. or after the hour of 6:00 p.m. so this is one of the ones you'd need to combine those with one of those bitwise operators alright so somebody's going to check us yes to see if the building will actually fall down on our heads and in the meantime the rest of us are going to blithely pretend that nothing is happening and move along with the exercise so we have this data frame flights now we want to know all the locations we're gonna use dot lok in specifically those locations where that canceled value is equivalent to one so I can ask for all those locations where my canceled column is equal to one and this gives me all of those canceled flights now this is not the only way that you could have solved this problem instead of using sort of the square bracket dictionary like access how many of us use the dot attribute lookup awesome you are spot-on that is also 100% correct did somebody do something that was in either of these two things oh that's that's a good one so instead of using the operator we can ask very explicitly for the method instead mm-hmm perfect and that would work - is that what you did for things that are greater than zero this also works because in this case the values are only one in zero awesome we had one more this was a little bit trickier I'm an exercise number two we want to find all the flights that are leaving before 6 a.m. or after 6 p.m. and there were sort of a couple steps to this one so just like before we're asking for some locations from our data frame so this is going to again be flights dot Lok and but in this case we're going to be subsetting based on two masks that we'd like to combine together now to make things a little easier for ourselves we're going to put these masks that we generate inside of parens and when we're using this to force the order of operations of evaluation to be the way that we want so we want the masks to be created before that operator is applied the operator that we're going to use is going to be this pipe operator so if you haven't seen this before it's in like the top right hand side of your keyboard I'm in Python this is the bitwise or operator and pandas it is the element-wise or so we want these two masks where either one of them is true so either before 6:00 a.m. or after 6:00 p.m. and now we can use our masks so for the things that are before 6:00 a.m. we're going to ask for the departure time and then there's one little extra step in here which is dot DT hour so when we were working on this exercise quite a few of you asked me about this about what this dot DT is and this is a place where we can hide some functionality that's specific to one type of data so this dot DT namespace is going to contain a lot of attributes and methods that are specifically for working with timestamp data now as you get on in pandas and you do some more complicated stuff you'll see that there is also like text processing so like regular expression patterns you're going to find inside the USTR namespace so we have not DT dot our is less than six or flights DT dot our is greater than 18 I did forget to put the DEP in there all right when I've combined those two masks here I get all these flights that are either leaving early in the morning or they're leaving late in the evening now there were some of you who composed these two masks separately so you gave each one of these a name and then combined those names together and use that final name inside of that indexing operation that is also 100% correct you are on the right track now we do need to move on but I would like to pause for a minute to appreciate that the ground is shaking and to ask if there are any last-minute questions you have before we move away from the indexing notebook to our next topic yes because we are still looking on that row index and there was another question yes that's a good question so and the question was and we can sort of really quickly take a look at just the top couple rows from this data frame there is this depth time column that has in hundreds of hours when flights are leaving and the question was why couldn't I just ask for things that are less than 600 or greater than 1800 and the answer is you also could have done that there were quite a few people who did this dot DT our was like a cool way to sneak in some advanced pandas functionality into our indexing section so you all learn something like extra-extra new yes they don't match up it's just 1902 19 hours and so I'm not sure why we can we can look at this offline and I'll get back to you and from what I see these match up fairly 101 so 852 for 852 hours oh but for like depth time to dip those seemed to match up yeah so we'll look into this later any other last-minute questions in the back yes whatever for yeah I think so so the question was so we're using this this index with sort of strings in my label or well later we'll see time stamps in the label can I have just integers in the label and yes and you can think about that as representing the original ordering of your data and this is separate from I look which is the current ordering of your data and we'll make this the last one yeah the question was does date time deal with non-standard or non Gregorian calendars and the answer is not as far as I know so you may or may not at some point in your life have the need to count from like one business quarter to the next business quarter and this is something that date/time operations in pandas will handle um if you want to have support for something like a lunar calendar year you can do that with a custom like date/time offset but that's something that you would need to write yourself something like that yeah all right so it's now time for us to move on to the alignment notebook ladies and gentlemen may I present tom so if you still have that you read me notebook open you can go back to that and click on alignment or in the main tree here you can click on alignment either want to work give you a second to get that open all right and yeah I should say we had to cut that last notebook short but there's a bit more on I'll oak which we talked about briefly and then a bit on indexing with date times so if we get to the day time section we'll come back here and take a look at it but from now yeah for now we're gonna move on to alignment okay so alignment is all about making your job as an analyst or whatever whenever you're working with data making that easier just increasing your the the flow of your your data as you're as you're working with data okay so it's not a thing that you actually do explicitly really it's gonna be happening in the background pandas is going to be doing it for you but it has a some implications for what operations do okay so we're gonna start off by doing our usual imports here numpy pandas and a few others and then we're gonna take a look at doing things the wrong way first okay which I hear is a bad teaching strategy but oh well so we have some data from Fred on GDP and CPI CPI is a measure of inflation and each of the two data sets are in a separate CSV and there's two columns in each CSV there's a date and then there's the actual values either the CPI in this case or GDP and the other one so we're gonna go ahead and read those in we saw a PD beat CSV earlier parse dates is new we're saying these columns they're you know they're strings in a CSV everything's a string so we tell tell pandas that these strings that are actually dates and we'll take a look at those first two so we have the GDP date and the actual value the GDP and for CPI they're the values and the date any questions on reading in the CSV s ok all right so let's do a little task here our goal is to compute real GDP so for those of you who've forgotten their economics courses we have nominal GDP is just some number measured in dollars in this case in the u.s. we measure in dollars total output of country then real GDP is a nominal GDP adjusted for inflation so it's in constant dollars to compute that you take nominal GDP and divide by some measure of inflation typically the GDP deflator we're not doing that so I can make a teaching point so we're going to use CPI instead of the GDP deflator so that is our little task here is to compute real GDP and the operation that we need to do to do that is division if you're familiar with sequel maybe you have two tables here one for GDP and one for CPI this is basically how you would write it you would select the date column you'd have the operation GDP divided by CP I maybe give it a name and you would have to explicitly say I'm gonna join these two tables using the date column okay so here's familiar with sequel okay I think you would do this in R I'm most familiar with AR but I think this is roughly how you do things and we can do it in pandas that way as well so to do joins joining multiple tables you can use PDR you give it your left table and your right table you say on so this is kind of like the using call them the keyword there and then we're gonna do an inner join just be explicit about it then we can go ahead and do the join and once we have that table that's joined together here with all three columns we can go ahead and do the division so we're going to do GDP we're going to select the GDP column and divide it by the CPI column yeah then you would do left on equal that rights on equal to other sorry the question was what if your they're the same you know thing they represent the same thing but they have different names so that's how you do it okay so we've done the merge and we've computed that there's a couple things that aren't great about this output so first of all we lost the dates right in this M column and both of our original data frames we have the date next to our values our GDP or CPI after we've done the computation we've lost the date which is you know it's kind of annoying it's not at the end of the world we can add it back but it's annoying and then the second thing is we had to manually think about joining these two together that's just an extra step our our goal was to compute not a real GDP which involves taking nominal divided by CPI nowhere in that like nominal divided by CPI there's nothing about merging so why do we have to think about that okay so there's a better way to do this with pandas and it relies on using your index labels to do alignments so we're gonna take a take a look at that so again we're gonna read in these CSVs we're reading in the data set fresh and we're making a couple changes first one is we're saying index call equal date this is the first important change that's saying I want my row labels to be date everyone see that remember that this is the series output here we have the index the row labels and then the values previously when we're at that end we had just our default index of zero through the length of the data frame minus one and then they had two columns now we have our row labels are the actual date the second thing we changed was the squeeze so if you leave that off you get a data frame do you see the HTML output here the nice pretty table and with the squeeze it squeezes that one column data frame down to a series so this is a series and I'll talk about why we did that in a second all right so we have one series called GDP we have a second series called CPI again the same things index call equal date and squeeze it down so now we have two series and we want to compute real GDP real GDP and ruin the formula for that is nominal GDP divided by CPI we just write that and we do it and we get there our output kind of first of all will reflect on that for a second so there's nothing involving joins here okay we're just writing our formula for real GDP and we're going ahead with it and the key to that is that we used our date as the wrote labels okay so before you take questions first of all let's talk about all these Nan's so we saw these earlier on these are not a number it's pandas missing value indicator for numeric data okay so this where did these missing values come from if we look back at the original data we can look at GDP ahead and we can look at CPI ahead you'll notice that the two dates don't over don't align exactly GDP is quarterly so there's one month one four seven whereas CPI is monthly so one two three four okay that's kind of why we have to go through this whole business of merging or aligning in the first places they're at different frequencies they don't have the same index with sequel you would mean you would explicitly say I want to join these two on this field and with pandas we're gonna let pandas do it for us by setting those two our row labels so like I said you don't typically do this kind of alignment explicitly typically it's happening in the background for you but just to see what pandas is doing for us let's actually write it out so we're gonna take GDP index so remember this is our the ones where the date is our our row labels we're going to take GDP index and Union it you can use pipe or you can use Union it will do that maybe this is clear we're in a union it with CPI and so this is the full index and this is monthly right it's it's got the it has all of the observations from either the GDP or the CPI series any questions on that Union that's like a set Union there okay so we have this new full index with all of the dates for either one and then do a reindex so again typically you won't be doing this manually but this is basically what pandas is doing in the background it's doing a union of the row labels its reindex ngey DP and CPI so you can take a look here reindex full index full ydx and then this will be the same output as what you get when you do just the GDP divided by a CPI but this is nicer too right then having to do the re-indexing any questions on that all right this comes up in a lot of places and it's it's you know it's it's I think unique to pandas so if you're familiar with are when you do when you have differing row labels I don't think it does this kind of alignment for you I think it ignores them so it's a kind of weird thing but ends up being quite useful okay let's do an exercise so we're going to be computing real GDP in 2009 dollars so this is going to be doing a couple things previously I think the index for a CPI was like the average of 1982 yeah 82 to 84 so that was like CPI is an index so we've arbitrarily said that some value is equal to 100 and everything else is relative to that so we're going to shift that arbitrary value to be the average of 2005 so we want to find the average CPI for 2005 which I'm now realizing we don't know how to do because we skip that section apologies I'll show you quick how do I select the values for just 2005 from CPI this is the daytime indexing section that we skipped so typically you know loke is we look up by labels so maybe you're thinking okay I need something like you know 2005 - mm you know maybe something like this but this is going to look up the integer 2005 so what do I do here you could do something like PD dot time stamp but we're not going to do that we're just going to do it the easy thing and put that in a string so pandas makes a special case for date/time indexes this is not looking up the string 2005 it's looking up the year 2005 or you can do the month 2005 January okay everyone okay with that that's a bit of magic that we skipped over but everyone see what's going on makes sense okay okay all right so now we have a CPI for 2005 sorry we want 2009 right for the exercise okay so we have the CPI for 2009 now we need to rescale CPI the entire series to be base 2009 so we want the values in 2009 the average value there to be a hundred okay so if you're familiar with numpy you might know how to take an average or a mean perhaps so I'll let you work on this exercise now we'll give it a couple minutes maybe three or four minutes and then when you're done put up your green flag or if you get stuck put up your red one and and yeah I'd recommend struggling a little bit if you need and then we can take a look at the solution and then we'll do a break at the end of this notebook which will be in about 15 minutes or so at 3:30 okay let's a start on the first step of this so we saw selecting the 2009 and maybe this part wasn't clear I'm I have a warped brain because of my economics background but the average value for 2009 is just some number it's 214 arbitrary and so what we want to do here is take this number and then to do the actual rescaling it's just it's something yeah we do as economist so sorry about that if the instructions weren't clear but we're gonna to rescale a series to a new base the idea is you take the original series like CPI you divide it by the the new thing that you want to call 100 and we're calling 100 we're making 100 equal to the average for 2009 ok so this is our new base 214 is are going to be become 100 essentially so we're gonna say CPI divided by average if you look at the scale it's a bit off here right so our numbers are basically you're centered around 1 everyone see that so it's a little less than 1 a little more than 1 that's because when you do this that mean here we've gotten shifted basically down by a factor of a hundred so to rescale it back up to be in a proper index we'll multiply by hundred so this is our new CPI ok so again I'm just breaking that down average is just some number notice when you take a series divided by some scalar like this pandas hazlit broadcasting like numpy so you can divide arrays by scalars and it works how you would expect and then we need to rescale it to make it an index any questions on that okay okay let's hear okay so for the next bit remember how earlier when we did the read CSV we had that squeeze at the end and I promised we would come back to it so this is why we're gonna make a new thing called GDP underscore and we're gonna this is the same as before only we've removed this squeeze so we've gotten rid of the squeeze remember I said that takes a 1 a dataframe with one column and squeezes it down to a series so this is a dataframe GDP underscore and CPI underscore at some point this will just happen to is you you're going along doing your analysis and all of a sudden all of your data is missing and the likely cause is that pan has aligned something when you didn't expect it to okay so what do you think happened here yes so the so and date frames have two indices right somewhere yeah so you have the index index then we have the columns index and panders has aligned on both of them which is sometimes what you want not always but pan is is consistent it's gonna always align on both okay so if you end up with all missing data at some point maybe check your indices and see if it maybe had a data frame where you expected a series so reason it to happen earlier is because series only has one index just the index index questions on that this will happen to you at some point okay missing data we've seen that in both notebooks so far I think and it happens often depending on your data set so pandas has a lot of tools for working with missing data so first of all for detecting missing data you can use is an A or is null same thing this will give you a boolean mask where the values are missing or not so true where it is missing false where it's not not anaise the same but reversed true or it's not in all false where it is an A we saw boolean masking earlier so give me the Rose where it is missing where GDP is missing you can use either dot Lok or since pandas considers boolean masking special you can also pass it into get item we had a question about that earlier and when you have boolean masks this is like a nice little numpy thing when you do operations on them like some or yeah some are this probably the most useful one those count as true counts as one in false counts as zero so if you want to count the number of true values you can do dot sum or dot mean if you want the percent of the values that are true and so in this case how many missing values do we have we have 552 okay questions there okay so that's detecting missing data maybe you've detected some missing data and you want to do something with it one option is to drop it the preferred option if you can make your problems go away with dot droppin a around so in this case remember real GDP we introduced some missing values when pandas did the alignment there wasn't an original GDP observation for this month and when we did the division pandas aligned and we got missing values okay we can drop the missing values with droppin a okay one common question point confusion is what do we think this is going to be when I execute this cell well the missing values be there or no they will still be there okay so dropping a does not mutate data in place it's gonna be a new results yeah it does yep so the question was does dropping I have in place equal true it does I recommend never using it so there's it's confusing because in place has two meanings first of all there's two ways to change are you know my my variable our GDP what does that refer to there's the concept of variable names and then there's a concept of copying memory and in place currently conflates those two so it's possible when you do when you do something like drop an A in place equal to true this is going to update our GDP and we don't have to reassign it or something we could also you know update our variable called our g.d.p by reassigning it like this okay so this is taking care of the what is my variable called our GDP reference we can update that the second thing is copying I think our G no no I think dropping a will yes this kind of mystery I don't know if this copies or not we would have to look at the implementation and you know it's going to be horribly confusing so it may or may not most many of the times when you have drop in a I'm sorry when you have in place equal true what pandas does is makes a copy of your data internally updates the thing and then updates your reference to it so itself becomes this thing this newly updated thing so if you're looking to avoid memory usage in place will not always do it for you sorry we're working on it someday any other questions all righty so we just saw drop in a when you have a data frame things are a bit more complex because you have to specify do you want to drop it you know what's considered missing missing enough to drop it if if just one is missing do I drop it does all the values have to be missing in this row you want to drop along the columns or the index so that's data frame dropping a has a few extra parameters okay by defaults these are the defaults so if you just get rid of all the arguments this is what its gonna do it's gonna drop any row that has any missing data so the other way to say that it says axis equals index so I'm dropping values along the index vertically and how I'm dropping if any values are missing an alternative is to drop the row only if all of the values are missing and the difference there was specifying how equals to all okay in this case no values were dropped because we didn't have any rows that we're all missing all right I think we have an exercise here so if you want to go ahead and take a look at that when I drop any columns from DF that have at least one missing value maybe take one minute or so for this I'll put the doc string for dropping a up on the screen okay which I put here I have two parameters I can specify first ones axis what should that be columns thank you I could just do this but let's be explicit what should happy any right drop the rows are sorry drop the columns that have any missing values questions there okay cool alright filling missing value so we've covered detecting missing data we've covered dropping it sometimes you might want to fill it with some strategy so one strategy is to forward fill values so take the last value valid observation and fill it forward you can do that that's going to be a series so just to take a look at it this is a series with no missing values everything's been filled forward and then we can plot it so you can take a look at do you have that fill in a there's other strategies like you can fill in a value a scalar scalar value passed a dictionary or series to specify a specific value per column or you can do things like back fill F fill okay we'll cover those when we get to them so we've seen both of these already I really quickly sped past concat when we made DF sorry about that but let's talk about that in detail now so first thing merge we saw this when we wanted to replicate that sequel example so those sequel experts in the room have an advantage here but go ahead and see if you can join the two data frames so GDP bad and CPI bad with an outer join so earlier on we did an inner join if you want a little hint you can scroll up to one of the first first few cells in the notebook so we're going to join these two and then we're gonna see how to sort them as well so take a minute to work on that and if you get stuck take a look at our first example okay you're going to start again and the next next topic in the next notebook that you can open is a group by notebook number three so for this notebook again I'm going to import pandas and other libraries this time we are going to use a data set on beers so there are beer reviews all the reviews apparently by the top are not reviewers so what is inside the data sets so there are beer names here view style and then reviews on the score from 0 to 5 of some different things appearance aroma palette palette and not sure where that is the taste overall score we also have a textual review so really a river in text when they did a review the name of the reviewer alcohol percentage so a bunch of data about people who reviewed a lot of different beers so here you also see the different cones I'm going to skip those two exercises we will see we will use that later on and we will explain it when it when we need it in the exercises so I will start with explaining the concept of goodbye and or and why we would need this so if I take a look at the data frame a typical question that you would like to answer and given that we have many different people reviewed the same beer so logical question could be okay what is now the average review score for a certain beer but we could do is given but we already learned today we could say and give me I want to select with the boolean mask so all the beers that are equal to just take the first one here make a boolean selection of that so now of all the beer reviews apparently that 17 people reviewed this particular beer so say that are my stills then I can of this one the overall review and I can calculate the mean I can now repeat this for another beer but of course if you want to do this for all the beers in the data set you want to not do this manually but you want to do this automatically so this pattern where you want to do a certain operation in this case taking the average of the overall review for each beer for each group and that's where a good buy operation will help you so what so you can see accrue by operation as different steps we are going to split up our data frame in different groups that's what I did here manually by it by here taking a subset of our data frame so that will be one group so we split in all those different groups for example all the different beer names we apply a function to it in my example it was the mean and then in the end we want to combine those results all those average overall reviews for each of the beers going to combine that again into a single data frame so there are the three steps of a goodbye operation and panas provides the group by method on a data frame to do this first step is splitting how does this work use space by inside the group by up method you can specify the column name on which you want to group so this will look at the beer name column and you will it will calculate all the unique their names and create all groups for each beer if you do this this gives us data frame cook by object so it's not really it's not yet a new data frame because we first need to apply a function and combine the results again so the applying combined happens together and for example to to do what we did above and we can use aggregate min here so what what is happening in this in this part in this line so this we already have seen we want to say group my data frame by the beer name and calculate the average that is what this is doing I will just revert this to show you so this world for each beer name and I now have an average of all those columns what was originally there was maybe I'm not interested in in the average of each column for example the the average ID or the average alcohol percentage I don't I'm not interested in so I can select similarly as how you would select columns on a on a normal data frame using a list inside the gets item added dictionary access brackets to select a few columns so we only calculate the average on that one so now this line did for each beer name calculate the average of those review columns and combined the result in in a data frame there are some different ways to specify the groups they are listed here above so typically you provide a single ie private string referring to one of your columns of your data frame but it can also be a full series that that has the same length as your data frame or it can be a function that is first applied on on the index and then the result of that function will be grouped if you have multiple levels in your index you can provide the levels but we didn't see that yet so that's not important in this case for this notebook and then how to apply the function can also done in some different ways so the aggregate method is very generic method if you if you want to do an aggregation you want to calculate a single value for each group for each column you can use aggregates you also have an alternative syntax where you can simply use the mean method this will give you the same and either in the aggregate or directly with the method there are the typical methods that also work on a normal data frame will also work on this group by object for example the minimum of each at the minimum review for each beer will give you the minimum for each one or we can maybe we want a median and so those typical aggregation functions they are all provided on this object yeah so that's why the explains here using the aggregates or the mean that is in this case equivalent okay so let's try that yourself so a small exercise to find the beer style and so it's another column bit bigger groups and we want a beer style where there is the greatest variance in the alcohol percentage within that group okay I will go quickly through the exercise so what did you want we want the boost cell with the greatest variance in the alcohol alcohol percentage so how do we do that with goodbye we first specify by which column we want to group so we want to know something for each visa so we call goodbye by beer style then we want to calculate something for the ABV column so we can select ABV oops that's not ABV that and then what do we want to calculate we want to calculate the variance in this group so we can both use either var or STD so the variance or the standard deviation in this case we just want to know the the one with the biggest so here it doesn't matter which one of the two that you take so this will now calculate for each beer style the standard variation or the variance within for the alcohol percentage within for all the beers all the reviews in the in the data frame for that group so the question was now we want to know the beer style with the greatest Ferentz so for that one way to do this is by sorting the values so we can inspect the results so I can for example give this call this give this a name for a moment we can do then oops ABV sort of values right at and you normally should see that American double here Imperial start has the largest parents in case that for this that's enough you just want to know that one in case that you want to have it on the top of your data frame so don't sort ascending but this ending we can use the ascending keyword and specify false so now the value here is on the top another thing that I wanted to note so here in this case I gave this a new I assign this to a new variable and none on that variable if the abv I called sort values that's fine but we could also have don't like this putting everything on one line so that's what in the beginning they don't asked about method chaining so this is what mill chaining is because each because this part returned a new panel object a series in this case we can always directly call the next method that we want on that object so in this case this also gives us the result we want okay that was a basic example of goodbye goodbye gives has a lot of more power to group by multiple things calculate multiple things at the same time so here I'm going to give you just a small taste of that we don't explain everything because we don't have the time to go very deeply into multi English's but here I grew by a single grouper and I calculate something on a single column then I get a series back if I do it on multiple columns either by not selecting a single column or I could here have puts herbs multiple values inside the square brackets I get back a data frame and you'll notice that at the beer style has become my index but with goodbye you can also group by multiple by multiple groups if I would simply get the review overall and then the mean you will see now I have here two levels of a v-neck so both groupers both columns that I could buy I have become a part of my index so that's called a multi index you can do a lot of powerful things with not a nexus English's and but that's a bit out of scope for the current tutorial what was there originally not only can you go by multiple columns you can also in one go calculate multiple functions by giving here at least to the aggregate function and do that for different columns so in that case you get so multiple levels in the index from the multiple groupers and also multiple levels in the cones so for each column for review appearance review aroma we calculated the different functions so that way you can create very complex data frames but in many cases the exercise we have seen up to now is typically but you will use most often okay first going to do some extra exercises on goodbye so the first one what is the question here you want to make a visualization one to make plots of the relationship between the review length the number of characters of the review and the average overall review just to give you a hint for the first one so what we are actually interested in if you look at the data frame there is a column called text this one which has some textual review you we want to know in this case the length of that review and that's something that I want to explain now as I skipped in the beginning so if you have a text column a column with strings panas provides some additional functionality to work on those strings similarly as we have seen in one of the previous notebooks that you can if you have a daytime column and that for example my date that you can with the dot DT accessor for example get the hour and so this is something that we've seen before on a text column and a column with strings we have something very similar dot STR and which groups all methods that you typical methods like capitalize or count find converted to lower upper strip certain things so typical string operations are here and for example the length is also available so with that you should be able to start this exercise okay I will quickly go through the exercises so I already gave a way that you can use the dot STR length so let's look at what that gives so the this gives you the length of of each text review and then we want to group by this length so that we calculate the average review for each possible length so up to now we always given if we did go by we gave her a certain column where a certain column that by its name but you can also directly pass a series so I can group by this one and then I want for the review overall I want to know the average IDs so this will calculate for each text thing that is in there the average of course I have many possible length so it will not be very big groups but this does that calculation and now we want to make a plot of this to have a visual idea of a relationship and we can use plots for this there is a small hint here to use style equals K dot what does this do the K means I will make it black I suppose they use the K because the B was already for blue so that's something remember K is black and the dot means I want to make points and not the default of lines so this gives us this plot where you can see some some in here in the middle of some relationship I'm not fully sure about that it's rather some variation on the tails table so here we did it grew by by again a series of something we calculated on the fly the next exercise is very similar to the one above but just instead of calculating the number of characters in the Revue the number of words seen it's a bit a tricky one I will just show this so instead of here we had before in this part we had the length now we have a count what does it do the syntax this will count the number of times we have this pattern I'm also not fully familiar does this mean a word or is that whitespace a word so it's a pattern for the words I suppose surrounded by white space or new lines so this will count the number of words but yeah that's something you need to know there is a link here to the documentation about regular expressions and for the rest this is completely the same apparently it takes a bit more time for this to calculate but we get a very similar plot maybe something I think we didn't explain that yet using the brackets around it the so but I did with above here I put this all on a single line and you can see that put it here I put this all on a single line but certainly if you have even more things you want to calculate on that result and in if you like this method chaining instead of each time defining an intermediate variable and then going further with that one if you want to do this method chaining but you get a bit too long and typically in for example in style guides for Python we want to limit the line length for example to 80 characters so the above the below here the way here is one way to wrap such a single method chain over multiple lines so for example I can do this I we can just put each coil on the next one like this the only thing is that you need to tell to python that it needs to go look further on the next line instead of already executing this statement and seeing that as a single line and putting then brackets around this is one way to say to python this is a single statement calculate I did a single expression that you need to calculate in one go the last exercise for goodbye so here the question was find the relationship between the number of reviews for a beer and the average review overall so in general for a beer so I will group by the beer name your name it was I want to calculate something for the review overall and one of the things I want to calculate is the average so the mean but they're actually not a single thing that I want to calculate for each beer I also want to know the number of reviews there are so that means at the the size or the the size of each group now we mentioned a little bit before that we can use if you use the aggregate syntax so this is equivalent to the one that we just uses just so before in this case I can give multiple things to calculate at once and in this case that's something you need to remember you can use count to indicate that apart from taking an average of the group URLs I will also want to count the number of values in that group so now I have a mean an account for each beer if you want to then actually look at the result create a visualization for it one way to do this what we can do is here we can use plots and similarly as the STR and the DT special attributes dot is also searching attributes and we can use plot scatter to create a scatter plots where I say X is a set account and Y is the mean like that so now you have a scatter plot we hear the count about number of reviews and the average review overall review so plot is a bit a special case because it can both be a normal function I guess like that but then the x and y in this case will not work but you can also so use the dot scatter to create scatter plots or I have bar plots box plots at the different plotting options that - provides ok so whether we have seen in this example we calculated two things at the same time by providing a list to aggregate any more questions about goodbye then so the rest of the notebook there is still some material on transforming using group by where you only want to transform your data and not not aggregate so transforming could be I want for each group normalize it by a group so for example subtract the mean of that group but keep the original shape of my data frame that's the transform operation and apply with goodbye is a very generic that you can just apply any function you want on each group but so if you want you can always look at that part and there are some exercises solutions are provided but time wise we will now go to the next notebook and that is one on tidy data you probably want to switch I think yeah I think it's a certain yeah [Music] all right so we have at least two more topics that we've flown and fit into sort of the next hour here so we're going to pick up the pace a little bit the next topic for us is what is called here tidy data if you studied computer science like 10 years ago you would have called this normalization I would bet that most of you have not heard of this at all not because I think something special about you because like nobody ever taught this stuff to me when I was in school so could I see you by a show of hands how many people have been told that there was like a wrong way and a right way to make a table that was a lot more than I was expecting actually so that's about 10 12 maybe so for the rest of you this is like super super exciting because I get to be that person who steps in is like this is the way this is the way to do things so there are a couple rules for how to make a table that works well there are actually about four of them we're gonna focus on sort of the first two in the first rule for making tables is that every row in your table should contain all so all of the data and only the data from one observation and we'll see like a concrete example of this in just a second rule number two here is that every column in your table should contain again all and only one source of information now what what you mean by this is that when I am recording a table there is some distinct moment in time where I am recording a set of values that describe something so maybe what I'm recording is some particular contest in that contest I have this many participants that occurred in this location these are all different sources of information so everything about that contest goes into one row and each one of those columns contains each one of those different kinds of things how many people were there where did it occur and I think it's sort of the easiest way for people to see what this looks like is if we see a table that is not following these rules and then we can see in Python in pandas how we can transform it into one that's a lot easier to work with so let's imagine that we have a table that is showing the relationship between your tutorial presenters in the number of pies they can eat in like a hypothetical pie-eating contest so here we have we have Dylan we have Tom and we have your Asst and let's say we have contest number one and I ate like 10 pies and Tom ate five you know say three this didn't actually happen in case that's clear we are making up numbers and in contest number two I got really sick because I think ten pison I didn't show up so this one is missing Tom had another strong showing this time with six pies youris though skipped ahead to the leading contest number two with nine so it Congrats to your ass on winning the second one so here is a table and we've all seen tables that look like this and these are nice in this format this is very very human readable the problem is this is a little harder for a machine to understand and the reason is that it's breaking these particular rules so if we want to subset this like if I want to subset this as a penas data frame and I want to find all the participants who have eaten more than let's say five pies in any contest like that's at least the steps in pandas with this table as it's formatted now so let's start with a rule number one that every row should contain all of the data from one observation and only data from a single observation am I breaking that yeah some of you are nodding yes other people are saying yes yes I am breaking this um what specifically is happening here in this table that is breaking that rule to contest per line that's exactly right so if we think about me eating pies in this contest as one observation we have snuck in two different observations in this one particular line here and it's also true for for Tom and for URIs we're also breaking rule number two here that each column should contain all of the data from one source of information and only the data from one source of information in this one we're actually breaking in two separate ways one of these is easy to see the other one is really tricky I like unless you've seen this kind of thing before so that seems okay to me so I'm like this column right here this is only people this column here is only pies this column here is only pies eventually well yeah we'll get to that so how many columns contain the source of information of the numbers of pies - how many do we want one so that's one of the ways that we're breaking this rule the other way we're breaking this rule is there is actually a secret column in this table you can't see it but it's there there is a second source of information that I'm recording and I've hidden it time exactly right so time is another source of information that I'm recording but I have put it up here as text in my column labels which is absolutely terrible and again this is great for a human to read we can all look at this and instantly know who won contest won and we can look at it and instantly know whose scores are improving over time but this is not going to be an easy table for a computer to work with a computer would much rather see a table that looks like this where I have three columns this should be like name the name of my contestant the contest number in the number of pies where I have Dylan and contest one had 10 Tom and contest 1 had 5 Tom and contest two had 6 and so on you will often hear these called the wide form in the long form where the one on the left is the wide one the one on the right is the long one long because that's going sort of up and down like this now that's not to say the first one is bad and in fact even in Python there are some functions that will expect you to pass in data frames that look like this so for us what we need to know is how we can transform from one into the other and back and the way that we're going to do this is using one of two strategies the first one we're going to call reshaping I don't know that the rest of the world groups it this way I've made this up just now and the second one we're going to call transposing so in this first strategy and this is the first function we'll look at I'm going to do one of two things to my data frame I'm going to melt it and that's going to make it long or if I have a long one and I want a wide one I'm going to pivot it and that's going to make it wide so let's look at a concrete example and actually let's put in code and Python the same thing we've just written on the on the screen up here so I'm gonna put some empty cells here at the top of this notebook and together we're going to make a data frame that matches this one we have here on the screen sorry there's two on here we want the one on the left-hand side to start with so I'm going to use the dictionary way of initializing a data frame with some data which means we're going to have a name column we're going to have a contest one and we're going to have a contest - for my name's we have Dylan we have Tom and we have your Asst in contest one I had 10 pies Tom ate 5 in yours 8 3 and contest - I didn't eat any Tom had sex Yoda said 9 so here on the screen here we have this data frame that looks just like the one we wrote up on the whiteboard and our task together is to go through the steps of transforming it into the one that looks more like the one on your right hand side so to do this we're going to use a function in pandas called melt so melt is a function that lives in the main pandas namespace so it's PD melt and in this melt function the first thing we have to pass in is that data frame we've just created the PI's data frame now what melt is going to do is it's going to take some arbitrarily large number of columns in my data frame and it's going to smoosh them down into exactly two where one of those columns is going to contain all of the column names that were there before and the other is going to contain all the values that were below them so if we execute this right now this is not going to do what we want we're going to end up with one column named variable one name value and we see here are all of my column names on the left hand side all the values associated with those columns on the right hand side we don't need to change we don't need to sort of smush this whole table the names are flying the way that they are so the next piece of information that we're going to pass in to this mount call is which variables I'm considering as ID this ID VARs means don't mess with these ones now I can put in a list of as many of these as I want right now we've only got one column that we have to keep the way that it is and that's the name column now this is getting pretty close to the the the data frame that we wanted this is pretty close to the format that we wanted the shape is actually right so we could you know like my work is done but this looks kind of ugly so first off variable is a terrible name for variable value is also a terrible name for a variable so we can fix both of those and there's a couple more things we might want to do personally like I would drop the the nan value from Dylan and contest too and we might also want to use some of those fancy string methods those Victor I string methods that we just saw to change this from being like contest one contest one contest one contest to just being one one one two two so we'll do all those things because why not I'm the one with the microphone so the first thing we're going to do is we're going to specify our own variable name so by var name we mean the name of the thing that is now called variable what does this represent it's which contest sure so we could call it contest and the next thing we're going to do is we're going to specify um the name of this value column which is the value name argument and what do those values represent the number of pi is very good so we could be like n PI's or just put PI's so now this is already looking better I can look at this table and sort of make sense of what is there just by virtue of the fact that we have column names that make sense now we'll do one more thing here and that is to add on a drop in a call and since I'm running out of a width here I will use that super fancy put everything inside parens trick we just learned from yours so here I'm melting this down into a new data frame we're dropping the value from me and contest to because I wasn't even there and this looks pretty decent now we want to give this a name so we can do some work with it and we'll call you long so now here is my long data frame our next task is to fix up this contest column to remove the string data from it and there are at least three different ways that we could do this we're going to go ahead and skip to the third one which is the best one and that's to grab this contest variable and look at the vectorized string methods on it in particular we want one called extract and this is going to allow us to supply a regular expression pattern and it only does this one on the one talking we're going to supply a regular expression pattern the voice that shook the hotel yeah that's going to allow us to pull out just the data that we want I mean specifically we just want anything that's a number so inside of our regular expression string the R in front of the opening quote stands for raw if you work with regular expressions in Python you'll see this a lot we're going to have parens to say I want to extract the whole group inside of our parens we're going to have square brackets this is a character set and inside of the character set we're going to say I want to extract anything that's between the digits 0 9 so 0 - 9 after the closing bracket but before the closing parenthesis many numbers in a row as you can if you want to impress your friends later this is called a greedy operator it's also a great band name and we're gonna do one more thing now we're gonna set expand equals to false so because panas doesn't know how many sequences of numbers exist in this column by default it will pass back a data frame that has one column for each one of those possible numbers we know there's only one number in each of these so we're okay getting just back a series and if we run this here are those ones in those twos now I haven't done anything with this series that's sort of dumped out to the screen here and because I haven't given it a name it's sort of disappearing into the heat noise of my computer what we would like to do is reinsert it back into our data frame and specifically we would like to overwrite the column called contest that already exists and we can do this with my dictionary like access to say that the column called contest should be this thing that I've just done all this clinging to and when I look at my data frame now this is this is just what I wanted it's nice and things work out so now that we've done this time we've gone this way we want to know how can I go and get back the old table that I used to have I'm like printing this out to put on the bulletin board at work so I can brag to all my colleagues about in this case I guess how I ate too many pies and got sick but that's fine so the way that we're going to do this is by a method defined on data frames in pandas called pivot now when I type pivot and hit tab you'll notice there are two options that show up one called pivot and one called pivot table there is a slight difference between these two we're ok using pivot if you have data that you're flattening out into a table where you might need to aggregate your data using something like I want the the mean of all these values you would use pivot table for that but for us we can use the regular pivot and what we're going to supply is the index of the new data frame what the index should be and what the columns of my new data frame should be and it's going to fill in all the values there for me so in this case our index column before it wasn't actually the index it was just like a regular column but it was serving the same purpose this was the names of my contestants in the columns before worm eyes at the contest one or is it contest number two so four columns here we have contest cont e st and when we execute this here is something that is not quite exactly but almost the same as a table that we were looking at before yes so if you pass in a list what you'll get is a multi-level index okay so do we feel okay with this sure sure so regular expressions are their own whole language if you deal with text data a lot it's worth like day it takes to learn for like the amount of time you get back when you do text processing what this is doing is we're starting off with a different kind of Python string literal so that always a raw or aw string literal in Python and this is a string with no escaping behavior this makes our lives easier when we use regular expressions because the regular expression escape character as a backslash in pythons escape character as a backslash so to have a literal backslash in a regular expression without the raw notation you need to have four back slashes in a row to to escape Python and then one to escape the regular expression one in this case that's not really required but it's a good habit to get into now when I extract when I'm extracting our regular expression groups and a regular expression group is defined as all the stuff from inside those parentheses inside the parentheses we're saying I want to match any kind of this range of characters so the square brackets represent a character set sometimes you'll see ones that look like this which is like the lowercase a through Z order you can have like uppercase a through uppercase e for us we want the numerals which is 0 through 9 and for those of you who know regular expressions we also could have used like the D for this but we won't worry about that for now and then the plus here is that greedy operator that means we want to grab as many in a row and what this is doing for us is if we have like contest number 12 like these piden contest go on for a long time it will grab the 1 from the string and the 2 from the string and return them to us together as the number 12 and I think there was one more hand yes that's a great question so the question was when I do the when I do the info look up on this table I see that there are actually two calling me to seize and that's correct so the one that we care about is the innermost column index which is the contest number one in contest number two at the upper level here the outermost index is just that these are all data about PI's so in a sort of looking for in a like a in a less well-thought-out library this PI's information we just sort of disappear into the ether and I wouldn't know what these numbers are anymore which would make it easier for me to make the case that the long data format is better pandas here is letting us know that Oh like I remembered what these numbers actually represent and it's storing that for us in that multi level index yeah so unfortunately we're not going to have time to get into like the multi level indexes so much there is a method for cross slicing that will let you extract just from inter labels instead of outer labels we could also just drop that outermost index and we can point you to a place it and read up on it in the documentation alright so this was one strategy that we had for making long tables that are done this way and why tables that are across this way there's another one though that depends on manipulating those index objects themselves and these are two methods one of these is called stack and the other one is called on stack so in this particular example where I have my now rewind and data frame if I wanted to make this long again in the long format we could add on a call to stack now I wouldn't leave the data frame like this I would probably do some some index resetting here but to sort of illustrate what happened before I had this data frame looking like this where I have one in two as my column labels when I call that stack that moves the column index from being a column index into being a row index where the values wanted to get repeated as necessary across the levels of the index that was already there unstack is just the opposite of the operation where it's going to move the innermost row index up to being back in the column index so anything that you can do with this Melton pivot you can also do with stack and unstack and whichever you use is whichever makes more sense to you so I feel very very strongly that melting and pivoting is more intuitive but I have a co-worker who feels just as strongly in the opposite direction that snacking and unstacking is something that just makes more sense to him intuitively so whichever way works for you is the way that you should go with now that being said we have some time to go through some of the material in the notebook so we're going to look at just one example of sort of the complex the complex stuff you can do with data when it's in the correct shape so we won't talk about this in detail but if we run through this quickly we can pull in some data that represents sporting events so these are basketball games that are occurring a couple years ago and for each one of these games straight to one of these contests I have like my home team and I have my away team I have the number of points scored by each now this is the data it's already in the right format that I want to work with if the relationship that I'm interested in is the contest if that's my level of observation if I want to know like how many people were at that particular basketball game this is perfect if I want to generate summaries based on the teams this isn't so great from a team's point of view each one of these observations should have who the opposing team was things like whether they were home or away how many points they scored did they win so we can use melt to get the data into a format where the team is that level of observation and now that we've done this we can do things like calculate how many days are these team spending on the road in between games so two days between games three days between games so we have some really cool stuff coming up and how you can use a library called desk to do some crazy cool super fast adil core computations on data frames do we have any questions before we leave the tiny data section to move into the cool nifty stuff yes I so probably not I'm probably not quickly anyway so I'll hop on the slant channel and I'll put up like a link to the Wikipedia page on normalization but there's about 50 years of research into like how you make good tables I mean principally what they're concerned about is not necessarily the ease of subsetting like we are here but they're more strongly concerned with things like update anomalies so if I update data once it should update everywhere which means only one piece of information stored relating to some set of attributes for a particular key which is a little mouthy yeah it has like its own language database stuff does so sorry that was disappointing yeah it's super cool stuff though if you do database work you should look into it because you definitely need to know it and then you get all the jokes about like normalization and there's a much about cod2 Co DD for the factory Hey other questions yes mmm yeah it's okay keep going keep going this one now do the stack yeah that's a great question it has more than one index yeah multiple indices correct that's a great question so the question was when I look at this index I have like Dillon in one and that's fine I have euros in one and that's fine I get down here and there's like a nothing and then a 2 what's going on the answer is that this is a yours to having your system repeated it's just being sort of alighted for nice presentation purposes but this is definitely still data that belongs to your us even though it's not being highlighted on the table that way they would definitely fill in because they're there in the index yeah all right you see no further questions I give you once again Tom all righty let's see so if you go back to the github repository there is now a URL here for the cluster if you want to go ahead and click on that link that'll drop you into a somewhere I think I have this open a Jupiter hub login where you'll be presented with like a username password box you can put in whatever you want there we don't actually check it so put in whatever you want for your username and your password once again this is the URL for the readme this and this is the URL for the cluster I'm gonna put that right here and make this markdown and a bit bigger so if you go to this URL you'll be dropped into a Jupiter how to log in and once you do that I want you to do two things I want you to scroll down to where you see a cell called this cube cluster run this and then you'll get it I'm gonna copy/paste this up here alright there's the URL again and look for a cell that says cube cluster we gotta get some resources allocated for you so I'll step back for a second and explain what's going on but go ahead and go to this URL log in with whatever credentials and then put in a search for this cube cluster and execute those cells and that will get you some workers spinning up in the background you can just put in this URL maybe can you read it does that big enough yeah right sorry there will be a green button that says start my server if you go ahead and click that after you login sagen yes so I'll talk about what's going on here in the background ah I forgot to it yes here sorry so that's why it's taking so long so what's going on in the background a whole bunch of pods are spinning up on this kubernetes cluster it should come up here in a second maybe the cluster is also being resized right now as we all log in so I'll talk a bit in general about desk as you are logging in there should be pods coming up in the background so if it's spinning don't worry it'll come up eventually in theory okay so desk desk is a library for parallel and distributed computing so everything we've seen so far with pandas has been working with you know relatively small data sets the beer reviews data sent number of observations but pandas is fundamentally an in-memory library so you can't have a data frame that's larger than your machine's RAM and so one of the questions we get as pandas developers is how do you scale pandas to large data sets one option is to use desk so desk is not just about pandas it's about scaling Python in general but task as pandas is one of the libraries that - scales I'm sorry desk yeah pandas in one of the libraries that desk scales okay so I'm gonna start working through this notebook and then when your pods come online which we've got some people yep so I'll talk a bit more about what's going on here but once we after we talked about desk a bit so if you're still pod still coming up it should show up eventually when it does search for this cube cluster and run these two cells to get some work oh I should do that so I get workers before you all take them alrighty so run those two cells look for cube cluster okay so in this case desk has components to it at its heart it's all about taking big problems and breaking them into smaller problems and then running those small problems in parallel okay so that's the the key idea at the heart of task is to take big problems break them into small problems run those in parallel so in this case we're gonna make use desk data frame we're gonna use this from pandas usually you won't be using this because if you think about it desk is about scaling out to large data sets pandas is about you know small you know in-memory data sets so usually you're not making it from pandas but I just want to see the API that it provides and then later on we'll see how to read in from like a cloud file system okay so we have this DF it's a - data frame and it kind of looks like pandas it's a little different but same idea is we have columns we have an index we have some values that aren't present right now and we'll talk about why in a second but first of all let's see some operations so we just have some random data here we have some strings some floating-point numbers and then some floating-point numbers for for B and C we do an operation with that data frame you know to select the column B add it to another column C with pandas what would this be adding to series you get back a series right same sort of thing with desks only instead of a Panda series it's a - series and then we still don't have any values here right so something a little weird going on there but let's keep going for a bit so we'll select B and C we'll take a sum again if you have a panda's data frame and take a sum what do you get back we've collapsed the rows we get back a series where the index is the original columns so the same sort of thing again we don't have a value here which is getting you know even more strange but let's keep going on what's going on here so we're selecting these two columns we call some desk hasn't computed anything yet it's kind of building up this chain of operations and we can look at that chain of operations I'm gonna make this a bit smaller so you can kind of see what's going on here in this case this is not right oh sorry here I'll just do it viz you eyes yeah sorry I think I over wrote my graph ok so you can see there's this get item call remember get items for the square bracket stuff so we're selecting some columns then we're doing a reduction and then we're doing this other reduction aggregate so by writing this sum we've kind of built up this series of steps to compute this actual value which will be a series in parallel and we can go ahead and compute that with by using the computes method so about some remember this is a - series and we can actually compute those things in parallel and get the concrete results so that's the kind of idea behind desk data frame there's a similar sort of thing for desk array where you write numpy like code it's gonna build up this task graph of things to do in parallel and then run it in parallel ok any questions on that it's kind of weird right now but we'll see an example of actually using it on some real data which might help out how are people doing on pods a lot of running and we got it's looking pretty good I don't see any errors so happy to see that ok yeah so we ready for a concrete result you go ahead and call compute ok so we've run this we've asked for a cluster desk is nice you can use it on a single machine if you want to analyze a data set that's like on your disk you're on your hard drive but larger than memory desk can take care of doing these kinds of computations in like out of core so you don't read all of the data and at once Dex just reads and the chunks that it needs to do the actual computation so you can analyze a larger you know tens 20s you know up into hundreds of gigabytes of data on your laptop in parallel just out of core so that's able to do that it's also able to do those computations so like all this kind of task building stuff it works well for out of corn and also works well for distributed computing so when you run this command let me make this bigger again sorry about that asking for a cluster this is going to set up a task cluster for you and you'll have eight machines ready to do your work okay so they're kind of small machines but each machine has two cores each and for some number of gigs of memory each okay so in total our task cluster has in this case 16 cores 48 gigs of memory so not huge but this you can have bigger machines you can have many many more workers to do your work and then we'll go ahead and connect our client so this came from desk as well I'm not going to talk about it too much just so we can get back to pandas like stuff but yeah that's there I'm also going to skip over this this is just some text about things that are different when you are doing stuff and distributed but this is a pandas course so let's go back to panda stuff so pandas how to read CSV desk data frame has something similar a DB read CSV in this case we're going to be reading stuff from Google Cloud Storage it's like s3 and a kind of some public data that you can download and the main thing that's different here is we're gonna pass in this glob string so if pandas read CSV you pass in a single file dest is all about doing stuff fun you know large data sets so you pass in a glob of files and we can go ahead and look at that one thing I would recommend is opening up the dashboard so below your cluster there's a link to a dashboard that you'll want to have open and hopefully mine comes up here okay so we've done this dd treat CSV if you do this you'll notice it finishes pretty quickly basically instantly we have not read in all the data remembered ask is about lazily building up these trees of stuff to do in parallel we can take a look at the first few rows with head and you'll see some stuff happen over here I'll talk about what's going on in a second but we've read in the first five rows of data and this is the same actually this is the yeah similar data set it's got a few more columns so a bit larger data set and then it spans a longer time period we saw just 2015 or 2014 all right so the first thing we'll do is persist our data and distribute in memory so again this is kind of like with pandas you know you read stuff into memory in this case we have a cluster worth of memory where we can you know have our data so I'm going to call this D F equal D F dot persist and what that's going to do is gonna it's gonna kick off a computation we're in the background all of our workers are eight workers in this case we'll start doing stuff so this isn't something that's available in pandas because it doesn't make sense pandas stuff is already in memory all right so what's going on on the side here vertically it's on this main task stream plot here we have each of the cores of our workers so I have 16 vertical bars here and each of these colored blocks here is an individual task so think about like a function if you think about reading a CSV from some network file system from some from from some cloud file system you're going to be reading bytes off of the network you're going to be parsing those bytes into some you know actual thing in this case pandas is doing the actual parsing for to ask and then constructing a data frame so that's basically what we're doing so we've just you know read this 12 or so gigs worth of data into distributed memory and one thing to point out is each of our workers has like 4 or 8 gigs of RAM I can't remember which so this this data set is larger than any of our individual workers Ram that's sitting in our and combined distributed memory once you've done that things will feel a lot like pandas and so that's what we want to do for the rest of this notebook is just play around with pandas like stuff on these large data sets so as an example we can do something like give me the number of non canceled flights so cancelled is a boolean okay I want to filter out the the ones that are you know true or not true so I do have that cancel see you and then count that so you could count it you could sum do whatever again we're building up this lazy evaluation we could try and visualize that it might be a bit big but then you compute it on the right if you think about something you know what does pandas doing here what is - doing it's doing a whole bunch of indexing a whole bunch of boolean filtering and then it's doing some counting in red here we have communication between workers so we do these small counts on each of our workers and then those communicate those to a single machine which kind of sums those in intermediate counts so if you think about how do you do a count on many machines that's how you would do it all right so we've got about 10 minutes or so I would say give forgot to comment out this exercise ignore this try and get the average departure put delay for each airport ok so when you see this you know for each what do you immediately turn to if you're doing this with pandas group I right so give that a shot and then we'll check in in about 5 minutes here and we'll take a look at a bit of time series stuff if you have trouble making a cluster you might have a a it could be if you have a space in your name I'm not sure someone else ID issue with that so if you do have that you can go to just the IP URL slash hub slash home and then that'll let you log out and then you can log back in with a different username without a space ok so if we were doing this with pandas and we want the average departure delay per Airport we'll say D F group by it's just origin and then it's going to be DEP delay ok so again this is just like pandas will take the average with mean again just like pandas only we haven't computer anything yet now that we're all ready for the result we can go ahead and say dot compute that kicks off the computation I mean it's pretty to look at some notable things is that there's a communication here so getting that that result we have to compute communicate we have to move data between workers which is kind of expensive on a distributed machine so it's worth highlighting here but that's our that's our result so it's you know doing a computation on a larger than memory data set with pandas like syntax and you know pretty quickly I think that's pretty cool I don't ok so we don't really have time to go into details but if you're interested you can go to desktop hi gorg this has links to various notebooks so you can like try out pandas on your own little cluster there's some details with working in a distributed environment that you know are interesting things like getting you know looking up a key can be very expensive if you know depending on how your data is laid out that's what this is going into we're not really going to go into it but I'm going to run this cell here call out set index so I don't think we saw that in the tutorial but set index moves a column into the index okay and that's all it does in pandas with tasks it kicks off a bunch more computation I think my cluster is too small should I got asked for more workers we're saying we're getting a bit close to the memory here that's ok though with Eska does more it actually does a triggers a a sorts partial sort anyway and now desk is able to know a bit about the divisions so we kind of skipped over the data path data frame data model that's hard to say but basically DAST knows the start and end points of all these partitions so desk is about breaking large problems into small ones so in this case has a dashed data frame is made up of a bunch of pandas data frames and after we've done the set index desk knows where each basically the start and end point of each of these partitions so for all of our pandas dataframes that make up this desk day frame we know the start in the end and now we can do things like give me all the flights for you know 1990 - oh - - ten or let's say fifteen all the flights on February 15th and 1990 and now this operation will be quite quick to compute because desk knows exactly where this data is thanks to that set index if we hadn't done that we would have to do something like you know DF of DF dates equal to 1999 t - o - - 15 if you think about this if this is just a regular column in our data frame to do this computation we have to look through the entire data set which takes a while when it's distributed alright and just for fun you know we can take a look at quick you know that supports a lot of the pandas API we didn't get to resample but there's a notebook on time series which I would HIGHLY encourage you check out but - data frame does some of that as well so resamples like a group by but for time series data so group this into months take the average fill forward compute that and plot that so this would be like the monthly average departure delay so you can see there's like some maybe you know monthly stuff going on but also a maybe longer time series trend towards more delays perhaps okay I think with that you're of course welcome to keep playing around on the cluster but otherwise I think we're all out of time so thank you very much yeah questioner yeah yeah so just a brief bit of background if you care desk works with a bunch of different resource managers this one's working with Google cloud platforms kubernetes engine so we have a kubernetes cluster setup it has a whole bunch of machines I'm not sure how many right now but that's basically what's going on is there's a bunch of virtual machines and then a kubernetes application if you're interested you can go to Pangaea PI data org this is a group that kind of developed this whole workflow for working with large datasets on the cloud and that's what we're using here for the Jupiter hub and then some of the stuff that we've built out there yeah so it's both yeah so in this case all of this computation is happening in parallel on all of these processors okay I'm not sure this is all you know it's all explicit it's nothing happening at the processor level okay yeah gotcha yeah so saying it's vectorized operations that Intel might be paralyzing same thing with this you know if if if pandas runs faster because if something Intel is doing then Daskal as well all right any other questions sir okay yes a desk works with kubernetes yarn all the HPC systems like slurm G whole bunch of them yeah already well thanks to Dylan and yours [Applause]
Info
Channel: Enthought
Views: 34,527
Rating: 4.9271255 out of 5
Keywords: python, Pandas, Data Analysis, SciPy 2018, Python Training
Id: lkLl_QKLgcA
Channel Id: undefined
Length: 172min 3sec (10323 seconds)
Published: Tue Jul 17 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.