James Powell - PyData 2021 Talk "How to Be a Pandas Expert"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is james powell and this is my pi data 2021 presentation today is thursday october 28th and this is a pre-recorded presentation and so without further ado i present to you how to be a pandas expert when we talk about developing expertise in a particular tool it is generally not the case that expertise is gained through studious memorization of an api studious memorization of details but instead through judicious understanding judicious discovery of what the core concepts what the fundamental intuition is underlying these tools when it comes to a tool like pandas the api is quite large there are many different methods there are many different data types and there is a lot of detail and i hope you'll see in this presentation that there is just one core concept that really brings together almost all of the pandas api and all of these details when you look at a tool like pandas that core concept is the index and index alignment through this presentation we'll try to motivate what the index is all about and why it's so important and why it's so intimately related to everything that's going on in pandas and we'll do that via some live coding and so i will share my screen with you i'll hide in the corner here and we'll jump right into what we want to cover we're going to be going through these notes in a live coded fashion and bear in mind these notes will be available to you after we are done with this presentation in both code form and pdf form but to get us started let's talk about why pandas in the first place why do we even bother with pandas because in my opinion pandas is a very one-dimensional tool there's basically only one reason to use it and there's a pun in there too and if you think about all the different reasons that you might have landed on using pandas there are a lot of reasons that might come up but they're not very good reasons for example maybe you want to use pandas because you like the python dictionary in the python list and you're looking for something that's even less convenient and even more perplexing for example here we can see this thing called a pandas series and it prints out a little bit differently than the list and we can iterate over the contents of it using itter items but we get that like 0 1 2 3 4 the numbering of the rows i don't know why that's useful so maybe we'll just unpack that into a variable you don't use or we could iterate over this directly but i don't really see what this is getting us over a list or more likely if we're using pandas we're familiar with the pandas data frame and this kind of looks like maybe a dictionary of python lists or maybe a bunch of lists stacked next to each other or some sort of matrix and again we can iterate over this thing but when we iterate over this thing we get something really weird we get the column names how is that useful to anybody so i guess we could find it or rows but i have no idea what this is giving me this is giving me one two three giving me it's the row and each one of the columns or iter items that's kind of like a dictionary right and this gives me each of the columns independently i guess i can throw away the column name and then i get the columns i don't know why somebody wants that or i could do it or tuples and there i'm starting to get kind of close to something i can deal with tuples lists dictionaries i understand those things because if you compare something like a pandas data frame or a panda series to a the built-in data types in python the python list the python set the python dictionary i'd say in terms of just straightforwardness the built-in types really went out here we have a python list and it's just a bunch of numbers in some structure that we can iterate over and we can iterate over and get each of the individual numbers and do something with each of the individual numbers here we have a python dictionary and it's just some sort of key value pairing some labeled set of these values and we can iterate over each of the labels each of the keys or we can explicitly do it with the d dot keys or we can go over each of the initial values or we can go over the pairing of the keys and the values that seems pretty straightforward in a very simple api and so if we think about why we might use pandas clearly it's not for the strange namings of all these things it or tuples it or rows maybe it's for the really bizarre errors that come up when we use pandas here we have a panda series and it seems to be just a couple of numeric values and here we have a pandas data frame and it seems to be three sequences of numeric values 0 1 and 2. and we can multiply these but we get a bunch of nands on the end how is that useful maybe i'm multiplying in the wrong way kind of like matrix multiplication isn't commutative and so here we can see that it doesn't actually matter the direction in which i multiply these you know s times df one df one times s doesn't make a difference maybe i'll introduce another data frame into the story and we'll see if anything else happens so here we have a different data frame and if i add that to the first data frame i get a bunch of nands at the beginning well that's no big deal because at least i can just drop the nands i guess that's what i do all the time with pandas just drop nands because they're popping up all over the place and what i could do is i could pretty much join these and i get a bunch of nands there i could guess i could drop them let me introduce another data frame here and this data frame doesn't look all together too dissimilar from the second data frame we were looking at and let me try and join these and when i join these it says cannot join with no overlapping index names hold on a second joining should just be like stacking them next to each other kind of like what that plus operation seemed to be trying to do and so why is it telling me no overlapping index names what does the heck does that even mean and if we are unlucky enough to ask one of our co-workers what does this mean why is pandas giving me this error they will helpfully tell us oh the answer is you got to rename the axis just rename the axis to something on both sides and then the join will work as you expect and you'll say what does that even mean how is that helpful that's total nonsense rename axis my goodness and so if you compare that to our built-in data types like the list the list just makes sense here we have a list called x's here we have a list called ys and these represent some opaque collection of items that we iterate over so if we add them together that is some sort of sequence level operation or rather some sort of container level operation it concatenates them and if we want to use some special syntax for this we can unpack these lists into a list literal so we can concatenate them in a different fashion and if we actually want to add them up lining them up well we just do an explicit for loop and we can see exactly oh you're zipping the x's and the y's and you're finding each one of the pairs of those values and you're adding up those pairs that just makes sense there's no rename axis in this my goodness and the dictionary also makes a whole lot of sense here we have a dictionary with some key value pairings we have another dictionary we can use this unpacking syntax to merge these dictionaries yeah merging dictionaries you take all the key value pairings of one or all key values for periods of the other and there's some order of precedence for if there happens to be an overlap and if we happen to be using python 3.9 we can even do this with just the pipe syntax and if we want to do some sort of arithmetic operation then we can explicitly do that arithmetic operation we can take the keys of one of these put them into a set takes the keys to the others put them into a set find the set union of those and then look up the values and then create a new dictionary with the keys and the values attained by once from one side or the other and if we don't see one of these key value pairings in one of the two sides that we're adding we can substitute a zero value by using the dictionary.get and even if we go into the python standard library and look at the other collection types it provides they just make sense like the collections.counter totally makes sense it's just kind of like a dictionary except it specifies that the values have to be some sort of integers and so or some sort of numeric value some counts and so if we have two counters like this we can add them together and i guess that makes sense you add the corresponding counts so if you have three a's and one and four a's and the other you have seven a's in the result and we can even find the intersection or the unions of the counts and we might scratch our head a little bit and say okay that's stretching it a bit what does that actually mean but it actually kind of makes sense because if you had two bags of things that you had counted maybe sometimes you want to say what's the minimum that i can rely on having in either bag or what's the maximum micro reliance having in the other bag and if we think a little bit these map to our intersection our union our pipe and our ampersand operators and so that doesn't quite give us a good motivation for why mu i want to use pandas maybe it's because pandas is so frustrating we end up doing a dot values all over the place just to get back to say a numpy and d array something that we know how to deal with for example here we have a series and it seems to contain some numeric values and we have a data frame and it contains some other numeric values and we can multiply these and we get all nands what is going on here if i drop nan here that's not going to be helpful for me at all because oh my goodness i'll just drop the whole thing here um okay let me see what happens if i take this data frame that i started with and another data frame and i add them together well they seem to be about the same size these two data frames but when i add them together i get all nands as well oh boy this drop net trick is not gonna help me because i'm gonna have no data left over and so instead maybe i'll just just throw away the pandas parts just do the dot values here and i guess i could do the same thing with my s dot values and my df one dot values except broadcasting error oh geez i guess even in the uh numpy universe our life isn't that easy uh and if we ask our helpful co-worker they'll say oh you got to use a new axis you've got to add to the axis here in order to satisfy the broadcasting rules and you say you know that is just not helpful you are speaking a different language i have no idea what you're talking about and if it's the case that you manage to coerce this to work and you dot values your way to something that actually gives you the answer that you want but you still want to have that data frame for whatever reason you can always take the result and stick it back into a data frame and so i guess that's not too bad it doesn't seem like a very compelling reason for us to use pandas maybe it's that in addition to that dot values to basically run away from pandas we can do a dot reset index to kind of stay within pandas and throw away whatever that index is because that seems to be the source for our problems and i guess that might be a compelling reason to use pandas that we can just dot set index our way to success and so for example if we have a data frame that looks like this and we want to group by that a column and that a column happens to be unique value so we should see exactly the same number of groups as there are rows we can sum all the values which in this case isn't going to do that much it will sort by the a values and if we can take that and say add it to another data frame and that other data frame is going to look like this well we can see that again we ended up with a bunch of nand values except for a tiny little sliver there drop now that's not going to help but you know what i don't have to uh do a dot values and make this work i can at least reset the index on one side and then it's giving me something that kind of looks like what i want or in the group i might be able to pass a keyword argument to tell it don't set the index and maybe it's because pandas is so unpredictable in terms of what it will give us when we perform an operation you know good tools should be guessable we should be able to write some code and have a sense for what it's going to do before we run that line of code pandas doesn't seem to be the case because even with something simple like trying to figure out how many rows i get when i do an operation with two pandas objects i can't guess what that's going to be like if i have these two series which both contain four values and i sum them up well okay the result has four values that's not too bad but if i swap them out with these two series which still have four values i'm getting as a result six values what there are four values and four values that i'm adding together and i'm getting six values in the result and if i swap it out with these two series that themselves also have four values i get eight values in the result my goodness i can't even predict how many how many values i'm going to get from adding these two things that would never happen with the python dictionary that would never happen with the python list i can explicitly talk about oh this is how i want to combine these two structures and i can see immediately from that code which might be a little bit clumsy but i can see immediately what's going on and so this is a pretty pretty terrible reason why we might want to use pandas because we have very little predictability over what's going on well maybe what i might want to do is just reset index myself to success here and so maybe i'll just take this result and i'll reset index and then i end up with something that kind of makes sense to have that extra column called index that's not very useful so even better than reset index i'll reset index and i'll throw away the index completely just just throw that in the garbage not helpful at all you're really not adding anything to my life maybe the reason that we use pandas is because it makes very weird and incomprehensible distinctions in its library for example if we have a data frame and we want to go into that group by and we want to group by this column here that is no longer unique values but actually it's only two values two are false we wanna find all the places where a is true and all the places where a is false when i group by and do a sum that works but if we want to do another operation like say the kurtosis then that's not built into the group i we can't do group by dot curt so we might do a group by.transform and that seemed to give me this result one two eight rows what on earth is going on here or maybe we'll do a dot apply and that gives me a and b like that that's bizarre and maybe eventually we'll come back to doing a dot ag and that seems to kind of give me the right thing but why on earth did they create dot transform dot apply and dot ag why not just have one why not just make this easy for me and what on earth does this even mean maybe it's the case that we use pandas because it is full of minor conveniences that allow us to eliminate more or less about one line of code for example if we have a series we could take that series and we could shift it by one value or we could shift it and subtract it from itself or we could do series.sum or we could do series.mean or series.sku and kurtosis and between you and me i don't even know what skewer kartosis means sure kurtosis is what the scaled fourth moment of a distribution no clue what that actually means but i do know that i could have just imported this from scipy.stats and i could have just used the numpy and d-array and i could have just used indexing on the numpy and d-ray to chop off the first element if i'm going to drop the nan anyway what's the difference and i could have just done a subtraction that seems much clearer do the subtraction of these elements removing the first element from one and going up to here this should be the last element on the other and for the rest of these operations.summon.mean already provided by numpy and skew in kurtosis great i got to eliminate one scipy.stats import doesn't really seem like a very compelling choice for or very compelling reason for why we choose pandas maybe it's because we have this data frame thing in pandas which seems to be the ability to work with grouped data sets and it only costs us about 250 000 lines of code of complexity in the pandas library to have pretty much a dictionary of numpy and d arrays here is a data frame like many of the data frames that you've seen before and i guess i can do operations down the data frame like i could sum i could create the column sums and i guess that would probably take me as many as two or three lines of code maybe a dictionary comprehension if i want to do this in pure numpy with a dictionary and a dictionary comprehension and i guess i can sum across the columns and maybe that might take me just maybe one more line of code and for all of this complexity that i'm removing group by there's a group line in the inner tools module isn't there all i have to pay is 000 lines of additional code complexity in one of my dependencies that's not such a high price and by the way 250 000 lines of code yeah if we take the panda's source code and we take every single one of the files in the pandas directory in that source code and we put the results into a pandas series after executing wc-l on that to find the length of each of those files we can see this is every file in the pandas source code distribution and how many lines of code are in each one of those files and i don't care about all those files because some of those are tests and so maybe that's not that helpful maybe i'll index this thing and remove any one of these entries where it's from a testing directory or the test directory okay and that gives me a mask and i'll take that mask and i'll do a dot lock operation to just pick out the files that don't belong to anything that's in that testing or test directory and so that's what's remaining and i'll take that result and i'll group that result by the parent most directory and the suffix of the file so i get something that looks like the following where i have every file and every location rather and it's suffix and i'll unstack this and fill with zeros if there doesn't have to be anything so i can get a sense for what's there and so here i have all the dot c files boy that's that's quite a few lines of dot c code and look pxd code pyx code but if i look at that dot pi column 86 000 lines of code in the core and 69 000 lines of code in the libs and if i sum this then it'll give me the sum on you'll see a per file name basis or full profile type basis 194 thousand lines of python code and if i sum this all up i'll see this is a lot of complexity that i'm paying for and what am i paying that complexity for a bunch of code that makes no sense that i'm struggling with all the time does not seem like a good bet because here's the thing here's my data frame and i want to update that data frame i do this all the time to python lists the python dictionaries to numpy and race so i'll do something very simple i'll just say go into that a column go into the zeroth row of that a column and just multiply that value times ten or maybe multiply that value times a thousand or ten thousand and it seems to work except if my data frame seems to be very slightly different with just one more column i get a setting with copy warning 250 000 lines of code and i still have no idea what's going on this does not seem like a compelling reason to use pandas and so our goal here is to try and figure out how do we become a pandas expert by understanding the core concepts and i think the core idea is understanding why we use pandas in the first place and so we are not here to talk about numpy so let's talk about numpy why do we even bother with numpy in the first place well that makes sense python list is slow numpy is fast as part of our conceit for this here we have a simple little timer and we'll time some code if i sleep for one second it takes a little bit more than a second to run and you can see our timer is not a great timer but it's approximately a decent timer well if i use pure python my python list that i'm familiar with and i create two python lists of size a hundred thousand takes about a half a second and if i do the same thing with the numpy api it takes about a hundred times less okay so that's a benefit fast code is is is good code and if i take these operations and i compute their dot product i can see it takes about .02 seconds to do the dot product in pure python i can totally understand what's going on here take each of the values of x's each of the values of y's pair them up multiply them and then sum the result that's a dot product but if i do the same thing with numpy you can see i get an incredible improvement in the speed that's 70 times faster 70 times speed up on some code that's that's definitely worth it and one of the core ideas that i have when i think about numpy and i try to motivate why i use numpy is numpy provides us with the ability to do numeric operations faster because numpy is a restricted computation domain namely it's the ability for us to put a manager class around some sort of python data to intermediate between the python layer of our code and some code that has implementation perhaps in c or fortran because we have that implementation barrier we can do things like unboxing values make them contiguous have exact control of memory eliminate dynamic dispatch and as a consequence we're getting 100 time speed ups 80 times speed up 70 time speed ups we're getting a significant increase in the speed of the code and the reason that we should think of numpy in terms of this restricted computation domain idea is it's this domain and we have to stay inside that domain or we lose all that performance because if i take my python dot product and i apply it to my numpy data and i cross the boundary of that domain it's actually slower than if i had done it in all pure python this restricted computation domain idea is very important because it gives us that fundamental motivation for why do we stay within numpy why do we stay within pandas because it is a domain which has intermediated between the pure python level and some implementation level as long as you stay in that implementation level everything is fine but if you cross that boundary you're you're in you're creating a number of additional costs that are going to be worse than if you just stayed on one side or the other of it it's a domain which gives you certain restrictions to allow you to do computations faster and in fact if we think about what numpy really is it's really just an interpretive view of raw memory here we have a numpy and dra and we can dig into that numpy and dra and see that this is actually raw memory at that memory location that we're interpreting in some fashion we're saying some block of memory at this location contains n64 values and we interpret it to be three values in a three in a one dimensional structure just three values and we have some mechanisms by which we can uh in linear time sorry in constant time move through these values some striding mechanisms associated with these and all of these pieces fit together when we think about numpy it's also that numpy provides us with something that's missing from python which is a vector matrix multi mathematical type in other words if we have our python list our python list is some sort of opaque collection of items so when we add two lists together because they're opaque collections items just a bag of stuff that we iterate over well the addition of these two should just be all the stuff that we had over it should just concatenate or if we multiply something that's a list and it's just a bag of stuff well then we're just saying oh give me that bag of stuff three times over repeat this thing whereas if we do the same on a numpy and d array we get what we expect to be mathematical operations vector multiplication vector addition we're saying oh this is not opaque we know what's inside this it's numeric values when you add these two together match them up and add them together when you multiply this go to each value and element wise multiply these okay that kind of makes sense and if we think about what the python list provides versus what the numpy and d array provides us the numpy and d array provides us with some sort of fixed size dynamic shape higher dimensional structure and it is important that we talk about dimensionality because this is related to one of the core concepts that we have to understand about pandas namely pandas is one dimensional data even though the documentation says two-dimensional data it isn't really it's really like like-aligned one-dimensional data well let's think about what that means here we have a python list and a lot of people would say oh this is a two-dimensional list and the reason they'd say it's a two-dimensional list is well if i want to access any one individual element i need two coordinates and so you could say two coordinates to access something makes it two-dimensional but in fact this is not really two-dimensional because while i can access just one row i can't access one column natively without performing some sort of looping there's no native first-class way to access one column and so it's yeah there are two coordinates for every value but it's not a two-dimensional list it is a nested list it's not that i'm providing two coordinates once i'm just providing one coordinate that gives me a nested structure and providing another coordinate so this is two operations not one operation and so each layer of that list only takes one coordinate it's just that this is doing two lookups on two lists and additionally it's a nested structure so something like this makes total sense in python here this is a list that contains lists and one of those lists contains lists which contains lists and if we think about what this contains here we can get one value this way we get one value this way and if we think about this we can say well this doesn't have uniform dimensionality some of the layers of this some of the rows of this have additional dimensions and that doesn't make sense because when we think about the dimensionality of a structure the dimensionality of a structure tends to be a property on the structure as a whole and it tends to be a scalar structure it's that everything is three-dimensional everything can be accessed via three coordinates this is nested data this is data where depending on where you are in the structure the lookup may require some number of coordinates that are different from the number of coordinates another layer so it's not quite a two-dimensional structure in fact a python list is never two-dimensional data it is always one-dimensional data it just has the ability to sometimes nest data note this structure here completely reasonable completely valid python list no mathematical meaning impossible to represent to numpy what does this even mean mathematically this is a matrix where it's kind of sticking up one one off the side of one edge and it's nested in it's like a hypercube in one part but not the rest no actual mathematical meaning whereas if we think about what a python list gives us that a numpy and array doesn't give us python less gives us the ability to update the size it's an indirection between the actual storage of this data so i can take a list and i can append items to the end and it just works and if i have two references that list both of them get updated whereas if i have a numpy and d right sure i can do an h stack but if i have two numpy and d arrays that h-stack is not updating that numpy and d right directly because a numpy and d-ray is a interpreted view of memory and because it's an interpretive view of memory it's a raw block of memory somewhere with some interpretive layer on top of it well i can't just resize a raw block of memory who knows what might be behind it the only that i can do is reallocate new memory copy things over and if i have two references then i've got to keep the old one around and so only one of these references to this numpy and d array reflects this amendment what we can see from this is that a numpy and d array cannot be resized but it can be reshaped and if we look at these two non-binaries we'll see two completely separate blocks of memory and so here this gives us an a very clear reasoning for why we use numpy numpy gives us a fixed size dynamic shape structure which is a nice correspondence to our python list which is fixed shape dynamic size fixed shape meaning always linear always one-dimensional numpy is a mathematical type the python list is some sort of container type and numpy gives us fast code that's great but we're not here to talk about numpy we're here to talk about pandas so let's talk about x-ray instead because we're definitely not here to talk about x-ray let's say we have some sort of real two-dimensional data like image data image data is definitely two-dimensional because i can access any one pixel in that image data by two coordinates and i can rotate it and the data means the same thing and i can look at any columnar or row slice or diagonal slice of this and that's a meaningful operation and the whole thing is homogeneous this is definitely two-dimensional data pictures are two-dimensional and you can see if i put this into a numpy and d-right i can access one individual row i can access one individual column without having to have any additional mechanism and you can think that there's many different ways for me to perform these operations but ultimately the numpy and dra is some way for me to take n-dimensional data in this case two-dimensional data and give me the ability to operate with it now the reason why i might want to introduce x-ray into the story is if i have this data even though rotations of this data don't really change what the data is as a human being i want to be able to access state data like a human being so i want to talk about the x coordinate and the y coordinate because that's something that i want to have in my code that'll be something that people will understand they'll be able to look at that and say oh yeah x equals three and y equals four and they're able to map that to some physical reality of where this data was actually captured and so what i'll do is i'll take my numpy id array and i'll lift it up into an x-ray data array and i'll name the dimensions and so now i have the exact same numpy data but i've said oh one of these dimensions is the x dimension and one of these dimensions is the y dimension and then i can do things like select x equals zero that is incredibly clear what this means select x equals one and y or x equals zero and y equals one one value and i get here a array of a single value but that makes sense and i can select arbitrary subsets of this arbitrary sub squares of this or diagonals of this select x equals 1 and x equals 0 and y equals 1 and y equals 0. x array is quite useful gives me the ability to access my data to look up that data by some sort of human metadata namely what the name of the coordinate is and you can think when you're using numpy sometimes you have to transpose things because the way in which the data was read in wasn't quite the the way in which you your code is going to operate on that data and you have code that hard codes oh look up this axis and this axis and this axis and you need to transpose it to make that work but with xray you can name these axes you can name these dimensions and so any code that you write absent consideration of any sort of performance consequences of this but any code that you write can just refer to those axes by name and just work irrespective whether the data happens to be transposed or not if i happen to have this data and it was captured on something with a coordinate system like this is physical data that i'm capturing on some sort of detector and that detector has tick marks 10 inches 15 inches 20 inches i can add a coordinate system as well and that allows me to say select my data using this coordinate system don't select the first data value that was captured select the data value at 15. that is very useful and with x-ray i can even say select the data at x15 y13 and if you can't find that exactly look for the nearest value and i can even say select the data at x equals 15 and y from 12 to 18 and if you're missing data in between do a linear interpolation incredibly useful because somebody looking at that says oh yeah 15 i know exactly where i put that marker on my detector when i was capturing this particular image very useful and it is the case that when we want to work in numpy because we want to stay within that restricted computation domain we will sometimes invent artificial axes that do not represent certain aspects of the individual data items for example here we have not one file but multiple files and we put this into one larger structure with one additional dimension one additional axis representing the file name that we're operating on and one of the requirements here is that every file has to be exactly the same size they have to be three by three and they have to have a shared coordinate system okay fair enough but every file was captured on the same coordinate system from 10 to 20 on some stepping and i just have the file names associated with each one of these files and that allows me to do things like say select x equals 15 y equals 15 from the file called a dot bitmap oh incredibly useful i can actually understand what the heck this is doing xray is a very useful tool if i have raw mathematical data and i want to interact with that mathematical data with some additional metadata that makes it human understandable namely the dimensions and the coordinate system on those dimensions but we're not here to talk about x-ray we're here to talk about pandas so why pandas here i have a panda series and if i look deeper into a panda series i will see that under the covers a panda series has a pandas array it has a dot array here and this dot array is one small level of indirection from the raw values that are stored if i go a little bit deeper in this i can probably even pull out the numpy values stored within and this one level of indirection the pandas array this extension array idea gives me the ability to do things like masked integer values because obviously if i have a floating point value i can store nands to represent data's missing but i can't quite do that in integers because there's not really an unambiguous inband encoding for that and so i need some sort of out-of-band encoding i need some sort of masking mechanism well the thing that we call the index that is probably the most interesting thing about pandas is just like that coordinate system on x-ray it's a way for us to refer to the values and a very common indexing that we might use is daytime indexes this isn't just value seven two zero negative five and negative four it's seven that i captured on the first of january two that i captured on the second of january and so on and if i look at this panda series i can see it consists of the raw mathematical data that's being stored in this pandas array and some index that helps me figure out how do i translate my human understanding of what this data or where this data is captured to where the data is actually stored in memory and i can do things like say oh give me the value at the first of january give me the value from the first of january to the end of the year or to whatever has been captured and that works and it's a lot more convenient than having to do this in terms of raw matrix operations especially if i'm working with multiple data sets and those multiple data sets weren't captured over the same time span but there is an overlapping time span that i want to consider it turns out that the pandas index is a way that we label data in order to access it and in order to give additional meaning to manipulations that we perform on this data and so here i have a panda series and you can see i've indexed it with values that appear to be some sort of measurement from negative 20 to 20 with some stepping associated with them and i can say give me the value at 0. and here this is giving the value at label 0. i can say give me the values from 0 up to 3. and you can see it gives me the values from the label 0 up to the label 3 which is just that value 0. now i can look at what happens if i ask for the values from 0 up to 3 from the pandas array the pandas array doesn't have any indexing on it so this is the physical values from physical locations through zero to physical location three versus the values from label location zero to label location three so one of these gives me three and one of these gives me one well it turns out that one unfortunate thing about the panda series is that square bracket notation can be a little bit ambiguous because if i change this indexing very slightly you can see if this indexing is an integer indexing then suddenly this square bracket seems to have given me the values from physical positions here to physical position three not from the label zero to zero to three and if i make this floating point values i can see something kind of similar to what i had before it's working on the labels again but you can see this mismatch occurs now pandas has addressed this and it's the case that we have an unambiguous specific way for us to access the contents of a panda series we can use dot lock if we want to use the labeling if we want to use the index and we can use dot i lock if we don't want to use labeling if we don't want to use the index and you can see the dot lock because it uses the index is going to be affected by the index the dot i lock will not be affected by the index because it doesn't use the index and if we were to compare the dot lock the dot i lock rather the accessing not using the index to the dot array lookup we'll see largely they give us something very similar the output looks a little bit different one of these gives us the series with the original index the other one gives us the raw values but in terms of what values they access they access the same values there are a couple of questions might show up here one why does that square bracket indexing by itself act the way it does and in a moment you'll see why that actually makes sense and another might be hold on a second this is using an interval notation when you're doing the slicing and almost everywhere else where i see an interval notation in python it doesn't include the endpoint but this seems to include the endpoint why on earth is this including the endpoint if i ask for label 0 up to 3 using dot lock it gives me 4 rows if the indexing happens to match up for that it gives me everything up to and including three that's a little bit bizarre but before we take a look at that before we think about that let's change our index very slightly if we go into this structure and we change the structure very slightly so it's a index letters from a to d then we can actually begin to reason about this the dot lock operation uses the index to figure out what value you want but the index doesn't have to be numeric values because the index doesn't have to be numeric values it could just be labels a b c d e it may be the case that if we want to include the next value we want to include that endpoint where we could have just done a plus one on the dot lock on the normal slice indexing to go up to and including that point we cannot do a plus one on any arbitrary index this is not well defined you can see the dot lock including the endpoint makes total sense because the dot lock is using the indexing and the indexing is not necessarily going to have some sort of ordering associated with it and it's not necessarily going to have some sort of successive operation and as a consequence a dot lock has to include that endpoint otherwise you would not really necessarily have a way to include that endpoint what does plus one mean if there's a date time index one day one hour one minute and if we look at our various indexing we can see some interesting things about the various ways these indices may operate here if we ask from a to the end or from a to c depending on that index it may give us many different things if our index looks like this where there's a bunch of nonsense in between will give me all the stuff from a to the end or from a to c with that nonsense if the a and the c are repeated you can see it's pretty smart it gives me from the first a to the end or from a to c from the very first a to the very last c but if they're interleaved i get this weird error cannot get left slice bound for non-unique label but that actually kind of makes sense this isn't a bizarre crazy error this is saying you're asking for the values from a to c but there's an ac in ac which a did you want which c did you want which age do you want to which c do you want do you want the very first a and the very last c that's not quite clear and if the index is in sorted order then there's an unambiguous choice we can have but if the index is not in sort order then this there may be some ambiguity here many of the errors that you see with pandas are actually about warning you against ambiguities which do not have well-defined semantics and it turns out that these ambiguities can arise on in different circumstances depending on the index there are in fact many different types of indices that you may encounter there are your simple range indices and if you look at that range index and you look under the covers you can see well you know there's a dot array and it's just some values from zero to four zero up to and including four you know there's a range index that looks like this and you can see it's some sort of stepping there is an integer 64 index and you can see that integer 64 index is just a bunch of integers that represent you know this integer maps to this particular physical location there's a date time index and you can see you can have a date time index with a set periodicity with a set frequency and you can even change what that frequency is you can have an interval index that says oh if you're looking for a value between 0 and 2 then this is or between 0 and 1 between 1 and 2 between 2 and 3 or between 0 and 2 and between 2 and 4 with this particular notation for which side is closed you have that there's a period index if you're looking for a value in q1 of this year and q2 of this year there are all these possibilities and this may lead you to believe especially if you do a dot array on all these that the index is data but the pandas index isn't actually data it is something that is convertible to data it is something that is often backed by data but the panda's index is actually mechanism and here's my proof for it if i have a range index with all the values from 0 up to 100 that's what it looks like if i have a range index from all the values up to whatever that looks like and this is actually data i would run out of memory i can't store all the values from 0 up to 1 quadrillion up to 1 quintillion there's no way for me to store that memory and yet i can create this range index the range index is a mechanism by which we take a human description of where the data is to label the key and we map it to the physical location you can think a range index is a very simple way because whatever you asked for in terms of the you know give me the element at 24 it just gives that back out to you or if it's starting from a different starting point or if it has a different stepping it just does a little bit of arithmetic in order to figure out what the corresponding physical location would be if there happens to be some logic here and when you look at the index the index has some information about itself it knows things like whether it contains duplicates it knows things like whether it is in sorted order and whether the values are unique because that's going to determine whether operations that use the index like that dot lock are well defined or not the index is centered around a method that you might call or it's sent around this method getlock and getlock says i'm going to translate for you what one key turns into when you want to find the physical position so get lock called with 10 for this particular index says oh the value 10 on a range index from 10 up to 20 stepping by five well that value is going to be at zero the position zero this is the translation between the human description of where the data is and the physical description of where the data is in some underlying storage and there's also operations on this like slice locks if you say give me all the values from 10 up to 13 with the human description of it oh that's equivalent to a slice of zero to one on the physical location of the data and there's other things like getting indexers you can give it another indexer and it can tell you how to find these so if you say give me all the values from 10 to 30 stepping by five you can see it'll say oh get the value at the zeroth position the value at the first position and those negative ones indicate that that data is not located in the original data set and so that's maybe why we might need to do some nand filling for those negative one positions the indices in fact support a bunch of common operations and a common api so here are a bunch of different types of indices and if we iterate through each one of these indices and we take a look at whether it has a get lock whether it's monotonic whether it's unique whether it has duplicates they all do and if we take all of these index types and we look at all the methods they support minus the ones that begin with underscores and we find the intersection of all of these we can see it's actually a fairly large surface area for the methods that the common indices that you interact with implement it's more than just is is monotonic there's is monotonic increasing is monotonic decreasing there's quite a bit that these commonly support unfortunately this does make it somewhat difficult for you to create your own mechanisms for looking up your data your to extend the index especially if you want to do something a little bit a little bit unusual here is an attempt to do a very simple symmetric index an index where you can say you know what according to my analysis the human's description of where the data is located is invariant on whether it's on the plus side or the negative side if i say negative 2 that's the same thing as asking for the element of 2 and i want all my logic to represent that so somebody can understand oh it doesn't matter if you ask for the negative 2 element that's the same thing as the element of 2 but i don't want to double the amount of data that i store especially since i'm just storing redundant data so i'll create a symmetric index it's kind of like the n64 index that says oh if you give me a particular location that i need to look up i'll just take the absolute value of it and give you that corresponding location and this very simple code actually mostly works here's my index one here's my series you can see that's my symmetric index in place and i can ask get lock give me the element of two give me l a negative two and it says oh that's corresponding to the element of position one and i can even do my lookups and you can see it gives me the same value the unfortunate thing however is if i had another series object and i wanted to add these together it won't add them an index line fashion there's a lot more mechanics that need to be done in order to make this work so there's a bunch of nands here and that kind of makes sense because this series too had a bunch of locations which the other one didn't have but you'd expect this one location to be filled in in this one location to be filled in and they're not and in fact it turns out that it's a little bit tricky for you to create your own custom indices because there's a lot of mechanism that happens behind the scenes to make these indices work i want to show you what happens when you do a simple s1 plus s2 operation in is and what i'm going to do is i'm going to use the assist module to set a tracer i'm going to trace every function call and every function return i'm going to put the results into a pandas data frame with the file name the line number the name of the function and the depth in the function call stack and so i'll take some very simple code like take this series here and this series here and add them together and let's take a look at what these two series look like they are not that unusual they're just two date time index series where one of them is from the first of the year to the fifth of the year and the other one is from the second of the year to the sixth of the year so they're slightly offset well here i can perform this operation and i can look at the data frame that results from this and let's take a look at this data frame and there are a lot of function called 677 function calls that i was able to capture from just that s1 plus s2 look at all the function calls that are less than a depth of five you can see these are all the function calls that are happening i'm doing a new method and i'm doing a bunch of checks okay that kind of makes sense aerith method i guess that's launching me into the plus and you can see there's quite a bit that's happening here ensure wrapped if daytime like there's a lot of extra mechanism in pandas and if i look for all the function calls that are relative in terms of the file location to the core indices you can see there's about 155 calls into something related to the index here and so there's quite a bit if you need to extend your own indices this is one area where i think pandas could be improved make it easier for you to extend some of these built-in structures like you're making your own custom indices but critically our goal here is not to extend indices to become a panda's expert and uh expert at using pandas and so just to get a sense of what this index is we can think it is operative metadata it's metadata that has well-defined semantics in terms of data operations that we use it's metadata that's used as part of some operation let me show you what i mean by that if you have a panda series you can actually attach metadata to that panda series like the source of this data or the author of this data using the dot adders and this will actually be preserved through operations so here we have two series two data sets and they have different sources and they have one stored the author the one stored the date when it was captured and you can see i can look at these data sets and i can look at what those attributes are and i can perform a group by one of these data sets and if i perform that group by in some operation you can see oh it appears that i have lost my metadata well that's a shame and different versions of pandas are better and worse at or getting better at preserving this metadata but you can see when you do the group by pandas like i'm not quite sure to do with that metadata and it threw it away if i were to take these two series and i were to add them together you can see it preserved the metadata but preserve the metadata of the first series if i add them in the other direction it preserves the metadata of the second series and that's not something that you usually consider oh that the order of operation is going to affect where the metadata is but if you really think about this this kind of metadata has no well-defined semantics in terms of the operations of the series in other words if you add these two series together what do you do with the metadata do you merge them well if you merge them what does that mean to merge them if there's two sources and they're each strings with a file name do you return a set of the two file names you return a list of the two file names do you take the bigger of the two file names if there are two dates what do you do if they're two authors what do you do if one of these had a numeric value another one had a numeric value like number of bytes do you sum those do you put those into a list what happens and so in the absence of well-defined semantics in terms of this metadata pandas is doing the best they can do and then the best they can do is just grab the metadata from one side or the other now the metadata that we have on a panda series is not just the adders but there's also a field called dot name and you can see that the dot name of these two is not being propagated when i add these different versions of pandas may or may not propagate the dot name here in some cases you may get the name of one side or the other but the core idea here is there are operations that we can perform on this panda series where the metadata that we have for that panda series namely the attributes and the name of that series may or may not be preserved and it's not well defined whereas when we think about the pandas index the pandas index is always well defined in terms of the operation because the operations are defined in terms of the pandas index and so the pandas index is actually metadata that will be preserved through transformations that you perform on that data because those transformations will be defined in terms of how they change the index and so it's some sort of persistent metadata that's very closely tied to that data well if that's the case what are the rules of the index well we're here to talk about pandas so let's talk about numpy and if we talk about numpy we want to figure out what are the rules in numpy we might think what are the rules of numpy broadcasting or the rules of numpy promotion and so an example of trying to discover what it means to develop expertise via a look at the rules we can say well if we wanted to understand numpy one of the things we want to do is understand what are the rules of the numpy and d array so if we have a numpy and data that looks like this four values we might be curious what are the rules which the d-type of this data set changes where something that is integer data may turn into non-integer data for example if we take these values and we take the logarithm and then exponentiate them we get floating point values back out and it doesn't round trip back to what the original result is and this might not seem like such a big deal floats integers doesn't matter but if we look at these floating point values and we see is every value in this thing equal to the first value no because they're all different numeric values if we mesh with this and see is every value equal to every other value no is every value close to every other value no these are four distinct values 992 993 994 995 but if we had some promotion happen behind the scenes and these turned into floating point values then the first value is equal to all the other values every value is equal to every other value and every value is close to every other value because as a consequence of this promotion we lost the ability to store the distinction between these four values because you can see 2 to the 53 we're probably working with ieee 754 double precision floating point values we don't have the ability to precisely store that last digit and so you can see that the rules of promotion will be very important if we wanted to learn about numpy now i don't want to go into the rules of precision for numpy but i will say that unfortunately this is one area where you may want to use a tool other than pandas because the rules of promotion and the associated loss of position it's very easy for this to happen in pandas if we have these two series here that both contain the values 992 to 995 but they're differently indexed and we ask these two series do you contain the same values they definitely do we ask if the indices are the same they're not we sum these two values it'll do an index align sum which will pop up nands where things are missing as a consequence of filling that nand well we can't store a nano and an integer array so it's going to promote everything into float64 and we might say oh but there's a dot add call which is a fill value so maybe i could fill with an integer 0 and it'll preserve everything integers but no it won't and so unfortunately pandas will sometimes get you into that floating point into that floating point d type even if the api suggests that it might not have to and as a consequence this loss of precision can sometimes happen a little bit more easily in pandas than a canon numpy but our goal here is not to talk about numpy promotion and it's not to talk about numpy broadcasting but let's talk about numpy broadcasting because it'll help us understand what index alignment is this thing that we've mentioned a couple of times so far if we think about numpy broadcasting it's about taking two dissimilarly shaped data sets and figuring how to match them up so that an operation performed on the two of them makes sense in other words if i have x's and y's and x's looks like this and y's looks like this i might say how do i add these two together well wise has two values and x's has two levels so maybe i add one of the values from one to one love the other and one of the values on the other one to the other layer or the other level of the other one when i add this together i get a broadcasting error cannot broadcast these operations together and if i swap out y with something that is two by three it still complains to me can't broadcast together but if i do something that's three by four it works why do you decide that a three by four against a two by three by four worked but a two by three against a two by three by four didn't work and if we ask our co-worker well how do i make this work with say just the two values they'll say oh use new axis right what what the heck is new x is all about well it turns out that if we understand the rules of broadcasting all of this makes sense and many of the errors that we might run into will be very clear why they are the way they are and things like new axis will actually make sense the motivation for them will be fairly clear note however when you're becoming an expert at a tool like numpy or pandas it is not guaranteed that the rules will always make sense it is not guaranteed that the rules will actually drive intuition or conceptual understanding because one thing that you might wonder is i had this very nice box drawing character here and you might say he used a dot format that's one of the first times i've seen him use dot format since f string was introduced why on earth he do an f string and i might say well if you take a dash value and you multiply times 40 well you get a line of 40 but it doesn't look that nice if you use the unicode box drawings line horizontal it looks much better if you put this into an f string everything works and so here i can have an f string with a new line before and after but the rules of the f-string say you cannot have backslash expressions in the s-string so i cannot write this with an f-string i had to use the dot format the rule here doesn't really give you a good sense of why f-string versus that format the rule here doesn't really give you a broader sense of anything but a restriction that came up when they were deciding how to design this feature in a corner case where when they wanted to parse parse f strings and they wanted to make it possible for additional tooling to be able to syntax highlight f strings they just couldn't figure out how to get the backslashes to work in a consistent fashion easily and cleanly and as a consequence they just said we're going to rule against this we're going to forbid f backslashes and f strings that that rules exist doesn't necessarily mean that rules are intuitive or drive conceptual understanding doesn't mean they make sense but the rules that we're going to talk about the indexing rules despite them having as incomprehensible errors as the syntax error f string expression part cannot include a backslash why not well the broadcasting rules and the index alignment rules make a lot of sense and why they are the way they are is something that can help you drive better conceptual intuitive understanding of a tool like pandas because here if we focus on this role these are one of the right rules to focus on we can see how everything fits together here are the rules of index alignment what you do sorry the rules of broadcasting what you do is you take the dimensionality of the structures you're looking at and you align them and you right align them so i'm going to take each one of these things that i tried i'm going to rightline them and then you try and match them up and when you try and match them up you're either looking for an exact match so four against four or for there to be something missing you're doing like a zip that's going in reverse so you stop at the short of the two or a one here two by three by four against two these don't match four doesn't match with two two by three by four against two by three doesn't match two by three by four against three by four that matches the four against the four the three against three so it's just going to do that operation taking that wise and apply it to this level here twice for that y's of two that i did the new axis that new axis or the indexing with none same thing is just about nesting this a little bit deeper to add one more layer to this and have that layer be just one element because a two by two structure is the exact same thing as a two by two by one by one by one by one by one structure well this new axis is just my way to obey the broadcasting rules to make these things line up in broadcasting you write a line then just match up the shape you match up these dimensions looking either for an exact match or for one of the two to be equal to one and if one of the two is equal to one then you just broadcast against that and so that's how broadcasting works and you can even double check this work here and and why it's all right alignments of a left alignment makes sense in terms of what we know about pen about numpy numpy is a raw view of an interpretive view of raw memory and numpy has some shape and if you look at the shape of this thing two by three by four and you look at the strides of this thing you might say well the shape is how we interpret this data as a cube or a rectangular prism with two layers and three rows and four columns each the strides is how we move between the values to go from layer to layer we skip 96 bytes to go from row to row we skip 32 bytes to go from column to column we skip 8 bytes because it's in 64 data of course eight bytes what you can see from the strides is that the outermost of these the right most of these axis is the most tightly packed these are the values that are most contiguous in memory these individual column values are most contiguous in memory well that's probably why you would choose to write a lining you want to start from the things that are most contiguous to memory to the things that are least contiguous memory so you're not jumping around memory too much and when you perform operations on this numpy and d array and you think about you know what an indexing does you can understand what this indexing does in terms of strides when you ask for x equals zero well why the indexing works from the outermost to the innermost is you're you're drilling down into the data but you always want to get the thing that's as contiguous as possible because that's going to be as fast as possible and so that's why the broadcasting works from the right hand side but the indexing works from the left-hand side and you can see this whenever you do indexing every time you're doing an indexing you're just saying oh give me a little bit of that view of that underlying memory and just give me this particular contiguous block and even if you're doing the indexing and saying getting just a particular column or something you can see that still works you just end up skipping over more values well when i take my y's and i can see my striding you can think that this wise that has had a new axis applied to it twice over is something which just has zero in terms of the number of values you skip because it turns out that you can increase the dimensionality of a numpy and d array paying almost no cost by just telling it oh you have one more dimension you have one more axis associated with it and that axis just contains one value and the strides for that it's just zero you don't skip it around at all you can even debug your broadcasting issues using broadcast two from them but you can see if i take zs and i try and figure out how it would be turned into something that's compatible with x's you can see oh it would turn into something that has zero starting at the beginning it's just going to repeat that thing as many times as it needs to repeat it on this particular axis it all kind of makes sense and so without further ado let us talk about the rules of index alignment because it turns out they're all about just matching things up now the rules have a couple of small parts them but let's talk about the simplest case where there are no duplicate values in your index and so here i have two pandas series and the two series are indexed a b c d and a b c d no duplicate values everything matches up and the indexes know that they're sorted they know they're monotonic and they know they don't contain duplicates when you add s1 plus to s2 it just matches things up exactly it takes the value at a and the value at a and adds that the value of b in the value b and adds that the value of c in the value would see an ad set didn't i tell you that the collections.counter is a first order approximation to a pandas series well it is i mean this is just adding up the corresponding keys to get the value with a little bit simpler of a syntax if it's the case that these things do not line up directly then it'll try and do the line up as much as it possibly can but it's going to figure out that oh one of these sides doesn't have a the other side doesn't have e whereas the collections.counter will fill in with the zero numpy will fill in sorry pandas will fill in with a nan and so that's why you have those nands surrounding this it is the case that if you wanted to look at the indexing mechanism yourself you could play with the index you could say hey index tell me if i'm going to do an operation with you against this other index what would happen you can see oh if these are dissimilar indices it's going to add up the first value the second value and the third value and it's going to fill in the blank for the other one and presumably as part of this operation what we're going to do is we're going to do this get indexer on both sides and union the two and that's what the result is going to be so the result will be the consequences of this lookup of these values and this look up these values merged together now if we look at these indus indexers we can even see that the indexers have some additional behavior we can backfill the indexers so if we have bcde and abcd we can backfill so we can say oh if you're missing something just pick the values before that we can forward fill something so we can say oh if you're bcd and cdef well just if you're missing values and this is monotonic just fill it going forwards and we can even look for the nearest value so the indices don't match exactly you can just find the nearest value and when you think about that and the get indexer you can see oh this is a mechanism and that mechanism has rules associated with it for how it matches these things up and the different types of indices may have more sophisticated mechanisms than just an exact match index alignment is about matching things up but it's not necessarily about an exact match it's about how the index considers the match to be exact whether it considers one of these methods to be the most appropriate method nearest match this is why you can put actual date time values in a pandas index a datetime index and then look it up with a string well the index knows how to convert the string into a datetime value and then perform that matching unfortunately these different indexing mechanisms indexing modalities are not available on most of the operations you perform on a panda's series directly you can see if you do an s1.add you can only really specify the level and the fill value and the axis you can't specify how to do the indexing how to do a forward fill or backfill but those are operations that are available on the series itself so you first transform the series into shape and then you let the ad be a very simple thing or what you do is you do index operations and then once you have the index operations you apply them to the data i often find that a lot of the very complex pandas code that i write does quite a bit of work to first figure out what values i want to merge or what values i want to operate they do index operations and then finally once i figured out how to whittle down to the data that i care about then i go to the raw data and i pull it out and i perform some operation on it the collections.counter is a first order approximation to a panda series and so you can see all of these behaviors in play you know if i have two series and i'm and i match them up or two two series and i add them together it's just like a collections.counter matching them up if there are duplicates in the index what would the reasonable way to match up values b if one of these indices had two a's in it and the other one didn't how would we match them up well probably what we do is we'd match each a in the first one against each a in the second we do a cartesian product here if we look at these indices we can see one side has duplicates and the other side does not and so how do you match things up there's duplicates on one side well you match one of the things on one side to every one of the things on the other side that's why there were more rows in that previous example because it was doing the cartesian product of the places where there were duplicates and because the index knows it has duplicates it'll use that as part of this operation you can see a3 a negative 3 well that's because you match the a7 and the first one against the a negative 4 and the a negative 10. index alignment also performs this cartesian product and if it's the case that there are duplicates on both sides well then you're just going to do the cartesian product of every place you see duplicates if there's two a's on one side and two a's on either side then you're going to get four values in the result for that a because you're going to match every one of one against every one of the other and that's the reasonable way to match things up because if you collected some data and you said i collected this data labeled a and i collected two measurements for label a on one side of the data set and i collected two labeled measurements a the others on the other data set i probably want to pair them up in all the different possible combinations um and if i didn't want to do that i probably would have used unique labels your choice of whether these labels are unique or not is you telling pandas consider these to be the same conceptual entity and so please perform the cartesian product it's not that this is a obscure thing that pandas has done to confuse you it's that you've asked pandas to do the wrong thing by telling pandas something that's not quite correct about the metadata related to your data remember and it is critical to understand this the pandas index is not like a primary key it is not guaranteed to be unique it is not guaranteed to be sorted it is just some mechanism for looking up the data dead stop is just a mechanism and in fact non-unique non-sorted indices come up all the time let me show you an example of a non-unique index here we have a date here we have a series and that series is on a time index it's on a daytime index and you can see i have captured these on different dis uniform measurements down to the second from the first of the month to the 22nd of the month now for my purposes i might want to resample this so i might want to resample this in terms of every minute and so here you can see i have every minute and everything's been rounded nicely and i have one row corresponding to every minute and there's a bunch of nands here because i don't have enough measurements for every single minute there's just too many missing measurements this is not too different from the idea of grouping these values by those times rounded to the minute and then performing the mean as well and if we were curious about what the difference between this resample and this group i might be we could do a little bit of index mathematics a little bit of index manipulation to figure out what that would be we could take the index we could take the first operation and that we've resampled on one minute intervals and we could take every value and put them into a set but not the values themselves maybe the index contents and put them into a set and so here you can see for this resampling that is the original data sample that got core that at that particular index value at two seconds past the minute that got bucketed into 51 minutes past the hour and i could do the same thing on my group by and you can see where they don't match up is going to be the cases where this dot group buy with a rounded time series is going to be different than the resample and in fact i could take these two results and i could put them into a data frame and here you can see this data frame is a collection of like indexed one-dimensional series so this data frame contains all of the measurements from the original data set that correspond to 51 minutes past the hour for the resample case and for the group by case and so you can see for that resample in that group by case the actual value should be the same because i'm performing the mean on the same values but anywhere where these differ it's going to be different in the resample case you can see the resample is giving me entries for every uniform minute whereas the group by is giving me no values if i didn't have anything to measure so the group by says nan here and doesn't have any values in the result whereas the resample just have a nam value because there weren't any values to perform a mean on if i take this result and i look for all the places where they are in common you can see these are all the places where the group by and the resample matched up in terms of the signals that they were going to compute as part of the mean if i take that and i capture the index of this result all by itself then i can index this thing where where it's true and i can see these are all the samples where things matched up so these are all the places in the result where things matched up i can take that and i compute what the same indices are i can look up the the resample and the group by where the same index are and then see if they're all matching up exactly as i said anywhere where they grouped on the same values anywhere where they joined the same number of values together in that mean they should be the exact same value so that gives me the common index if i want the different index i can do another index operation i can take the original indices of everything they both have subtract out everywhere where they're different and i can see all the places where they differ and what you can see here is index mathematics come up pretty commonly it's the case that anytime you have a very large data center you're working with another data set and you've captured this metadata it's not uncommon for you to do these index operations say okay take this data and then subset it and subset and subset it and if you're not using the index you're probably doing this on the raw data so you're making lots of copies of the raw data which may have many many many columns and so that could be a very expensive copy and it's probably many lines of code with lots of for loops that nobody can make heads or tails of but here what i'm saying is take these two structures figure out what samples were included in each bucket find all the buckets where they're the same and then compute all the places where you use the same bucket compute all the places where you use a different bucket and then actually go up and look in the original data to find what the corresponding calculation was for those buckets and verify that the group buy and the resample are sometimes the same in some places and sometimes different in the other places but the only difference there is how they bucketed how they performed this windowing operation if you think beyond the pandas series because we often don't use the panda series directly if we're not that familiar with pandas we'll often just use a data frame all the time but it turns out the panda series is a very useful tool all by itself well here is an example of a panda's data frame and a panda's data frame that you may have worked with before it has dates it has dickers it has prices it has volumes and we can do operations like do a dot lock give me the element at the label zero and the ticker column you can do a dot i lock give me the element at position zero and column zero so i guess that would be the date give me the element at label zero oh maybe that's why when i had the series with the square bracket indexing sometimes there's some ambiguity because when you look at something like a data frame and you perform an operation on that data frame you're going to get a row or a column and if that's a single row in a single column it's going to be a pandas series and if it's a panda series you're going to want to do a another lookup on that maybe a square bracket lookup and what are you most likely to do you're most likely to have string named columns and so you're most likely to want to operate on the labeling for the subsequent operations that you perform once you do the lookup in the data frame to reduce that to a series and you want to do an operation on the series and if you look at these dot lock and dot ilok operations you can see there's quite a few of these you can see this gave us a series object so that's why it had this implicit behavior it's just trying to guess what's most common because nobody really wants to do dot lock dot lock like this even though that's less ambiguous and you can see the dot lock even has the ability to to have two parameters you can say give me everything for the ticker column and for every particular row now let me remind you pandas is very one-dimensional the main reason to use it is the index and it's very one-dimensional it only stores one dimensional data a panda series is a one-dimensional data set that has one index a data frame is doubly indexed what a pandas data frame is you have collected multiple measurements those measurements are collected with the same metadata the same indexing mechanism you want to store them together and line them up and as part of storing them together and lining them up you also want to refer to those measurements by another index a data frame is a doubly indexed structure where it contains like indexed one-dimensional data pandas is only one-dimensional data yes the panda's documentation calls it two-dimensional data because there are two coordinates for a dot lock or a dot i-lock for accessing something but if we try and compare a pandas data frame like the data frame here to our x-ray data and we said the x-ray data was clearly two-dimensional data this is not really two-dimensional data because i can only look down the rows or down the columns to perform a meaningful operation i cannot look down the diagonals yes i can access this data via two coordinates but this data is not homogeneous if i perform an arbitrary rotation of this data the data fundamentally changes what it means whereas with an image it doesn't really fundamentally change what it means it just means you're tilting your head a little bit sideways this is actually four data sets that i've collected the date the ticker the price and the volume and i've collected this data with the same indexing and in fact date and ticker that's not data that's not something that i collected that's a labeling what i really should do with this data set is i should have indexed it on the date and the ticker it's two data sets i've collected one price one volume and i've collected them with the same date and ticker and i want to line them up and i want to perform operations making use of those being lined up operations not only on this data frame itself but operations combining this data frame with others this is what a panda's data frame is all about and this is what pandas is all about indexed one-dimensional data whether it is one indexed one-dimensional data set a panda series or a collection of like indexed one-dimensional data a data frame that's what pandas is all about let us take a look here i have a data frame and you can see this data frame if i access one of these rows using dot lock it gives me a row it gives me this row as a pandas series if i access one column of this it gives me a column of this if i look at what that column is it's a series a panda series is not a row or a column of a pandas data frame a panda series is just a one-dimensional data set when you go into a pandas data frame and you ask for a subset of that panda's data frame and that subset of that pandas data frame happens to be one-dimensional you get a series back that's the relationship between a panda series and a pandas data frame one is for collections of like one index of like indexed one-dimensional data and the other one is just one one like indexed or one indexed one-dimensional data and for the data frame if you're collecting these together you put one more index to just be able to refer to what those individual items are let me show you something that happens quite often in pandas here if i look up the element at location 0 the location 2 because i'm using dot lock this is going to be index aware so if i change what the indexing is it is going to change what the result is a dot lock operation is not guaranteed to give you a series or a data frame unless you are absolutely certain what the contents of that index is because if there are repeated values in that index the dot lock is going to give you two rows the two rows where this data is set to is set to zero and it gives you a data frame whereas if you ask for entry two there's only one entry there's no duplicates for two it only gives you one the dot lock operation gives you anything give you a data frame give you a series i guess those are the two options but i can give you either one of those two options and you need to understand what the indexing is in order to make that work where dot ilock is helpful and word.ilock is a very useful tool even if you often think about your data in terms of this metadata this indexing is the eye lock is unambiguous it always represents the physical position so i lock of zero always means physical position zero and there can't be two physical positions zero that is unique the dot i lock can be thought of as an indexing with a range index a unique range index starting from zero up to the size of the structure that's what i like really is now when you do these operations like you ask a pandas data frame for one column or one row one of the things you can see is that the indexing of this will correspond to the corresponding indexing on their side so in other words if the original data frame was index 0 1 1 on the columns and 0 0 1 1 2 on the rows and i ask for row 0 the indexing of the result is going to take the column indexing this is why it's doubly indexed so when you look at a particular slice down a row or down a column you know how to index the result of that using the column index if you looked at a row or using the row index if you looked at a column and you can see the name of this panda series is set to the value of the index that you operated on to get this particular data set so if i look for entry two the name here will be set to two and the index will be set to whatever the corresponding columns were for that and you'll see this irrespective whether you look at a column or a row and so now it kind of makes sense why it is that we have this implicit behavior this implicit behavior on the panda series with the square bracket all by itself is basically trying to guess what it thinks you're trying to do when you do these chained operations do you really want to look at the column names do you really want to look at the row names and depending on the circumstances you may want one or the other and sometimes this works sometimes it doesn't but of course you can always just.i lock or dot lock your way into explicitness the setting with copy warning actually makes sense consider this if you have a pandas data frame and you ask that pandas data frame for one row because this is not two-dimensional data but multiple data sets that are like indexed there is no guarantee that those columns contain the same type of data meaning the result that you get out of this retrieval of one row is going to be a pandas series with potentially an object d type because here this row contained a string a float and an int how do i store that contiguously will i have to use the object d-type i can't store these in 64's because what do i do with the flow what i do with the string if you're doing that and the panda series is built on top of say a numpy and d array the numpy and array wants these to be contiguous then this lookup is going to make a copy it's going to copy this series or it's going to create a series backed by some new data and that new data is going to be the python pi object pointer values the reference values for these underlying entities in memory and it may even make a copy to bring these back into the boxed universe depending on depending on what needs to be done here but whatever the case here a copy is being made you looked up a a row and it had to make a copy because this row contained non-homogeneous data and had to do something in order to represent this and because it had to do that and because i had to make a copy if you try and do a setting on that copy that setting might fail because that setting is going to change the copy not change the original data and the addition of that additional column in the preview example that changed this from everything working to everything you know failing with that uh can't set or setting with copy warning was because in one of those cases i had only integer data because i only had integer data the series didn't have to make a copy because i could just get a series with integers in them here because this is homo heterogeneous data i have to make a copy and this is why that wording happens and so the solution to this is pretty clear don't make a copy and that's why that dot ilock happens to have the ability to take two different dimensions two different coordinates here this is to allow you to say give me the value at zero on the [Music] positions for the rows and c on the position for the column away that doesn't work give me the value at 0 on the label for the rows and c on the labels for the columns okay that worked or give me the element at zero on the positions for the rows and at the position for wherever the c occurred in the columns using get lock you can see direct use of the index here or vice versa unfortunately this dot lock and dot get lock does ilock and lock operator do not allow you to mix and match whether you use labeled or positional indexing which is often times why people do the the multi-stage indexing but in these cases what you might need to do is you might need to actually use the index directly in order to figure out where that element is so you can consistently use that lock or not iloc with just positions or just labels so if we think about a panda's data frame and we think about what this thing really is a pandas data frame is typified by the block manager data that it contains the actual data in this thing in this case a couple of integers a couple of floating point values and a couple of string values that block manager is a topic for another day it is very complex how it works and it can be very opaque in terms of its operation closer to the surface however we have the index which is our way to refer to the rows of this structure and the columns which is another way to refer to the rows of the structure and these are both indices sometimes people call the index the row index or the major axis sometimes people call the columns the column index or the minor axis columns index can be somewhat misleading in part because they're both index objects they're both mechanisms by which we can refer to the underlying data and everything just kind of makes sense when you see how the index works and so let's wrap up our discussion of the rules of index alignment by talking about what the rules of the index line are on the structure that we commonly use the data frame if we happen to have two pandas data frames and they both look like this and these are doubly indexed collections of like indexed one-dimensional data what we can think is if we perform an operation on these together it's going to match things up on both the rows and the columns index it's going to match up from the first to the fifth of the year and from the second to the sixth of the year on that row index so i'll get nans at the top and nan's at the bottom because i'm missing one val one row for one of these and missing one row for the other and since the columns are the same it's just going to match those columns up by name and that's exactly what i expect if there is an extra column in one then we get all nand values because it's going to match things up on both of these indices what are the rules for combining a series with a data frame well when you have a pandas data frame that looks like this and you have a panda series that looks like this it's going to match the columns of the data frame against the rows of the series and notice even though it should make sense to you oh yeah you're just going to take that series you're just going to take that 4 2 0 1 8 and just multiply them across that's not actually what happens when i perform this it gives me a weird result with a bunch of nands because it took the row index of that series and it tried to match against the columns and because the row index of series in the column didn't match up i just get a bunch of data all nands and it doesn't matter the order in which i perform this operation it always matches the rows of the series against the columns of the index and i can't transpose my way out of this by the way because a pandas series is not something that can be transposed it's not a column vector or row vector it's just a one-dimensional data set so it has no notion of what its orientation is the transposition of a panda series is just the series itself and so there is no way for me to transpose my way around this but this actually makes sense why this rule is in place if you think about how we actually use pando's series and we think about the operations we actually perform and so if we take a look at these two structures here i have a series that matches the column indexing of the data frame then i can add these together and i can immediately see how this is useful if i go back to my real ish world data going back to my real world data here is my data frame containing dates tickers prices and values i've already set the index of the date and the ticker i have some factors and this is indexed by the ticker so for each one of these stickers i have some multiplicative factor that i want to take this original data and i want to multiply it through by if you think about it that multiplicative factor is unlikely to affect both the price and the volume if you think this is maybe even an added factor it makes even more sense why it's unlikely to affect both the price can't be negative but the volume can the volume is integer data and the price is floating point data so it's unlikely for that factor to apply to both of these it's more likely for this factor to apply to just one of these columns and so what i'm likely to do is i'm likely to take the data frame pull out the one column that i want which gives me a pandas series and do a series based alignment here that makes sense or what i'm likely to do is take two data frames that have the factors for the prices and the factors for the for the volumes and apply those data frame by data frame matching up the columns on one column against the other the rows on one the rows against the other if i mismatch them then what it's thinking is what i want to do is i want to match up the columns i want to do something to all the prices and something to all the factors moving down that data frame because most of the operations that you can see are performed on a panda series or a pandas data frame tend to be performed down the data frame down the series down the rows of this structure now you can see here this is a multi-index structure so i can do things like index slice this data frame to get all the places where i'm looking at ticker ar and q and this index aligned operation is aware of that so even though i have all of these prices and this is indexed by this multi-index of the date and the ticker and i have this factor that only includes the ticker pandas is smart enough to figure out match each one of those against every one of those dates think about how much pure python code how much looping code it would take for you to make that work in the same fashion so what is this multi-index and what are the alignment rules well the alignment rules are pretty straightforward the exact same alignment rules that you see on the series in the data frame it's just that pandas is able to use the name of the index levels in order to figure out how these things match up and this is why you sometimes get these errors around oh this is misnamed this is why you can see i always set the name of my series i said the name of my indices that's what pandas is using that's the metadata that pandas uses to figure out how to do the matching let me show you a series with a multi index here's a series with a bunch of symbols and a bunch of quarters that i've collected this data from if i look at this i can see this is clearly a multi-index if i do a lock operation i can see this is hierarchical so i can look at everything for q1 or i can drill down and i look at everything for q1 and abc you can see if i did this with another index so if this is our data frame and i had another coordinate here representing the column that i'm looking for is this one-dimensional is this two-dimensional is this like nested one-dimensional you can see why the multi-index definitely makes this not really our sense of one-dimensional because there's two coordinates that are used for one and one coordinate used for the other but they're not splattered this is nonsensical because this would be a three-dimensional data set panes data frame isn't really two-dimensional it's actually like indexed one-dimensional data but here you can see i can drill down and i can even use index slice to do really smart cross-sections like give me the abc and the xyz for q1 2021 and because it's a very smart index i can even give it an actual date like the 15th of march forward and it figures out which quarters are going to be affected and then does a slice on those to give me the corresponding tickers and the index alignment here is just smart and it figures out what layers match up to what layers and does the index line on those layers note multi-index is a topic for another day and it is a very big topic people are often afraid of the multi-index especially if they don't like using the index itself because it has these weird hierarchical behaviors but the multi-index is not the only hierarchical index because it turns out because that index is just a mechanism hierarchical indices pop up all the time a date time index is hierarchical if you ask for just the year the year they'll give you all the rows for that year just like if you had asked the multi-index for just one layer of that index and there's a drill down that's possible with the date time index just like the drill down that's possible with the multi-index it's just that the syntax for it is very slightly different because where the different layers are being separated is once this date time is parsed by the daytime index or in terms of what input that i give to the multi-index but largely these are you know if you think about it a daytime index is actually a multi-index where each level is each fidelity of time measurement that you have year month day and so forth as deep as you happen to have fidelity and so this is actually a multi-index with three levels or equivalent to exceptionally equivalent to a multi-index with three levels year month and day it's just storing that as one particular index without some of the syntax of the multi-index and so if we take all of this and we try and figure out indices are they useful very indices do we need to understand how they work and do we need to use them ourselves yes indices and index alignment if we think about hindus indexes in indices and index alignment and think about how they work can we become a pandas expert yes because it turns out that almost everything that you can do on a panda series or a panda's data frame can be determined or or discussed or defined in terms of operations on that index for example here is a series it is indexed on individual days i can group it by month and i can do a mean but you can see i get an index that's the month number that's not probably what i want what i can do is i can have the group by i can operate on that index and turn this into period and do that group i and then with this result i can i have a nicer index so i can look up exact dates and have those exact dates translate to the months and have everything work nicely this is way easier than just doing a group by index month and then i'm basically back to something akin to my range index because i have to figure out how to map actual dates two months whereas i could have let the index do this myself all the different operations that i can do in the group by make sense in terms of the index if i have a structure and i want to group it by something and i want the result to reflect indexing based on the groups i want whatever my groups to become the new index that is dot hack what dot ag is about is taking the groups and collapse them to one scalar value so you can produce as the end result something that is indexed on each of the groups and there's one value for each of those which is whatever whatever user defined function you've provided so here you can see ag with the skew or ag with the kurtosis transform is about preserving the original indexing so what transform is about is saying take some structure with some indexing perform some transformation on it but keep what the original indexing look like and so the udf that's passed the transform should preserve that indexing so i can't use kurtosis or skew because those are reductions those will collapse but i could use something like z-score and you can see here the z-score did what i wanted apply is about saying you know what i don't want either of those i want a brand new indexing apply is about doing any operation that can perform a that can result in any new index and then concatenating these back into a pandas data frame later and so if i wanted to group these by their month and then i wanted to take all the positive values and find the cumulative sum well that's going to drop rows so the resulting index is neither going to be just the groups nor is it going to be the original index it's going to have the original index of pieces missing i have to use apply and when you look at the result of apply you can see it tacks the indexing that was created by the udf here onto the indexing that is implied by the grouping because you kind of want to know oh these were the dates that i grouped together to form this and this is main case where you might want to drop one of these levels you might want to do a little bit of reset index just to throw away part that you do or do not care about but that's what apply is all about there is one more additional dimension to this which is that aggregate and transform work on a series basis and apply works on a data frame basis but unfortunately we're not here to talk about rolling functions we're here to talk about the index and so we'll have to leave that for a future presentation if we look at about 80 of the pandas data frame api almost all of these operations can be thought of in terms of something related to the index here is our data frame with our time series data and i'm going to talk about as much of the pandas api as i can fit there's some parts of the pandas api i won't talk about i won't talk about things like end and shape the number of dimensions and the shape of this thing the number of dimensions of a panda's data frame is always two the number of dimensions of a panda series is always one the shape is always the number of rows and two or the number of rows or the number of entries and one not particularly interesting i'm not going to talk about things like the metadata like the flags and the attributes not that interesting i'm not going to pluck the index the columns and the axes the axes that's just the index and the columns in a tuple not that interesting i want to talk about operations like stack and unstack here's what unstack is unstack is taking a layer of a multi-index and pivoting it up into a layer of the column so you can see you take the innermost layer of the multi-index on the rows and you pivot up to be the innermost layer of the multi-index on the columns why you might do this is because if you were to do matplotlib plotting directly from pandas each column usually produces an independent graph and so sometimes here you might want to have one line for each ticker for each price in each volume is this useful from an analytical perspective maybe maybe not but definitely from a plotting perspective unstack and then you plot this and you get one of these per particular stack is just the opposite operation it's just taking one layer of that column index and pivoting it down into one layer the outermost layer of or the the innermost layer i should say of this row index so there you go is this useful in this particular case probably not because most operations you perform on a panda's data frame are down the rows and i don't know what operation could perform on prices and volumes together that's that's not clear to me and so here that's probably not particularly useful but you can think as duals there may be a case where you want to take something and stack it or unstack it in general these are useful operations things like melt and pivot if you look at a different data frame a data frame that looks like this this data frame is something that maybe might have been read from an excel file because you can see you can imagine how somebody put this in excel each row corresponded to one ticker and they have some sectors associated with them and they have the dates where they captured some measurement and when you look at that you say okay that kind of makes sense well what i want to do is i want to operate on this in a consistent fashion i actually want to take part of those of that column index and i want to take that and pull it down into the data that's what melt is all about melt is all about saying take part of my columns namely everything but the sector column and move it down into my data and so here you can see the column labels are this variable column and the corresponding values of this value column and it's a very bizarre set of names because you can see there's like var name and you can name what these columns are if you don't ignore the index you preserve the original indexing and if you set the id vars you actually duplicate any variables that are supposed to be identity identifier variables this is actually two data sets that has been stored in the same structure but here you can see okay each date each sector each value and here probably what i want to do is i want to set the index here into this thing and sort the index so now i have each ticker each date in the index and i probably even want to swap the levels of these so it's date ticker and the and then the sector and then that and i probably want to sort the index on this so here's what my result looks like now if we think about melt in that fashion pivot is kind of a dual of melt and when you think about something like pivot again you can think about this in terms of operations performed on the index the pivot is just about doing some sort of group by and then an unstack so if i took these values and i wanted to group them by the month in which these values occurred for the original data frame i want to group these by the month in which the values occurred and then i wanted to take the sum of these values for the month and then i wanted to unstack this here you can see i'm doing some sort of pivoting operation to take whatever is in the outermost layer of the row index and move it up into the or the innermost layer of the row index move up to the inner most layer of the column index well a group by unstack is basically a pivot table operation a pivot table operation is just you explicitly saying what do i want the index to look like what do i want the column index to look like and how do i want to compute what the cell values are and so here i'm going to pivot this by saying oh i want as my index to be the dates by month i want us the columns to be whatever was in that original uh index layer or column called ticker and i want the aggregation function to be the sum of everything that that managed to match up that's all pivot tables just an operation where you set what the columns are when you set the indexes sort index obviously sort by the index you can sort by the index on either axis this is just going to alphabetize what your this is just going to alphabetize either in ascending or descending order what your columns are because it's just going to sort on that column index and this column and it's just strings why does sort index use ascending equals true and false instead of reverse like the sorted function in python i don't know i mean there's cases where the pandas api diverges from what we might expect this is probably influenced by some some non-python tool i can sort by values okay that's pretty obvious reset index what reset index does is it takes a layer of the index and it pops it into the data is that useful well in some cases you do want to throw away the index or in some cases you may have preserved the index as a metadata deep into an operation and at that layer you know that index only contains one value and so it's no longer valuable to have that index anymore as part of some sort of sub setting you have all of just the same ticker maybe you did a group buy and you're doing some sort of nested group by pop off the index you don't need to you don't need to be you know percolated back up if you did like a group by apply you can drop this as well and set index is basically the dual of resetting says take a column take the data and move it into the index all of these operations are just moving things around either from the column index the row index from the data to the to the index from the index to the data here you can say set the index and append it i'll just set it as an extra level drop level that's just about dropping one level of the index take off the part of the index that has ticker in it and drop it by name and that's why naming these makes sense re-index just says take the data set and don't change what's in the index change what the data is based off a new index look up new things reindex says here you have your data with your indexing give me a new structure and look up all the values using this index to look them up so reindex here is just saying oh you know give me all the values from the 1st of february every 14 days and for four periods so that should be up until about mid march that's all re-indexes set axis very strangely name but set access is about setting the values of the index given some map or some values itself sorry using given the actual value itself so set index taking the columns and then uppercasing them or title casing them and setting this axe as a column this is just setting the values of the index just mutating the index the index what the index contains presuming that the index is backed by some data what this would do for a non-data backed index might be ambiguous i'm not sure what this will end up calling i think this will actually end up trying to do set item operations on the index and so how the index is going to intermediate that and figure out what to do is up to the index mechanics itself and so this is where the index api can be a little bit large and this is a case where you know maybe this this has increased the size of the api for the index you know what rename does all it does is it renames all it does is it is it um renames the values of sorry it sets the values of one of the two indices the column index or the row index using a mapper it's basically set index but not using the data using a mapper instead so here you can see instead of actually performing the computation to title case all these i can just give it a function to title case all of these myself and so what rename could be used for is if i didn't want to do this on the columns i want to do this on the rows maybe i want to do a uniform transformation to part of this index i want to take that date and i want to push it forward or push it back one day i could do a rename on this and you can see this pulled everything back one day just went to every one of those values and then performed an operation on them swap level just swaps the level of the index reorder levels just re-orders level of index given some ordering rename axis just renames what the axis is called not the contents of that axis but what that access is called because if you look at the dot columns this dot columns is an index that has a name field as you do stack and unstack operations that name field is going to be used to figure out what to name things when you pull something from a layer of the column index into the row index and we did that with our previous stack operation you can see it was missing it was a blank there because it wasn't named originally so rename x is about naming these and it's valuable to name these so that when you do multi-indexed index alignment pandas knows how to match things up that's why that rename access came up before pandas didn't know how to match things up so you named the layers of that multi-index in the previous example and so panda said oh now i know what matches up index alignment and you're off to the races and you see that when you stack this thing and you look at the index you can see oh this layer of the index didn't have a name because i hadn't named it that's all it is all about swap access is just about transposing this i don't really understand why swap acts exists in addition to transpose sometimes the pandas api isn't always clear there may be some use case for swap access that's not the same as transpose squeeze is just about taking something that possibly could be a series but it's currently a data frame and dropping it to a series it doesn't force something into a series and so here you can see it didn't do anything to my data frame but if my data frame happened to be just one column squeeze said you know what i'm just going to make this a series and notice what happened to the name the name was whatever the name of the original column was and so this is where squeeze comes up and you can squeeze on either axis if you want you can squeeze on the index or the squeeze on the columns explode is the opposite explode is not quite the opposite explode is the idea that one of the values in your data might be some sort of nested information like say a list as a consequence of how you read this in maybe it was like a csv file where each csv entry had some sub entries explode just says explode notice it explodes it down the the rows and notice how it preserves the indexing see duplicate duplicated values in an index come out all the time you can see the seven was attached to the three the three was attached to zero it's just exploding something nested into additional rows common operations like at and i at and lock and i lock just about accessing things what you can think is if you do an i at and for that i ad we're missing a pair of parentheses here you try and access oh and this should be square brackets here if you do an i at you can access a single value or an at you can access a single value you can either access a single value by its label or an accessible value by its position very similar to dot lock and ilock the one tiny difference between these is these will always give you the scalar values whereas a dot lock is going to give you potentially a series a data frame it'll give you a panda structure back whereas the ilock may may give you just a single value other operations.head just use it's basically just like an ilock giving me the first three entries or give me the first n entries a tail it's just like an eye lock giving me the last range it's just an index not aware operation sample is just like an ilocks giving me some random choice of values all of the binary operators like dot add index aligned they take the index from one the index for another and they add them up and here i'm adding this to itself and so obviously it'll be index lined already sub is the same subtract is just because people don't want to write subtract and so they write sub instead but some people really want to be very clear that this is about subtraction not about submarines mole and multiply same thing index lined operations div divide also about index lined operations because pandas is built on a numpy and d array which is a restricted computation domain and a raw view of interpreted or an interpretive view of raw memory the division scenario that we had in python 2 where there was a difference between dividing to floats into ins in the numpy and pandas universe and so numpy and pandas make it explicit whether you want to do a regular division or a floor division or an explicit two division things like power are basically exponentiation you can see here oh yeah if i had two columns and i wanted to exponentiate the price column by one thing and the volume column by something else i'd want to align the series on its rows on its index against the data frame on its columns same thing with modulo other operations like dot basically dot is a multiply and a sum again index aligned operations like your comparison operations all of them are index aligned they all just match up something on the index and see is the corresponding value the corresponding index equal to the corresponding value the corresponding is not equal to or what not they're all index aware operations if i have two data frames df1 and df2 a line just tells me how they would be aligned so if i have df1 and df2 you can see a line is just telling me what the result is going to be if i try to align these and so it's maybe like our broadcast too our way to peek into what an operation will perform and i can choose whether that alignment is done only on the index or only on the column just if i'm curious oh if i align these just on the index what do i get if i align these just like columns what do i get dot join is basically taking that alignment and joining two structures laterally taking the columns of one against the columns of the other and so it uses the index by default and it has this l suffix so if the columns happen to overlap you can figure out which column is which i actually don't think that that's a particularly good part of this api but that's what it is you can control how that joining is done just like in sql inner joins left joins right joins outer joins and merge is just a generalization of this joint idea i don't think the join should take the suffixes i think what the join should actually be is you take one of the data frames and you do a set axis to add another layer to the multi-index that is the columns and that gives you oops that gives you this like that it gives you one more layer there you do the other to the other side then you join them if you join things with multi with multi-index column axis you can see it joined nicely you can see where the a is and the b is for each of the two sides and then once you have that then you can rename what this axis is so you know oh that's the side the left and the right is the side and a b a b that's the columns and then you can do group by operations across the columns to explicitly identify oh do i want the max of one side maximum i think that's how join really should have worked it shouldn't have taken a suffix it should have created a multi-index instead abs oh absolute values obvious it's just going to take every value in there and it's going to compute the absolute value and there's a lot of these operations just perform on every corresponding appropriate numerical value so abs or clip and just going to every value and clipping them by something or absolute value by something and that's maybe not index aware things like min are index aware they're indexed around the columns it's going to give you a result that is the indexing for whatever the columns were unless you specify the index on a different axis then it's going to be the indexing order the other axis was so if you do a min on the column a min reducing the columns what you get is the original the original indexing of the structure and the minimum value going across each column if you do a min on the index the rows it gives you the original indexing of the columns with the min down those columns for each individual column and the default is usually operating down same thing with max same thing with sum prod and product same thing with any and all same thing with mean standard deviation variance same thing with curt kurtosis and skew for median the same idea the indexing of this result is going to be the indexing of the axis against which you operate so if you operate on the on the if you operated against the row axis then that will be the indexing if you opt into the column axis that will be the indexing mode is a little bit different because mode gives you the most common values and if there are ties it will give you those values here here this is a range index but this range index makes sense because that's the zeroth most common that's the first second third fourth and so far most common value and so you can see there's a lot of nands in an index align this back together you can do some very nonsensical things like do a mode on the columns here but we'll skip that count just count the number of values on a particular axis just like min and max n unique count the number of unique values same thing mad i'm so mad that i didn't realize the index was what it was that's what mad does it reminds you that you should be mad that you didn't pay more attention to the index when you first started using pandas value counts just counts the number of unique values this is a multi this is um this is being done on on both of the columns so this is this is counting the total number of times each combination each tuple of each of these columns appear and so if you look at the index of this you'd expect this to be a multi-index with the values of the price column and the values of the volume column n smallest tells you what's the smallest value for a particular column n small is three what are the three smallest values for a particular column and when i say four particular column i mean using that particular column in order to determine as a predicate what's the smallest one so enlarges says using the price column what is the first largest value what's the third largest value and whatnot and this is useful and the indexing is useful because oftentimes it's not the smallest value in a particular column that you want but the corresponding value and another column is two something in a particular column in other words this line says tell me what the three smallest values in the price column were and tell me what the indexes are for those and then look up what's in the volume column what are the volumes for those three smallest prices this is where the index is helpful imagine how much code you'd have to write in order to do this if you didn't have an index and you had to recapitulate the index yourself with a bunch of for loops it would be terrible things like index max and index min very similar you want to find what the max and minimum value of that index are and you're going to use that then go look up what the actual values are things like cumulative min cumulative max cumulative sum and cumulative product just perform operations down the rows and you can see here it performed this operation cumulative minute just found the minimum value going down now this isn't particularly useful because you can see it's mixing tickers against each other there should be some sort of grouped operation where you first group by the ticker and then perform a cumulative min this should be something like this group by ticker and then what you're going to do is you're going to preserve the indexing so you're going to take this series and you're going to do a series that cumulative min that's probably what you want the cumulative min separating the tickers but largely you can see this is just a rolling operation in fact those are largely equivalent to just expanding operations on this except there's no expanding prod and so just an expanding window operation talking about window operations there's also ewm again this ewm like any one of these other group by operations with zeros operations gives you something that is indexed by the original structure ewm is going to preserve the original indexing expanding is going to preserve the original indexing for your operations like shift and diff just move things around on the index just move the data but keep the original indexing the same diff just move the data or just do that and perform a subtraction as well rolling just give me three windowed values preserve the original indexing or not depending on what kind of rolling we do but actually it'll preserve it'll always preserve the original indexing it'll always preserve the original indexing because it'll look at three rows here and it doesn't know how to collapse those into a single value but here's a rolling another grouping operation group by do a mean that just preserves the index or it creates the indexing based off what you're grouping by group by ag same thing this is just a built-in function this is a user-defined function group by transform preserve the original indexing group by transform with an i-lock preserve the original entity and then look at what the result is group by ag operate on a series all by itself you know preserve you and set change the indexing of the result to whatever you're grouping by resample well resample doesn't work on this because it's not a daytime index resample telling out what level of that index to use works but this is like a group by where you're grouping it by the resampled index and so probably what you want to do here is something the first groups by the ticker and then raise samples in that preserving the original outer indexing so you can do further index operations if you can see from this very tedious walkthrough of almost all of the operations on a panda's data frame they all have some relationship to index or either index aware they either operate on the index they either move data in or out of the index from one of the two indices everything comes back to the index and the end and rather than studiously memorizing all of these operations it's probably better to understand the rules of indexing and then apply them to the operation that you need to use and so in conclusion if we think about why it is that we use pandas in the first place is it because we have a container it's a container type that's less convenient and more perplexing in a simple list or dictionary no because a list and dictionary can't do what a panda series or a dataframe could do which is that really nice matching up of things for operations with that additional associated metadata is it for the bizarre errors no because most of these errors actually kind of make sense when you understand how the indexing work is it because we will constantly want to use adopt values to force it to do what we want because we want to constantly do a dot set index no because usually if we're doing a dot values or set index we're doing it because we don't care about the indexing but if we're aware of index alignment we generally don't need to do dot values or dots index that much it's because we have all these weird incomprehensible decisions in the api sure things are named kind of weirdly in some places but for the most part a lot of these decisions like the different group by operations are in terms of how they preserve the indexing is it for the minor conveniences definitely not because the convenience that the index provides is a major convenience and is way more valuable than not having to imp import something like scipy.stats is it for the 250 000 lines of complexity that they introduced to our lives as a dependency that could break that needs to be updated that we have to keep up to date with it's a lot of code this is a big tool and a big project and using pandas is a big investment into your tools but as part of that investment as part of the investment of becoming a pandas expert understand what the index is all about understand what index alignment is all about that 250 000 lines of code is to support this extremely convenient extremely sophisticated index aligned operations on single index structures or collections of like indexed doubly index structures the pandas data frame and so why do we use pandas it is for any of these reasons is it for the index no it's probably because we have a co-worker who wrote 50 000 lines of pandas and then twitch jobs they left us with that we're not rewriting that from scratch that's probably why we use pandas but if we're going to use it we might as well know what we're doing we might as well try to become a pandas expert thank you so much for watching this presentation i hope to see you in the follow-up session where we'll take a look at all of these details and show you how almost everything that happens in pendants can be understood in terms of the index and index alignment but with that i'm james powell it's been a pleasure presenting to you thanks everybody
Info
Channel: Don't Use This Code - Training
Views: 2,053
Rating: 4.775281 out of 5
Keywords:
Id: oazUQPrs8nw
Channel Id: undefined
Length: 121min 20sec (7280 seconds)
Published: Fri Oct 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.