Keynote: What Does It All Really Mean - James Powell

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning pycon india thank you so much for inviting me to keynote i wish we could have done this in person however it's uh shortly after 4 am where i am so if you give me just a couple of moments we'll get started with our keynote for this morning okay let's get started this is what does it all really mean we're at pycon india it's saturday october 3rd 2020 and i'm james powell if you like this talk you can follow me on twitter at don't use this code i give a lot of talks of similar nature and hopefully if this is something that you're interested in you'll have a chance to see more of this in future now i wanted to give you a little bit of context for what this talk is about and some people have always asked me why don't use this code what does this mean and you might think it's because in past i've given a lot of talks that were kind of gimmicky or frivolous about small little niche details or doing things you're not supposed to do and it is very much the case that don't use this code wasn't intended to be a disclaimer in terms of you know what don't take this too seriously don't use this code however originally don't use this code was really more about a particular approach that i was taking in thinking through particular problems in order to derive actual practical benefit for myself and so to give you a little bit of context around what that procedure was i often do consulting work or coaching work or training work with people who have to do programming but aren't actually programmers think a optical engineer or a network engineer or a physical scientist or a data scientist programming is a part of the work that they do but they never call themselves a programmer as a consequence sometimes they're hesitant to employ a certain degree of technical sophistication or they're even hesitant to focus on certain details because to them they wonder do the details matter is this really gaining value is this really giving me value in my life is it really helping me out to learn all the nitty gritty details of a language or an api or a tool set it's just a means to an end for myself and it is very much the case that in for most for a lot of their work the details don't really matter just make it work get the results you want move on to the next thing but it's also the case that sometimes the details do matter and i think all of you have had a situation where you've seen some code that was pushed to production by somebody who is a little bit sloppy with the details or didn't employ a sufficient degree of technical sophistication and the work had some failure or had some deficiency that led you to have to go and fix it later or to rewrite it from scratch and so we know that it's not quite the case the details strictly don't matter the strictly do matter what i want to show you is a particular approach that i take where i try to motivate meaning through investigation of the details and this is really the ethos of why i started with don't use this code the whole idea is this if you take a look a really really close look at the details and you go as deep as you possibly can on the way down and on the way back up you're going to find meaning along the way and that meaning is not going to be gimmicky or niche or inapplicable to practical problems it's going to be something that gives you actual value and actual things that you solve or actual things that you have to deal with in your daily work in order to really show you what i mean i want to start with an example and we're going to presume that we're doing some kind of code reviews well this is a code review example we've run across a little bit of code and there's some details in that code and in fact the details in this code are very very very small i want to show you how these tiny details these almost trivial details can actually give quite a significant amount of meaning and understanding to the work that might not be visible otherwise along the way i want to kind of introduce you to the thinking process that i take but let's get started so let's say that you run across a line of code that looks like this somewhere in a code review there's not a whole lot you can say about this it's very simple okay maybe you could complain about the variable name but beyond that there's really not anything that's interesting there but what's interesting here is instead of writing x's as a list of the numbers one two and three somebody could have chosen to write it as a set of the numbers one two and three somebody could have chosen to write it as a tuple of the numbers one two and three somebody could have chosen to use a numpy and d array in order to store this data somebody somebody could have chosen to use a pandas series they could have chosen to use a pandas array if they were particularly clever and using an up-to-date version of pandas or they might have even chosen to use a panda's data frame but it's likely in this particular case they prob this is a little bit gratuitous now what's the difference between these why did somebody make one choice does it really matter who cares generally who cares why did one of the choices that somebody made to use a set here versus a list versus a tuple make any difference to anybody's life well if you think about it if the goal here was to store a couple of numbers and then you just compute something on them like their sum you store them in the list the sum of six you store them in a set the sum is six you store them in a tuple the sum is still six you store them in a numpy and d array the sum is still six and you just have two ways to compute the sum this api and the numpy api you store them in a panda series again the sum is still six you put them in a pandas array surprise the sum is still six you put them into a data frame the sum is still six with the small caveat that this doesn't actually work you have to write this instead but this is three characters different and they're characters that you'd find the difference between these two you'd look it up on stack overflow you'd ask somebody to help you out what does it really matter who genuinely cares why does anybody care the difference between these types this is a very small very minor detail and a very small question but as we go through this exercise i want you to see how this question can drive an enormous amount of meaning and understanding for what python is about and what these tools are about so let's get started now our goal here is to determine what does this all mean and not only what does it all mean but what does it all mean in the context of things that will actually help us write better code that'll help us improve our lives in some measurable way what we're going to do is we're going to embark upon a thinking exercise and this is the thinking exercise that i employ as part of all the talks that i give and as part of all the consulting work that i do all the training work all the coaching work that i do now you might look at this term thinking exercise and wonder well why did they think thinking exercise why didn't they say thought experiment hold on a moment this isn't so fancy that we call it a thought experiment it's just a thinking exercise so the very first place we start is let's just compare some of these options one on one let's compare the option where they encode this data as a list and they code this data as a set and we'll make this example even simpler we'll just have an empty list an empty set and so here we have an empty list here we have an empty set what's the difference it's two characters different slightly different in terms of i do or do not need the shift key on my keyboard they're actually the same key on this particular keyboard there doesn't seem to be anything particularly interesting here and it doesn't seem to be a detail that if you were to talk to somebody who's not a programmer they're going to really really care oh use the square brackets not the curly braces now one thing to clarify this is not actually a list or a set i hope you all were able to catch it this is actually an empty list an empty dictionary this is an empty list and this is an empty set i hope you all caught that now going back to our example we might say well the difference between these two is the list is a sequence and a collection whereas the set is a set and one interesting thing that we're going to see in a moment is what's actually meaningful here about the list is not that it's a collection with a capital c but that it's a collection with the lowercase c and when you look at that this is an even smaller difference than the choice of punctuation that i'm using in order to create this type whether it's a capital c or a lowercase c this is a very small distinction and yet it is surprisingly meaningful now what do we mean when we say that this is a sequence well we might know that in the collections module in the standard library in the sub module there's a capital s sequence object and we can import it and we can say is a list an instance of a sequence and python will say yes a list is an instance of a sequence or this particular list is an instance of a sequence we could do the same thing with the set we could say is this set an instance of this collections.abc.set that doesn't seem to be particularly interesting it seems to be a formality it seems to be something that somebody might care about at the edges but we still haven't motivated why it matters that this is once a set one's a sequence we haven't motivated how this has changed my life how does this help me do anything that i want to do better we've just added in some jargon and some vocabulary and in fact if we try to take this a little bit further and we say well hold on a second it's not just that it's a sequence with a capital s it's that because it's a sequence it supports these indexing operations you have this get item protocol you can say give me the zeroth element the first element the second element and in fact you can even do some kind of subsetting of this data you can say give me a slice from the zeroth up to but not including the first element that's a little bit more interesting that's an operation that you can perform on one of these but you can't perform on the other but again somebody who knows a little bit of python might look at that and say well you know a dictionary is not a sequence but i can also perform that square bracket indexing so the the core meaning here is not that you can do a square bracket after this that you can use the get item protocol there's something more here because it's not just to get item protocol that's interesting one as being a sequence and the other one as being a non-sequence we could even maybe argue that if somebody somewhere had chosen to make slicing slicing objects hashable we could have even do some we could even do something very similar visually and formally in terms of the syntax to the slicing operation on the dictionary unfortunately this doesn't actually work because the slice object isn't hashable and so you can't look up the key of the slice object in this dictionary but ultimately we still haven't gotten to anything that's actually helping us in our day-to-day lives now if we look closely at the sequence object we can see it as a register function and we could even tell python to register the dictionary as a sequence and then we ask python is a dictionary a sequence it'll say yes and so this sequence object and collections.abc really isn't that particularly interesting it's answering a question for us that we might ask at some point once we have already determined why is it interesting which of these two choices somebody has made but just that we can arbitrarily register something just because it has some particular syntax isn't quite getting to what the what the distinction is here if we wanted to go even further we could even say well you know what i can see that there's a sequence in a mutable sequence we'll talk about this a little bit later and we could register the dictionaries both the sequence and immutable sequence at the same time but again we're not really getting to the core of what this actually is in order to drive something that is of practical benefit with the set we can try the same thought experiment we can try the same exercise but we'll see that the answer is a little bit closer to the surface we can ask this set object are you a set it'll say yes we can say well what does it mean to ask this thing as a set obviously it means that it's been registered as a set but setting that aside we can say maybe it kind of means that you have this ability to perform certain operations on this this operation encoded by the ampersand the pipe the carrot and the hyphen and it's not just that you can apply this particular syntax certain things in the collections.apc module like callable really just mean can you apply the open close parenthesis syntax after the object and you can see the syntax itself is not what's interesting here because you know what i can ampersand pipe carrot and hyphen an integer as well but it's not a set it's that when i do the ampersand pipe carrot and hyphen i actually mean something by ampersand i mean find me the elements in common do a set intersection by pipe i mean find me the find me all the elements the set union by the carrot i mean find me the unique elements on one side or the other the set symmetric difference by the hyphen i mean find me the set difference take all the elements of this and subtract the elements of this find the set difference and here we can see oh is something a set that's interesting well it's kind of akin to a mathematical set i can perform set like operations in other words if i have some data and i need to figure out from that data what are the elements in common what are the elements that are unique what are the elements in one but not in the other and vice versa oh i probably want to use a set that's actually valuable to my life because if i see a problem of that nature then i know what i want to choose but we're still not quite sure why this list is interesting as a sequence by the way when we talk about the set we can also say that there's another notion here of oh there's unique elements and with the list we don't have this guarantee that all the elements will be unique and so if it were the case that we had some problem and we wanted to say i don't want to worry about duplicates automatically lie duplicates the set type would be very valuable for us but the real question here is is this meaningful and i think i think it is meaningful that we typify this thing as being a set we can see that there are certain operations that mean something to a human being that help a human being solve a human problem they want to solve that help a human being model a data in a particular way they want to model but we're not quite sure how that applies to the notion of the list being a sequence well if we step back a bit and we separate ourselves from the notion of the sequence just being the indexing operation and we think about this for a moment we can say the set isn't a sequence but the list is well one of the things that's implicit about being a sequence what makes the list a sequence but not the dictionary is that it has an ordering a human ordering an ordering that a human being cares about and just like i can say i have some data and some problem where my problem is to find the common elements the disjoint elements maybe i have some data and what's important to model that data is when the data arrived when the data will be serviced whether this is first in first out last in first out and so a sequence or an ordering is an interesting property that may or may not exist in my data if i need to model for example something that you might call a queue then maybe a list might be valuable or maybe even a collections.deck if i need to model something akin to a stack maybe a list type is valuable because that ordering principle that i get from it being a sequence is actually what i'm looking for and it's surprising in my experience working with semi-technical and non-technical users a lot of semi-technical and non-technical users actually very much understand notions of oh this is a first and first out or a last and first out algorithm and so when you tell them oh we'll use a set here or sorry we use a list here because we need that sequencing behavior it's something that they might be able to understand and appreciate and so i would argue that the list being a sequence if we dig a little bit deeper is actually something that's meaningful now the distinction here is not super meaningful they're not particularly that common if they're not particularly that similar the list in the set but you can see through this comparative exercise we can try and tease apart some differences here if we want to really drive for a little bit more meaning and we want to make this even more meaningful let's take a look at another distinction let's try and distinguish the list versus the tuple now i the reason one of the reasons i bring this up and i add this this example here one or two reasons the first is this i often ask as a benchmark to try and get an idea whether somebody really has a clue when they're using python i ask them what's the difference between a list and a tuple in fact i asked them a three or four part question what's the difference between a list a tuple and a numpy india right and i try and get a sense for what their answer is and from their answer you can kind of see are they stuck at the surface details or do they really understand what these things are and are they able to employ these tools in a fluent way and you can see syntactically the distinction is quite minor one has square brackets one doesn't i could put optional parentheses around this tuple but it's not a syntactical distinction oftentimes when i ask people what's the difference between a list and tuple i get some trivial answers about some behaviors of them but let's try and go through the exercise in a similar fashion as before if we take a look at the list and the tuple we might say they're both sequences and they're both collections they both have indexing operations available to them they both have some kind of ordering they can both be sliced they're both capital c collections and i told you before you know i actually care a little bit more about lowercase c collection oh by the way if we check this using the collections.abc module we can see you know they're both sequences and collections and if we ask do they do they support the syntax that we want can they be indexed can they be sliced they both can be indexed and can sliced but one thing that we might notice if we go back to this notion of the sequence versus mutable sequence it is very much the case that the list is immutable sequence but the tuple is not immutable sequence and so here we might have teased the part something that might appear to be a difference one is mutable and one is immutable and we might say that's the meaning that's the choice i'll make a list and tuple they're basically the same thing except one could be changed and the other one can't be changed so if i need the data to be mutable i use a list if i need a data to be immutable i use a tuple and i think that's missing a point i think that's missing the point i think the distinction between oh we have this list we can change the elements in place we have this tuple we can't change the elements in place it's an important difference and it's definitely something that will affect the code that you write it's not completely trivial but it doesn't really get to the meaning because we haven't gone deep enough and so let's go a little bit deeper and let's think a little bit deeper when we talk about mutating a list if we look around and we look at how we mutate a list one of the most common ways we mutate a list is not by changing individual elements so much as adding elements or removing elements appending elements at the end now if we were to do a list append operation and it looked like this and we were to slightly change our syntax and say somebody has written a function called append and that exists somewhere and that's what appends to the end of the list i don't think there's a really major difference between these two these two syntax very minor difference now if we take this syntax and extend it very slightly by just doing one more assignment and then actually implement append we see something very interesting here here we have a list but we're treating this list as though it were an immutable type and nothing really major about how this code might be subsequently used has been affected there may be some early versus late binding live versus snapshot view changes that might affect us because fundamentally people might expect the list to be mutable and they might expect multiple references to all be updated in sync versus a tuple being immutable and so every time you need to perform some transformation you're making a copy but if you think about it and you try and extend your thinking beyond just the limits of python you might say well hold on a second i also write a little bit of javascript because you know i need to create a front-end for the data science reports that i create and i and i like to use immutable.js in javascript and in immutable.js they also have a list type but immutable.js are all immutable types and you can see you know what immutable.js probably kind of works like this what makes the list a list what makes this list a meaningful thing is not really the mutability versus immutability it's got to be something else it's definitely an important characteristic and it's definitely something that affects your program but there's something more here now if we were to try to play the same game with the tuple this is the second reason i wanted to show this to you we could think about this a little bit fundamentally the tuple is some memory that sits somewhere that had to be allocated by somebody and the memory was uninitialized at the beginning and you had to put the elements into it and so it stands to reason that somewhere inside the c python source code there's a way to mutate a tuple it's necessary in order to create the tuple in the first place and it turns out that happens to be the function pi tuple set item and it turns out if you look at pi tuple set item it's actually exportable because third-party extension modules may also need people to create tuple types to interact with other python code and so as a consequence you can think well you know what it's actually quite trivial to mutate tuple just write a c extension module or write some code in cython call pi tuple set item and change the tuple in place but you might back up there and say you know what this is actually quite trivial you're talking about an implementation detail and we can't really derive that much meaning from implementation details because those details could change over time who knows maybe some initiative to simplify the c python c api might come into play and we no longer have access to something like pi tuple set item it's entirely possible and would that fundamentally change the meaning of the code we've already written i'm not sure now if we were to take this example a little bit further and we're trying to think well can i find a way to make that tuple mutable and can i do that in a way that drives additional meaning we might say you know what i'll define a function and in a function i'll do an assignment and what i'll do is i'll look at the disassembly for that function i'm going to look at the disassembly for that function i'll see it looks something like this you load a constant the ellipsis object you store it to the x value you load the constant none and you return that value this function just happens to return none and if you squint a little bit and you zoom in a bit you might look at the store fast instruction you might wonder what does that thing do so you might go a little bit deeper and you might say well in the c evaluator in c eval.c the python main loop we have this implementation for the bytecode that stores the thing and you might squint at this again and say what's interesting about this is the set local operation and it turns out that set local operation is a macro it's a c macro and if you squint at that a little bit harder you might say well that's actually calling the get local c macro and if you look at the get local c macro you can see there's something in the c python implementation called fast locals and it's doing some kind of array lookup so what it looks like is whenever you're trying to store a variable in python in the end it turns into an array access in c but here's something interesting there's no bounce checks because there's no bounds checks what that means is if you're doing a store fast and you can create bytecode you can tell it to write to that array anywhere and it doesn't even check that the index is non-negative meaning store fast that's an easy way for you to do arbitrary memory access what that means is if you could construct some poisoned byte code and in that poisoned byte code you could say don't store fast to an actual place where an actual variable is but compute some offset that allows us to write to arbitrary memory and that arbitrary memory happens to be a tuple you just made tuples mutable now uh i would share the proof of concept of this we wrote a proof of concept of this a couple of years back it's way too big to put onto the screen it involves um splaying generators into your heap until you find one that's at the right distance from the thing and you or rather it's playing co-routines that you're heaped with the right distance it's enormously complicated in order to create it and we're still investigating ways to just take this door fast and use code type in order to create poison byte code there's at least one proof of concept of this in play now setting that proof of concept aside we might say again this is really weird it's really niche this does not seem to benefit my life i just want to write some code that works you're going to tell me that it's going to list in a tuple and you're going you promised me that it was something that would be valuable to my life you haven't you haven't made good on that promise yet well let me see if i can come up with a different way and here you can see why i wanted to talk about list versus tuple i secretly wanted to show you a couple of different ways to mutate tuples let's talk about way number three we know there's a library called numpy and numpy provides you a data type called numpy and d array we might not be sure what the numpy and dra type is in fact that's one of the choices that we had and how to model this data because that's one of the choices that we had in modeling this data it's going to be important for us to try and figure out how is this thing different from the types we've seen so far now if we dig around a little bit in numpy we might say well numpy not only provides this type but it has a library it has things like linear algebra operations if we dick around a little bit it has stride tricks and inside stride tricks there's a function called astrided and it's still not quite clear we're not even sure what the numpy and dra is we certainly don't know what astride it means but our fundamental goal here is to mutate the tuple why not just because it's fun not just because it's gimmicky but because we're going to see something very interesting here if we were to try a mutated tuple we might try and write a function called tuple set item and it might take a tuple an index and a value and what we're going to do is why don't we create an empty numpy and d array and we create the numpy and d array we might say tell numpy what you're doing is you're creating an empty array and what's contained in this array are u and 64's eight byte structures now it might be interesting for us to think a little bit of what the numpa what numpy is well numpy is actually a way for us to take a memory view a view of memory and to look at it in different ways and so it's important for numpy to know is this a un-64 what's the size of this thing what's the strides what's the dimensionality of this thing and that's what asteroid it does it lets you tell numpy oh this block of memory just look a little bit differently this is akin to something like a c cast you're not making a copy of some data you're just saying oh an n64 it's actually eight in eights and you can do that with numpy because numpy is just this notion of take some arbitrary block of memory and look at it operations on it and so what you can do is you can go into numpy you can say hey numpy in your array interface tell me where the actual data for this empty array would be stored what's that memory address and it turns out that when we're talking about memory addresses we have another way to look up memory addresses the id function in python is a way at least in c python up to today is a way that you can get the memory address of an object now does the id function mean give me the memory address of the object no it doesn't and i can prove that to you very simply alternate python implementations like iron python use a monotonic counter for the id and so id meaning memory address is not correct id means unique identifier it just happens to be the case that in c python id means memory address now there's one other interesting thing here if you know a little bit about numpy and a little bit about python you might know that python is going to heap allocate your tuples you might know that uh numpy uses raw malic not pymalic you might be able to guess that any pipe any numpy and d array is going to be allocated in lower memory addresses than any python tuples meaning if you look at the distance between where this tuple is and where the numpy and d array is it's always going to be a positive number meaning you can go to numpy and type you know that empty array i was wrong it's not an empty array and it's not an empty array of eight byte ins of eight byte ins it's actually single byte values and its size is exactly where in low memory the array was all the way up to in high memory where the tuple was uh-oh you told numpy now i own all the memory in between these two and you can tell numpy you know what i was i want you to give me a a new uh numpy and array called ys and what that is is it's a little offset to right at the beginning of that tuple object that you found in memory that's just a size four that's a bunch of eight byte elements and why eight byte elements because likely this tuple object as it's stored and represented in c python has a couple of ins a couple of pointers things like that you might also say numpy give me a zed's and what zed says is it's an offset into this tuple object this raw memory layout for this tuple object and it's the size of oh it's the size of the actual data that tuple stores because the tuple object actually stores its its size and the actual underlying references to the things that it contains all in one and one can use memory block when you do that well it's easy peasy you tell you tell python or rather you tell python via numpy take that little memory address where you store the size of tuple change that take that memory address for the actual underlying ids the actual underlying references are add something else in there and when you have this in place you can take a tuple that has a little gap in it it should be 0 1 2 none 4 and you can mutate a tuple using numpy how about that now if you take a look at all three of these they're all very gimmicky they're all very niche they're all dependent on implementation details this last one is the most dependent on implementation details despite being surprisingly safe and easy to do in part because unfortunately not a lot of people use iron python everybody's using c python and in part because numpy's available on all platforms and some of the assumptions that we made along the way are fairly are surprisingly common and safe assumptions to make but even though none of these individually are particularly compelling i think when you take them in combination you can say this mutability versus immutability is not altogether that interesting i can make a list immutable or i can treat it as though it's an immutable type i can make a tuple mutable but i'm not really really changing how i'm using this thing i'm not changing what this thing means it is the case that one is a sequence and one is immutable sequence and it is the case that there are implications to this for example if you need to store one of these as the keys of a dictionary well the keys of a dictionary have to be something that can be hashable and there is a relationship between the between mutability and hash ability and it may be the case that if you need to have some kind of structured key for your dictionary your choice is a tuple and a tuple only even if you wanted to use a list you couldn't but i want to get to a deeper meaning and the deeper meaning relates to this capital c versus lower kc i told you that these are both capital c collections but i told you that the list is actually interestingly a lowercase c collection that's where the meaning is and so the question would be what is the tuple it's not a lowercase c collection it's something else it's a record let me show you what i mean by that when we think about the thing that we do to a list most often what do we do we append to it we pop from it well what that means is if you gave me some data and that data happened to be a list i might not know how many times you appended and popped from it i might know the size of the thing and what i'm going to typically do with this i'm going to do a for loop i'm going to iterate over every element and i'm going to perform some operation on them well because i'm going to perform the same operation f on every single element every element of this tube of this list should support that operation as a consequence of that i can kind of think it's also important that if there's no elements this code doesn't break if there are many elements this code just runs over all of them and as a consequence of that if i try and compare this to how i conventionally use the tuple i might say you know what i don't really loop over a tuple in a for loop i'm usually unpacking a tuple and i'm doing different things with the different parts of that tuple you can see in both cases they're both sequences they're both ordered in fact it turns out even the set is ordered the distinction that's important between the list and the set was not that one was ordered and the other one was unordered if you were to iterate over the set multiple times you'd find the elements come out in the same order you might not be able to predict it the difference between the list and the set was that the list was not machine ordered the set was the set had some ordering that facilitated fast lookup operations that was decided by the machine i.e decided by the ordinal value of the hash value subject to the open addressing policy of the of the you know uh probing the perturbative probing hash table implementation or the split hash table implementation sorry the non-split table implementation since split table was only added in python 3.6 for dictionaries alone but what was interesting about the list was not that it was ordered versus unordered but that it was human order it had a human order and it happens to be the case that both the tuple and the list have a human order it's just that we're using that human ordering differently we're looping over in one case and we're unpacking another case well if we try and split that difference and we look at a little zoom we zoom in a little closer we could say a consequence of this is going to be that in the list adjacent elements are semantically similar or conceptually similar because we're performing the same operation on everything that's contained in that list they all have to kind of be the same thing right they all have to be a bunch of numbers a bunch of personnel records a bunch of components but in the tuple we unpack and do different things with each of the components and so they're semantically or conceptually dissimilar and if we think about a python collection type the capital c collection type is the case that python capital c collections are always capable of being heterogeneous in type and even going back before the pep 484 days you know we were always a little bit loose about you know what the type of something was what does that mean well it turns out that if you have a list and you're performing the same operation on every element there it's very likely that every element has to be homogeneous and type sort of even with the pep 44 work we still consider say an int to be interchangeable with the float or interchangeable the complex or interchangeable the bull but we generally kind of say you know what everything in the list is semantically or conceptually similar and so that's how we decide this notion of homogeneity whereas in the tuple we say everything is structurally conceptually semantically dissimilar and so we'd say this also would typically lead to heterogeneity in terms of the types that are associated with what's stored in it and so when we look at this again we might say okay that gives us a notion the list is a collection with the lowercase c it's just a bag of stuff it's a bag of all kind of the same stuff and it's important which is the first thing and which is the last thing and it's important the exact ordering of the things but there's not a fundamental difference between the first element and the last element whereas the tuple is a record it's a bunch of fields it's very important what the first element of the last element is because the position indicates what the thing is the first element has some particular meaning that's not applicable to the second or third element or may not be applicable to other elements it is very much the case that as we said before the tuple exists as this immutable type in order to be used in a dictionary it is very much the case that mutable versus immutable is important but it's not the fundamental meaning here the fundamental meaning here is one is a collection type and one is a record now i wish we had a little bit more time for this presentation because we could talk in greater depth about this hash ability immutability thing this is another area where i see sometimes there's a little bit of confusion oftentimes i see people say oh hash ability implies immutability or immutability implies hash ability and it actually turns out to be the case that hash ability strongly suggests immutability assuming one criteria which is you need some kind of not random or non-intermediate access and it turns out that you make mutable objects hashable very often in cases where the lookup has some intermediation a very common example would be in a network x digraph when you're never really indexing you're never doing a get item into the structure directly you're always doing some kind of some kind of intermediate access to the elements via you know graph.nodes or graph.edges if you're asking the object itself to enumerate what is contained within it you're not saying oh just give me this particular element randomly but unfortunately we don't have time for that so we'll go back to our example and we'll talk about this list this set in this tuple and let's make this example a little bit more specific a little bit more concrete let's say that we're storing not just numbers we're storing host names if we were to intentionally make a choice between the list the set and the tuple we could convey a lot of meaning in that and there could be a lot of meaning behind this there could be something really there in terms of the choice that we make it's obviously the case that we could choose one of them and maybe the code might work might not work and it might not affect the underlying functionality but if we were to make this choice intentionally there might be something there let's see what that might be so let's take a look at what differences might occur if we were to so let's take a look at what differences we might see if we were to make one of these three choices if we were to choose the list formulation one of the things that we might try to convey to somebody is we care about connecting to each of these machines in a particular order that ordering is important which machine you connect to first and which continue you connect to last is very important now in terms of the difference between the machine there might be some modality hidden in there there may be some predicate you perform this operation on this machine versus that operation on that machine but fundamentally they should all be mostly similar if you think about the scent formulation you're basically saying i don't care about the ordering i care about connecting to a machine if there is a duplicate i only want to connect to the machine once make sure you connect to the machine but there's no real difference between if you connect the machine first or last additionally if you use the set formulation you may implicitly be saying that you know what i've got a bunch of different host names what are the ones in common what are the ones that are not in common perform some set like operations and you can see this choice even though it's a very small one and it's driven around a very small detail really closely ties itself to how you go about solving this problem and what you'll be able to easily do using the using the structure that you've chosen now if you choose the tuple as we saw the tuple is a different type of ordered structuring it is human order just like the list but the human ordering has some notion of the different elements being context can be contextually conceptually semantically distinct in other words it's a record not a collection and so here if you have these two host names you might be saying well you know what this host name is the prod host name this host name is the dev host name or you might even be saying this is the primary host name that you connect to and this is the backup host name and you may actually do fundamentally different operations there are certain things that you do in prod that you wouldn't do in dev there's certain operations you might do on the primary and not do on the backup now if you think about it with this primary backup example you could also model this as a list but there's a meaningful difference here the meaningful difference is if it happened to be that there were n backups there's some primary machine at the very beginning and then if you don't hit it you hit the next one the next one next to the next one that's probably a list but if it's strictly the case that there is a fixed modality of either primary or backup and you're guaranteed that everybody has either a primary or a backup and that there's a very stark kind of discrete difference between these two as opposed to the continuous modeling of the list well maybe that's a tuple versus a list now what you can see here is that the details are very interesting because we can dive very deep into the details and we can use the meaning that we get out of them to clarify a choice we can use these details in order to convey an intention and so it's very important that we say if you were to make this choice intentionally then this is the result because it's very much the case that you might not be intentionally choosing between these three choices and your code might still kind of work it turns out the machine doesn't really care about this meaning it's the human being that cares about this meaning this meaning is valuable in terms of what it conveys to somebody else and also in terms of how it helps guide you in the work that you're doing but ultimately the choice that you make may not actually make that much of a difference in terms of what's executed by the machine additionally it's very important that when we talk about things like this list versus tuple view this is an interpretation this is my interpretation i find that the deeper you go into thinking about the list and the tuple type and the set type and all the built-in types this interpretation really holds up there are other places where this interpretation is further corroborated but ultimately its interpretation and it could very well be the case that you choose to just look at a list as a immutable or sorry a mutable tuple however i think that you lose a lot of fidelity in terms of what you can convey to somebody and i think you fall you find yourself going astray from how these things are typically used in practice now that was a view of the thinking exercise this don't use this code exercise you go deep deep deep into something and then you on your way back out trying to figure out how is this meaningful why does this matter i want to complete this exercise for you and i want to talk about the other choices that we had and in going to the other choices i want to give you a conceptualization of how all the pieces of numpy and pandas fit together and we're going to try and follow the same exercise and the same steps that we did before and those steps are collect the facts collect the details look at as many deep niche details as you can find differences look for similarities but then but then review conventions and review expectations an example of an expectation versus a convention would be convention might be how does somebody actually use this an expectation might be when somebody looks at this what are they really thinking yeah maybe a difference is here but does somebody even notice that here's an example there are multiple import styles you can say import x you can say from x import y you can say import x as some alias and when you look at those three for the majority of python programmers they don't really see a strong distinction between those and so there is a fundamental distinction import x versus from x import y is an early versus late binding distinction when you create that name are you creating that as a live view or a snapshot view or rather when you access the thing by that name is are you accessing it as a live view or a snapshot view is it early bound or is it late bound most python programmers don't really see that distinction and so their expectation is not going to be this is a very important distinction and so for the most part when you're trying to do things like compare import styles it's really going to come down to how much typing and if i'm a data scientist you know what other two letter alias can i find for the new package so i can minimize the amount of typing that i have to do collecting all of these pieces you need to apply some judgment and this is a very difficult thing to do because you have to really sit back and say what can i justify can i put together a convincing argument for this you might be right you might be wrong there may be a case where there is no right or wrong and fun and finally you derive some interpretation which leads directly to some meaning the reason that you want to do this is you want to be able to look at something and say what is this thing really what does this thing really mean and so let's take a look at all these pieces of numpy and pandas and let's figure out what are these things really what do they really mean we've seen the numpy and dra already and we talked about it as a memory view we said it's one way for us to access some raw memory and we said that numpy has the ability to do something akin to a c-cast you can look at that memory differently so you can say oh this memory is one linear sequence of nine elements or you can say oh it's actually a three by three matrix you can look at that and say oh you know what it's a bunch of in sixty fours no no it's actually eight times as many in eights that's kind of what the numpy and d array is but the meaning goes even deeper than that because when you think about it versus the list it's a sequence just like the list in fact it's a mutable sequence just like the list and when we try and drive a difference between the two of them we might say well the list is dynamically sized and the numpy and d-ray is fixed size but you can make arguments for well the numpy and array is fixed size because it represents some raw memory at location and you're not really going to shrink or grow that that might require a new allocation but the list is some kind of reference of some some grouping of some references of objects and so yeah i can see there's a difference here but we're still not quite getting to a core meaning we could say that the list has a fixed shape it's always some linear sequence where the numpy and array has a dynamic shape it could be linear it could be two-dimensional could be three-dimensional we could even say you know what the list can only actually ever be linear and the numpy and d-ray can be any number of dimensionality such that when you multiply out the size of the dimensions you get the total size of the thing that you're looking at but there's something a little bit deeper than that and if you look at what you put into the list and what you put into the nd array and you think about what those things are and how python works you'll see a very interesting distinction up here let's say we have a list containing some integer objects and what is an integer in python well it's some it's not like an integer in c or c plus plus because it can't overflow it doesn't have a bit width it doesn't have a sideness it's not signed or unsigned if i put some integers into a list and i operate on all of them i kind of expect each one of these integers to perform the same operation but we don't really but i don't really carefully look at the homogeneity of that thing because you know if there were it were two integers and a floating point value i might very well expect to be able to add one to all of them and when i look at that integer i can see you know it has this nice auto promotion behavior it's what i might think of as a boxed type unlike in a language like java where you have box versus unbox types in python everything is boxed what that means is the list doesn't store the actual underlying data source references the data therefore the list is non-contiguous as a consequence operations on that list might need to jump around memory and so they can't benefit from cache coherence additionally because it's a box type that box type can have behavior associated with it so that might not be a list of integers it might be a list of subclasses of integers and when you perform an add operation on one of those integers it might perform some stateful operation to mutate another one and so the actual underlying behavior of these things is unconstrained this this dynamic dispatch has to happen at runtime and what it actually does there's no limits to that what does that mean well it means that if i have a list that i'm processing a list can i auto parallelize the processing of the list no because the things that are contained in the list have some arbitrary unconstrained behavior and as a consequence crossing the list from front to the back versus from back to the front might be meaningfully different however if i think about the numpy and d-ray what do i typically put into the numpy and the array it is the case that i could put pi objects in applied array if i weren't worried about things like i don't know potential memory leaks from circular references there's some long-standing bugs related to this in numpy and it is the case that there are places where you often do put python objects into numpy and the arrays for example you know pandas interval might show up in an index so might end up showing up at an nd array but ultimately when you're usually using a numpy and d-rate directly what are you putting into it you're not putting into an integer you're putting into it an n64 well an n64 is an unbox type because the memory for that n64 is managed by the numpy and d array therefore that memory is contiguous therefore if you want to perform some operation on an nd array you get cache coherence and because it's a machine type it has constrained behavior it's an n64 you add one to it you know exactly what's going to happen you can't subclass it there's no notion of any type system there it's just some bits that have some operation that's understandable by the computer and can be represented without dynamic dispatch as probably a single you know low cycle count assembly instruction when you operate in the nonpayment array you can operate or not just like the python list you can go through each element and apply an operation but you typically don't do that you typically ask the nda to perform the operation for you and so you have this distinction and syntax a very minor distinction of syntax but what that immediately leads to is the notion that the numpy and the array beyond just being some memory view is actually a restricted computation domain it's a way where you can come to terms with the fundamental inefficiency of python python is too dynamic and it's simply too dynamic and has certain limitations around uh no ability to control memory layout and as a consequence if you need to get certain optimizations out of python what you do is you take computationally intensive parts of your program you draw a line around them you call that your computation domain you build a manager type that's what numpy and d array is that manages everything inside that instead of allocating a bunch of python objects you have the manager type allocate raw memory and then manage it itself and box and unbox on the boundaries and because you have control over that domain you add some restrictions to do things faster to eliminate dynamic dispatch to add in optimizations that's what numpy and d arrays it's some computational domain sitting within a bunch of python code that does program structuring so when you think about that you can say well hold on a second if i need to store a bunch of numbers and those numbers are being stored for the ver for for in order to be able to do some mathematical work some computational work put into a numpy and array but if i'm doing that for some kind of program structuring like i'm printing them to the screen or they're deciding some mode for what i do here or there probably put them into a python list now we can make this we can draw this distinction even even clearer we could say that if a list versus a tuple is a collection versus a record a list versus an nd array is a collection just some opaque bag of stuff versus a mathematical vector or matrix or tensor or something along the lines of that and so you can think list is for storing a bunch of stuff numpy and d arrays for storing some mathematical stuff now if we think about pandas arrays versus numpy id arrays we can start to create a conceptualization of what pandas is all about if we think about a non-pioneered area that stores floating point values we can store three values here and each of these are valid values we've taken a measurement one two and three now we can store another thing that's not quite a number and you might say this isn't a value a nan and this means we took three measurements and the third measurement was not applicable was missing was erroneous in some fashion so we have two actual values and one error condition here we have a modality the data that we're storing can either be of this class or that class it can either be a value or an error condition can either be one two or a nan and we have to find some way to encode that well when we're encoding the actual values the one and the two we're using ieee 754 double precision binary floating point and if we were to try and code the nan we'd find that ieee 754 the bit patterns of how it's stored on disk reserve certain bit patterns for things like infinities and things like nands so if we're talking about about a double precision but a single precision ieee 754 floating point type this would be the bit pattern for a nand and this is not a valid value this is a nan it's some sign bit and then a pattern of a bunch of ones and then some payload it's surprising how few applications use the sign bit or the payload there's actually very few bits that identify the nan and a bunch of bits that you can encode a bunch of other stuff into now if you think about encoding this modality how would you do the same thing in integer well what integer value what bit pattern would you choose in order to encode an error well you can't choose zero because that would be ambiguous with an actual zero you can't choose negative one because it'll be obviously negative one uh if you chose like the highest value then you reduce the range of your integer and historically integers have had very limited range you know 32-bit integer can only get so big a 16-bit integer only gets so big and you really might want that range and also historically people did not encode inside the integer type any bit patterns that were reserved for anything but values themselves so as a consequence if you need to actually work with real code and real data that somebody has already encoded and written for you you can't do anything with integer and so what happens is if you happen to put a nan into a numpy and d array that stores integers numpy promotes everything to float 64. and that might not be what you want and that might actually result in certain problems accuracy precision problems that you have well if you think about this a panda's array can store an n a type while still being an integer array and the way that it does it is it doesn't encode the modality into the type itself and encodes the modality out of band it stores a mask and the mask is is this a false value or true value is this a nan or is this not a nand and it stores the data separately and if you look at that data that data happens to actually be a numpy and d array and so you can see it's just an in direction on top of the numpy and array to allow you to store out-of-band information what other out-of-band information other than these modalities might you want to store well a very common case for the num for the pandas array would be categorical you're not trying to store some modality but what you're trying to do is you're trying to say let me try and compress the data that i'm storing instead of storing a bunch of strings there's only three options for the strings store a bunch of integers that are very compact and then map those integers to those strings because it's some enumerated type that's a pan is categorical and you can see how the numpy or sorry how the panda's array makes that possible now if you think about the panda series what is the panda series at on top of the array it is an in direction on top of an array but let's ignore that for a moment and think about it as an interaction on top of a numpy and d array let's skip the middle and if we think about that and we look deep into the pandas series we'll find a numpy in the array there is that indirection it is that nd array plus something else and in fact if we were to construct a series from a numpy and d array we'd find that it turns out that it really is just a wrapping of that ndra to the point where if you mutated the original nd array you'd end up mutating the series this is by the way one of the reasons why sometimes things like memory management is hard to assess in pandas it's not always clear the pandas make a copy of this data or not and even within the value constructors for series and data frames sometimes a copy is made sometimes a copy is not made and the api is not always that clear the documentation definitely isn't clear about this what makes this series interesting well it's not the same thing that makes the array interesting it's not that interaction what makes it interesting is the indexing if you think about an unplanned d rates a sequence it has some human ordering associated with it you say this is the zeroth element the first element the second element but what if you want to address those elements differently what if you want to describe the position of these elements differently what if you want to say give this element a label 10 20 and 30 or give it a alphabetical label a b and c or instead of giving it the instead of taking the natural number label 0 1 and 2 just swap them around well what a panda series gives you is a lookup modality and the lookup modality is a distinction between being able to look something up by its integer position is this row 0 is this row 1 is this row 2 and some other lookup mechanism encoded in the index is this the element with label zero is this the element with label1 is this the element with label2 the panda's index is very very interesting and there's a lot of details to it that give you a lot of meaning for how to use pandas correctly pandas indices can have hierarchy they can have implicit versus explicit hierarchy people who are scared of things like multi-index don't always realize that date time indices and pandas are also implicitly hierarchical and so there are certain look there's certain operations that you might perform that you might not be able to guess does this give you a series or a data frame without knowing a lot about the index the deeper you go the more you find that even things like the monotonicity of the index is very interesting in terms of what it will return to you when you perform an op when you perform a lookup operation unfortunately we don't have time to go into this but maybe if i have an opportunity to attend pike on india next year in person i'll tell you a little bit more about the pandas index instead let's let's wrap this up and talk about the pandas data frame i told you this is a little bit gratuitous you have a data frame with three elements in it more likely you might have a data frame with two groups of elements two columns often times when we think about the data frame we think about it like as an excel spreadsheets a bunch of cells there's like rows there's columns we might think about it as tabular data but there's something a little bit more to it and we can really formulate it in terms of the series we shouldn't necessarily think of the data frame as another layer of interaction on top of the series because it's not guaranteed to be the case the data frame is composed of series objects you can take a data frame and extract series from it you can turn a series into a data frame but it's not like one's necessarily built on top of the other that's not the conceptual structure here instead what you can think is appendix series is a notion of having some data and having some alternate way to access that data by some kind of index label a data frame is the idea of having two-dimensional data and having an alternate way to access it on a major axis and an alternate way to access it on a minor axis in other words the index in pandas is an index and the columns are also an index i call this a major and a minor axis because there's operations you can easily perform on the pandas data frame index that you can't perform on the columns for example you look up columns by label but it's actually a little tricky to look up columns by their position and in fact that's not something that's generally meaningful to a pandas this is column zero column one column two it's not something that people expect to be meaningful so there's not really an ordering of them there's just the labels there and so the monotonicity of the columns is not something that people and slicing the columns is not something that people typically do although i've actually had cases where i wanted to represent a breakdown of data so i had columns or represented geographic regions and i want to be able to do a multi-index on them and it's actually quite powerful and it gives you the ability to really slice and dice data very nicely now when you think about the pandas data frame instead of thinking about these silly data frames that just have two columns let's think about one that has two named columns and an index and we'll try and see what the meaning behind all this is because i've told you a couple of facts i've told you a couple of details to try and give you a conceptualization of what these things are but here's the thing what really improves your life is being able to fiddle less with pandas is being able to get your analyses done quicker it's being able to represent these analyses more cursely or more directly when you think of the pandas data frame as this almost geometric structure with a major axis and a minor axis and you think about all the transformations you typically want to do to append this data frame well you can kind of think about it geometrically do i want to collapse multiple rows into one row do i want to collapse multiple columns into one column do i want to take the columns and make them the index do i want to take the index and make them the columns and so for example here if we think of this visually we have a panda's array it's got columns a and b with values 1 2 3 4 5 6 at indices x y and z index labels x y and z if i wanted to take that a and b up here and just pivot them to be part of the index i want to take the minor axis and pivot it to be part of the major axis that's just the stack operation if i wanted to do the opposite that's just an unstack operation now when you stack and you unstack you leave the existing major axis in place so you're appending onto that major axis and so you get a multi-index if you want to throw away the major axis and you do a drop level and so here when i look at this i can immediately see in my mind because i understand what this thing is i can immediately see in my mind what happened you had a and b x y and z you took that a and b and you pulled it down here so you ended up with a b a b a b and then you threw away the top level so you ended up with a b a b a b so you have a pandas series with the numbers 1 4 2 5 3 6 and the indices a b a b a b and when you think about it this way you realize all that fiddling and all that searching through stack overflow to manipulate my pandas data frames it really is owing to a misunderstanding of what this thing means and an inability to grasp the fundamental structure of this thing and so i hope you've enjoyed this thinking exercise this is the thinking exercise that underlies almost all of the consulting corporate training all the talks i give i go as deep as i can on the way back out i try and find meaning i try and find applicable meaning because ultimately it is not quite clear do the details matter i think the details do matter but i think the details matter insofar as they allow us to understand this meaning meaning is about what as a human being can i derive from this that allows me to convey something to somebody else or make a decision better or allow me to quickly destructure a bunch of details into or quickly structure a bunch of details into things that i really need to know and things that are just things i look up for example i understand fundamentally stack versus unstacked pivot versus melt but all the keyword arguments that they take in pandas i gotta look that up every single time does it take it in place does it not take it in place i look that up every time but the meaning here allows me to say i know what this thing does structurally so i know exactly where to look and then yeah i look up the details i look up you know is it a keyword argument is it a positional argument what what are the names of the arguments does it have in place or not in place does it make a copy or is it a view and so forth and so on because ultimately when you're talking to people who are programmers but only because programming is what they need to do to do their job their data scientists their optical engineers their network engineers their physical scientists the first question on their mind whenever they look at the details or look at technical sophistication is why should i care how does this improve my life how does this give me something better why does going into those little gimmicks and these details how is it not just italian something fun if i have some spare time how is this something that actually is valuable for me and the answer to how this improves my life is you go all the way down there and on the way back out you start to see the bigger picture you start to see the meaning behind these things you start to see how to structure this information so it's not just a bunch of details but there's a clear path for you being able to distinguish the underlying ideas from the details that's that corroborate or supplement that because in the end what's really most important is what does it all really mean i'm james powell thank you very much yeah yeah so again that was really an amazing talk so thank you so much for your great insights so here you have the first question so how do you structure an advanced conference talk that still suits all kind of audience i i wasn't intending to do that what i was intending to do here was to share with you the only real gimmick i have which is how do you figure these things out how do you derive some meaning some understanding of these things how do you move beyond just a bunch of details because in reviewing a lot of educational material around python i've found it's often very poor it just overwhelms a person with facts and there's no structure behind those facts there's no meaning behind those facts it doesn't give them any greater guidance and so i thought what i would share with you is how you might be able to develop that yourself or some of the steps along the way and i think that what you often see is when you really have this deep-seated intuitive understanding of the thing it's not actually very difficult um one of the things that we do is we have a we have a corporate training curriculum and it's called fundamentals of programming and it actually goes into much more sophisticated things than the actual advanced python training that we do because turns out that when you talk about you know modalities or early binding or late binding immutability immutability laziness versus eagerness root level code versus leaf level code those are all things that people can actually pretty much intuit they're pretty straightforward you can motivate them very easily and what really traps people is all the millions of little details if you talk about python's object model in terms of protocols you can within a couple of minutes figure out okay this is how the thing is designed this is how the thing is supposed to be used and then you can spend the next two weeks of your time trying to memorize all the different underscore methods that exist so what i would say is to try answer this question as directly as possible i actually think that some of the really interesting advanced stuff is surprisingly accessible to a novice audience if you can find that right meaning in that right structure so so the question was um an example of code in production where an issue was caused where the code was working out it's explicitly intended it kind of works what you what you should do is you should see if you have any friends who are data scientists and ask to look at their code and i say that in jest because the truth of the matter is all people who are data scientists or physical scientists or non-programmer is really motivated to write good code maybe maybe not they're motivated in so far as if the code breaks in production maybe that's something that affects them but they're not typically within their organization's value they're not promoted based on writing the best code you know they don't publish more papers when they write good code and so as a consequence it's very difficult to really convey to these people okay why does it matter and so hopefully part of this talk was yeah you have all these little niche details all these little pieces but when you come back out the other side of it and you look at these things yeah it kind of matters right if i choose a tuple versus a list it says something different it means something different some things are easier some things are harder the distinction is actually very clear very stark and maybe next time i make that choice i can make that choice right and so what i would say is the amount of bad code out there that's bad because the person didn't really put in the effort because the person just couldn't write good code this is not that high the amount of bad code that's out there because a person wasn't incentivized or motivated to write better code is where the main problem is and that motivation that incentive and that meaning is surprisingly easy to convey and surprisingly easy to get somebody to look at so hopefully for any of you out there who are trying to figure out some of the extensions yourself and code that you're writing maybe you might be able to get it right the next time great answer james uh we are overtime so i really want to thank you no no no so thanks a lot
Info
Channel: Python India
Views: 756
Rating: 5 out of 5
Keywords:
Id: clRGJ6jMbLw
Channel Id: undefined
Length: 67min 10sec (4030 seconds)
Published: Fri Dec 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.