James Powell- Why do I need to know Python- I'm a pandas user | PyData NYC 2022

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is why do I need to know python I'm a pandas user we're at Pi data London it is June 18 2022 actually no that's not correct we're actually at Pi data New York City and it is Wednesday November 9th 2022 and the reason that I got the date here wrong is because this was actually a talk that I had originally prepared for pi data London and pro tip do not change the password on the only laptop that has a copy of representation 30 minutes before your presentation is pro tip number two great password choices P sign dollar sign dollar sign w0rd now let's see if I can actually give this presentation and it'll be effective and the presentation here is about well you're a pandas user why do you need to know anything about python sure python or rather pandas is written in Python but how much do I really need to know Tuple versus list come on I'll just try one and if it doesn't work I'll try the other and so to set the tone for you let's take a look let's say that you're a pandas user and the majority of the work that you do is work like this you have some time series data Maybe in a multi-index dates and entities and there's some signal in some data frame that you're working with and your data looks something like oh let's it's going oops let's actually get this going a little bit better than that and then we'll get started and so this is something that kind of looks like this and you have some data frame and the data frame is basically some kind of Time series data entities and signals and most of the work that you're doing is grouping by this entity maybe doing a transform maybe you saw this Lambda as part of some kind of chaining syntax but for the most part you just memorize the syntax what do you really need to know python for if you're a pandas user I want to try and answer this question for you by answering this in terms of four separate questions and the very first question that I want to tackle for you is why do the basics of python even matter if you're just a pandas user who's doing Group by operations rolling operations manipulations on columns in a data frame well let's take a look let's say you're a pandas user and you're working with Excel data and you know that there is no guarantee that in that Excel data the actual table you want to read is in the top left corner so when you read this in using pandas.read Excel how hard can it be it turns out you get some garbage like this because a person who created the Excel file actually put the data on the second row and there's some there's actually two data sets in here you can see there's a first data set and the second data set and what are you going to do you're going to go into this data frame and drop the nands and try and extract the data no what you're going to do is you're going to use Pi Excel and use Pi Excel to actually manipulate this data to actually pull out what you need because good luck telling the original author of this data please fix your data set it's really screwing up my process and when you do that you'll notice for example Pi Excel does not implement the context manager protocol when you open the file so you have to learn about contextlive.closing because if you write some script and that script is analyzing that file and the script does not close the file you will find that you cannot rename it you can't move it into the trash can you have to reboot your computer just to change the name of the file so maybe it's worth learning just a little bit about python maybe it's worth learning about things like contextlive.closing but when you take a look at the file you might say Okay I want to go into that file and figure out where the data is and I may actually know where the data is maybe it's consistently in some place so I want to represent the view area for the data is and you could represent this as a tuple or a list of lists but maybe a little bit more readable way to represent this would be as a collections.name tuple where you can say this is the minimum and maximum row minimum maximum column where I expect my data to be maybe it's worth learning about collections.name Tuple because it'll give you a little bit of guidance for how you can extract the data you can go to Open PI Excel and say give me just the cells in this particular location and if the data set changes maybe you can toggle these or maybe you can even write some code that infers where the data is supposed to be by just scanning through the Excel spreadsheet whatever the case may be it might actually be valuable to learn a little bit of python in order to make this very routine task a bit easier now once you've started on this path you might say okay I've got my name Tuple I've got my Open PI Excel I want to work with this data but what's the guarantee that the data actually has The Columns of that order because they might not move it around on the spreadsheet but they might change the order of the columns or even add in extra columns how do we know that it's name group and value well maybe you might be advantageous for you to learn a little bit about python syntax maybe what that assert is because you could add a very simple line setting assert that the data looks the way that it's supposed to look and warm me if it doesn't so if maybe they swap the columns your analyzes don't just break for no good reason well if you're gonna do that you might wonder should this be an assert or not and it might be worth learning a little bit about how the python internals work to understand for example that asserts are used for programmatic documentation they are not for checking business logic they're for documenting so that somebody knows as of this line of code we know that the cells look a certain way we know that our data has been written correctly and it may require that we understand things like there's a debug variable with two underscores before and after that is automatically defined by your python interpreter that is by default set to true there's a dash capital O flag that sets that to false Python's constant folding allows it to allide blocks of code which it knows will never run and it allows it to align assert statements and so those assert statements are not necessarily run if you run python with a dash o flag which means you could put very expensive data checks behind an if debug flag or in an assert and when you're running in your debug mode you can run those checks but when you deploy this to production you don't have to run the checks because maybe you're a little bit a little bit more convinced the data set looks right and in fact it might be worth learning about things like the disk module so you can see this yourself here you can see the assert when I actually look at the code when I run this with python-0 it vanishes which means we now have a way in Python without learning that much about python to add very expensive checks to our code to make sure that our data is correct and all it took was learning about the debug flag and the asserts might be worth learning about that now when you are doing your analysis you might say okay I have my data set I load it in I defined what the area is I checked a couple of things I'm going to create a data frame from it and it's very likely that that's what you're going to do with this data you're going to create a data frame you might use something like two numeric and at this point no big deal use all you know about pandas read in that data frame do your group by everything is fine except what if you have more than one data file what if you have a bunch of data files and they're in some directory and it's not just a one data file who's hard-coded path you know but any of the data files that happen to be in the directory in this case this data directory which happens to have a bunch of data files for the US for EU for pricing for demand information and so forth and so on and it even has one data file this manifests that maybe I don't want to read in and it might be worth learning just a little bit about python like passive.path because it gives us the ability to very easily manipulate paths on disk for example iterating through the paths and finding everything that has a CSV suffix and only processing the CSV files and ignoring any of those metadata files that aren't necessarily relevant for the analytical use case itself in fact it might even be worth learning about things like the python set because you could think you could put all of the files you want to read into a python set and then you can Mark any files that you know should not be read like a bad file that contains corrupted data that we'll fix that later but I just don't want to read it for my analysis you can put that instead and do set subtraction and only process the files that you want and if somebody says oh we fixed that file no big deal you could make this an empty set and everything's good or if there's other files you want to skip you could do that you can get a very readable readable way now once you do that it might be worth learning about things like comprehension syntax because we'd Express the finding of all the files that we have on disk and the removal of the bad files in literally two lines of code to Define these and one set subtraction and here you can see go to my directory find every file that's a CSV ignore this one file for now and these are the files I'm going to process and putting that all together you might then go through the files read them in and there we go we have our analysis with each of the files Additionally you can notice that there's an in-band encoding in use here the file names themselves include information about what they contain it's in the name EU dot pricing maybe it's worth learning a little bit about the string operation so you can do something with that string like split it into two different strings one of which tells you this is the region that the file contains data for and this is the region or this is the metric that it has and in fact you might say it might be a little bit worth my time to learn about things like regular Expressions because I can do that just a little bit safer I can write a simple regular expression that says every file name that I want to process ends with the region and the column that I'm going to create with the CSV file and thus I can go through all of my files and only match the files that have this pattern to read them in I can of course remove any bad files that I know are problematic and with just a little bit of python I can do things that would otherwise be extraordinarily manual and extraordinarily difficult well once you start doing that you might say Okay I want to read in these files and I want to do something with them well I know that the data types that I'm given by pandas and by numpy do not nicely grow in place they are genuinely or generally dynamically shaped but fixed size which means if I have to read a bunch of data I should probably first read it into a list and at the very tail and do a concatenation and so here it might be worth doing something like reading in the data files into some kind of python dictionary using this dictionary comprehension syntax doing some kind of concatenation on all the files based off of what the regions are in order to create my final product which is all of the different data files that I have here are the actual data file names and all of the data frames that were created from these data files I was able to do this in only a couple of lines of code it might just be worth your time to learn a little bit of python even if you happen to be a pandas user let's take a look at our next question and this next question is why does looping even matter in my life I'm up-handed user what do I need to know about looping because after all I learned everything I need to know about looping what I learned about numpy I learned about broadcasting and vectorized operations I learned that when you take two numpy and dras there are rules that dictate how those are interrelated how those things may support a particular operation for example here you can see I'm doing an element-wise multiplication of these two structures and it's going to broadcast one to the other and you might even have taken the time to learn that the broadcasting works by first right aligning the coordinates or right aligning the sizes of these and then matching them up either exactly or looking for a one and so here this is going to repeat this structure three times on that axis there and you say that's all I need to do about looping in fact I'm not even supposed to Loop if I'm using pandas or numpy let the restrict a computation domain handle that I want to do everything in those vectorized operations myself you probably along the way have said the word vectorized 16 times already but why do we need to know about looping where does this come up even for a panda's user well it turns out that maybe we have some pandas series and in that Panda series we want to do operations like resample this data and it may not be the case that we want to resample this in some consistent fashion maybe one user wants a zip resample in one way another wants to resample it another way maybe for our analysis we want to do some kind of rolling and one person wants to smooth the data by seven days somebody else wants to smooth the day by three days there could be a lot of hidden modalities here but largely you can see much of the looping is handled by pandas itself as long as the looping we're talking about is the computational looping here we can see some examples of manipulating this Panda series and Performing operations like looking up something after resampling the structure on the periods and then finding the mean and we can even see a little bit of piping syntax here but it's still not clear to me he said the looping that I do computationally would be done in pandas what else is there other than computational looping well it turns out that when we're using something like a pandas data frame appendix series or an Umpire we are acknowledging that python has two separate domains the domain where you do your computations you do these via these kind of operations these broadcasted operations and the domain where you describe the problem you're trying to solve what I like to call program structuring that's where python lives and that's where knowing a little bit more about python looping might help you even if you're a panda's user because it turns out that when you have your data frame you might want to do some analysis on it you want to do some scenario analysis for example you have some low risk situation and some high risk situation and these correspond to multipliers on the columns that you have in the lower situation you're going to multiply the a column and the B column by these values in a higher situation you do this maybe this is some kind of perturbative risk analysis well whatever the case may be you definitely know everything there is to know about index alignment but you might not realize that we can express the scenario analysis very nicely just using simple mechanisms of python for example here I am defining the scenarios that I'm going to apply I'm going to apply them to my data set and I have a nice simple little comprehension syntax that says for each one of my scenarios apply them to my data set using the index alignment and give me the results and you can see this is my lowest kind of high-risk analysis how would you do this without that looping syntax you'd have a bunch of cut and paste code but you know I'm a pandas user why do I why do I care well it turns out that the scenario analysis tends to be a lot more complicated than just that it may be the case that I want to dictate my scenarios print out the results but the scenarios aren't just this one this one this one in a row there may be some complex interrelationship between them for example I have multiple different scenarios that I want to compare against each other I have regimes and adjustments I'm in a low risk or a high risk regime I'm doing a low overhead or a high overhead adjustment and I want to look at the Cartesian product low risk low overhead low risk High overhead high risk low overhead high risk High overhead and in this case do I really want to be cutting and pasting 2 times 2 4 times no because I could use eithertools.product in order to set up the structure of the analysis I'm trying to do and compute all of these different scenarios at once in one simple python Loop it might just be worth your time to learn about the inner tools module because it allows you to express how you want to structure your analysis in terms of what are you going to actually be analyzing even though the underlying copy mutational work will still be handled by numpy or by pandas well it turns out that this may be an even more effective approach than you think because this allows you to separately specify what the analysis is going to look like you can see independently these are the four scenarios I'm going to do and if I put this into a python set I could even subtract out scenarios that nobody cares about maybe they want to see the Cartesian products just minus this one case that we don't think is actually going to happen and I can express all of these in a fluent fashion by just learning a tiny bit about python in fact when I do this I may discover that I have a very complex set of these scenarios that I want to analyze and I want to be able to fluently Express don't just do this one scenario do all of these scenarios and give it make it possible for me to just randomly add extra scenarios I'll make my business users happy if they say oh and this one more thing and it's just one line of code that makes use of all of these looping mechanisms that are provided by the standard Library by the inner tools module but it goes a little bit further than that because it may be the case that for my scenario analysis and here you can see the result of my previous computation for my scenario analysis I may want to do something even more complicated for example I may have some scenarios and for those scenarios I may even have let's go past this one I might even have some scenarios and those scenarios are ordered in some linear fashion like a happens and then B happens and C happens and D happens and E happens and I want to compare from A to B what happened from B to C what happened from C to D what happened from D to e what happened and I could do that using the other two's module recently in the python standard library in the inner tools module we added pairwise which gives us the ability to look at pairs of elements here you can see I could do an analysis of my data and let's get this one going where I can compare a against b b against c c again d d against e preserving that linear ordering looking at those in overlapping pairs very similar to a pandas rolling with a size of two now I know in the audience we have some people who are looking at this slide and saying pairwise what on Earth is going on for those of you here's analyze NYS is my generalization of pairwise looking at Windows of any arbitrary size by building from the building blocks in the inner tools module something even more powerful than what it gives me itself and so here you can see my favorite little iteration Helper and wise with the idea of NYS I can do the exact same thing that pairwise has but if I wanted to look at scenarios with some other window size like for example the previous scenario the current scenario and the next scenario I could do that as well and here we can see just a little bit of looping and a little bit of pure python gets me very very far here I'm comparing a against B again c b again C against d c against D against e and so forth and so on and in fact if I just take that one little idea and take it further I might write other iteration helpers I might write iteration helpers like ny's longest or first and last which then allow me to fluently Express exactly what I want to do in my analysis in this analysis I want to look at the first scenario all by itself then I want to look what happens between the first and the second scenario the second and the third the third and the fourth and then I want the last scenario all by itself in in isolation and it's just the combination of these tools from the inner tools module that allows me to set up a very fluent Loop here which says look through the previous and the current scenario identify which is the very first in the order which is the very last in the order if you're in the first one do something here maybe it's part of your analysis you may or may not use some sort of factor where you may perform the analysis slightly differently for everything else do this and for the last scenario do this here it might just be worth your time to learn a little bit more about looping learn about things like the inner tools module even if you happen to be a panda's user well if we're going to talk about looping we better talk about things like generators because you saw in the previous example a generator expression that's the thing with the squigglies that are kind of like that the round thing around the thing that has the four in the internet your panties here who cares but when we talk about things like generators you might say well that's good if I have iterators or iterables but I don't have iterators or iterables it's a very kind of mechanical thing who cares about generators I'm a panda user well it turns out that generators are probably the most useful feature in all of python semicolon that you also aren't using today because generators aren't just some mechanical thing that makes the python programmers very happy to talk about iterables and iterators generators are your mechanism by which you can express non-closed form operations without specifying the end modality that sounds like a lot of verbiage but I'll tell you what that means let's say we have some analysis and that analysis is done in numpy well we know that performing operations inside numpy is definitely a better idea than performing those operations in some Loop in the pure python level in fact simple performance analysis shows us that's the case here we have a DOT product implemented in pure python and a DOT products implemented in numpy and we can see that there is an order of magnitude difference between the two of them and so when I tell you about things like generators you might say well hold on a second it sounds a lot like you want me to loop at the pure python level which is going to be a terrible performance hit but it turns out that when we're talking about generators we're generally talking about particular regime a regime that you may not even recognize in your code until I show it to you here's how it works in computation or at least in Python we have two ways to represent a computation a computation perhaps that doesn't have a closed form that can't be done in a single shot we have ego representations of computation that's what a function is it does a bunch of work before you ask it completes the entire computation and gives you the results if it's the case that you have some computation that you can break into parts and the incremental part of that computation is way faster than waiting to ask somebody should I keep going then you do not want to use generators you want to use numpy you want to use pandas because numpy and pandas are fundamentally eager computational tools what might this look like well in the previous example you can see the incremental computation was a multiply how fast is the multiply very very fast to ask a multiply do I keep going or not we'll require going into the python level which will take on the order of microseconds I can do a multiply and much much less time than a microsecond and so when we look at examples where we are doing operations on computational data it is often the case that we see some perplexing results like what we can see here here I have a numpy ND array containing values drawn from a normal distribution and in that numpy and D array I Am timing the amount of time it takes to find all of the positive values and square them and I'm doing that with two syntaxes in the latter syntax I'm indexing into that structure and squaring all the values that are greater than zero in the former syntax I am squaring every value and then multiplying times a mask that is only one if the value is greater than zero and is zero otherwise this line of code does more work than the second line of code and yet it's faster it's over 2 two times faster if it's the case that the incremental computation is so blazingly fast then it's oftentimes Worth to do extra work and just throw it away because the time to ask or the time to mask is going to be bigger than the time to do a bunch of computations that you throw away this is the regime that you're used to but it is not the only regime in which our computations sit because in the other regime we're talking about lazy computations and lazy computations or any computation where the time to do an incremental slip is significantly slower than the time to make a decision so if it takes one second to compute and it takes me one microsecond ask you want me to keep going I better ask each time because otherwise I could waste a lot of computational resources in fact any time I have a simulation a solver a Optimizer and in that Optimizer I have some ending modality simulate for how many steps simulate until what condition is met if the underlying simulation step itself is slow and by that I mean more than maybe a microsecond of time it is probably the case that you want to write this as a generator let me show you an example here I have a simulate and it runs forever and it operates on some fairly slow step performed on some data frame now when I think about how I write this when I think about how I formulate this computation I might say well my users want to be able to simulate for 100 steps 200 steps 300 steps all out of modality I'll add a num steps parameter and that'll be fine until some user says oh I can't tell a priori how many steps or the best number of steps to run for my machine learning training how many steps do I train for I can't tell unless I look at the data so actually can you give me a threshold parameter because I want to run this until I see that threshold parameter and then I want to stop because otherwise I might be doing extra work and some other user comes to and says you know what I don't care about the exact values I only have a somewhat loose time budget of maybe one second to train this for before I have to deploy it so I don't know the number of steps I don't know the threshold but I know the amount of clock time that I want to spend in the simulator and when you want to satisfy all of those users this is the nonsense that you're going to write for them three different modalities three different parameters to make everybody happy but you will never be able to make everybody happy because the moment you show them that you're willing to accommodate such a ridiculous request they're going to come to and they're going to say well actually what I want to do is I want to train my machine learning algorithm until the relative Improvement between every two steps falls under some threshold or I want to change that threshold from an absolute threshold to a relative threshold or we want to support both modes and you'll see that this grows and grows and grows without bound your misery will grow without bound if you believe that any closed non-closed form operation which may have different ending modalities like a simulator a solver a machine learning trainer whatever the case may be if you're trying to accommodate all of those but if you use a generator what you're doing is you're representing that computation as an unbounded sequence of steps and you're externalizing the control of that computation you are not preferring any modality over any other modality you're just running this forever and externally from the way in which this computation is modeled your users decide we'll actually want to run for end steps oh actually I want to run until some threshold is is met well actually I want to run for some total time and in fact you can write simpler code that has less of a maintenance burden for you and is more flexible for your end users you can satisfy all of these modalities by externalizing the condition under which the simulation stops well that's all a generator is a generator is the idea that a computation has multiple steps but I won't tell you ahead of time how many steps I want to run I will just structure this computation and I'll let you externally decide via the inner tools module that's come up twice already hasn't but it turns out that this is a mistake that you see throughout libraries that are computational in nature and you might say well no people couldn't couldn't be missing out on such a big win PSI Pi dot optimize.newton Newton's method runs for a number of steps to find the zeros of a polynomial let's look at the help text for it well scipy optimized.newton takes a max number of iterations or a relative tolerance or an absolute tolerance those all look like modalities that are trying to decide when do you stop optimizing and you might say well hold on a second it's PSI Pi that's what means it's probably written in C nope scipiooptimize.newton is just a big old python for Loop they should have written this as a generator because if they written as a generator they wouldn't need those arguments at all they would have just run the optimizer forever and then had somebody externally decide I slice it down to the steps that I want to run because all the generator is doing is exposing the computational steps to The Interpreter to allow you to decide where they keep going or not and this works if it's the case that the incremental computation is slower than the time that it takes to make a decision which is likely to be the case for an Optimizer maybe maybe not but you're already in the python levels you're already paying the cost for it anyway now from all of this you might see it might just be worth my time to learn about things like generators in fact tools like generators are a fantastic way to simplify your apis while making them broadly more powerful and this is one of my favorite examples that I like to share with people this is called the genetic algorithms example here you can see I am modeling a genetic algorithm where the incremental steps are considered to be slow enough because it's doing a lot of numpy manipulation and I'm doing it by modeling the actual genetic algorithm step as an infinite sequence of steps so that somebody externally can say run this until I hit a Target Fitness or run this for a fixed number of steps or run until the fitness is no longer increasing and what I like about this example is hidden within this example in order to handle things like the single point crossover you may even use tricks like the masking trick because if the incremental computation is fast enough it may be worth just doing extra computations and throwing them away but if the overall computation is slow enough then you might want to model this generator in fact anything that you have where you have a multi-step computation and the computational step takes more than maybe about a microsecond probably you want to model its generator to eliminate all of those modalities that make your code a lot harder for you to use and preclude certain ways of stopping that computation now here you can see my my work play I want to answer the last question for you and in this last question I want to answer for you why does object orientation even matter well I guess Panda's data frame is an object it's got methods on it and then numpy and dra is an object and it sometimes has methods and sometimes doesn't have methods like why did they put mean on the ND array but not median why is that like a function that you have to import why is it that in pandas you have mean and standard deviation skew and kurtosis as methods but in numpy you have to import actually in numpy you have to import it from scipy well it turns out that object orientation is something that we can get very deep in the Weeds about but fundamentally the core idea of object orientation is how do we find better ways to structure the code in our program and you might look at this and say why do I care I'm a pandas user because you know what I'm more than happy to write some pandas code where I'm basically cutting and pasting things I have two data frames I'm going to operate on and I'm going to do some operational data frames maybe clip them within some particular region from negative 0.25 to 0.25 and I'm going to subset those data frames and I'm going to do some grouping and you know what yeah it's a little bit of cut and paste but come on what more do I need in fact if it's the case that I see somebody's looking at this gun they're like yeah I just did that this morning right if it's the case that I see all of this duplication in my code does it really end because we want to subset it in different ways you can see this is kind of geometric growth of the cutting and pasting here I'm first clicking the data frames and then I'm subsetting them on the on the greater than zero on the lower than zero and then performing all the different group buys on each one of these sets and every time I add another aspect to this I'm just duplicating the amount of cut and paste I have to have and almost certainly someone here in the audience has a data frame where or it has a Jupiter notebook where there is df1 df2 df3 underscore Temp df4 and it's basically the same code being cutting paste all over the place and where that really Burns you is what if I decide to change the column that I'm going to do the grouping on I got to change it in four places and it's very easy we forget to change that in four places a little bit of data structuring might help well do I really need data structuring because I could write these as methods right and I could kind of simplify the logic I could create a single source of Truth for how to perform this operation and then I guess I could do something like take the grouping and write that as a method I could I could just change it in one place so I don't know I still don't see the case for object orientation I feel like this is just a case where I just need functions and I'm a panda user but I'm at least willing to learn functions because those are the things that get my work done well it turns out that simple object orientation is a very effective way for us to simplify the structure of our code to gain real benefits because if we take this function idea a little bit further we'll see that we start growing modalities and starts getting very clumsy we start making certain assumptions of how the data frame looks and we have all these functions floating around that have all sorts of assumptions about how many columns there are what the columns look like and I don't know that I'm really gaining anything because here even though I have a function I still have these two modalities whether I'm going to uh do the greater than zero or the less than zero I don't know that I'm really gaining anything or I'm reducing that much cut and paste well what I might do is I might make use of the absolute bare minimum of object orientation in Python I might use a boilerplate elimination tool like a collections.name tuple or a data class.data class I might create some structure that allows me to give it a data frame and to perform operations to store within it the data frame subject to some sort of data cleaning I may give it some additional options like hey given the data frame that you have give me the subsetted version I may say here's another operation given the data print give me the grouped version and I can represent the overall analyzes as some entity in my program where previously this overall analysis was represented the program but in the form of multiple disconnected lines that were cut and paste all over the place I can group things and structure things a little bit better and when I do that I might say you know what this might have some benefits here is a block of code that I wrote a while back and it was to demonstrate something that I thought was very interesting which is to say that when you look at where tuples are allocated in a python process and we look at where numpy and dra is not going to python process numpy and dras are always allocated at lower memory addresses than tupas and I was really curious why might it be the case so I did I went and I collected some data and I collected all the locations of all the allocations over a sample program and I said okay I want to compare all the different ways that these things are allocated to try and understand what's going on here I want to try and find the pattern programmatically and as I started cutting and pasting my way through all of this code reading cutting and pasting the read CSV code cutting and pasting all of the manipulation code I said I need some data structuring because I'm losing my mind cutting and pasting all over the place and I reached for the most simplistic structure that I could have a name Tuple with a couple of methods I wrote a class that allows me to instantiate some analysis from a CSV file by reading in that CSV file I wrote something that allows me to instantiate this analysis from some raw data frame and it goes in that data frame and it does some data parsing it does all of my little operations to extract The Columns that I want to clean up the data to parse the data to split this into multi-index columns which I wish we had time to talk about that's another talk called why don't you know why do I need to learn how to use pandas I'm a panda's user but we'll set that off for later here you can see I even thought you know what for this analysis I want to see where these different these different entities are are located so I want to be able to compare them and I said to myself I want to use a slightly more nuanced approach to object orientation I want to for example have these methods be named methods not the underscore methods because for example I might not want somebody to mistakenly distinguish this thing or mistakenly be unable to distinguish this thing from an actual pandas data frame so I want this to really just be an analysis energy that represents the work that I'm doing that programmatically structures how I'm performing that analysis and that allows me to write this code this is some of the my favorite code that I wrote over the last year because you can read it and understand exactly what it's doing load in my data for the tuples load in my data for that erase the Tuple allocation covers a large range tell me what the addresses of the tuples are tell me how many unique addresses there are tell me what the mean and the standard deviation of where the Tuple locations are so we can see where these are located we can even graph them and here this is the output okay given that do the same thing for the ndra repeat the exact same analysis and tell me what's the max location where the ND array was allocated the Min location the mean and the standard deviation and the unique and you can see I have that analysis there I've cut and pasted into this file give me the range of addresses comparatively give me how far apart they are the Min and the max the Max and the Min you can see this kind of geometric expansion where if I have two characteristics and two characteristics I'm going to have four things that can go ahead and paste well here I just have the two lines answering those questions directly see if I can discover where the discontinuity is between the Tuple allocations because when you do this you actually see the tuples under 59 elements are allocated in one place and over 60 elements are out in another place and you can programmatically show that discontinuity by just taking the structure and Computing these groups by the mean finding where those groups are looking at the index Min of those two and then doing a little bit of index lined operations here and we can see that tuples of PSI 59 are allocated here two plus a size 60 are allocated here you can visually see that programmatically then tuples of size 60 a assert that they are always before the end the array memory irrespective allocation order and assert the other one for tuples of size 60 assert that maybe they are before maybe they are after and you can see the majority of this code is not the cut and paste to do the analysis the majority of this code is telling the story of what I'm trying to accomplish which largely you can see here is some what mechanical but I think it's still broadly applicable to the kinds of problems you want to solve because ultimately when we have some computational structure when we have a data frame we want to be able to do operations on we want to be able to do operations like assign new new columns and we want to be able to do that in such a way where if the data changes that column is updated and we might try to do this as a function but that doesn't quite work out for us we might try doing this as some kind of analysis object and we get a little bit of a benefit there we might even we might even be tempted to do this using something like the registered data frame accessors to extend a panda's data frame to have extra methods like dot computed where you can always recompute values dynamically but let's be realistic you're a panda's user you don't care about object orientation you don't care about generators you don't care about looping you don't care about python Basics but here's something you might care about um this is a little bit of python just a couple of little mechanics that allows you to Alias columns so you can give a column two names so you can give it a really pretty name so it shows up nicely on your matpot lip graphs and get a really easy name so you can use that dotted access and maybe if you happen to be a data scientist maybe you might be motivated by the convenience of the things that you can do and the capability to do things that you didn't think possible without cutting and pasting code without writing code that nobody can make heads or tails of later writing code that somebody can understand and amend that they can extend and grow maybe you might do that and if that happens to be you well I'd encourage you to follow me on Twitter or LinkedIn my name is James Powell but in all truths why do you need to know about python you're a panda's user thank you so much [Applause] we have a couple of minutes yes so thank you very much for the presentation so coming back to the example that you made with sci-fi not optimize the Newton I really like um sometimes it's hard to like push that complexity to less sophisticated users and I have felt that firsthand in several positions how can we reach that Gap apart from telling our users like watch James talk and then learn some python I've learned a really good way to motivate people is to tell them they're wrong and to criticize them in front of others they never bother you again Apparently after that another potentially successful way to do this is to understand that almost all scientific or analytical users are under extreme stress to deliver analytical results in today's market environments they are under even more stress to deliver and if you cannot present these in terms of what will save them time you're just another programmer who's chatting in their ear telling them things like has anybody ever tried to tell a data scientist write a unit test and they say no they're not going to right we have programmers in the room who tell their data scientists to write unit tests and documentation does anybody here write documentation and that they tell that they're data science is right no so the way that I think you do this is for every part of python at maybe about 30 or 40 deep into the language there is a very clear motivation for why it exists you can see here we can map out look you can take the path you want to take and if you take that path there will be these consequences if you're okay with that fine you are a highly paid professional you know what risk you're willing to take but there are other opportunities or other up there are alternatives that you can try and if you find that you're willing to entertain those opportunities there is material benefit there is a lot in the Python language that doesn't really have that much material benefit to analytical users if you read through the change logs there's a bunch of stuff that they add in there where you're just like come on nobody cares right you know the statistics module does anybody ever use the physics module for anything real no they use scipy they use numpy right the truth of the matter is if you can tease apart what actually matters in your life you can present it to them and you can have to pick and choose so these are the four examples that I chose very particularly because I know in my own analytical work I have two paths One path is painful I want to show you that it doesn't have to be that way yes with that same example of the new method there yeah um how would you you know let's say that they're aware of that and they but they don't want to cause a breaking change to write a convenience layer you write a convenience layer that that stubs out the intermediary so you write like Newton slash I'm gonna say something you're gonna hate this but um generally for these kind of closed form methods also these non-close four methods that have some kind of ending modality generally these are exclusive bounded modalities you either are one or the other or the other and as some of us know there exists a decomposition of a bounded modality in other words yeah nobody knows what that means but anyway you can take something like that Newton's method and you can write for them Newton underscore num steps or something maybe slightly nicer named and that just hard codes the I slice operation or Newton underscore a tall and you can create that intermediary lever layer rather that doesn't require a lot of Maintenance that allows them to just import something in the volunteer life without compromising the Integrity of the underlying core code in fact raise your hand if you are a user of the core layer of a specific tool called matplotlib has anybody here ever imported matplotlib without the pi plot one person in the back three people over here right you provide them with convenience layers you'll provide them with the pi plot layer they live in the the universe they want to live in but you keep the core layers saying for people who like our colleagues in the back you know actually want to have that flexibility does that satisfy you're no okay I need to wrap up fantastic thank you so much [Applause]
Info
Channel: PyData
Views: 7,099
Rating: undefined out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, learn, software, python 3, Julia, coding, learn to code, how to program, scientific programming
Id: iE5QLrzkGBU
Channel Id: undefined
Length: 43min 21sec (2601 seconds)
Published: Tue Jan 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.