James Powell: I Just Inherited 50,000 Lines of Code! What Now? — A Practical Guide | PyData LA 2018

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Only 50k? Luxury!

👍︎︎ 13 👤︎︎ u/lelanthran 📅︎︎ Apr 17 2019 🗫︎ replies

Yes, this talk is long, but it's honestly one of the best talks I've ever seen. While it's from a Python conference, it's absolutely not Python-specific.

👍︎︎ 3 👤︎︎ u/Arve 📅︎︎ Apr 17 2019 🗫︎ replies
Captions
I'm James Powell you can tweet at me at don't use this code if you like this presentation or if you have any questions about what I'm about to present I want to give you a disclaimer usually I don't know if any of you have seen any talks I've given before this is I think the 26th or 27th now maybe thirty fourth or so PI data I've presented at given a lot of PI data talks usually they're complete nonsense something fun but very silly this talk as you can see from the title is intended to be useful and I thought that I wouldn't be able to convey to you through the material in the talk please don't use this code so instead I decided to grow an absolutely absurd mustache because nobody can take this seriously with this thing in place so that's my disclaimer let's get started so this talk is about a very simple scenario that I think we've all encountered at some point in our working career we're working somewhere one of our colleagues left they left us with some large code base maybe fifty thousand lines of code and now it's your problem to deal with what they left behind and what I want to go through in this talk is a simple guide for what you do when you're put into that situation on this is a very complex and very deep topic and in the span of 45 minutes I don't know that I can give you all that you need to know about it but I wanted a touch upon a couple of common points because this is something that I faced all the time in my work and I wanted to give this to you as a mixture of a couple of simple techniques and a little bit of theorizing around the topic to try to understand how can we break down a problem that immense and so we'll do this very basically step one find a new job there is no step two thank you for much anybody have any questions okay so I heard a bunch of questions in there let me see if I can answer some of them one of them is why lines of code I mean it's a very imperfect metric but I think it's a very understandable metric it really doesn't you know when we say fifty thousand lines of code what we're referring to is a very specific thing it's not five million lines of code it's not five lines of code it's something that's a sizable problem that you might expect would be within your capability to pick up but in practice turns out to be very difficult and in fact I think I heard somebody say actually this is a comment than a question but doesn't really depend what fifty thousand lines of code we're dealing with a colleague of mine told me when when we were discussing this talk that they had written a hundred thousand lines of code in a month that's ridiculous I mean that's enormous productivity I think the industry average that people have been talking about since the 70s is average programmer writes about a hundred lines of code a day and if you think there's about two hundred working days in the year that means that the average programmer outputs to twenty thousand lines of code and this statistic albeit a very flawed statistic is a Sicily that's used to promote languages like Python because the thought is would you rather write 20,000 lines of Python in one year or 20,000 lines of C in one year but what you can think is obviously it does depend on what the 50,000 lines of code are if for example you're given a 50,000 line code base and you run c-loc on it and it shows that it's 20,000 lines of HTML and 20,000 lines have reacted 10,000 lines of Python that's a very different problem to solve then if it's 50,000 lines of pure Python obviously some type of code like HTML you have other techniques to refactor that but since it's not active code there's a certain limit to the complexity that an individual line of HTML can contain it's a lot easier to deal with so for the purposes of the further questions that we'll be answering for the remaining 42 minutes that we have in this talk let's just talk about Python we'll talk about the kind of code bases that we interact with day to day which are large Python code bases usually what you might call an application or a system as opposed to core library now for some of the analyses that I'll talk about in the further questions that I will be answering will I write I ran some of these on some code bases that I've used for work and some common libraries like matplotlib that we used depending on what kind of code base it is whether it's application code system code whether it's a library code some of the metrics that we look at and some of the techniques that we look at will or will not work now one really interesting question that I did here on the cab ride over here this morning was somebody asked me why don't you use the machine learning algorithm to try to understand the code that you have and I thought that was a very interesting topic I think that the topic of using some of the rich data analytical tools that we have in order to understand the code that we're producing is something that doesn't really get a lot of coverage I think one of the reasons for that is if you think about it when you're writing code you're essentially modeling and irreducible complexity the the unending profundity of the real world into a semi mathematical model typically when you're using a machine learning model you're dealing with a much more concrete mathematical entity it turns out that machine learning models can give us some very interesting results where those mathematical entities correlate very closely to real-world tangible things we can think of like image recognition you know the pixel value can somehow correlate to recognizing an entity or not but if you think about it it would be very difficult to thoroughly mathematize the functionality and the requirements and the complexity of a codebase in such a way that you could feed it into a machine learning model and get something out of it although it'd be an interesting exercise to try and if somebody wants to tweet at me after I don't know uploading a codebase into a machine learning model and trying to predict what the next line of code would be that would be very interesting if you could figure out a way to actually structure that problem now one question that I think I heard among all of the questions was this is actually happen and I think this has happened to me a number of times every time I just found out their new project or a new job in a couple of cases I wasn't able to do that and so I've had to deal with it but if you think about it if somebody sends you an email and they send you five lines of code you just use those five lines of code there's really a generally incompressible amount of complexity that you could see in five lines of code maybe you could refactor those five lines of code into one library call maybe you could fix the syntax a little bit maybe there's a corner case and so those five lines of code turn it to seven or ten lines of code but the five line of code number is really not that much that you have to deal with in fact you can read through it either decide to use it decide to refactored all in the span of about an hour if somebody gave you 50 lines of code or an email saying hey we want you to use these 50 lines of code you could probably just rewrite that code you might not even read through it you might just ask what does this do and in the span of about a day you could probably just rewrite it from scratch and ignore what you were provided another question that I think I heard was what's the big deal I mean fifty thousand lines of code but if we continue that thought process if we think about five hundred lines of code what happens when somebody gives you five hundred lines of code well that might have some hidden complexity it might have some hidden state it might require access to some resources like a database might have a very complex runtime context and so most likely what you would do is you'd read through it you'd rewrite it and the entire time would take about a week and you begin to see that there's an interesting expansion of effort because in a very productive day almost certainly all of you in this room have written two hundred three hundred four hundred five lines of code but I would I would doubt that in a very productive day would be very easy for you to rewrite that in just just a single day so you can see what took you maybe one or two days to write may end up taking somebody else one week to rewrite one week to refactor one week to understand now if we extend that further for 5,000 5,000 lines of code that becomes maybe a month long process where you'd have to read through every line maybe try to understand what's going on in that code maybe write some tests around it and I think the 5,000 line of code target is about what you see when people give talks about refactoring via test-driven development typically they're talking about small libraries where you can easily identify what the functionality is where the complexity generally is and write a couple of tests around it and over the course of a week or two start to dismantle that complexity and understand what's going on I think another common question that comes up is why not ask for help with this and if you think about it if you were given five hundred thousand lines of code or five million lines of code it is very unlikely that a single person has written five hundred thousand lines of code or five million lines of code and honestly if you're given five millions of lines of code written by one person not only should you quit your job but you should quit your job sort the stock of that company and run as far away as possible convince your friends to quit your to quit that job or alternatively start a consulting company just sit wait around until the company comes to you because they're hopelessly there just in a hopeless situation having to deal with five hundred thousand lines or five million lines of code written by somebody maybe hire the original author who wrote this and cash in but generally what we see for large code bases like that is the work product of multiple people and so there's usually some form of documentation there's either a mailing list there's either a SharePoint site there's either a wiki there may be in encode documentation in the form of comments there may be some remaining artifacts of the development process the problem is oh let's see if we can get back to where we were the problem is when we oh that's a nice thing live editing our slides right there we go the problem is when we deal with something in the middle this this five five 50,000 line of code level because when you're dealing with 50,000 line of code it's possible that it's the work product of a single person done over the course of maybe one to three years if you think about that metric a hundred lines of code per day that's about two to three years worth of work that somebody could have done because it's the work product of a single person doesn't necessarily mean that it's not completely critical to the operation of the business unit or the company itself but it does mean that that person could have encoded a number of hidden requirements into this code it means that there could be a lot of hotspots where there's some small corner case in the business model that's being addressed and it's entirely possible that there are no documentation and there's no tests whatsoever in that code base because since it was a work product of a single person they could keep all of the work in their mind and they didn't need documentation they didn't need tests they just kind of knew how the thing worked and the most cynical and unfortunate part of it is they may have set unreasonable expectations because think about it the work product of one person that's five 50,000 lines of code they could have turned around code fixes or new features in the span of maybe a week or a couple of weeks when you inherit this codebase this is now beyond what you can read through in the course of a week or a month it may take you as much as a year to actually read through and understand all the complexity in that code but when that gets handed over to you it from a business perspective you go from somebody who was able to turn around a feature request or a bug fix and anywhere between a couple days to a week so now somebody who has to say well I can't do any work on this system until I have spent a couple of months to read through and rewrite it and so you have to be incredibly strategic in order to manage these expectations if you're not then you're suddenly seeing enormous degradation in the business process because you can't respond as effectively as the original author and yet you have none of the resources none of the tools none of the built-in knowledge that original author had and so I think the only other reasonable question is we know that if we're put into the situation we should find a new job but even in as good a job market as we're in you may not be able to find a job that quickly and so what do you want to do while your table searching well if you don't mind just getting fired I would say take your vacation time or just kind of send a couple of emails or just slag off the old author as much as you can but if you feel a little bit of obligation to your former company and you actually want to do something that maybe helps them before you book it and you get out the door there is maybe a couple of things that you can do and I would say that if you were to think about this process one of the main things you want to do is you want to fix your expectations the original author was able to work within this very degraded software development lifecycle this very degraded soft development process where there might not be documentation there I might not be test but they're still able to be effective and they set a set of expectations that you will not be able to meet and so one of the most basic things that you want to do to set those expectations is just find something to run and run it this is absolutely critical if you cannot run the code there's very little that you're gonna be able to do other than slag off the old author and update your resume and one of the unfortunate things is I have been given code bases where I could not run them to say I'll just run the test suite there may not be a test suite I'll just run a sample app there may not even be one I've been in situations where the code runs on a remote machine that has access to specific databases or specific resources and I can't just sit at my console and import a module and run it and so the absolute most critical thing to do is find some runnable context in which this thing works in which this thing runs from completion or if it's a long-running tool run you know you can start it up because once you're able to run it you can then begin to try to dismantle its complexity unfortunately Python is such a dynamic language that you're going to be you're gonna find it very difficult to understand a Python code base unless you apply dynamic instrumentation and dynamic analysis techniques especially in the hot spots in the code where something kind of hinky is going and those are the hot spots which you might be able you might be obligated to fix a bug or to extend functionality one of the things that I also try to do is I try to iteratively narrow my focus as much as possible when I first was given an assignment like this when I was very junior at one of my first jobs my goal was to thoroughly understand the codebase and to be able to confidently say everything about how it worked and within a week of no reportable progress I realize that's not going to work at all in this degraded scenario of you being given a lot of code with no support you need to kind of disabuse yourself of the notion that you're going to have mastery over this codebase and you need to try to find as narrow a portion of this code base that you can do something with as possible you may never be able to understand the entirety of it but there are a couple of general rules of hand that I've seen with regards to these kind of code bases most production code bases especially when you're talking about an application and not a library most of the time less than 20% of that code is interesting most of it is just application of some core library or some technique to some business problem and 80% of it's just application of the business problem and you don't really need to touch that and generally most of these code bases about 50% or more of it works and so there is some core of that code base that's both interesting and either needs new features or is buggy and needs to be fixed the problem is how do you find that and what I find often is if you want to kind of break it down in a large code base like this there's some portion of that code that you might say is dumb code it's boring code there's some portion that's smart code or interesting code the boring code is code that is generally very static it's very repetitive it's typically applications of some core tools or libraries and there tends to be a lot of it when I say static what I mean is no Python is an incredibly dynamic language we know we can do monkey patching we know we can have first-class functions we can we do all sorts of things that make it very difficult to statically analyze our code which is why tools like my PI or pep 484 type type in ting have limits to what they can provide to us similarly it's it's relatively difficult to make generalized refactoring tools in Python like the Java community uses all the time typically the majority of the code base is static it's boring it's just people calling static calling functions where we know lexically what that function is where you can trace that function back to the import and there really isn't this dynamic dispatch or this indirection that you'd expect from the interesting code the smart code or the interesting code the minority of this code base and very likely the place where you need to dig in and do something tends to be very dynamic so there might be some code which is using configuration in order to generate what functions get called that dispatch may not be determinable statically you may need a runtime context to even figure out where the code goes for example if you look at how set up pi and dist utils work try to figure out what the process is from going to the from the entry point of a setup py to the actual code that runs or if you look at a tool like clique or any of these other command line generators I actually try to figure out how much code runs between you passing in a single argument to determine what command gets run and the underlying function similarly for web framework like flask or Django just try to as an exercise try to see how far there is between when the the web app actually receives an HTTP request and when it actually is able to dispatch that to a view function or a view class is that there tends to be a lot of code in between there and there are many decisions that can be made that can guide the direction of that code differently this smart code often tends to be very core to the to the operation of the library it tends to be fairly minimal in terms of where it is and it also has a tendency to be very scattered throughout the codebase so it may not be easy to find that code sometimes I've seen code bases and I struggle to even figure out where the meat is it's all bread or all you know pasta but the little bits of meat inside are very difficult to find and so one of the things that I find that's necessary is trying to figure out how to find that now there's another core dichotomy that I've seen which is it's very difficult for us to analyze smart code statically it's just too complex there's too much dispatch there's too much knowledge there too many there's too much states going on at the same time and we often cannot analyze dumb code statically because there's just too much of it in in terms of the lines of code there is we can't read through it we might be able to apply some static techniques but we won't see anything that interesting and the interesting patterns that we'd see there are not particularly visible and unfortunately trying to distinguish where the dumb code and where the smart code is is also very difficult I would encourage you to take a library like matplotlib or numpy if you've never looked at those libraries before and actually try to find out where the ndra is defined that's the core of map numpy I actually try to find out a matplotlib where the draw a graph is generator and how that's processed and that's also very difficult to find in the depth of all of the other functionality that wraps you know these core data models now when I say you need to find something to run it obviously if there's a test suite run it I'm part of the reason to run a test suite is not to have that foothold into runnable code but also to set your expectations because there is no guarantee for the work product of a single person that the test suite actually works I've seen test Suites where the author didn't need to run the test suite all the time so half the tests fail because they haven't been updated it was just too much effort for them to do I imagine for your own work product where you're not sharing this code with anybody else or we don't expect to share this code with anybody else how often do you really write and maintain and pay a lot of attention to your tests especially if your actioning requests for additional functionality at a fairly pass pathway but you can set expectations by at least telling your boss look I can't do anything with this code base test suite doesn't even run ideally there's some sample code somewhere and just being able to run even a dumb sample app is useful ideally there's also some production code that you can run and you can see in its operation if it's a if it's more of a system or an application than a library the idea here is you want to set a baseline not only in terms of the expectations that you're broad to whoever you're responsible to for the maintenance of that code but also to set a baseline for the requirements for what the code takes to run because it's very often the case that you may not find that you can just run this code in a DA container in a virtual lab you may need some very complex set of environment environmental variables access to database resources it may not be able to be isolated a very good goal once you have a foothold in this code base and you're more comfortable once you've been able to get it run in any fashion whatsoever is to try to reduce the runtime complexity of the context in which it runs try to isolate the code as much as possible because if you get the code to run in a dock container with no access to any other folders without Network access with only a list of enumerated requirements code library requirements then it's gonna be a lot easier for you to start to tease apart that complexity a lot of very complex code bases end up being complex because they're reliant on state that's relying on things that are not tracked in the code itself in the requirements txt or the requirements in or the set up top UI additionally that runtime context is very important because the only effective way that you can really understand what's gonna go what's going on on a large Python code base is via dynamic analysis and unless you have a runtime context to throw your dynamic instrumentation tools into you're not gonna be able to get anywhere so one big thing is what if the software is data-driven been given large code bases what are which are driver driven the actual code base itself relies on some data in order to determine the modalities in the code base you're out of luck here because a very data-driven app it's written by a single person unless you have a sample set of data that exposes all the possible modalities of that code base you're gonna have a lot of difficulty really managing that code base because you're not going to know there's going to be functionality that you won't be able to ascertain actually exists until you see the sample data this is the case where I've seen cases where you have an application that relies on an object database so people are storing Python objects and some you know no SQL object database those Python objects are pickled so they have functionality associated with them the application might load bunch of objects like imagine if it's a finance a trading application and it has a portfolio and each object represents a single position in that portfolio and each object has a price method well unless you know what a sample portfolio looks like you don't know what the price methods look like you don't know what the price method looks like you have no idea what the actual extent of what that code can actually do is and you'll never be able to really get a hold on what's happening so a really important question is how do you find the smart part of the code because you're given this enormous code base you cannot read through 50,000 lines of code in a week unless it happens to be exclusively dumb code and the monotony of reading through 50,000 lines of code makes it very easy for you to skip over the interesting or the smart parts they just have your eyes all to the back of your head and to just kind of zone out while you're reading through it so a very important thing is how do you find the smart part and another important question is why is it important to analyze this code base dynamically why is it difficult to analyze it statically so to start with let's think about some very simple static analysis techniques when I was coming together with when I was putting this talk together I asked a couple of my colleagues who are core developers at different Python libraries some of the techniques or tools that they've used and I asked them for some of the static techniques they've use and from the dynamic techniques they've used a very simple static technique is what we saw before just run see lock to figure out what what what languages are used in this code base how much of this is Python how much of this is C if you run that on something like matplotlib you'll see there's portions of that code that aren't fortran there's portions of that code that are in c for the most part if you were inaugurated as the BD FL of matplotlib it may be the case at that C code or that Fortran code is so old has been around for so long you probably don't have to look at it immediately you may only eventually have to look at it you might want to focus your attention on the parts that are more user facing which would be the Python code similarly if you see a code base there's a lot of HTML or CSS you can narrow yourself and start to whittle down how much of that code you actually care about another approach would be you know one person said just take all the code print it out and just sort it by the lines and you may begin to see certain patterns if you look at a single line and just sort by what's the most common line I ran this on matplotlib the most common line was an empty line then there was a couple of documentation lines that were very common doesn't give you a whole lot of information for a library but imagine you have an application where certain function calls are called over and over again or maybe it's a library where you have users who are writing little scripts or plug-ins for that you'll start to see common lines like database access like setting up the context and you'll begin to see a couple of hints about what the general pattern is it's not that useful of a tool unless the code base is written in a very repetitive fashion another approach might be you know just take every every file split it onwards and find the most common words because you can think if the code base was written very statically and there's a bunch of function calls and it's always just from module import function a and there's a bunch of lines that say call function a call function a call function a you can at least find what are commonly called functions again in library type code this is not that useful because if you think about the way that that code would be written if people were repeatedly calling the same set of functions in the same order somebody's going to add an abstraction where that's extracted out in order to minimize the amount of repetition we're told as programmers don't repeat yourself and we're told that this normalization of our code makes it easier to maintain and so the desire to normalize our code and to not repeat ourselves ends up defying our ability to analyze it statically now we can improve the static analysis better by you know doing things like using pythons tokenized module to actually tokenize the code if you do that you need to do something a little bit more sophisticated keywords or remove punctuation and if you do that you can at least find certain general themes of the codebase like for example what variable names are used often when I ran this on matplotlib I found a couple of variable names referring to you know our C params comes up as one of the most common tokens and matplotlib and for those of us who are matplotlib users we can kind of understand why that would be does it give us that much insight into the code no but it gives us a little bit of a foothold cause on the boss-ass so it's been one week where are you you can say well I know our C params is really interesting I don't know how useful that will be but it gives you a little bit of a foothold as to try to understand what's important or what's common ideally if this code base is very static that might you might see in the most common list of that common function names or common module names percolate to the top but you might have to do a lot more filtering and if you think about the output of this if you look at the first 200 items might take you an hour but you might it might give you something that's worth grabbing for so these static unfortunately where we're guided as programmers is to defy the ability to statically analyze our code so we need to have some techniques in order to dynamically analyze our code and this tends to be much harder to do in Python that it is to do in Java or C I want to go through just a couple of simple techniques that I've used in order to static in order to dynamically analyze code now at first we can talk about these as what you might call instrumentation mechanisms mechanisms which take some code and add a little bit of information you have some context which you can run this code either sample code a test suite or a production code in many cases if the production code is doing a lot like it's some kind of high-frequency mechanism that's responding to requests very quickly the amount of output that you generate from these instrumentation mechanisms may not be understandable by human being so you might have to take that output and then apply both data analysis techniques to it like generating histograms what's the most common output or writing analysis tools to actually analyze the output of your instrumentation that is not that uncommon but let's assume at least for the purposes of looking through these instrumentation mechanisms that the output of these are within what you can look at in an afternoon and understand as a human being without applying another metal layer of instrumentation analysis a very simple thing that people might do is you might start with very holistic instrumentation like you could use a tool like s trace and try to find what are all the files that are opened what are all the files that are touched read written to by by your tool that at least gives you an ability to try to understand what the runtime context is what configuration files does this look at what you know there's a database or a socket that's opened what what's touched by this file what modules are touched by this file where those modules are located it's often the case that if this has not been deployed correctly it may be that the modules that are used by this are not isolated into a condo environment or a virtual environment but they may be scattered wrong the file system and you may end up looking at one version of a library that you think is being used and be misled because it's actually stored somewhere else so this could at least tell you generally where all the files this thing reads how useful is that well it gives you at least one baseline to figure out what the runtime context is if you want to figure out what the runtime context is another thing that I see in complex code bases is they'll make use of environmental variables those environmental variables will determine how the thing runs it won't just be things that are passed into the on the command line and so you can always go into the OS module and just patch out OS dot and vairam and you could see what environmental variables are actually being looked at think about a code base that that you've run where it has like a UAT a QA and a production environment there may be environmental variables that cause modalities in that code base that cause you to follow one code path or another unless you can actually figure out what the environmental variables that are actually looked at you'll never be able to parse down you're not be able to pair down the the runtime context for many of these apps it's like a shell script that set some environment to variables that calls another shell script that calls another shell script that finally calls a Python system if you can at least track at the end what those environmental variables are you might be able to start paring down the execution context so you can I eventually isolate it into a docker container so you can have some fixed boundary around what this thing does or what this thing touches it what's that you know another dynamic approach you might use Python is actually full of dynamic analysis techniques and tools for that so another approach Python and the Syst module has a line tracer we can do a set trace one thing that I've done is you do a set trace you run your test suite you you create a histogram of every filename and every line number that's touched think about this the static analysis technique where you do a sorting on the lines of code that you see or you do a tokenizing and a histogram on the tokenizing that's gonna mislead you because if somebody has refactored that code suddenly something that's touched a hundred times has now only touched five times because there's now some layer of abstraction that reduces the repetition however in a runtime context that repetition still exists and so if you look at every line of code that's executed as part of the execution of your test suite you might actually able to find out where the hotspots are you can see oh when I run my test suite of a thousand lines of code you know this one particular line of code in this one particular module gets called 50,000 times so that's probably close to where the interesting part of that code is or you can start to graph out more or less what modules are interesting what files are interesting and what lines are interesting you could do the same thing at just looking at the file name or just looking at the package or the module if you do this for so I did this for Matt plot live with a very simple PI plot histogram to try and see what that would give me and I was able to pick out certain parts of matplotlib that I didn't know where that interesting where those lines of code were being executed 10,000 times for actually not 10,000 lines but like 50 or 60 times for just a simple execution of a simple graph now the information that I was able to glean from that did not give me more information to understand how Matt public work but if I ran it over the entire test suite I might see certain certain areas like you know general configuration dispatchers which every piece of code goes through and those will start to percolate towards the top and it'll give me a little bit more of an indication of where I need to start looking by the way we haven't even gotten to maintaining it we're still trying to figure out where to look this is what happens when you're put into this situation you don't even know where to start now another technique that I have employed successfully is you know in Python you have system meta paths where you can insert into Python hooks that hook into the import mechanism what I've done before is I've used Network X and a custom meta in Port hook in order to what the import up does very simple every time Python tries to import a module my import hook says yeah I can import it internal to that import hook it just reinforce the importing mechanism and it keeps track of what imported what importer what this actually works surprisingly well in Python because the importing mechanism of Python has a lock around it so it's impossible for multi-threaded code to actually to actually call this thing in parallel and so you have you can write some fairly simple it's about 20 lines of code fairly simple mechanisms that assume everything is single threaded irrespective how the application actually runs that constructs a graph of this module imports this module imports this module import this module when I did that I began to see in that application one module that everything imported and that was a place where I wanted to start looking because if you think about our application code it's often the case that if it's written well there's some core and it's a hub-and-spoke model where some core is used by everybody and then everything's kind of a spoke off of it however if you do that and you see a spiderweb where there's no one place where everybody convenes then you really know you better hurry up in your job search because this is gonna be a real mess and it's not unusual for codebases to end up with that pathological case of this it's no longer a hub-and-spoke with just a web where everything imports everything and it's difficult to identify what the core part of that module is there are other tools there's quite a few tools that you can download one of them that I like is object graph object graph you can give it a Python object and it'll actually graph the references for that Python object so for example if it's a data-driven application you might be able to take one of the pieces of data that's driving the modalities in that program ask for the pythons GC module actually be able to tell you this object holds a reference onto these objects and these objects hold our reference for this object so you take one of the one of the data items that you have and you can see who's holding a reference and you'll begin to see some of the collection structuring so in a financial example if you're trying to figure out how a portfolio analysis tool works you might start with a single position and you'll see from that well there's a portfolio that holds a reference on to the position there's some a larger object larger object large orbit and a GUI program you might see some modal dialog is held by some windows held by someone or some held by some window and it'll give you an idea of the structuring of the project because remember at this 50,000 line of code middle ground it's very possible that the code is not well structured there may be some structuring to the code base but it may not actually reflect how the code base works at 500,000 lines or five million lines that starts to become untenable because new people can't be onward onto the project and figure out where things are but at the 50,000 line of code base it's very possible in your mind to just remember where all the interesting part is and it doesn't have to be reflected in the code structure the problem is all of these techniques may then require further analyses and so instead of actually looking at this information itself one thing that I always do is I always try and log it to disk because you can generate megabytes and megabytes and megabytes information from the line profiler because it's telling you every single line that's run here I've you to histogram of it which is a compression information instead of generating the histogram directly if you just log every line and you have the ordering there and you know that it's a single-threaded application then you might be able to do larger window analysis remember the histogram is only looking at one data item at a time imagine once you have this on this you might say well what are what are bike and think about it you could what are buy grams of function calls or lines that are called in some fashion so you can start to apply some of the analysis techniques that you're used to for the actual work that you're doing to the code finding which two function calls are likely to appear you might even create a machine learning model given this line of code what's the most likely line of code to follow it what would that tell you I don't know but it might tell you something kind of interesting and at worst it'll be a really cool PI a to talk that'll help you find that next job so one thing that I often used to try when I was given these kind of code bases is try to find the entry point I am very much on the fence about how useful it is to find the entry point of an application it's useful in one context if you have the time it's very useful to find it because if you can find the entry point of a Python application as in wherever the main function is or where the thing is launched from it can very much help you with some dynamic instrumentation techniques where you need to inject or wrap something and you need to make sure that you're injecting or wrapping code occurs first and the execution of the program otherwise finding the entry point and trying to trace the entry point to the interesting parts of the code base is a is you can start going down rabbit holes and I've wasted so much time going down that rabbit hole of here's the main function here's what a calls here's what a calls here's what a calls oh now I need to know some dynamic input to figure out where it goes next and then the number of possibilities those modalities explode however if you have the if you have the the entry point you might be able to do some injection or wrapping that allow you to find that allow you to inject instrumentation to find the the core functionality so one of the other questions I heard was how do we find the functionality here who's impressed I'm still holding on to the conceit that we started 35 minutes ago get into a debugger so because you have to analyze this code dynamically yes you could emit print statements or logging statements and then subject those to to a panda's data frame position some analysis but oftentimes that smart code the code that you care about is small enough that you just want to find it and it's difficult to find it sometimes I've I've signed myself up for like lightning talks where I promise to make this change or that change in the Python interpreter without actually knowing where that code is in the Python interpreter Python interpreter is about from memory about 150,000 lines of C and Python code altogether so it's a fairly large code base it's well designed and it's well laid out but even then it's difficult to find things for the talk I did on Thursday one of the things I wanted to find was where the Python interpreter performed nfk D normalization on tokens so I could patch that out and I asked one of the core developers who was there at the conference and he didn't know off the top of his head so I had to use debugging techniques to try to find a point which would allow me to trace where in the interpreter the code is that perform that action there's a couple of things you can do if you're using Python 3.7 there's a new break point function that's been added in Python 3.7 that will drop you into debugger so if you can find some code and you just type breakpoint as a function you can actually drop yourself into debugger wouldn't it be amazing if the horrible code base you were given happen to also be up-to-date with all of the libraries and all the versions of Python so maybe this is a little bit speculative because there's a very good chance of this terrible code base we're talking about is written in Python 2.6 and if you use Python to point six point two it breaks for some unknown reason and it's using some version of matplotlib that was only released on one platform fifteen years ago that's a possibility there's a couple of other tools that have been in Python for a very long time the PDB module gives you a debugger there's post-mortem if you can find the entry point one of the things you can try doing is triggering the code to fail rap-rap the entry point and a try except once you get into the except after you're triggered to fail maybe by doing ctrl C to kill it to raise a keyboard error if to kill it the post-mortem will launch PDB into something where you can go up and down the back trace to figure out where you killed it so run the application so instrument the entry point or create a new entry point that launches into the original entry point instrument that with try accept post-mortem find some way to kill the application and you get a back-trace that tells you at least the the stack trace of where you are and it might give you an indication of where to look next in order to find where some functionality occurs similarly PDP also has set trace so if you have the ability if you kind of know something adjacent to the interesting part of the code just say from PDP in port set trace and call set trace but PDP is a pretty weak tool sometimes what I've done is in the code module in Python in the standard library there's something called interactive console you can use interactive console to actually drop your code into an interactive console just like you type to python in the command line and it's about four lines of code i've done this off enough i can type these from memory and what you can do is you find an interesting part of the code base your drop into an interactive console it's not as rich as pd because you can't go up and down the stack trace to figure out where this was called from you can't see that you can use the inspect module to do some of those things but this gives you an interactive console so you can start to do things like try to mutate the current state instead of here this is taking a copy of the Global's and locals but if you just take the original locals you can mutate that state and see what happens one of the techniques that I used to do in C code all the time is instrument C code with just something looks like this interrupt three is trapped two debugger so I'd run my C code under a debugger trap the debugger at the interesting point and that would give me just enough of a hook where I'm in the debugger at a point which I kind of knows something about I can see the back trace and from the back trace I can see where's the next place where I'm gonna try and put this debugger hook you can do the same thing in Python using the signal module and OS so if you add these three lines you'll actually trap two debugger the problem is in a normal run of Python this is just gonna cause the Python program to seg fault it'll just cord up for you so what you need to do is you need to be running Python under a debugger in order for this to work however unfortunately these techniques are kind of invasive and all of these kind of presume that you have the ability to take the code base and to change it imagine this code base is deployed we have some production deployment mechanism so you can't really just deploy code into production that has breakpoints all over the place and so if you don't have the ability to invasively change the code by adding you know from PDB import set trace all over the place or adding break statements you're a little bit limited there are a couple of techniques you can you can approach that are similar to this unfortunately one of the major limitations is Python has no ability to create what you might call a pure proxy you can't take a data item in Python and wrap it in something else either via composition or inheritance to make it a pure proxy one of the reasons for that is if you have like a subclass of the type of object that you have somebody could have done type X is and you can't fake that on the Python without going to the sea level so in the absence of a pure proxy the most you can do in Python is subclass something implement those underscore methods and try to capture accesses like I did in the previous slide with my OS not environ this is just about the best you can do in Python you can't just completely fake out an object in order to catch where that works I've had this happen before where one piece of data and a large data driven pipeline application was causing the application to terminate I didn't know which it was I tried to load all the data that I knew the program was going to use I wrapped them in proxy objects and then I saw the moment one of those proxy objects had a method called on it in order to figure out where in the codebase the method was being called that was causing an exception in the code base like a division error because my trace backs were because this was embedded in a runtime context where I couldn't see the trace back and I couldn't get any debugging information I need to do this myself there are some significant limits to what you can do with that but it's at least a good start now one thing that you could do none of this code still works but a while back as a joke I created a module called our watch what our watch allows you to do is allows you to put a hook on the Python internal virtual machine every time a variable is pushed to the Python internal stack so every time pure Python code is going to look at a variable you can put a breakpoint it looks something like this where you you can basically tell Python every time you run a python instruction that's going to look at a variable give me a breakpoint unfortunate this code is not maintained but this is one approach where you can actually start to try to get something akin to a pure proxy and to debug these situations where you have where you have data-driven applications that just don't work or where one piece of data is kind of gumming up the works now one other thing that you could do similar to the debugging trick is you could in your entry point register a trap Handler and so you could tell Python when I get a cig trap like when somebody sends that kill signal to me I go into this handler and that would allow you to have at least one point where you could then use the inspect module to get the current frames and the outer frames and then external to the program you just do kill trap on the pit for the program what you do in that case is inside that inside that handler that you registered you'd use the inspect module current frame and get outer frames to figure out what their back trace is or to figure what the stack what the track frames led you to that point and you'll be able to get a little bit information about where you are in the program and what's going on one thing that I've done is I've run pure Python applications under gdb and then just control C which would trap to my debugger this can be helpful because from the debugger I could put breakpoints but I can only put breakpoints on C code so this is only really useful if I'm trying to debug some Python application that is a Seikh dimension or a c component to it and a python component and I'm trying to figure out what's happening on the boundary between both of those but you could use this to maybe try to find the entry point if in C you do a break on them on main and figure out where the main is implemented this is one of the first things I did in Python to figure out where the main function for Python is implemented it's in Python run dot C I believe no it's in the program's directory under Python dot C now I can't remember but what you could do is you could use gdb to break on like PI eval eval frame X and in the Python standard in a source distribution of Python there is a set of gdb macros under tools gdb Lib Python that allow you to start to introspect from gdb the Python layer so you can see from a gdb sea level stack trace what are the actual Python functions that got called and that can start to give you some information about what the frames were evaluated and where you are this is a very advanced technique that you would only use if you had a Python application that called in to see them and called it the Python that's not that unusual for data science code so I think one of the last things that we might think is how do we avoid inflicting this pain upon others and I think that all of these different techniques is grab back up take different techniques in this general discussion kind of leads us to a couple of takeaways in terms of how do we avoid this problem make it easy to run if you're writing code make it easy for somebody to just run what that means is create test code and make sure that test code is up to date because that test code is how they're going to exercise the function out of your program in order to get a hook into that program to be able to find where things are even if you don't document this is the part of the module that's interesting this is where the core is if you can at least run it they can at least start to use some of these dynamic analysis techniques to try and figure out where things are if not test code give them sample code and make sure you keep it up to date because that can also help them exercise the modalities the modules the functions that are used in this code base and see how they enter how they are structured and how they interrelate with each other if it's a data driven application please give them sample data because a lot of the complexity and the functionality of that application may be hidden in the data that's used to drive the application if you only provide them the code but you don't provide them sample data they may have no ability to really understand what's going on make your code runnable in the most isolated circumstance possible ideally in some docker container using some default version of like Ubuntu LTS with no network access without access to a without access to any shared folders so that they can isolate what it is that this program needs in order to run because in the absence of that they're gonna find it difficult to even get started with your codebase to find a way to run the thing because oh it needs this config file in this directory it needs access to the network in this place it it loads this from the database much less before they even try and figure out what modules I have to install to get this thing to run ideally make your code in portable as a module I've seen many code bases were the only way to actually get a runtime context is to run the application by calling it the command line if you're gonna write your code make it importa Balazs a module so somebody from inside and interactive console can just import your code and start to play around with small pieces of it because when you when your code is only runnable as a full program that immediately requires that they start to isolate the functionality of that down to this function with this data produces this result if you make it importa Balazs a module then they can just start importing pieces of the codebase and immediately get down to they can whittle down what they have to care about to just one piece of your codebase we talked about you know flat is better than nested in the structuring of your program you really want to have a fairly flat structuring for your program or a hub-and-spoke because otherwise when they start to look at the stack frame so the trace packs for that program they're gonna find that the trace back where some configuration parameter used is here and then parallel to that there's some trace back there's some some cut out there's some code path that set the environment to where the variable and the further up that is and the deeper that is where the two of these code paths could join the harder is for them to understand so what you want is a very flat code base where there's one controller that calls a bunch of stuff they can see in that controller oh load this configuration get this user input and then actually perform some operation on that and ideally there's some centralized hub where everything is run from and you're not going down you know 50 frames deep before the code is actually in its main loop where it's trying to perform the core of the action don't bother writing out documentation like nobody's gonna do it like I could tell you to write documentation don't even bother but asserts there's start name this the assert statement of Python can be alighted from running Python code if you run Python with a - capital o a stirrup statements get alighted from the ast in fact if you use the debug variable with 200 scores before and after if you have a line that says if debug that code branch actually gets alighted it gets removed as part of the parsing of your Python code so instead of writing assertions I start instead of writing documentation at least put a couple of assertions there so that somebody reading through your code can say well they said assert X is greater than 0 that's not going to be part of a runtime check but at least at this point in the at this point in this code I can assume that X is greater than 0 because if they were running this in a non optimized - oh context the thing would have failed and so you can start to get people the ability to constrain how many possibilities they have to consider when they read through your code by adding assert so at the minimum if you're not going to write documentation and don't bother because nobody's gonna read it anyway use asserts as a form of documentation or use the if debug trick to put to put into your code expectations that somebody reading through it can use in order to limit how much they have to keep in their head when they're trying to understand your codebase and so among all of those I hope you now have a little bit more of a tool set for trying to understand this problem this is a very deep and complex problem any one of these techniques we go into in great depth I hope that was interesting to you and I hope it gave you a couple of techniques and some guidance and how to maybe avoid inflicting this pain upon your co-workers but you know what why do you care you're leaving the company anyway thank you so much I think we're out of time for any real questions did anyone have any real questions whose actually had this happen to you before really no a bunch of you have never been given some work process that it wasn't documented wasn't tested and now it's your responsibility what do any any any techniques that you used here that might be of relevance to the rest of the audience absolutely and we can also see something about the world which is some people like David here are the kind of people who drive down the street and I get every green light and some of those of you who raised your hands are the kind of people who get every red lights so you're always the one who receives the code that David wrote any other comments [Music] really that and that and that's good because when I when I leave places and they asked me to hand over my work to somebody else I always give them wrong documentation just as ticket to them but if you videotape me then you catch me with that but that's if that's everything Andy yeah as [Music] that would be that would be amazing but I think what happened two people leave organizations is their email inbox just disappears and so all the conversations nobody records those conversations nobody when I've been given these code bases often times I've been told oh here's a thread where people talked about it and the thread has two comments and three links and it's really unfortunate so maybe maybe part of the cases if you're building a codebase you're building something you have an inbox where you can just CC in important relevant email threads for later archiving because it's it's absolutely the case when you leave an organization all your stuff gets archived you know in case the SEC is coming looking for it but nobody else has ever ever has a chance to look [Music] [Music] absolutely has anyone hiring yeah that's a problem you're handing it off you might generate reams of documentation that the next person just ignores because it's so often that it's not useful at all okay I think we have time for one more question or one more comment and the problem with Python is sometimes you can't put the the breakpoints assume that you have a physical point but what if you're trying to put a breakpoint on a code that's tight on a function is dynamically created through some other process you can't put a breakpoint in the idea on that okay I think that's all we have time for we might have run a little bit over time but thank you so much [Applause]
Info
Channel: PyData
Views: 11,976
Rating: 4.9163179 out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3, data scientist, data science, data analytics, analytics, coding, PyCon, Jupyter, Notebooks
Id: mr2SE_drU5o
Channel Id: undefined
Length: 58min 44sec (3524 seconds)
Published: Thu Nov 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.