Advanced Python Debugging Techniques Using GDB

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
where is mr. Collins dolly with advanced python debugging techniques using gdb cool Thanks should be a fun talk I hope so I'm Colin stylee I work at sprout social Chicago for sure talk to me afterwards we are hiring and if this kind of stuff excites you I have some really nice people for you to talk to you later also note that all this stuff is on github and it actually might even be helpful to pull it up and look at it so I'm sorry there's no better way for me to communicate that to you then just put it up there but if you want to take a look there's we're doing some example at the end looking at the code might be helpful so cool so um sorry and I keep looking at the screen because I can't pull the screen up on my laptop so so sorry this is kind of awkward but so typically when you have bugs in your code your programs in a crash that's kind of the happy scenario yeah I got a crash I can fix it it's easily fixed unfortunately many times I'm not that lucky my code has bugs and I don't get a nice clean crash I don't get an error message in the logs it's just not working and a lot of times these bugs are only you only see them under weird circumstances particularly circumstances that you didn't see while you were developing or when you when you were touching your code in a staging environment almost always in production and for some reason always around 3 a.m. often on a weekday so debugging these programs in these kind of environments is extremely difficult what you typically have is log statements print statements a lot of times that can be very helpful one of our other developers Jeff son Ryan gave a great talk yesterday on logging and how important it is and that's exactly true having good logging is helpful PDB is another great tool to use for debugging Python programs and mutters same thing these are all helpful tools however all these tools require you have to have instrumentation in place before the bug occurs and the bug has to be relative readily reproducible what I mean by that is you have to have log statements and in place in advance before the bug occurs in order to catch it right or at the very least you need to have a bug where you can do something make the bug happen and sort of dig into a little bit more many types of bugs at least the kinds of bugs that I have confronted this is actually not feasible right because if I had known the bug was gonna be there I would have just fixed it right these always take me by surprise so we're gonna look at gdb as an option for debugging live processes that haven't been instrumented and we want to do is attach to the process see what's going on and sort of catch these things in the act the beauty of gdb of course is that you can attach to any running process it could even be a Java Virtual Machine it could be a go program or in our case what we're looking at today is the the Python interpreter the beauty of gdb 7 is it actually has built-in integrations that enable you to actually directly go into the Python code not just at the sea level where we're typically used to seeing in gdb 6 and before actually gdb 6 did offer some really helpful macros in a file called gdb in it and the C Python repo you can still use those if you're a gdb 6 person but the integration in seven is much better and it's actually also more reliable there's always a rub the rub of course you have to have debug symbols so the deal there is in order for the debugger to know what's going on you have to have a wealth of metadata in there to inform it as to what the state is these debugging symbols are generated by the C compiler to compile time the problem is of course that in probably 95 percent or better of Python installations out there these symbols are stripped out I don't fully understand why that is I think it's probably to conserve bandwidth in the day and age of terabyte hard drives I don't know why we're worried about disk space but perhaps to conserve bandwidth we strip them out typically they're embedded in binaries they can also be stored in external files and so you can get a hold of these things by either so the - G flag on GCC if you're compiling a C program just toss that - G in there it'll give them to you just embed them right in the executable you can also get them so in specifically in the case of Python if you're doing an app to get 1 2 or Debian style that's the Python gdbg you'll see a lot of those actually I think in the in the Linux distributions these dbg this and DB g that these are all just the debug symbols for these C programs to facilitate this kind of debugging and likewise if you're a Red Hat person yum install debug info same same idea you want the debug symbols for both the C library the Lib Python library as well as the the executable for the Python interpreter so likewise if you have any C extension modules you'll want to do the same thing toss that - G on there in the build step this is you have debug symbols there - so if you need to dig into the C extension you can see what's going on there - if you don't have extension excuse me if you don't have debug symbols for your extension modules it'll just kind of give you this relatively unhelpful oh yeah it's in some module but I don't really know what's going on and of course if you're trying to get at something that's not very helpful couple other notes I use OpenBSD a lot so I have to set this little sysctl thing for global P trace that's a security feature with something similar also in Linux for something called llaman and this essentially the idea here is you can't trace processes even if they're owned by you if they're not your direct descendants and there's well-documented security issues on why that's important people have been able to go into SSH and sort of attach to that process and then do cool things or maybe even nefarious things when when using this mechanism so it's typically disabled by default but you can of course enable it be very careful when you enable is I just warned there are definite exploits available for these for these settings which is why they're turned off by default so if you do end up doing some production at worst set them back to disabled after you're done debugging but yeah they are definitely dangerous of course if the process is running as a user other than you you're going to need route as simple as that so this happens all the time people ask me ok well that's great but my production server is is not outfitted with gdb doesn't have debug symbols and furthermore my dev ops team is never gonna let me do this so what can I do there or perhaps maybe the the bug you're trying to chase down happens before you get a chance to install the bug symbols RGB on there the beauty of this is of course if you can have a lab machine where the lib and binary versions are the same you can actually drop a core on your production machine copy that core file to your laboratory machine bring the core file back up on your laboratory machine preserving the state and then debug from that point without ever having to actually go on to production to see what's going on and I've done this on a number of occasions simply because I can get on to get the core file but I'm not gonna sit there and bug the process while all my production is sitting there hurting because the process is not working get the process running again but take a snapshot of the state so I can devote it later the one trick oddly enough if you do have a laboratory machine you're just sort of poor file dumping and copying make sure your Python code is in there and it's the same path that was on production otherwise I get all kinds of weird errors and won't make any sense it's like well yeah you kind of have to have your Python code there because the Python interpreter debugging symbols do not contain your code and nor do the core files so yeah so once you attach G core is your friend you can just say G core and then a filename it'll just dump it out for you this sometimes doesn't work I don't know why but never fear there's always another way you can also send a sigmoid or six AV seg viii signal to terminate the process as well as drop a core that way so get around in different ways something like that perfect finally you're also going to need Python gdb PI this is in your C Python I assume if you're going this deep into it you're gonna have a copy of this C Python source lying around grab that module you're gonna need it what that module contains read it sometime it's quite interesting has a lot of tools for basically all the implementation for all the Python specific tools in gdb and you're going to need that available so that those tools are exposed when you're in your gdb debugging session I typically just place it in my virtual environment bin directory you basically want it where the Python binary is living I always use virtual environments so it's easy enough to just stick it in there if you don't really know where to put it you can also also just add a gdb init file that sources it works great as well so just some basics and again I'm not assuming everyone here has used gdb before C programmers probably use this all the time maybe a lot of people in your haven't so we'll just talk about some of the just the basics of like what this thing does and how it works getting core file I mentioned it already G core for looking at back traces this is probably the most useful command I can tell you about today and good old PI BT so what that's going to do is it just gonna actually run through a back trace and we'll go through some examples you can actually see how this thing works but the idea is this is a Python specific back trace and back trace full will give you some additional state information for each frame as it gives you the back trace and then if you want to go to the sea level for whatever reason you can also use BT which is the standard gdb BT back trace but of course it's not gonna it's going to give you the back trace at the C level not at your Python code level and that's a and that's a pattern you're gonna see all over the place same idea here if you want to walk up and down the call stack PI up PI down these C analogues being up and down same idea PI locals and PI print the idea here is I want to take a look the local variables that are defined one sort of annoying limitation of Pi print and PI locals is if you want to look at like a an attribute of a local variable like self dot foo you can't actually do that very easily with PI print I've been searching and trying to find a way to do that if anyone knows a way to do that let me know but generally you can only sort of look at the the base level objects in the in the local scope info locals and info args will also tell you again at the sea level what variables you're looking at typically not as not nearly as useful if you're debugging a Python program as PI locals and PI print but but those are there as well for inspection this finally is extremely useful PI list or list will give you the source code of where the debugger is currently sitting incredibly useful kind of surprised me at first that I got actually like literally pull up the source code and have a pointer as to what line is sitting at I do a lot of threading work so I want to know what threads are available and what they're up to info threads will tell you all the information very useful if I want to select thread three for example simple as that and gets direct traces for all threats so sometimes when I'm doing a process I have 10 different threads and I don't want to necessarily iterate through each one by hand this one does it for you I can get a back trace of every single thread that's currently in the process another very useful command so typically it's going to look something like this when you're going about debugging this stuff you're going to attach to the process with a pin I'm gonna give it the executable so it knows where to load the symbols from then you're going to see what threads are available you're gonna pull up I a back trace and you're probably gonna pop around at different threads and see what's going on at least that's how it typically runs for me pretty simple stuff maybe so let's let's take a look at some examples now I've simplified these for brevity but these are actual code examples from production code not necessarily at sprout social although some of them have come from sprout social but others them have come just from places wherever and actual code bugs that exist so hopefully it will get some useful knowledge out of just like don't write code that does this kind of stuff in addition to finding out how gdb works that's my hope anyway so the first one to look at is something that we do a lot these are HTTP Polar's we had a situation where we basically have a process that repeatedly fetches data from a bunch of URLs this is a very common thing it's Prout we're constantly pulling in data from social networks we do this a lot and just to increase throughput we're gonna we're going to use threads of course threads are of limited utility in Python as we all know but when it comes to i/o they're incredibly useful because you're gonna have a lot of blocking right when you're pulling points double up in network so let's take a look at the code really quick if I can pull this up just to kind of give you a sense of what what we're up to here so I hope this code is reasonably straightforward for us to see the idea here is again we're setting up a thread pool we're going to basically dispatch just a crap-ton of jobs into the pool and then each job is basically this fetch this fetch method fetch method is gonna build a URL it's gonna do some timing calculations but ultimately is just gonna bottom out in this requests call I presume everyone is familiar with requests if not it's just a wrapper around URL Lib that's very convenient to use it's doing HTTP GET request there on the URL and returning its JSON we log what happened and then we return whatever it got pretty much all it's going to do and the idea here at the end so we're going to instantiate it we're gonna kick off say 40,000 or so of these jobs and and then we're as they're completing we're gonna log them out and this is what happened so that's basically what the the example code does if you see the bug already don't tell anybody this is a surprise yeah so let's just take a look here I also okay so I'm sort of mocking out the HTTP requests in this case I have and again it's probably on my github an example of an HTTP server that does bad things so it's going to maybe it'll drop a connection once in a while maybe it'll do something stupid like that maybe just won't respond right away these are the client of course the kinds of things that happen all the time in in everyday life but you know for the purposes of this demo it's gonna be particularly bad right so that we don't just sit and wait for three hours for it to fail so I think it's already running if not this will be hilarious I think it's not right all right maybe we'll have liftoff here all right let's give this another shot all right so excellent so it's at a pretty pretty reasonable rate of speed here we're pulling down stuff off of this if this URL it's going pretty quick see the counter going up really fast that's always good to see now what I'm gonna do get over here I'm gonna curl this crazy thing and I'm gonna say okay just let's let's let's make it busted let's just certain misbehaving like crazy you're gonna see that throughput kind of diminish it's getting slower and slower it's trying to get really sad now notice the request time is roughly about the same about 10 a lot 100 milliseconds or so I think something like that getting pretty stinking slow at this point kind of hope and you'll die pretty soon that's where we're headed so while we're waiting let's get the process ID good old P grep there's our process ID oops I have no idea why I can't highlight that but right so it looks like we're in a dead stop here now so this actually happened and I actually want to point out that if I curl the server the server is doing just fine by the way so yeah I can I can I can curl it all I want it's it's it's still responding it's just fine but our polar has completely seized up also note if you like a top you can't look at top if you look at top it's doing nothing it's just sitting there well virtually nothing so so something something's going on right so so let's see if we can find out what's going on so what what's that so let's go ahead and fire up the the gdb debugger here we'll go 13 was it 829 or so cool so what just happened we attached to the Python interpreter process it traced through all the libraries that LinkedIn loaded the symbols so now we're pretty much good to go here right off the bat we're sitting and received from so some people should maybe start to think maybe this is a network issue we'll go ahead and fire off PI DBT and see what's see what's happening a lot of stuff going on here let's see if I can page down a bit we'll get a little bit lower into the code again this is all a Python back trace right this isn't c-level stuff so I want to make sure that's clear that this is this is this is an actual stack trace of Python code not sea level code now what we're actually looking at here so this is where we're sitting we're making the get request and and obviously nothing is happening we're actually in our I can't find it here's our example 1 code line 24 we're sitting here waiting for the rest was to come back okay so let's take a look at info threads see how many threads we have going right now we have several thread apply all pi BT and what you're going to see I know this is a lot it's like an entire wall of text here but what you're gonna see over and over again is that every single one of these threads is sitting in requests get waiting for a response every single one so now we could dig into this more I kind of want to move relatively quickly here so we'll get we'll sort of cut to the chase but if we go back to our source code and this is where audience participation gets fun can anyone wager a guess since we're sitting stuck at this line anyone wager a guess as to why so this is actually one of the deficiencies of the requests API that I think personally is the timeout is is defaults to none so there's no timeout of this at all this will sit forever and wait for a response or in many cases it will sit and wait for a read there excuse me will sit and wait for a response it may well fail on connection but if it succeeds to connect but never gets a response it will still wait forever for that response to come back so what you have is you have a full saturated pool of threads all sitting here waiting for a response that don't never get very sad so obviously a solution specify timeout is a key word hard but by default and I see junior programmers make this mistake all the time there is no timeout it will wait forever and particularly in the context of something like this where you have a lot of activity if there's a timeout just fail don't don't sit and wait for it but by default that's what it does so let's check out example two okay cool so this is this is actually another very common thing we have a queue of work we're gonna fill up we're going to spawn a bunch of threads to consume until the work is gone pretty simple idea so let's take a look at that code real quick so we have a worker pool class starting to look a lot like the other one maybe okay so you've got you've got to work you in a result queue we're gonna pull jobs out of the work queue we're gonna do the work we'll stick the result on the result queue excuse me and and then we'll return all the results as a dictionary I hope that's kind of clear here so we pull up pull out the jobs we add them to the queue we fire off a bunch of threads apparently ten those ten threads will start them up wait for them to finish when they're done pulling off work and that will return the result so this is just a little you know kind of a work queue thing we could have potentially used the thread pool executor here to avoid having to kind of implement some of the basics that it does for us but we didn't for whatever reason someone in thought probably thought they worked smart enough to do it themselves and then the other thing here so here we're actually looking at the queue if it's not empty we'll pull off some work we'll do the work and then of course we'll put the result on the queue so let's give that guy a shot this one requires no so it worked that's good great I'm glad it worked worked again that's great let's try it again that worked just fine and it actually works fine a lot of the time I'll bet you it won't fail today yeah oh it did okay so what the heck right you'd think it would be deterministic but of course when we're using threads things aren't always deterministic so so we got a problem here right okay so let's let's pop back to gdb take a look you'll see what's happening good Python and whoa look at that example one still running isn't that something all right so yeah so there's our pit we'll give it the Python executable we'll fire it up the same thing as before so let's go to sector eights what's going on so we're waiting on a join here this is actually fairly straightforward I think this is actually waiting on all these threads to finish not terribly interesting take a look at info threads see what else is going on so there's only one other thread apparently running so let's go ahead and look at that one thread too and we'll see what he's up to so he's sitting here check this out this is our little program he's sitting waiting on work yet and apparently he's been sitting there for a long time now let's have a look at PI locals live debugging for the win all right we're probably not this is moving up and down the stack now this is perhaps somewhat interesting let's move one more up here but again I want to drive home the point here that now we're actually walking through the call stack of the Python code I realized this is a ridiculous wall of text but the idea here is we're actually getting here's the arguments for this class this self hears self way up here you see my pointer we have access to the result queue here all tasks done and of course some of this stuff is unintelligible but if we follow along here I'd like you to take a look and clearly I've debug this before but for purposes of the demo that's fine where are we add unfinished tax tests 2000 however I think that's actually in the result queue but the idea here I can't actually see it somewhere in here there should be a unfinished tasks is zero I'm just not seeing it because it's such an unintelligible wall of text but the idea here is that we can actually take a look if we're patient enough to go through the text to see what's going on inside the code and that's sort of the whole point actually now that I notice it here's work.you right here so if we look here whoops that didn't work move down a bit max sighs anyway that's fine but the idea is is that it's actually exhausted the queue really which I can see where that thing is it's in there somewhere but the idea is that it's completely exhausted the queue and it's just sitting there Owen get now I don't know if anyone's familiar with the with the queue API but if we go back to the code I'm actually just gonna if we go back to the code can anyone tell me so we're sitting here right I think that's what it was I should go back and double-check yeah so we're we're actually let's just bring this home pile list is telling us we're on line 32 here there's our code that's what's going on so if I can actually get T mugs to cooperate so we'll go to line 32 so what excuse me so can anyone hazard a guess as to what's going on here yes exactly right that's exactly right so the idea here is that we are checking the queue if it's empty first saying no then we go to grab a job but oh look some other thread came in grab the job before I got a chance to get at it so now I'm blocking waiting for the queue to empty out and that's where I'm stuck excellent work thank you so yeah so that is yet another example let's try it one more I mean I'm hoping we'll have enough time to cover all for this one this is a C tub types issue in this case we are exposing some C functions to Python using C types this was actually lifted predominantly out of Pi chef if anyone's these that not to pick on them but I also haven't filed a bug a bug report on this perhaps I should Thanks so let's let's take a look at what's going on there maybe yeah cool so this is this is your typical kind of c-types thing the idea here is we're just exposing some C functions and Lib crypto to Python and so C types kind of gives you gives you the ability to wrap these things in Python objects and then call it's actually quite powerful so we're setting up RSA sighs private encrypt generate key there's probably a dozen other ones that I just simply snipped out for purposes of this demo but the idea here really salient point anyway is we're just wrapping some C functions with Python so that we can access them so we have this little key class again this is just adapted from what is actually in PI chef but the mechanics are the same so we can initialize it will generate a key with this class method we can encrypt stuff just give it a value and some padding and it'll just kind of figure that out again all calling underlying C functions that's all it's doing and I just have some example code here just to sort of make that happen so what happened in real life here was the nops guy ran this stuff he's like oh I got this great chef recipe it's great but look at that I'm so skillful I actually got the Python interpreter to cordoned I was like very good very nice so why did it core dump so again gdb to the rescue as you might have guessed we can absolutely take a look at that just so we're clear here we have the core file conveniently located right in this directory so we'll go ahead and load that up with - C cool now immediately like right off the bat don't even have to do anything right take a look RSA sighs getting an argument 0 hex 0 then highlight this so that we're clear we're dereferencing a null pointer well of course we got a segmentation fault over D ref dereferencing a null pointer so right off the bat we're like okay cool so this this actually shouldn't be too difficult to debug good ol pi BT tell us what's going on so inside our example we're calling RSA sighs that argument is apparently null or none so I've kind of given given away most of that just by just by just by running through that stack trace will do a PI locals here just to sort of expose the the local variables but again ki is equal to none here so unsurprisingly if ki is equal to none and we're calling RSA sighs on self ki that's going to get a none just going to be a null it's going to be a null pointer it's going to blow up so I've pretty much given it away at this point but let's go ahead and just return back to the code oops we're not so what we have here is simply a failure to set key in the initialization you have to call this generate thing first and although otherwise you're just going to crash Python so not too surprising there nothing really that jarring but to be able to pinpoint that error without having to you know wonder why the heck did generally like when people see the Python interpreter explode like that you know they start screaming and tears and things are involved so very nice way to just immediately pinpoint what's going on finally plastic thread deadlock problem here this is sort of a standard garden-variety readers writers problem we have a pool of threads that might read or write and we're gonna use locks to serialize the writes but we want to allow multiple reads to execute concurrently so let's take a look at that code and again I hope this is reasonably reduced so that we can follow what's actually going on so we have readers writers we have a write method and a read method we grab a file lock we write to the DB file and then we just return the string right for reads we grab a mutex we increment our reader count we acquire the file lock if we're the only reader do our read and then likewise release it for the last reader to to have the lock so the idea here is if I come in and say yeah I'm a reader I want to you know if someone else has already got the read lock great we'll increment the count with you you're reading concurrently with everybody else no problem and then will log if anything bad happens so I've basically just set up a test case for that will instantiate the class we'll set up another thread pool we'll pass it oh I don't know 50,000 or so test jobs to just write out I am biasing this so that we're kind of read heavy so it has a two out of three chance of picking a read three chance of picking a right and we'll just submit those off to the pool using our read and write functions and then we'll wait from the finish and we'll log them out and then we'll just rinse and repeat when that's done so let's see how many more team up sessions I can stack onto here that's crazy yeah all right we'll just reuse this one then it's incredibly hard to do when you're not in front of the screen yeah so let's go ahead and run that and an amazing pace we're actually right reading and writing right and it's it's pretty decent no problems so I'm going to cheat a bit I'm gonna create some problems I'm going to make some IO problems happen on the on the actual machine so as long as nothing bad happens this code works great it's always when something bad happens that we run any trouble so so let's go ahead and run my little chaos shell script here which is basically they're just gonna pick at random pint time so just just remove the DB file now this is sounds kind of silly right just like deleting the dozens of database file from underneath it but in fact in these days of clouds and I scuzzy and hypervisor pauses and things like this you absolutely can have a situation where the file might temporarily not be writable or might just disappear or maybe there's something else going on so obviously is a little crude but let's just go ahead and run it and what do you know you guessed it it froze up no stack trace no errors just looking stupid that's all we have so by this time we're familiar with the drill let's take a look at gdb if I can type the word P grep I mean quite a collection of stalled Python processes on here 57:17 alright so let's take a look at the threads first see what's going on there everyone's sleeping so that's always fine let's take a look at where everyone's sitting right now waiting for a file lock we're trying to write we're waiting for the lock ok let me see if I can actually get up there is it like totally obvious yeah waiting for the lock let's page up the next one what do you know waiting for the lock by this time suspense is no longer a thing waiting for the lock and finally we have one that's apparently sitting here waiting for everybody else to finish so what's going on so we have a bunch of writers sitting waiting for the lock but apparently no readers that actually are holding the lock right so with only a fraction of time left let's uh let's go back to the code you know what this anybody care to offer a gasp as to what's going on here and we're very short on time so I'm going to rush I'll give you a hint it's in this method first of all some dope made this a debug level in production you're never gonna use debug so even if there was an error exception it wouldn't been logged anyway so that's that's bad don't do that the other thing is here is that like if an exception happens it's gonna bomb out and it'll never let go of this lock and that's exactly what happened when I was screwing with the DB file right is that like an exception happened it never released the lock everybody else gets jammed up so cool so in conclusion gdb is a sophisticated tool I've scarcely touched the surface here but this sort of gives you a flavor of the kinds of things you can do and there's like way more sophisticated stuff you can do I'm just trying to give you the sort of basics of what's going on I would advise you to resist the temptation to restart and fix it later a blame that goes our solar flares this is what people do all the time oh who knows just restarted it's fine well the idea is like we should actually find out what's going on when you have these crazy problems you might learn new things and you'll definitely improve your code so that's my advice to you thanks a lot don't laugh that's my twitter handle but i never post to it but you can absolutely send me stuff but yeah it's like probably three years old so yeah thank you thank you for for watching
Info
Channel: Next Day Video
Views: 9,281
Rating: undefined out of 5
Keywords: pyohio, pyohio_2016, ColinStolley
Id: rB9rPdMRxIA
Channel Id: undefined
Length: 40min 44sec (2444 seconds)
Published: Thu Aug 04 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.