A Deep Dive into Python Stack Frames

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you everyone for attending this talk I'm Nick Hill and I'm going to talk about how Python does call stack evaluation and I would implement stack frames so I've used Python for small tasks for over a decade but it's only been about a year since it became my primary language at Dropbox so at Dropbox I work on the desktop client and most of it is written in Python even though it functions as a native app so the way we achieve that is that a lot of our libraries eventually use native code to talk to operating systems and native code can do bad things like it can cause seg faults it can do weird thread race conditions and so on and all of these lead to programs crashing so this is obviously not a great experience for our users so we do need some way to be able to know when users are crashing and why they're crashing so the way we have done this for several years at Dropbox is to have a signal handler and a signal handler allows us to trap certain kinds of conditions within the process and then run some code to you know capture what went wrong and upload it to our servers the problem is there are some limitations with a signal handler for one thing it has to be in Python because all our networking stack and stuff is in Python so if the crash happens before Python even started or before the signal handler was installed then we don't get any crash reports similarly I don't know how familiar people are with UNIX internals but signal handlers interrupt the program indiscriminately which means you try to do anything fancy within them you can end up in deadlock or all sorts of bad stuff so in late 2016 we started to migrate to this thing called crash pad so crash pad is a piece of software written by the Google Chrome team and used by chrome it runs as a separate process from your application and so you can start it like very very early in your application startup which means anything that goes around after like the first five lines of your code you'll show up in the crash pad crash handler and because this runs out of process it can do a lot more complicated things like it can like capture all the stacks in your program it can capture more extra information you can add some extra things like Oh what version of the app is running you can assign some unique ID per crash or something like that so it's a very good mature solution and once it captures the crash it uploads it to the server where our tool chain like processes them and we can do aggregation to find out like which crashes are being hit the most and so on the problem is crash pad is designed for native applications so Chrome for example is written mostly in C++ which means it doesn't really understand Python so you know how in Python when the program does something wrong you get a nice trace back like that telling you exactly what went wrong well crash pad sees something like this and this isn't very useful at all like you can see that something went wrong in C but then all of this code is just like C and C Python interpreter details so if we wanted to like actually get this shipping to users we would need to teach crash pad about Python so is it possible to get to some something like this in or like that we can teach crash by to do this so I'll spare you the suspense yes we can so the way it works is that when Dropbox crashes its it ends up in this suspended State and then crash pad can go and do basically access all of the process of memory and it can read like registers it can read thread states and so on and so we can leverage this by understanding how the C Python interpreter is evaluating our code because if we know what kind of data structures it is using to run our code then crash back and like just read a memory of those data structures and capture the information that we need so this is where my journey started into digging into this the Python interpreter is one of the easier interpreters out there to understand because it doesn't have like a just-in-time interpreter or something and there's the global interpreter lock which a lot of people don't like but it definitely makes reasoning about the code a lot easier so oops what did I just do so here's the code that created the C stack that we shop saw before it's very simple all it does asleep so that I can attach gdb to it and get the stack and here's the cleaned up stack so does anybody notice any parallels between the two sides so there are these three functions that appear in the C stack for every Python function here so we can deduce that by eval underscore eval frame X is something important for the rest of the stock I'm assuming Python 3.5 some of these details change in python versions so just be aware of that so PI eval frame X is the workhorse of the Python evaluation loop it is actually written in C and is about three thousand lines long so this is the simplified version in Python so the wave Python works is that your source code is translated into something called byte code if you want to know more about that James Bennett gave a talk at PyCon this year about that so basically each of your Python source lines is translated into some simple instruction like binary addition or call function and call function here is the important one like every time you actually call a function that's what this gets translated into and then when it encounters the call function it actually like calls the C function call function that eventually degenerates into fast function and that is how we get the stack that we saw before where there were like these three things that get interleaving with each other and now you've been noticing that there's these frames and this PI thread state and so on and this these are all the data structures that these three functions manipulate to evaluate Python code so let's dig into that text so lots of C structures coming up this is literally code from the Python interpreter copied here and I've removed some things to make it easier so every Python interpreter has one PI interpreter States this stores a lot of information about the overall state of the program and then for every thread every Python thread there is a associated Python thread state structure by the inci Python Python threads and native threads map one is to one so essentially every native thread has one Python thread state so let's look at the interesting bits in this structure for one thing we have the thread ID this is dependent on the operating system and uniquely identifies the thread we see that there's a pointer to the interpreter State this is like a BAC reference to all of them pointing the same interpreter state and then there's the struct frame which we're going to see in just a bit and finally you'll see these two pointers struct previous and next both of which are of type thread state so anybody who's done c programming might immediately recognize this as something called intrusive linkless and what this means is that this is just a data structure that has a linked list within itself pointing to the previous and next thread structures so in you can see that if when the main thread starts it has a reference to the interpreter state and then when you create a new thread thread one it's going to use these previous/next pointers to create a link between main thread and itself and similarly for thread two and the interpreter States thread state head maintains the head of this linked list so you can just iterate through them to see all the thread States and then finally we have the frame so each thread state is responsible for evaluating one byte on thread right and so each function call is represented by what's called a frame denominator comes from native code execution and as you make nested function calls we keep building up these linked lists of frames and so this F underscore BAC can be thought of as a linked list pointer to the next element and so if you go back here oops so the thread States frame pointer that we saw is going to point to the very first like topmost thread in the evaluation so when we called main a called foo which called bar and then bars thread states frame pointer now points to bass and then bass this F underscore back field points to bar and so on so again we can like once given a frame pointer we can walk the link list and get all the frames that we care about all right so the other thing to notice is that there's the spy object where head macro here this evaluates to something that allows basically Python code to access this frame and access certain attributes on the frame so clearly the frame structure is really big and initializing or allocating this can get pretty expensive once you start making like thousands of these function calls a minute so Python interpreter does some interesting things to avoid doing that when possible I recommend you read the code there's something called a zombie frame which I found very funny so fortunately most of this is bookkeeping information so exception types and so on we can ignore this it is all used for loops exception handling trace backs and so on let's focus on the simplest case which is a frame that only evaluates a really simple function in this case again we have the F underscore back pointer then most importantly we have the F code variable and this is a code object so what I mean is when you write a Python function and it is evaluated it is translated into this bytecode and this bytecode and some other associated information is all stored in a code object and so even if you are like multiple calls through the same function they're all going to show us share this code object then we have the locals plus array for those not familiar with C syntax this means this is a array of by object pointers that can be arbitrarily sized because it's at the end of the structure and this is the execution area for each frame this is where local variables are going to get operated upon State is going to change and so on the F value stack is a double pointer which means it kind of points into the locals Plus array and I'll show diagrams of this in just a bit and it's used to indicate which is the bottom most operating area within this the F underscore locals plus stack the F underscore last I variable is a reference to the instruct bytecode instruction that was last executed by this frame and we'll see how it's useful for doing some reverse engineering so this is an example of the frame let's take the simple example where there's collar function that calls the collie with just one numerical argument and the collie has a one local variable so the collars frame looks like we saw the whole frame structure since it's the first function called F underscore back points to the module and then the thread States frame pointer points to this frame finally since it doesn't do anything particularly complicated its stack size is only two both of which are empty right now so we can use the disk module in the standard library to see the actual bytecode when this is being executed so it's using the two instructions load global unload constant to load the collie function itself and the argument 7 onto the stack and then we encountered the call function of code that we have seen before this obviously causes it to enter the call function and the fast function function so here I folded all the arguments into the diagram for readability but remember that each of these are pointers to PI objects here when when the caller calls the collie a new frame is created by fast function for it and we can see that it hasn't started executing yet and then it allocates two slots in the locals plus array because there's the argument itself and then there's the local variable that we are going to assign and then there's a stack size of two this stack size calculation is done by the interpreter when it compiles your code into bytecode and we are going to ignore how it does that and then finally you can see that the value stack is pointing to the index - because that's the stacks offering area and we don't want to go below it an F underscore back has been set up to point to the caller from here execution can proceed so first of all fast function is going to copy the one argument that is passed to the function into the first slot in the local splice array and then we can see the disassembly for the Cali function the first thing that happens is we load a constant 5 so it gets pushed onto the stack and then we store if we use the store instruction to put it as a local variable and then we again load both those variables back onto our stack this seems kind of redundant like it had 7 5 there and then it has 75 again but the entire Python evaluation loop can only operate on the stack space and so you have to do this like copying even if it doesn't make any sense finally we encountered the binary add instruction which is really simple it pops the first one out does the addition and replaces the second one with the answer 12 and then we hit the return value we pop it off the stack and we return it to the caller and that's it like Colley's evaluation is done it'll be cleared up destroy de-allocated whatever needs to happen huh so that was a lot of things happening so but that basically is the journey from like starting with a function call to how it gets executed so the inspect and trace back modules allow you to dig into these frames if you need to and based on everything that we just saw we can replicate a lot of this functionality ourselves so for example here it is we can access the the top frame in each thread using the cistern frames function and then we can access various attributes to print stack like that so that's what we need to do to get it working in our crash reporting except we have to do all of this in C code outside the interpreter so there's only one more thing of remaining we know how to walk the stack we know how what data structures are involved to find the stack for each thread so what's the starting point the interpreter the thread states and the frames are all being dynamically allocated so we don't have a good way to find the beginning of them fortunately it's all just code one thing I realized in my career is that it's all software and any time like something isn't providing you enough information you can just make a change and it'll work only remember that when you change the interpreter you're taking dependencies on stuff that is guaranteed to break at some point so the one thing we did was we used thread-local storage so python uses thread-local storage to store the thread States what I mean is that try local storage is the operating system provided structure where you can have multiple threads sharing the same key in this case five that points different values in each thread and so you can see that each thread State is preserved separately and this is already done by the interpreter all we needed is a good way to access auto TLS key so we use something called a section which is a all modern OSS they're executables allow you to like name regions of memory with a well-known name and that way you can just like later oh give me the section that's named this and it'll give you the range of memory you need to look at and with that we can go and walk our entire data structure so eventually we'll reach the frames and we already saw how to get there once we get there we can use the file name and the name from the code object these are just bite on strings Python decoding Python strings can be a talk in itself but we just assume they're ascii and read out the bytes the other thing we need to do is the code object has the first line number and the code object also instead of like wasting space mapping every byte code to absolute line number has a lookup table called L no tab that stores like a farm mapping and so we can decode that and then use F underscore last I to look into the table and then just add the first line number to it to get the absolute line number in our source code and there's the documentation in the Python interpreter on how that works so whoo that's it that's like a lot of details but when you put all of them together you can go from that to that and once you get this for every user who's ever hitting a crash solving bugs becomes like that for Python engineers right so this kind of low-level tinkering is not only fun it makes our engineers significantly more productive if you haven't read the blog post Dropbox moved from Python 2 to Python 3 over like several million lines of code shipping to hundreds of millions of clients we wrote the blog post two weeks ago so I recommend you read that the crash pad effort was very instrumental to that because without accurate crash reporting we wouldn't know if we broke any users while doing this migration so to recap the Python interpreter stores a list of threads and so on and there's frames on each of those threads and a frame is the basic unit of evaluation and we learned that we can leverage these internals to ship more Khali's software there's some related works around profiling but I'm out of time so I'm not going to go into that but basically you can use the same techniques to get a sense of how fast your code is running using two profilers called PI flame and PI Spy they use exactly what we did trait to just do it faster so I would like to thank my colleagues Jonathan Chen and Max Bollinger who are not here for helping with this and doing code reviews and so on there's some resources you can use crash pad is a really powerful piece of software if you're shipping apps to remote users because you get crash reports and of course there's the Python source code which I spend most of my time reading for this talk so with that yeah thanks again for coming here and that's [Applause]
Info
Channel: PyGotham 2018
Views: 2,577
Rating: 4.8367348 out of 5
Keywords:
Id: smiL_aV1SOc
Channel Id: undefined
Length: 20min 12sec (1212 seconds)
Published: Wed Nov 14 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.