using memray to debug (and fix) a memory leak in krb5! (advanced) anthony explains #567

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to another video in this one we're going to show you how I debugged a memory leak in a production application at Sentry uh initially thought it was a memory leak in post itself but it ended up being in keros which is auth authentication something like basically everything uses krb5 in some way or another um and I found a memory leak in it and I'll show you how I found it uh and what tools I use and eventually how I fixed the bug uh but anyway let's jump into it all right so in order to set the stage we got to run a worker uh a real production worker but we're going to run it locally uh this is a celery worker familiar with celery it's sort of a task CU management thingy um you can't see the this over here but it's Max tasks per child I'm basically just setting that to a really large number because we want to profile the process we don't want it to restart on us so just keep running forever basically uh we're also doing concurrency one cuz we want to trace a single process and not bother with all the other ones uh this will still Fork a child and so we'll have to manage that um but let's run this I have another script which feeds data into this uh it's not really important what it does but it is basically just uh sending data into there this inor function eventually calls uh the celery task and defers its work uh Sentry exact t.p U make sure that this started yeah there's all sorts of warnings we need to change our cellary set actually I think CER is broken celer is doing it wrong but whatever that's neither here nor there okay anyway we're running the worker we're feeding data into it and I noticed when I uh looked in htop at these particular processes you'll see here we have uh this is the parent process this 250 Meg and this is the child process and you'll notice that the resident memory is creeping up over time uh and so I was like oh okay cool we've reproduced the memory leak that's you know that's exciting in its own uh pretty often you chase these bugs in prod never never find them in Dev um so this this was I was already excited when I got to this point but I'm going to show you how I used a tool that I had not used before uh and sort of learned how to use in order to reproduce this uh and that tool is called memay uh memor is a memory profiler uh it's from Bloomberg some of my some of my buddies actually work on it so that's kind of cool uh but basically you can attach it to running processes or you can WP processes and it'll look at allocations in them uh so I'm going to show you the commands and how I learned to use the tool from failing at the commands over time uh so the first thing that I'm going to do is I'm going to run a specific chunk of time for this worker and then stop it after that uh and I did this mostly to make sure things were normalized cuz I was actually compar comparing 310 and 311 for an unrelated thing uh and so I'm going to use theout command in order to do that I actually did a video on timeout so I will I think I did I'll try and Link that if I if I made video but basically I'm saying uh send Sig after 10 seconds uh and then it will run some other process the process we're going to run is memor run and I'm just going to do it with no arguments uh which is well I'm going to do with quiet we'll show you that one argument otherwise memor runs and in impr like a foreground uh top like thing which I didn't find super useful personally I found it a little bit overwhelming and difficult to read but uh we'll get to the parts that I think are important um this is actually the wrong command and uh the hint about celery forking will come up later um but after that you just pass in your command that you want to run so centry run worker cues counters blah blah blah the same command as before this is going to run for 10 seconds and it's going to profile all of the all allocations in the program uh we're then going to be able to generate some HTML output from uh memory and basically see what happened there and so the command for that is memory flame graph uh and this has a bunch of options we're specifically looking at the leaks option because we want to know we want to know what memory leaks happened here we do leaks and then we can pass in our memory binary format that we got out so that's what this me memory Sentry blah blah blah. bin is this will do some crunch some numbers and spit out an HTML file I actually need to download that so I can view it on this computer um and if we open that up memory gives us a nice pretty oh don't need the rest of of these uh things here just look at this one so memory gives us a nice pretty graph of all of the memory allocations here and I actually noticed two things went wrong when I did this first time first they give you a nice helpful warning up at the top that we are using Pik and without going too much into detail pallic is a block allocator and uh doesn't really play Super nicely with normal memory profilers you can of course force it to use Malik and we will do that or you can use the trace python allocators I found that both of these work pretty well uh but we're actually going to use this one because um it'll help us profile native functions later I think they both do basically the same thing um but the other thing that I noticed is this is the parent process this is not the child process so this is not actually what we want to profile and so we kind of had two problems there so let's fix both of those at the same time go back to our little timeout command here and we're going to put python Malik equals Mal actually first we're going to delete all the memory files so I don't first time I went through these I got really confused by all the files that were there I'm going to delete those uh so this time oops run here we're going to set python m m equals malic this forces python to use just normal GB C malic instead of its own memory allocator uh and the other thing we're going to do is we're going to set follow Forks down here this is going to make memor follow any you know subprocesses that are created uh or forked processes in this case and it'll generate a little binary file for every single one in those Forks okay so let's run this again uh memay so what is follow Fork not follow Forks oops I'm pretty sure that's how it works all right so this will run again it'll profile all of the processes and all the child processes and spit out a bunch of binary files um and so if we look here uh we should get a bunch of memr centry blah blah blah and the one we're particularly looking for I happen to know is the largest P ID that we get here which is this one here and so we're going to do memor flame graph leaks on this guy and that will generate a new flame [Music] graph takes a little bit but it's going through a whole bunch of uh data so it doesn't surprise me uh let's download that and open it again and this told me a few things so uh there's a lot going on here but uh some of is unimportant this is some thread some background thread probably inherited from the parent process um and actually I'm pretty sure this is Century itself tracing this the SDK um and the actual code that we care about is in the I want to say it's in this work Loop here um nope not that one uh where's the actual leak that I noticed huh maybe I need to run more hold on I will just find the the particular one CU this actually isn't this doesn't actually show the leak here maybe I didn't run it for long enough uh the data feeder was running all right we'll we'll run it for a little bit longer uh time out man this worked so well the first time I demoed it but then my power went out so we're getting round two all right let's do this again uh so generating our little graph um while this is generating U basically what I ended up doing is finding a particular python call that was uh the leakiest One memay Flame graph uh leaks and I took that um that particular function call isolated it to its own function and so we'll show that next generate this one's taking a little bit longer okay so I think I think this will probably give us a better result here surely okay now that's completed we'll download it again we'll open it up and no this is still the reddest connection dang it oh I remember what this is there's another thing that's leaking that's not the one I'm tracing here anyway I will skip this for now and I'll just show I'll just tell you what happened here um here is actually the Leaky tree this one right here um the SQL connection somewhere in here anyway way basically found that postris was connecting and leaking memory and so what I did is I isolated that to its own little function um so we CD into example postp by itself and we have this little little function here which is connecting to postris all it does is connect and close so it shouldn't leak any memory we're creating a connection and then immediately closing the connection uh so python Malik equals Malik same thing as before uh and if we run this python file should generate a new profile and this will show us where the leak happens um but unfortunately this is not going to show us much it's just going to show us that connect is leaking and so I'll show you the next piece of the puzzle after this one runs uh we're going to decrease the number of iterations here just for the sake of the video um it's easier to show the like memory ticking up over time with 10,000 iterations but a th should be plenty for us to demo this um and if we do [Music] memory flake craft leaks uh oh dang it we generated two of these uh we generate another flame graph that down there so that I can just rerun this SCP command and we'll notice that there's a leak in postris Connect Now this doesn't really show us much this is the Imports over here so they're taking some memory and then this is the actual execution where um connect takes a bunch of memory now this doesn't tell us much and so I wanted to dig further into this unfortunately uh memory has a helpful utility for profiling all functions including C functions uh by just passing native to oops to the Run command uh Native and so this will profile all C functions it works a lot better if you compile your python with debug mode and have symbols and all sorts of stuff uh and we'll actually see that um that this doesn't generate the best symbol output and uh I end up figuring it out in a different way later [Music] m481 uh no symbol information yep that's fine and we'll do that we will download again and then we will open it up one uh and we get a whole bunch more information here all of the C functions well not all of the C functions but a good chunk of them um this part isn't the interesting part here this is the Imports over here this is the actual connect um and you'll notice we have a bunch of calls here including PQ connect DB this is basically the postris connection um but all of them are leaking memory in this unknown function uh and that unknown function is called by get primary name um and so that's probably what we're looking for I noticed at this point that get primary name actually this is really small for you so let's make it bigger uh get primary name was called by krb5 which I knew was uh keros or keros 5 which is an MIT Library uh and so I knew that that was probably where I wanted to look for this um and that it probably wasn't in postgress however by this time I had actually reported this bug to uh psychop pg2 which is the post responding that I was using cuz I hadn't gone this far and figured this much out yet uh I gave them a nice little minimal reproduction and showed memory leaking and that it came from this function um ended up dgg further and jumped into the keros code and in order to do that I actually wrote a little C code uh I copied the code out of postris to kind of demo this example C code um and basically uh basically uh postris calls this GSS acquir cred to see if credentials exist at all on my machine they always failed so it was always in this case um but that's not super important to this basically I called this GSS acquir thread and if we looked at Firefox here that's this GSS acquir thread here acquir cred I didn't get the rest of the stuff but we got to that later uh so compiling this function or this code uh show you my little compilation um basically just using the Homebrew version of this version 1212 uh VI use if you watch this video in the future this is already fixed so you probably won't see this problem anymore uh on newer versions of kb5 I don't know that it's gotten released yet um but if we run this program all it does is it tries that 10 times and prints this out not super useful on its own but we're going to use a Mac OS tool called leaks uh and this is uh kind of like Val Grine but it's Mac OS specific and it ships with the developer tools uh and so if we run this it spits out a whole bunch of garbage you can ignore everything down at the bottom the important stuff is somewhere up here there's like a stack of leaks that it prints out yeah here we go so there's two stacks of leaks that happen here um one of them is then this function here which is not actually important uh the way kb5 has error messages is it basically has a global variable where it stores the last error that occurred this is not actually a memory well I mean it is a memory leak but it's not actually an important memory leak because it only leaks once and it's meant to be you know so you can show error messages uh python has a similar thing where the last exception is stored in the CIS module this is not the not the interesting one the interesting one comes from this one up here and you'll notice that it's repeat a bunch of times uh and that's because we called the function a bunch of times and so we expect to see the thing repeated um this told us a few things uh first the call stack here so you can see um that git primary name function that we saw before as well as four or five well four functions and then Malik below it which if you remember from our little graph here we had four unknown functions and then our Malik uh and so this kind of matched it matched for me up with the uh profile that I had already found which is nice oh we can't see get primary name there you go um and it tells us kind of what the object types are in this case it's an OS xpc connection uh so that kind of gives me what to look for um and it shows us like how it was allocated and um why why it didn't get freed and that sort of thing so from there I jumped into the code for that get primary name function and I started hunting around for things that created the connection and ultimately didn't free them so I hopped into the code here this get primary name and we have this Con this Con variable is what we're going to look for go away get hub no one likes that feature uh okay so we find con we create the connection here great cool if it's none we go to clean up blah blah or null go to cleanup blah blah blah so we should expect some cleanup down here um and if we look at our connection here uh one of these things is not like the other uh the cleanup code for connection just calls xpc connection cancel and I'll spare you the details but I look look through the docs they're not very good because they come from Apple um I look through the docs and in some of the xpc connection Creations they tell you to use xpc release but this one specifically did not for whatever reason however it creates the same type as all the other ones and all the other ones are freed with xpc release and so my guess was that this connection cancel was the wrong call here and it should instead use xpc release so what I did from that is I built a version of car B5 that has that fix in it uh and I've pre-built that into this location down here don't don't super worry about that um but if we recompile my little little C code here and I run it through leaks again uh leaks at Exit a out run it through leaks again got to scroll through all this garbage okay uh we run through leaks again we notice that there is still a leak here uh but it's only in that error message and we don't see any other leaks and so this tells me that I have fixed the memory leak oh a bit of a whirlwind um and then I sent that patch to fix the leak and it's already been committed so got rid of the connection cancel and it's now correctly using xbc release unfortunately the sad part of this is this is an Mac OS specific bug and so the memory leak in prod that I was chasing I did not find instead I found a ma OS specific PG but I still fixed it I still learned a cool new tool and hopefully you've learned a cool new tool too anyway hopefully you found this useful if there are additional things you would like me to explain leave a comment below or reach out to me on the various platforms but thank you all for watching and I will see you in the next one
Info
Channel: anthonywritescode
Views: 4,884
Rating: undefined out of 5
Keywords:
Id: bw5AHdZA7e4
Channel Id: undefined
Length: 20min 8sec (1208 seconds)
Published: Wed Jan 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.