Surge 2016 - Abel Mathew - Post-mortem Debugging: could you be the one?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in terms of the talk what I wanted to do is start off with kind of laying out where does post-mortem debugging fit into the different set of methodologies that people use with debugging today I wanted to go over some basic tools that people use for post-mortem debugging across languages and what they look like and then finally I wanted to talk about why just a lot of people don't use it and some one of the problems around it then finally I wanted to end the talk with actually a case study in a paper that was done by Microsoft about a post-mortem debugging system they built and some of the interesting things they found out about so debugging methodologies let's start off with the first and the easy one it's a interactive debugger some cases people call it a Institute debugger but it's basically the typical repple that you guys are used to so you know next print etc instrumentation in tracing so this is stuff like / estrace and we'll talk about why you want to use instrumentation in a second there's the typical logging the the classic printf statement that people use and then finally the the star of the show today which is postmortem debugging so interactive debugging is great for development and reproducible bugs so but it's terrible everywhere else so you don't want to use an interactive debugger and production you don't want to attach the debugger to a live database it's going to slow everything down it's going to ruin your night and day the next methodology is instrumentation instrumentation is really powerful when you have a live running system let's say it's not logging what you want to log but you want to extract for course data out of that running system allows you to observe internal state things like execution paths and events over time so I was talking before about perf the peripheral weather so you run perf on a process it will actually give you a frequency distribution of the functions that it observed over that time you can also need perf to observe pmcs there's other sort of system events and this is really powerful when you're trying to guess what's going on I'm sure most of you have dealt with perf and it's a really powerful thing the problems with instrumentation and tracing is that there's a lot of overhead on the live system you're actually doing this on the system the workload is happening which the overhead includes both extraction and capture as well as the post-processing of it now it's possible for you to take these asset files and actually post process them on different systems but typically what you want to do is either use the debug symbols available or use any other assets on the system for you to be able to report on the data that you capture and the idea of instrumentation really is centered around one-off investigation so you typically use it when bad stuff happens so when you're not seeing the kind of behavior you want to you launch up your profiler and you try to go at it logging is one of our favorite ways of debugging it's easy to use there's a lot of existing systems to be able to ingest aggregate and analyze I'm sure all of you have heard of you know Elks wonka log stash and allows you to be explicit about errors one of the nice side effects of logging is that it kind of documents code a little bit at least that's what a lot of developers tell me when they don't document they're good cons is that it's really rigid and inflexible meaning that you only get the information that you explicitly elect to log that could be through your own internal accounting this could be very powerful however it's still only what you want to log and it's not easy to log and it's not easy to serialize and represent highly dimensional complex data so things like not only just a stack trace but also things like a linked list or variables or you know let's say slice it's not easy to log all of them out and have them in an easy to represent standard way that's structured that's universal across your program I've unfortunately and you can actually see that the read that well but uh this is this is code that I've written multiple times several times it's basically saying you know in a comment this should never happen and then printing out analog statement that something bad happened when it does this is the way I view logging a lot of cases I kind of anticipate what's going to happen and then I log that if it does happen let me go ahead and log it these different methodologies also over a particular issue sorry particular type of bug or two filler cases what I wanted to talk about was now post-mortem debugging which actually tries to solve another set of use cases but also you know you can mix and match before I do that I wanted to give a little history on postmortem debugging you know dates back to the dawn of computing in 1951 this is the picture of the world win one computer out of MIT the idea of post-mortem debugging is at the time of error you know the complete state of the application is captured and saved for later investigation I should underline or italicize complete because there are post-mortem debugging systems that don't necessarily give you the complete city of the application however what they do is they give you a much more rich much of robust set of information beyond logging or instrumentation the the cool thing about the world when one computer is also where we get this idea of the core dump and that's because it was the magnetic core literally what they would do is they would actually output the information stored there an octet form to a CRT monitor and then they'd have a picture sorry a camera take a picture of that monitor at that time and then they would have developers debug via a picture essentially so the idea of a core dump representing a snapshot of your application kind of comes from this process post-mortem debugging and what are its pros and cons gives you very rich data set so when I say the complete say to stave the application I do quite literally mean in most cases it's a it's a complete replica of the memory space of that process it's a robust data capture method so this does not rely on an internal library doesn't rely on your logs getting somewhere you're actually like relying on facilities provide either by the language runtime or provided by the operating system itself there is overhead only at the time of error so only when you want to actually capture the complete state of the application which is different than you know let's say your instrumentation or your logging and what this allows you to do in most cases it's at some point contained enough so that you can do this analysis asynchronously which allows you to do a very powerful tool link as well which will go into later in the presentation the cons are is that if you get the full state of the application serialized it's obviously going to be a huge dump of information and four languages like Python and Java which we'll go over there's actually a lack of tooling and documentation on to how to utilize this dump of information so I stole this slide from one of Brian controls talks about debugging and I think it did a really good job about kind of setting the space about types of errors so we have our x axis which is non-fatal and fatal and then we have implicit and explicit errors so the the terminology is pretty self-explanatory but if you have a fatal error it's something that's going to cause your application to stop you know executing non-fatal is something that doesn't make that happen and explicit is something that you know beforehand that you can actually be indicative of where as implicit is not the case typically the way that I've seen it talked about is on the right side everything that's a fatal error is going to be something that you want to use post-mortem debugging for and more post-mortem debugging can be very powerful logging is typically for your non-fatal explicit errors and then instrumentation is for the top left-hand corner implicit in non fatal errors that said it's not as clear-cut is that you know you'll see cases where people will actually use post-mortem debugging for explicit non fatal errors you also see cases where people use post-mortem debugging for implicit in on fatal or logging for a fatal explicit and they're just hoping that the logs get through I wanted to give this little gem here because I was doing a lot of research on go laying before this presentation and I found this on a stack overflow the best part of this is that he said he put this at the beginning of every function and then in order to prevent my program from crashing now I'm wondering is this really the way to go and the best part is that you look at the code and all he's doing is printing the air to prevent to catch the he captures the air and then prints it and that's all he does when he anticipates a crash so that's always sounds like a fun debugging experiment also another thing that i love that to do is pick on people and so people who write and go typically say that their stuff doesn't crash so i took three pretty popular open source projects in golang and i'm not going to call them out and I said I research the word panic panic is a fatal error and go and so I saw things like 43 opened and the oldest one being may fifth so this is indicative not only of the fact that even in your higher-level languages you do see these fatal errors but they're also pretty gnarly errors as well so when you see these kind of go panics or you know you see these uncaught exceptions they can be very difficult to debug and that's because you can't necessarily rely on the tooling available within the language or the system to be able to operate within some of these scenarios going back to this talk again like I said before bottom left hand so barham left quadrant is what we're logging really shines but I've seen logging used in for things like fatal explicit errors as well on tracing obviously as used in a lot of different cases as well so is there a clear choice on windy use post-mortem debugging versus logging versus instrumentation and the fact of the matter is is there's not you know people use logging for fatal errors people tend to use what's familiar to them and so if they have a tool that they really love and it's available to them that they're going to use it and that's part of the problem with post-mortem debugging and that's why we don't see a lot of it today um when I talk about the problem with post morning debugging is the familiarity and availability of tools so with C and C++ you kind of have full support for this stuff native support and your common tools like gdb and ll DB and there's a lot of core dump analysis tools in the form of either debugger extensions so python scripts for gdv or standalone products but for something like java you know typically you have j stack and jdb which do have support for java core dumps however they are very difficult to use in my experience you do have a really cool tool with java he dump heap dumps with eclipse mat and it's cool because you can do a couple of things with the memory analysis tool from eclipse I provides a query language for you to actually query the state of the heap dump so you can say things like show me all of the objects that reference the specific object which is crucial with the GC type language because you might have live references and you don't know where those references are coming from you can also generate a Dominator tree as they say in eclipse map which actually gives you this hierarchy of references to be able to do to be able to investigate GC issues I put the flags that you want to pass through the JVM to be able to get a heap dump from Java that you can pump into eclipse is matt python unfortunately the story is not as good so there's a set of gdb scripts that allow you to inspect Python core dumps if you're not familiar with G core what you can do is on your Python process actually call G core and there are some I believe there's some Python utilities as well to actually invoke and cause a core dump and then you can load that core dump into gdb with these Python scripts there's also pi heap dump by as anzalone he actually gives a really good presentation on this so I'd recommend you look up anzalone Krause and PI heap dump to investigate some of this stuff this is also really powerful because like Java it allows you to inspect what's on the heap and be able to investigate Beach resources for nodejs which I actually didn't see anyone say that they do any JavaScript in no Jas but so the guys at a joint I've done a lot of good really good work with an MDB debugging module so MTB is available on smart OS the Lumos I'm not sure if it works on linux yet but they provide a lot of extensions to MDB to be able to inspect javascript post-war no more core powerful steek stack and heap inspection so when I talk about that there's actually a really good presentation by the guys out of Netflix on how they use these tools to find memory leaks in there through their core dumps and that's one of the reasons why post one of the debug be really powerful which we'll talk about a second is having a complete state of the application actually gives you a point in reference and so a point in time what was my application doing and actually having the full state of the heap allows you to use that kind of like as a bee analysis so if I have another core dump can I compare the two and I can actually see the differences within number of objects that I have and allocations with go there has been some support from the joint team that was added to MTV we at back-trace actually have something available but unfortunately according to the docks you know gdb does not understand go programs well and this is from the go lame ducks itself and basically they're saying that we're not really focused on this right now and we'll think about this later when it becomes a bigger problem so we have this awesome debugging methodology so we have this thing that says I don't need to explicitly log information I need I don't need to add overhead when i'm trying to do tracing i can get the complete state of the application for a fatal error but how come we don't use this more in practice how come we don't actually try to get core dumps from my processes and try to expect them so this is the part of the talk where i was going to give an awesome more story about some see web server that i wrote that had a pretty gnarly bug and having to debug it but i figured that that doesn't really get to the point as to why people don't use post-mortem debugging anymore it surely says that i am not the best at programming you know and i have bugs in my code but the reason why people don't use post-mortem debugging is because it's very difficult to actually do detection and retrieval of this data so when you're running a Python process and you actually wanted a generated core dump a lot of cases you have to do it manually through G core you know you have to add additional flags to your jvm to be able to get get the heap dump as well and so people don't add this because it's not their first tool of choice the next part of this is that you need the tooling and the knowledge to be able to interpret this information you know this is something that you don't do over and over again then it's going to be very difficult for you to add this into your repertoire of tool sets and the last thing is that with post-mortem analysis is typically when my program is crashed and so it's when I only want to find the root cause of the crash and nothing else but there's all this wealth of information that people just don't use in their post mortem assets and so for these reasons many people choose to use more readily available methodologies the way I see it though is that and the way that I had this kind of love hate relationship with post-mortem debugging is that these are all solvable operational problems what that means is that this isn't necessarily a problem with the act of post-mortem debugging and the power that it has but these are things that we can actually solve by changing the way we go about collecting and actually operating on this data but why does that matter I mean we live in this world today and you know no one's really doing this do we have an example to look for to actually see the kind of power that can be done with post-mortem debugging at scale and to go through these operational problems fortunately in a very ironic way we have something to look for and that's the microts Microsoft and if anyone did not get that joke it was very subtle it's a star wars reference to a new hope but people kind of think of Microsoft as you know the dark side says little ironic I'm glad I had to explain that joke too so so Microsoft publishes excellent paper in 2009 called debugging in the very large and the way they describe the system that they talk about and debugging the very large is they say it's a post-mortem debugging system this was the post one of debugging system they built for microsoft windows and mike soft office so you can think of this as a massively deployed system that you know obviously there's not a lot of coordination between nodes but they have quite a bit of volume quite a bit of complexity both in Windows and Office and the way this system came about is Microsoft Windows had this web page where you would actually every time you get the blue screen of death there would be an asset that would be dumped on disk and you go to this website and you actually upload this kind of cord up or this kernel information and so they would then allow you to upload it and they would have this back-end system that would do automated analysis on top of this dump and they thought that was great and then Microsoft Office had this other thing where they didn't actually require you to upload the dump what would happen is is that a small I think it was you know the clipboard thing or whatever you know would say do you want to submit this error report you'd say yes and then submit it to a system but then they did no automated analysis in the back automated analysis in the back end and so microsoft said hey why don't we combine these two systems and that's what came about with the windows error reporting solution so wer so with Microsoft they had thousands of developers and they're trying to scale out their development team and specifically scale-out debugging this is complex software with a large volume of installations you know millions of computers running Windows in office it was guided by three goals the first goal was automated Eric diagnosis which is really awesome if you think about it think about your software crashing and then being able to actually automatically diagnose what the air is it wasn't perfect and they said a large part of the system was actually intuition based but it really dug deep and actually exposed a real world use case of algorithmic debugging used by a pretty massive company the third goal was progressive data collection so for them they didn't want to collect a full you know dump full core dump from every time you got the blue screen of death or every time your Microsoft Office crashed what they wanted to do was they wanted to say if I've seen this error before don't give me the full dump generate some sort of signatures so that I can identify its uniqueness and then if I want more information have this kind of feedback protocol where I can gather this information the last third goal which i think is one of the coolest ones is statistics based debugging and so it's kind of flipping or shifting the way we think about debugging away from intuition based and more to a statistics problem and thinking about ways of correlating variables and actually using the depth of the information from this dump to be able to isolate variability the really interesting part also is that windows error reporting was not used just for crashes it was used for Hanks it was used for install failures and they provided an API for developers to actually generate a dump from a running application so essentially imagine an API call saying trace me so it would allow for runtime non fatal errors as well and the capture of information of that before I dig deep into the operation of the system beyond these three goals I just wanted to show a quick diagram from the paper and so this laptop at the top assuming you're running Microsoft or office or Windows it would generate this what they call a label or a bucket and then it would send this off to their front end web front-end i i is i is servers which are just web servers to say have I seen this error before what do I want to do with this type of error it would then hit their online job servers which would go about and actually what they call do more labeling heuristics the point of labeling heuristics are generating this signature was so was so that they can actually determine uniqueness on a greater level so just because you have the same stack trace from an error does not mean it's the same error you know it could be the case that the different inputs that actually generated a different use case so they would you labeling as a way of determining you this uniqueness offline job servers did what they called classification so the idea of classification was this idea of automated error diagnosis so can I get the developer or can i get the operations person looking at this five steps ahead in a like a 10-step process of debugging how much of this can I automate in the debugging process the cab files there were what that would do is they would actually store the full dump if they wanted the information from an application and so it's a completely separate process that they would then go about now actually give this to the developers if they wanted the full state of the application like I was talking about before the automated Eric diagnosis system labeling was trying to identify uniqueness and it was a powerful way to scale data capture which I'll talk about in a second server-side bucketing so these offline job servers were doing classification they called this system being analyzed and it was to classify errors to maximize programmer efficiency so they use this system for classification to be able to say show me the last time I've seen some sort of memory corruption you know within this part of the system show me the last time that I've been able to show this type of bug and soon be able to do these higher level statistics and intuition on the system it was derived mainly empirically so they did not come in with all of these algorithms to be able to do this but what they did was as they said these are the kind of problems we've seen time and time again can we program program and algorithms to be able to automatically classify them the second goal of the system was progressive data collection so stage one was as they would generate this label and send it to the front end servers they would say is this unique or not then if they haven't seen this error before they would actually collect a mini dump which was abbreviated stack with some of the memory as well and then the third stage was the full memory dump into a cab file so if they wanted more information from the developers request it they would get it the really awesome part about progressive data collection is that they could do this on a massive scale and they only needed one pair of SQL servers so only two systems to be able to record every error on every window system worldwide and this was in 2009 the third goal with statistics based debugging and so with a such a large number of error reports you can actually do data analysis across this data set so they did this to do prioritization you know finding hidden causes through correlation so imagine the ability to say is the is this crash in Windows actually most of the time from a third-party device or for a little bit more contacts maybe let's say is the input request let's say it's an HTTP request that comes in is it always of this type before to be able to see this error and then the last part that was really cool is testing root cause hypothesis which I'll talk about in a second the really awesome part about this is that it was shifting debugging into a data analysis problem which I talked about before the the last awesome part about statistics basie bugging is that they could release the solution and they could actually measure and see if it would cause sorry if it would continue to happen or if there are any regressions so the benefits to this system were by and large a very very influential at microsoft according to the paper so it helped them do prioritization help them test hypotheses and these other benefits so what does prioritization look like so according to their statistics you know the naive way is let me sort my errors based on the volume of reports that I see so let me knock out this error that I see the most essentially and this was an effective way for them to do prioritization on the right here you actually see a CDF of each of these different applications and for the percentage of errors that actually fell into the top 20 error types and so you see this very nice distribution of the top error type actually counted for quite a bit of the number of errors and so buying it being able to prioritize that way you can actually knock out a number of errors immediately the other thing that they pointed out in the paper was that there was an alternate way that they used to do prioritization and this was centered around this idea of debugging locality so if I'm actually trying to change this function let's say I'm addressing a bug in this function let me see all of the errors in this function and let me actually prioritize based on that and that way I can fix all of the errors within this function that are potentially within this function i should say and go from there and so provided a very powerful way to get more efficiency out of the team with correlation the mini dumps that they captured actually capture not just the state of the application but also characterization information so they call this the wmi which included hardware info so one such example is they saw a specific error where ninety six percent of error reports came from a specific third part computers running a specific third-party device it was important for them to realize that this bug was not happening on the scale of it was occurring the most so as comparing what they expected versus you know the deployment of this third party vice versus what is observed so if I expect that you know ninety percent of computers are actually running the store party device that's big but if only a small percentage of them are but of this type of error 96% they could prioritize effectively the last part that correlation kind of fits into is this idea of stack sampling so tell me the function where I've seen the most errors essentially which is a really powerful thing to do when you can determine stability of a subsystem the Bing analysed system also allowed engineers to write hypotheses test functions on real live application state so one example of this is that they had a plug-and-play lock in the windows I of subsystem and what they said what they did was is we constructed an expression to extract the current holder of the lock from the memory dump and then ran that across 10,000 memory dumps to see if their intuition was right so maybe sometimes you guys have been out there and debugging and you said yourself you know if X is happening then it must be why the problem with that is that it's very much as intuition based and so having this kind of data store of all this air today air data was powerful for them to actually test their hypotheses so bringing all of this together actually wanted to share a war story from this paper that was really cool basically in 2007 they had this really bad malware called that they called Reno and what this malware would do is it would actually cause you know your GUI to crash and then when it would attempt start before would actually be able to fully restart it would crash again so you're kinda in this endless loop of crashing luckily for them you know the windows error reporting solution came up to a point so that sorry the GUI came up to a point where they could actually report back the air so they were seeing this repeated crashing happening on everyone that was infected by this malware so the system itself was actually useless to the user but when they started detecting this they actually released a fix and what they did was is that every time now when they release the fix the problem with Windows is that you actually have to download a windows update to actually get the fix and so for people that were experiencing this error they would actually show them a prompt to be able to download this fix automatically so they would automatically identify the error that they were seeing and then actually proposed a resolution path to the end user which was really cool because it actually brings about this idea of automated error resolution as well so not just diagnosis but now the resolution path you can kind of begin to see the building blocks that they're building here with the windows error reporting solution so I called out this paper as well because they kind of took this to the next level in 2012 with performance debugging enlarged via mining millions of stack traces so taking this post mortem debugging system and actually retrofitting it sorry not retrofitting it but actually enhancing it to do performance analysis which is a really really cool idea if you think about it so if they're capturing all these many dumps that have stack traces in it and then actually capturing on the scale of millions actually taking this data and actually making it a very large scale profiler and trying to see where people are spending most their time within certain functions within certain execution paths I recommend reading this paper it's very math heavy but it's also a really cool paper just talking about this so this whole story about postman and debugging doesn't really end on a very good note it's not i'm not going to try and convince you to use post one in debugging all the time now there are certain cases in which you want to use it but i will say it's a very powerful method a certain type of error and certain class of errors so if you have a memory leak or leak resources it can be a very powerful tool for you to capture the state and compare it to a future state as an example these data assets can actually be leveraged to gain further insight secure applications so like Microsoft did it's not just about determining the root cause for a specific error we can actually gain insights into what your application does and how it operates the problem is is that postponed in debugging is incredibly painful and it's hindered by a number of operational problems and so if we think about this and have a conversation about ways we can open up the idea of post moment debugging by building debugging infrastructure and at our companies and in the places that we work so not just using something like elk but also being able to leverage things like these core dumps and being able to upload them to centralized systems and being able to do centralized analysis as well can be incredibly powerful and so the verdict is still out let's see if post murder new bugging can be a thing that captures a story that catches on but uh that's the that's it for the top did I have today I don't know if I'm supposed to ask four questions I'm just a nice whistle you guys can leave if you need to my she want sure so for you devops guys or girls gals out there you know it's it's interesting because databases you know they're notoriously they're supposed to be very resilient to these kind of things but if you look at you know mysql if you like a postgres you know a common thing is actually seeing core dumps from them but then they actually don't use and leverage this information the kind of way that Microsoft did and so it's post one debugging very powerful thing beyond root cause analysis cool and i recommend so i saw a lot of raised hands for python so i recommend looking at pi heap dump it's a incredibly powerful tool pretty useful and it can really help out especially for the Python developers out there awesome no questions you guys want to talk about this I'll be around here at surge at backspace we work on things like this and we're actually working on a post-mortem debugging platform so trying to actually build something so that we can bring windows error reporting to the masses
Info
Channel: OmniTI
Views: 241
Rating: 5 out of 5
Keywords: OmniTI, Surge 2016, abel mathew, debugging, scalability, backtrace
Id: WHhorNLa934
Channel Id: undefined
Length: 33min 42sec (2022 seconds)
Published: Thu Oct 20 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.