CppCon 2018: Greg Law “Debugging Linux C++”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- So, welcome everyone. Hope it was a good lunch. So we're gonna talk about lots of cool debugging stuff. All right, saw some people coming in. This is a big room and these lights are kind of scary, but anyway, I'll try, I'll try my best. Okay, so just to start off and just to set the scene, the scope, of what I wanted to talk about today, so gonna just show a bunch of specific tools, some of them freely available, some of them, some of them commercial. Gonna either show or talk about just a random set. There's no particular rhyme or reason why I chose what I chose. If your favorite thing isn't here, then, well, deal with it. And it's really about detecting. It's about root causing bugs. It's debugging. So often, these things, scopes are best set, I think, by saying what they're not. So I'm gonna steer away from generic advice. I mean, there's some really useful stuff we can do, and I think a lot of what makes experienced programmers valuable is the tips and tricks they learn themselves for how to debug stuff, not necessarily using tools, but anyway, this talk isn't about that. It's not about testing or testing tools, right? There are loads of great testing tools, there's loads of great testing talks, that's already important stuff, just, that's not what I want to talk about today. It's not about how to avoid writing bugs. I'm gonna assume that nobody in the audience is perfect. And it's not about performance profiling. Well, not per se. In a sense, looking at a performance anomaly is a bug, right? If the program doesn't comply with the spec, and the spec might be that this thing needs to respond within so many milliseconds, then if it's not, it's a bug. So there's clearly overlap between performance profiling and debugging, and sometimes, the two are the same. But I'm looking at this from a point of view of debugging, not performance profiling. Certainly not exhaustive, as I kind of said at the beginning. So, oh yeah, and also it's not a workshop, right? So all the tools I show, nobody's gonna come away expert in those tools who wasn't already an expert when they walked in. And the point of it is, hopefully, you'll see some stuff that you just haven't really seen before, or you might have heard of, but weren't quite sure what it was, and then this gives you enough to maybe, then you can Google it and you can read the man page or whatever, and you'll never become an expert without actually using stuff. But I do want to talk a little bit about why I care, and give it a little bit of context. And I'm gonna start at the beginning. So this chap, chap called Maurice Wilkes, who I think had as good a claim as anybody in the world to being the world's first programmer. He's the first person to write code on a computer to do a real job, to do something other than just test this experimental machine that people were building lots of just after the Second World War to test that they work. He was actually, I can't remember what it was. It was some biological problem that was (mumbles) very complicated mathematics behind it, so he wrote a program on one of the first computers. And he said in his memoirs that he remembers that the realization came over him in full force, that a good part of the remainder of my life was going to be spent finding errors in my own programs. I kind of remember that feeling as well, when I first started to program, and I think we've all been there, we've all had that realization that it's just really not as easy as you might think. I mean no one thinks, people who've not seen programming before, I don't think they think it's easy. I don't think you realize just quite how impossible it is to get a program right. I mean, I like to think of myself as a reasonable programmer, at least, perhaps, I was before I got out of practice. What's the longest program I can write, and it will just work first time? 20 lines, 30 maybe if I try really hard. And I think that programming is dominated by debugging. I think there's been studies that show most programmers spend more than half their time debugging. That might not be debugging some in-production failure, might not be what we think of debugging, but again, just ask yourself that question, how often does it work first time? I remember a couple of years ago when my daughter started programming in Scratch, and she said, "Dad, it's lots of fun, "but I have to try lots of times to get it "to do the right thing," and I was like, yup, there's that feeling still happening. And it's kind of cool 'cause then she could understand a little bit more about what this company with a funny name that Dad started actually does. So anyway, that's kind of why I think debugging is really, underrated's not the word, no one could rate it, but it's underappreciated. I guess, so there's the obvious thing, well, it's better not to write the bugs in the first place. Yeah, well, duh, of course it is. But none of us is perfect. I think a nice test to the statement as whether it's worth making is would the opposite of that statement be in any way sensible? And clearly this one, so obviously better to avoid the bugs, writing the bugs in the first place, and prevention is always better than cure. But whatever prevention you do, you're going to need cure, and when it comes to programming, you're gonna need quite a lot of it. And here's a nice quote from a splendid chap, Brian Kernighan, that probably many of you know. It's quite well known that everyone knows that debugging is twice as hard as writing the program in the first place, so if you're as clever as you can be when you write it, how will you ever debug it? Good quote, makes you think. And I think that's another of the differences with experienced programmers. Someone likened it to learning to ride motorcycles, and when, like these 18-year-old guys go on their first motorbikes, they tear around at 100 miles an hour, and the ones that survive, sort of the ones that don't get selected out of the gene pool, go on to be kind of middle-aged men riding around on motorcycles, nice and slow, well within their kind of margin. They've got lots of margin for error 'cause they know they're gonna make mistakes, and I think programmers, as programmers, we learn this as well. I think it's an interesting, if you think about what that statement implies, it's that debuggability is the limiting factor in how good our programs can be. So whatever metric for good you have, whether it's how fast it goes, whether it's how extensible and maintainable it is, whether it's how small it is, whether how many features it has, whatever the metric for good is, if you could make debugging twice as easy or half as hard, then you could make those metrics twice as good. It's the bottleneck, it's the limiting factor in how good our programs are. Yet it gets remarkably little attention. You look at the ecosystem out there, the number of talks available, the number of tools, the number of books, the meetups and conferences on things like performance profiling or testing, is huge. There's loads of it, but there's comparatively hardly any on debugging. Now, I appreciate, to some extent, I am preaching to the choir, 'cause you guys all turned up today, but still, that's my little soapbox bit. So, and if you just think for a moment, the magnitude of the task, we do this every day, this debugging bit, and just kind of normal practice. Modern computers issuing billions of instructions every second, and that's if you have one thread in one process, so that's kind of as simple as it gets. And you're looking for that one bad instruction often. And it is the ultimate needle in a haystack challenge. But here's another quote from that same chap, Brian Kernighan, says that the most effective debugging tool is careful thought with judiciously placed print statements. Now, splendid chap, he did say that in 1979, and the world has moved on, I think, a bit. In fact, when he said that in 1979, interactive terminals were kind of new. Print statements, more often than not, meant printing things out on a dot matrix printer or a line printer or something. So I think the world has moved on since 1979, and we do have much better tools at our disposal. There are times, we all know there are times, when good old printf debugging just is the tool for the job, so like all these tools, you have to choose the right tool at the right time. But yeah, world has moved on, so where are we now? I think there are really two kinds of debugging tool. You can categorize things how you like, but I categorize them like this. So you've got the kind of checkers, so dynamic and static analysis, and mostly what you're doing there is trying to look for did my code do a particular instance of a bad thing, a buffer overrun, for example? And then the other kinds of more general purpose debuggers, which are really about code exploration and trying to work out what did my code do, and think actually a good debugger, you can spend time in a good debugger not actually debugging, just trying to work out what some piece of code that you've inherited does. Perhaps somebody has been very clever, and you need to find out just how clever they've been. Perhaps it was you, a few months ago. But it's just a general, what did the code do, kind of debuggers. I'm gonna cover both types a little bit today. They definitely both have their place. It's usually, if you can catch it in a checker, if you can catch it, I mean static analysis probably falls into that prevention thing, the isn't prevention better than cure, but it's still kind of, it's about root causing and extracting bugs from your code. So running a, I'd say this is roughly what I'm gonna try and touch on today. Lots of stuff. I'm not gonna cut exactly to schedule, but as I say, the point is that hopefully we'll see enough of these things that you'll say, oh yeah, okay, that's cool, I see how that could be useful for me, and maybe not too intimidating as well. I think a lot of these things, you hear about them somewhere, probably one of the smartest people you know talks about using it, and you think, gee, that sounds complicated. Actually, lots of these things really aren't that complicated, but as per the title of this talk, there's going to be a bit of GDB wizardry focusing on GDB, just touching some of the stuff. I gave a talk last year, two years ago, on some advanced GDB stuff. I'm gonna cover some different, little bit of overlap, little bit different stuff. I'm certainly not assuming that you saw any of my previous talks already. And all the rest of the stuff is new, is new, live demos to go wrong in all sorts of kinds of new and exciting ways. Okay, let's start then with GDB. So I think, it was in the abstract anyway, this is all Linux-specific, obviously C, C++ specific, 'cause we're here. It's not really about C++, obviously, but it's about how you debug binary code, compiled code. So yeah, for GDB, it's certainly not intuitive. It can be very intimidating. It's perhaps, of all the tools, and having said that some of these tools sound complicated but easy to use or easy to learn, I think GDB probably is a good example of one that isn't. I think it is easy to use, just not to easy to learn. But once you've got the hang of it, it is pretty powerful. So the first thing I'm gonna talk about, and I find I'm always amazed by this, it's the one feature of GDB that I think is least, (mumbles) combination, least known and most useful. So here is a program, Hello, World with just a tiny little bit of extra stuff, and so I'm gonna compile it. Now, I need you to compile gcc you (mumbles) with -g. Actually, it's better to say that. That will generate, -ggdb3 will generate more, richer debugging information can be, and GDB can do much better job of in-lined functions, optimize the way data templates and all that kind of good stuff. So that's, and actually, using a very ancient debugger. That's probably a better argument to give it. Sorry? (man mumbling) Can I increase the font? Yeah, like that? Good. Okay, so made my little program, so now I'm gonna run it in GDB, and I'm gonna type start, which is basically the temporary breakpoint on main, and then continue, and here we are. So this is definitely better than Kernighan's 1979 world bit but then really not that much, so I can look at my program, I'll type list. Yeah, this is feeling an awful lot like 1979 actually. So let's bring GDB forward into the, screaming forward into the 80s by Control X + A, and I get my nice cursors interface, and now this is much more useful, and now I can next and it's much more like being in a debugger. So that's a very, very useful feature. It does kind of, is a bit temperamental, to be honest. It's like most cursors applications. It's worse than most cursors applications if you're running your program inside it like I'm doing here. I think it behaves better when you attach to a running process because here, it's kind of fighting for the terminal with the process that you're debugging, with what GDB calls the inferior. But nonetheless, we can get some multiple windows, so here I can step through the disassembly as well as the source code, for example, so all cool stuff. I'm not gonna spend too long on that. Control + L is very useful in TUI mode because it refreshes the screen and you need to do that more often than you might hope. Sometimes there's no way around it. You just have to start again. Yeah, terminals are messed up. Very, very briefly, my good friend Jeff Turow is gonna be talking about Python and GDB later on on Thursday, is that right? I say good friend, only met him about half an hour ago, but we're a conference, so that counts. But just very, very briefly then, to introduce the Python built into GDB. It's really powerful. You can do all kinds of cool stuff with it. So go like that. Or I can... I can import print I am pid. It's pretty, pretty, pretty complete. So it's not just shelling off the Python process and running that, and if I look at the processes here, so I can, if I want to shell off the process and do that, I can do that from the prompt with shell, and now I can see there are the pids that GDB is pid to 3905 and sure enough, that's what it printed up there. Now there's a lot of, they've bound the Python interface to what's being debugged really quite tightly. There's all kinds of things you can do with breakpoints and exploring the data, and I'm gonna leave that to Jeff's talk to go into that into more detail, but those are just some of the commands you can just kind of get started. You can do, I'll just say heck of a lot with that scripting. It's very powerful. So I'm not gonna go into any of the details, as pretty-print is for STL. The only thing I will say, just one little note of advice that I've seen before is, generally, you can, generally, GDB will debug arbitrary binaries that you've made anywhere, and it works well for that. If you start trying to debug things like the STL using it's pretty-printers, you need to have used a similar, probably the same or similar distro to have compiled your program on which you are now debugging it. Otherwise, it gets, it all gets very confused with the pretty-printers that live on the machine on which you're debugging. So you can actually take a copy of the GDB binary and move that around quite easily. But then, yeah, that's the other thing I see quite a bit is the GDB, Python, if you're just using your nice Ubuntu or Fedora or whatever, and it's all packaged and it all works really well, if you start trying to take a GDB, for example, and run that GDB binary on another distro, it will kind of work, and even the Python integration will appear to work, but then it will try to use some of the Python libraries, and find that there's version mismatch between the GDB binary and the Python library, the Python interpreter that's inside the GDB and the libraries on the system that it's trying to use, so that's kind of some of the more commonly, just messed up configs that I've seen that cause all kinds of issues. My one advice, I think I said this wasn't gonna be any general advice. This is perhaps straying dangerously close to it, but the one bit of advice, again, this is just kind of pitfalls that I've seen before, keep your GDB in it nice and simple. I remember years ago, we had to help a customer who had some really weird behavior and all sorts of mad stuff was happening, and it turned out that they had a run command inside their .gdbinit, which was something we didn't think to test of ahead of time and we didn't quite handle properly. It's quite, I mean it's quite a good, if you can put your, a nice GDB in it, there's all kinds of functions and things in your source control and then source that from the GDB command line, that works. That works quite well. And the history save is good because that means your nice up arrow, get my commands from before, that saves then across sessions, so history save has nothing to do with reversible debugging or anything like that, but it just saves the commands but often, it's much nicer to type up arrow + Enter than actually have to type the thing in again. And yeah, pagination off and confirm off because if you're not living life, if you're not living life on the edge, then you're taking up too much space. Okay, little bit just about how GDB is implemented, because I think this is useful to really get the most out of it. You kind of have to, it's like old stuff. You can understand kind of the layer at the top and you can go so far, but if you do more layers down, you can understand, the more you shouldn't get out of it, and particularly, the way that GDB interacts with signals is, I think, kind of surprising at first. It really makes sense when you understand it but it's surprising at first. So the thing you need to know is that GDB is built on top of ptrace, which is like a really horrible API in the Linux kernel. I think it was inherited from releasing Solaris, possibly from longer ago than that. And yeah, it's an awful API but it works and so that's what, there's been a couple of attempts to replace it over the years, but none of them have really got traction. But so when GDP is running, when you're running the inferior, whether you've attached to a running process or whether you run it from the GDB command prompt, it's doing that under the control of ptrace, and the way ptrace works is when the inferior process, as GDB called it, when the tracee process receives a signal, it doesn't actually receive that signal. It stops at that point. Control is returned to the tracing process, which is GDB in this case, which will pick it up through a waitpid return, and then GDB can decide what to do and it can decide to just continue the program, throw that signal away, feed the signal in. So let's have a look. Oh, I'm gonna set (mumbles). So most of the signals, we've got Stop, Print, Pass to program. Most of the signals that, it'll do all those things, so when the inferior gets a SIGHUP, it'll stop, control returns to GDB prompt, and it will say, got a SIGHUP, and you can press continue, and if you do, that SIGHUP will then be passed into the program and if it has a handler, that handler will run, or if it doesn't, then it will do whatever the default action for that signal is, usually terminate, sometimes ignore, whatever, but some of them don't, so SIGINT, for example, that we treat specially, so when you're in your debuggee, your inferior process are inside GDB, and you hit Control + C to get control back. Well, at least if you've launched, if you've run the program from GDB prompt itself. GDB isn't doing anything special with that. It's just that when you type Control + C at the terminal, it will generate a SIGINT and deliver that to the program that's being run. Actually, every process inside the process group of which the terminal is the controlling terminal, I think, and so the normal thing happens, the process stops, GDB gets the notification that SIGINT has arrived and it returns to the prompt, and you type continue, and you'll notice the pass to program there for SIGINT is no, so that if you type continue, then that SIGINT will not be delivered to your program. So if your program you're debugging has a handler for SIGINT and relies on SIGINT being called, then you'll need to change that inside what you do with, you'd need to say handle SIGINT, and then you have print, stop print pass. I think if I want to go back to the original behavior, I'll do that. Okay, so I think that's, oh, SIGTRAP likewise, so when you hit a breakpoint, it'll just generate a SIGTRAP, and what GDB will do when it's a breakpoint is it'll change the code. I think it's architecture-specific, but certainly on x86, and probably most architectures, it will change the code. It will write the opcode to generate a trap. In the case of x86, it's the 0xCC opcode, which is a single byte opcode that generates a trap. Other architectures, it might generate different instructions, and it will just literally plonk that in the text section so when the program gets to it, the programmer receives a SIGTRAP, GDB stops, and returns to prompt, and again, I think if we go, we'll see SIGTRAP, which is here. Again, does not get passed to the program when you continue. Okay. So yeah, they'll actually, SIGINT SIGTRAP are used when you're normally debugging, but GDB doesn't actually hand, doesn't do anything particularly special with them other than responding to the SIGTRAP in the right way when it hits what it knows is a breakpoint. Watchpoints are, watchpoints are super cool. Really cool with reversible debugging which we'll show in a bit. So watch foo, so I'm sure most people will have had experience with this. And so yeah, you've watched and you continue and then when foo is not a foo, is a variable in this, assume leave a variable here, when foo is modified, it will stop and so you can run forward to the next time, and the foo is modified. It tries to be quite clever, and so foo is a local variable when it goes out of scope. GDB will actually set a breakpoint internally at that function. That's the end of that scope and then also, okay, that's no longer being watched because it's out of scope. Actually, if you're debugging compiled code, usually, what you care about is I want to watch that address 'cause I've got some other stray pointer somewhere that's stamping on this or something. So watch -l, which is new-ish. I don't know (mumbles), anything new within a few years old will watch -location, and that won't try and do the clever when it goes out of scope, stop watching it thing, so if you've got some local variable that's being trashed, it'll just watch that address. Read watchpoints, so generally, actually, if the variable foo is written to, let's say foo gets an integer and it contained 42, if foo is written to and it's updated with the same value as before with 42, then it won't stop. That's not considered, the variable beta hasn't changed even though you actually physically wrote to that piece of memory. Just waiting for it to change, and rwatch is a read watchpoint, and if the architecture supports it, x86 does, then it can stop whenever that variable's being read, which is useful. And we can have thread-specific watchpoints and we can apply conditions and we can combine these in all kinds of useful ways. So thread apply, I think, is another useful command. Most commonly used, I think, with backtrace. So, often, especially if you're debugging some, someone else sends you an error report or some sort of bug report, and you just want to say, yeah, thread apply all backtrace full. That's nice, that will give you, let me show you that. That will give you a backtrace of, we're at a multi-threaded program now, so I've got, here's one I made earlier, so that's, that program, and if I run that, it's just got these 10 threads which were just up, just running around updating those values. So if I run that... All good. So as you probably know, info threads tells me all of the threads in my process and where they are, and yeah, thread apply or backtrace gives me a backtrace for all of my threads. Thread apply all backtrace full, including all the local variables, so that's kind of useful. I've only ever seen thread apply used with those options but you can do other things, so I can say thread apply 1-4 print, and $sp is a convenience variable for the stack pointer. And so thread apple. (chuckles) There we go. Dynamic printf, so much maligned printf. Printf is the worse debugging tool in the world except, of course, it's quite useful, but obviously, the worst thing about printf is that you have to think in advance where to put the printf, and you have to put the right one in with the right date, print out the right arguments. Otherwise, you need to recompile your program and deploy it again and run it again with the printf that you wish you'd put in the first place. Now, dynamic printf is halfway to solving that problem. So it's kind of neat, so we can go, so let's do, so dprintf, and it's, the syntax is a little bit arcane. I think it's, here it goes, right, yeah, so mutex_lock is my function. And so that's where I would put a breakpoint. I've got a feeling you have to do this without spaces, I can't remember, so I've got these mutex things that I have got in my little threaded program I just showed, and I've got a magic (mumbles) and... Okay so kinda crummy, but my program exited. But that's okay because I can just start it again and run, and there we are. Okay, so I've... That's... I wonder how many times that particular bug is made in the world every second. All right, there we go. Cool, so don't print, it's cool. It's a bit slow, I mean it's fast enough in this kind of case that we don't care at all. Particularly if you're remote debugging, which we'll probably won't get time to look at, but we'll touch on, if you're remote debugging, then it's very slow because what's happening internally is GDB is hitting a breakpoint on mutex_lock, control is returning to GDB, it's then running printf commands, like calling those inside the inferior to do the printing that it needs to do. Getting control back, removing that breakpoint and continuing, all of which is very slow, and all of which is really slow if you are remote debugging, so you can do this dprintf style agent. I'm sorry, the first, sorry, I lied. The first one, GDB. GDB will just figure out what the printf would have been. Call will call printf inside your program. An agent, if you're running, if you're doing remote debugging, so you've got a GDB server on some kind of target, it will do the printing inside that agent and it can save a lot of time. And it's reasonably configurable, as you can see, so dynamic printf is cool. I mean you still need to put the dprintf in before the actual bug has happened. You still need to catch it in the act, but at least you don't have to change your code and recompile your code to get that, get more printf info out. So I just touched on calling inferior functions, and so this is very useful. You can just type call foo from the command line and it will call the function foo. It can be surprising. Print foo+bar, if you're in C++ might, we might have overloaded the plus operator and so GDB is smart enough to figure that out. Well, smart enough, and so sometimes, that can be surprising that that might call. Print errno will call a function in your inferior because errno is a thread local and it's actually defined as a function called get errno address or Something, and GDB will just call that when you call print, when you type printer errno. And this one caught me out. Does my little pointer thing work? Good, I think it does. This one caught me out. Passing literal strings, so from the GDB prompt, I type call strcpy( buffer, "Hello, world! The first thing it will do is call malloc inside my program. So malloc a buffer into which it can put Hello world, and so if you're debugging your own malloc implementation, then yeah, that can get interesting. Catchpoints are very cool. I'm not gonna go into them in detail. They're kinda like breakpoints but they stop on a nominated system call. If you say catch syscall or you can catch exceptions, which also is useful. So yeah, kind of like breakpoints but rather than giving a line of code on which to stop, they give some kind of condition, something your program might do on which to stop. Remote debugging, I touched on. I think I'm gonna just put that up there so you can, I mean it's quite, very simple to use on the same machine, so here we're debugging over a socket, so you need to run this gdbserver, which is this little, little stub application that GDB will connect over a socket or whatever, which itself will then, then GDB server will debug the inferior using ptrace. Yeah, you can do multiprocess debugging, which is good, but we're kind of running out of time, but very quickly, so I can actually get multiple, I can debug multiple processes at same time, and it looks very like debugging a multi-threaded application. So if I just, I have to set the set follow-fork-mode child parent, and set, actually, so the key one is set detach-on-fork, so by default, GDB will detach on a fork from one of the parent or the child process, depending on what you set the follow fork mode to, but if you say detach-on-fork off, then it will continue to debug both the parent and the child process after a fork, and you can list, just like you say info threads to see all the running threads, you can go info inferiors and see all the running processes and switch between them, like you say thread one to switch to thread one, thread two, say inferior one or two. So that can be kind of handy if we've got lots of processes to debug and to keep in your head all at once, or you could just start to copy to GDB, whatever floats your boat. You can create your own commands in Python. I think Jeff is going to talk about this in more detail, and so won't go into that. You can have little stop handlers, so (mumbles) it's a Python that get called when certain things happen. Also very useful. You can do temporary breakpoints, you can have breakpoints on a regular expression, which is really neat if you want to stop on every, and if you've got some library that starts mylib underscore, though you can go mylib underscore dot star, and it will put a breakpoint on every function in your library's API. One little note, because people often get confused about this. Typically, we have debug builds and release builds, and in debug builds, we run them with low optimization, and debug to find not, not find, sorry, and so you got all your assertions in and everything else, and it can go a lot slower, depending on, I've heard of applications going like 10 times slower when they're running the debug build and a release build, and so people go, oh, I can't run GDB on my program because debug builds are too slow, and that life is, the world is more complicated than simply having the debug or release. They're just sort of conventions, and whatever optimization level you have and what debug info you're generating are completely orthogonal, and so you can have minus 09 and minus GDB3, and lots of debug info very optimized code. It'll be kind of weird when you debug it because you think you're stepping forwards a line and the compile has laid out code which you didn't expect, so you need to kind of, to be aware of that, but it will work, and there will be absolutely no runtime performance impact. In fact, the only thing you'll use is a bit more disk. You won't even, if you're not debugging it, you won't page in the debug info sections from disk, so just to correct that common misunderstanding. All right, enough GDB. Let's move on to other things. So valgrind, everyone calls it valgrind, but actually, it's called valgrind. I think, so I'm told. Anyway, actually, so the most common one is memch-- It's a platform, you have all these tools, different tools. The most common one is memcheck. So they're kind of synonymous, valgrind and memcheck. When people say run valgrind on it, they often mean run valgrind with memcheck, which is fine. Then you've got these, actually so it does strike me, it's definitely called valgrind, not valgrind. I don't know how you say cachegrind 'cause that doesn't sound right. Cachegrind and callgrind, but anyway, there are the other, these other tools that you can run within valgrind. It can be rather slow, but it just works, which is really neat. So you don't need to recompile your program, you don't need to link against any libraries. It's in most distros, so you can just apt install valgrind or whatever, and then just use it. I'm reminded of a real-world story of using this when I worked at my last proper job before I started doing Undo was, we had an LD_PRELOAD library which was doing kernel bypass kind of before that became a common thing, and very often, we get this one thing that became consistent in my old life and working in Undo is customers would often say, well, your stuff's broken, it's definitely broken 'cause I run my program without your kernel bypass library or without live recorder and it worked just fine, and I run it with your kernel bypass in set or, and it's broken. And like a lot, like most, certainly a good chunk of the time, in both cases, they're right. It is our stuff that's broken 'cause as I said, programming is hard. But some of the time, actually, is their program's broken and they just didn't notice, and so the guy I was working with at the time, very smart guys which couldn't believe, couldn't understand why our stuff was broken, so he just got a copy of their program and ran it with valgrind, and sure enough, there was some uninitialized data that was being accessed, and we could kind of point them at that, so let's show, let's show that in practice. So here's my little canned version of that bug. So here's a nice simple program. Of course, I compile it as normal. Run it, there's nothing wrong, that's legal. Just is undefined. So, let's run that inside valgrind, and see what happens. And so, oh yeah, look. Now you'll see it's saying, it's an instruction level thing, so what it's doing, actually, I asked if Undo works a bit like this. What it's doing is it's translating the machine code as it runs in a sort of JIT fashion and doing analysis on that code, so it's not simulated, but it is (mumbles). And of course, if you printf an undefined value, the first thing you notice is there's a jump, a conditional jump, based on the uninitialized data 'cause printf is trying to turn your number, the number in this case, into a string. So it's kind of useful like that, but even more useful is you can combine it with GDB, and if I, now, the thing you got to remember that I just said, I said so valgrind is doing this translation, this binary translation of the code, so the code that you're executing is under valgrind. The code the CPU is executing is functionally identical to the original program, but it's got extra stuff in it, this different code, and so if you try to debug it through GDB in the normal way, you'll just see nonsense, because once it tries to, once GDB tries to look through ptrace, what it'll see is what the CPU sees, which is not what it was expecting to see. But valgrind has built into it, a GDB server, which you can connect to, and then you can start to do all the GDBs, and then the other thing we want to say here is if we do it like that, it'll just run to the end, so you can give it an error count like that. I think (mumbles), I don't think it matters but anyway, so this thing stopped after zero errors. I could say stop after 10 errors, and they stop after zero errors, which is gonna stop at the beginning. Sorry, there are different vgdb modes. Full is, I'm gonna use that 'cause it doesn't have any surprising. There is on as well, on or full on, just so trades off performance for being, it's not incorrect but it just gets a bit weird at times, especially with, it can miss watchpoints and things. So now, I start valgrind like this, and now it's nicely telling me what I need, tells me exactly what I need to type somewhere else to get GDB to connect to this, the server inside valgrind. So it says I'm gonna run GDB a.out, so I can run that without, here I can run that without copy/paste. But the next bit, I can't run without copy/paste 'cause it's more than nine characters, so, that line there. Okay, so here I am at the beginning of time, and then I can continue. As you can see, it's a little bit slow but it works just fine, and here I am inside this printf and I can get a backtrace, and I can see, what's wrong with my program frame 2, and here we are, accessing this uninitialized memory, so I can do all the all the GDB stuff to walk around and explore and try and get a bit more information. Now you can't combine, unfortunately, you can't combine valgrind with any kind of reversible debugging. That would be super cool. Can do it with AddressSanitizer and stuff, which we'll get to in a minute. Okay. Kind of getting on for time so let me try and speed on through. So yeah, a whole bunch of different tools. The default is memtool, which I think most people think of, as I say, all different things you can do with valgrind. Right, on sanitizers, which are kind of like valgrind but different, so unlike valgrind which will work on an unmodified program, the sanitizer's built into the compiler. Originally came in clang, and has been available in GCC for some time now as well. Slightly more arcane typing needed for GCC than in clang. I don't quite know why but that's what Google told me I had to do so that's what I did and it seems to work. Anyway, it's quite, it's much faster than valgrind. Valgrind will slow down by, can't be anything up like 100x. Can be more like 10 but it can be 100x, can be, yeah, can be very slow. The AddressSanitizers, and the other kind of sanitizers are much quicker, typically, 2x. So there's still, there's still an overhead because it's (mumbles), the compiler, basically, is instrumenting all your memory accesses, but it's doing it at the compiled time, so pros and cons. Also they do find different types of errors. There are some that's sort of an overlapping set of bugs that they'll find. So let's see, so I have, so I've got this out of bounds function, which is nicely written, so if I give it a number to (mumbles), it'll use that directly to indirect this, to reference this array here. So what did I say? Fsanitizer, it calls address and (groans) what's the, static-libasan out_bounds. Fsanitized, not sanitizer. Okay, so if I run this like this, it's fine 'cause that's, there's enough elements in my array to access. If I do it like this, it's a right activity array out of bound, and now my program has told me. Now, because this is actually running, this isn't doing the JIT binary translation stuff, this is really running on the hardware. You can combine this with other debugging tools just fine, so particularly, with reversible debugging, which is something close to my heart, just works. So let's first show some reversible debugging for those of you who've not seen it. So this is a program. Bubble sort of program which contains a bug. Which is one of those... Non-deterministic bugs so if I run it in a loop like that, it runs just fine until eventually, it doesn't. Now I've got, now I'm running with, I wanted to actually do this a different way. Anyway, let me, this has got the stacks motion detection but it actually slightly messes up this demo, so I'm gonna... I think that failed first time that time. No, it failed every time, okay, 'cause it's the wrong (mumbles). So... Non-deterministic bug, it will fail. Eventually, that will fail and I'll try to look at a core file and I find out the core file is useful. You'll have to trust me. So, but I'm (mumbles), so gdb bubble sort. So I quite like this little trick. So I'm gonna run it with process record which is the inbuilt GDB reversible debugging stuff. So now I need to run it a whole bunch times. It doesn't usually, as we've seen, doesn't usually fail. So I'm gonna do, put a breakpoint on main. I'm gonna put a breakpoint on exit. I'm gonna put a condition commands one which is gonna run a bunch of commands every time it hits breakpoint one, and I'm gonna go continue, and I'm gonna, sorry, I'm gonna type record to start turn on process record, which is the bit that we need for reversible debugging, and I'm going to continue, and I'm gonna put a break, I'm gonna put my breakpoint on dash it on score exit, so commmand 2 and just gonna rerun. That's it, all set. I think I probably would've done this because of my, but I need set confirm off for that to work, and off we go. And this will keep running it with print, and you can see it's slow. It's a very, very simple program and it takes, now the slowdown of GDB process record is kinda bit like tens of thousands of times slowdown. What it's doing is single-stepping every single instruction, and recording what changed, so it uses lots of memory and it goes very slowly, but come on. It definitely will crash. And when it does, we can step back. You know I'm gonna leave that... Oh, damn (mumbles). Let me try something in parallel, see if we can't, let's also try it with the rr which is record and replay, so let's do the same so, so rr record, my bubble_sort program. Does this saves the trace, that I can subsequently go and debug. So let's just keep doing it. Oh no. What the (groan) it's crashed. Right, so both crashed, so here we are inside the process record, so backtrace is garbage. Just like, I'm in hyperspace. I can look at the program counter. That doesn't even look like a sensible address. X to examine the contents. No, there's no memory there, but I can reverse-stepi, or rsi, and that will just go back one instruction and now I'm back in insane land. I can kind of see where I am. Okay, so I'm at the end of the function. Let's have a look, so this little arrow here tells me yeah, I'm at the end here. There's return instructions. A return on x86 will pop what is ever on the stack and jump to that, so that's the stack pointer. If I examine that, whoop. Sure enough, that's that garbage address, isn't it? So I've got garbage on my stack. So I'm gonna set a watchpoint. I'm gonna watch that address like that, and then I'm gonna reverse continue, and sure enough, we've gone back in time. Unsurprisingly, to when this array is being written, and the array SI is 35, and if I go show type info, oh no (mumbles) ptype array. I can see the array is actually only 32 elements long, and obviously, my bug is the rand P (mumbles) module or size of array, which is the size in bytes rather than the size of elements. I ran it concurrently with rr. Considerably quicker, and non-interactive. Kind of works, look more like strace, which we'll cover in just a minute if we get time. So rr replay. Now this was the one, the last one that it ran, and so this is gonna look kind of similar. Now I'm at the beginning of time. Continue to the end. Here we are, we've got this (mumbles) at this random point and I can do all the stuff that I just do. So reverse step. I should think for some reason actually two reverse, so anyway, so reverse step there, and so I can watch. So the stack is now that. And so I can watch just like I did before. To reverse continue, and there we go. Right, so it's come down the same thing, only it's much quicker, and it's got this separate record and replay step. It is quite, it's a little bit, it needs to be running on the right system, so it must be x86, must be a relatively new Intel CPU that has the right support. Doesn't work on AMD, doesn't work on ARM or anything else. I need to be running not in a virtual light, not in a cloud environment like AWS or something, because they don't have the performance counters exported that it needs, but if it's got all the bits it needs, it's very cool, it's very fast and works well. So I want to go quickly. Kinda gonna run out of time but so quickly on to ftrace, which is a different thing, so it's a function tracer. It's kind of part, this is one of those part profiler, part debugger tools. And I'm going to talk through a little case study that we did it at Undo just a little while ago, so we have this thing that's kind of like rr but works in different way, and so it doesn't have some of the restrictions, and we had a customer who's integrating it into their test suite. It's very useful in that, so you run, when you've got those failures that just happen one in thousand times, and they're not, they're intermittent, not reproducible failures. You've got a recording and you can just debug that. Very, very useful, and so we're trying to integrate it, helping our customer integrate it into their test suite, and we got to kind of, we've got an exchange where a customer came to us and says, yeah, Live Recorder keeps dying, keeps getting SIGKILL. Think actually in the process I'm recording was getting SIGKILL (mumbles) things are dying with SIGKILL. So we said okay, well, you got this quite complex test suite. You sure you don't have some kind of process killer running around, killing stuff, and said, yeah, no, definitely not doing that. Said okay, so we take a look. So we had a look around, and after a while, we came back to them, we said, you're really sure you don't have some kind of process killer? And they said, yeah, 100%. Said, okay, so we said, can you run this script which was an ftrace script and have a look, and they ran that, and then we said, oh, we've seen a process called watchdog that's sending SIGKILL. What's that? And they went, hmm, have a look around. Oh yeah, that's this process killer we have in our test suite. So the script we sent them was this, and (mumbles) the whole script, but the point, the interesting stuff with this was here, so ftrace is controlled. There are some, there's a wrapper tool you can use that's quite cool but the low level, it's controlled by this sys kernel debug file system, and then you sort of poke different things into that, so let me show you essentially what we did here. So this is the, so you can just look at the trace. If I want to clear that trace, I can just... Sys/kernel/debug/tracing it's just echo, something at it, and it resets to trace, and so what we did here was told it to, first of all, tell it to enable signal tracing, and then, say, we're in, this is the filter. They're interesting in any signal, any signal, basically, and then you go echo 1 into tracing on, and that starts tracing, and now if I look at my trace file, there we go. Started to see some signals. So everything looking quite normal here. Just some sig-17, sorry I was just SIGCHILD, but yeah, you can see these two events are the generation of the signal and this is the process that generated the signal. Oh look, cat and bash, so this is me. So I think I'm only tracing. I said last time I did this, I did it for the whole system and somehow I've set this to only trace. I think for this process creep or something, oh no, no, no, no, no, there are other things happening as well here Xorg and the like. So, yeah, here you are, sig-14. That is SIGALARM, I think, and so you can see these two events, generations (mumbles) on the delivery of the signal. Typically, you'll see the two paired, not necessarily always. Depends if the, maybe the process has masked the signal or something like that. By the way, it's another thing, if your program masks SIGINT and you're in GDB and you try to Control + C, it won't. That's why when interrupted, but the program doesn't receive the SIGINT from the kernel. Cool, so that is, that is ftrace. Oh no, now we have to go through this whole story again. Strace is probably better known than ftrace, and you can trace all the system calls, so, but obviously, you did a little bit more like, then strace also is built on top of ptrace. So ptrace is the process, gets interrupted every time there's a signal. You can also configure ptrace to interrupt, to return when there's a system call, and it will output all the system calls being issued by the command. You can do all kinds of neat things. You can follow fork, you can write the output to a command file, to an output file, et cetera. Ltrace is like strace but rather than for the system calls, it's for library calls. And actually can be system calls as well if you want, and you can configure it to print out sort of certain useful things. Oh, I forgot the really cool thing in strace. - k, strace -k command gives you a little backtrace for each system call that's issued, which is super useful because usually, the system call is not directly what you want. Anyway, sorry. So yeah, that's ltrace, so should we simply... Show it to you one of these. Can't remember what program did I? Okay, let's do date, ltrace date, and you can see all the library calls and the system calls, and obviously, you can be more selective if you want. Perf trace is a new one to me. I would find this, it's like strace but better and worse, so better because it's a lot faster. So strace will slow down the program being traced quite a lot, particularly, if it's making lots of system calls, because every time the process makes a system call, it stops, and strace gets control, and it does some stuff, and it prints out, the system call has happened, and then it continues, and that takes time. This is built on perf, so it works kind of like ptrace, like strace. It's so much, much faster. It's quite flexible as well. You can actually get it to perf all the perf events, so all kinds of events like cache misses and all kinds of stuff. You do need to be root, and it also doesn't, it's also, I mean I think this will probably change over time but right now, so strace, if you've got a string argument, so if you're saying right, string to file descriptor, strace will follow the point and tell you the string, whereas perf trace will just show you the raw pointer. That's me running out of time. Fortify is very useful, so that would get the compiler to check certain things that it can. Reversible debugging, we've done. Oh, look at that. Just a bit quick at the end but I got there at the end, so thank you very much for that. Whizzed through all the things. Could have taken a bit more time. I don't think I have time for questions in this session but I'm happy to answer any questions at the end. How strict are we on time? Can we do questions? Just couple. No one's saying no. By just seeing everybody, and just suddenly thinking do I want questions in front of everybody. Probably, I don't, because I probably won't know the answer. I think that'll be it. All right, thank you very much, everyone. (audience applauding)
Info
Channel: CppCon
Views: 28,173
Rating: 4.860806 out of 5
Keywords: Greg Law, CppCon 2018, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: V1t6faOKjuQ
Channel Id: undefined
Length: 58min 25sec (3505 seconds)
Published: Thu Oct 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.