So I'm going to turn it over to our headliner: Chandler Carruth. Thank you. (applause) CHANDLER: Morning everybody How are you guys doing? Excellent, excellent! So everyone who is at the back, you're going to want to scoot up a bit, OK? This is fair warning! Guys, you might want to start that shuffle process now... You're going to want to be able to read very little bitty text up here, for the whole talk, OK. So, my name is Chandler Carruth, I'm here from Google where I'm the C++ and LLVM tech lead I do a lot of fun compiler stuff, I do a lot of fun programming language stuff at Google and I'm here to talk to today about tuning C++. When I say tuning C++ I mean tuning C++ for performance - like runtime performance we want to have really fast and efficient code It's one of the primary reasons why we still use C++ today is to get the last one and two percent of performance. I'm going to try to talk to you a little about how you can do that, I'm going to cover thing like benchmarking Some of the crazy things we see with CPUs Compilers (which are my bread and butter) All kind of crazy stuff Sound good? OK there are a couple of caveats. The first caveat is, and I actually do feel a little bit bad about this, This talk is going to be very Linux focussed Now I'm hopeful that the concepts are going to carry over to other operating systems I think there's similar software on almost every other operating system but I'm a Linux expert, I've got to teach from what I know It's going to be fairly Linux focussed It's also going to be x86 focussed Most of my performance work is on x86 these days and for better or worse And that means I'm going to talk a lot about x86 There's going to be a lot of assembly and other stuff up on the projector So prepare yourself, right, for that part. The last thing I really want to caution you guys about is this is not going to be a talk similar to to Herb's talk or Bjarne's talk. I don't have the right answer here and that's not the point of this talk So you're going to see things that I'm not 100% proud of. You're going to see things that are very questionable but they're going to work (that's at least the hope). And to make sure that you believe me when I tell you that this stuff actually works and that you can do this if you go do this on your own, I'm going to do almost the entire talk live. So prepare yourself for that as well. Alright, so let's get started. So performance does actually matter How many folks here think that performance still matters today, that it's still relevant? Good - you guys have been paying attention! Now for anyone who is questioning, like really quickly do you have any guesses why performance still matters today? Anyone? You can shout stuff out, it's OK. (Audience mummers) Energy - I heard, yeah! So this guy - Nicola Tesla. We have this whole electricity thing, it's really cool, super-fun. And you know, electricity's great but this kind of gives the wrong impression about electricity because here electricity is just everywhere, right? It's just crazy... there's plenty of it Like we don't even know what to do with it, we're just studying it What we really have is a scarcity of electricity and the scarcity of electricity comes in almost every place where performance matters today. From the tiniest of little devices that we got to introduced to, that have these very frustrating batteries (sometimes), to the largest of data centers. And all of these things need electricity and they are all struggling to get it. That's the primary motivation behind performance work today. So this is what's in the back of my head constantly when I'm thinking about performance OK, the second thing I really want to talk at a high level about is exactly what C++ code you should spend your time tuning. Because, spending your time tuning C++ code - that introduces complexity, it's a hard task, you may make mistakes, you may even introduce bugs - we don't just want to tune anything. This actually became much more of a pressing concern for me when I first started at Google, OK? Very first- so I- you have to understand I started at Google almost eight years ago I started at Google right out of university I had no idea what I was doing I'm not kidding - I literally had no idea what I was doing I thought I was crazy for even going to Google... So on week one, they take you through this lovely orientation thing They say "Welcome to Google!", they give you funny hat to wear (Which doesn't go over too well) You get made fun of by Sergey Brin on stage It's a good experience on the whole It's a good experience, I was very happy... Then week two came around... (fumbles with mic) Sorry, my microphone doesn't like me. Hopefully it stays on this time So on week two, things get a little bit more interesting, I started to actually talk to my co-workers, figure out what team I was on That was fun, got to actually learn some stuff about what Google is actually doing And then I had a meeting with my manager at the time, great guy He had an interesting thing, He came to me and was like: "I've got a problem... "I've got a problem because "we have some code that we need someone to take a look at... "...and we thought you might be the perfect person to take a look at it" Which, by the way, for those of you who don't understand the translation, that was - "You're the new guy!" I didn't understand this at all... So he was like "We thought you'd be perfect". I was flattered.... I was like: "Aw - of course I was!" And so he tells me: "We had this problem with some of our systems "they're really running into trouble... "The latency on this one system that we use internally for our internal development work" (I was on our internal developers tools team) "It's too slow, it's actually the bottleneck for almost all of our developers" That's a BIG problem! And this problem had bubbled up several levels of the company and across several different parts of the company and then down and back up and you know the whisper game? That you tell your kids... That's true in corporate politics as well So at some point it reached a totally different part of the company and had lost some small elements of truth to it and that part of the company had an extremely senior engineer who had decided to have a crack at this problem And you know he worked this problem but not talking to us, right? which is fine - he didn't even know he needed to talk to us because by the time he heard about the problem it was a totally differently described problem And then he sent my manager this code My manager pulls me in to ask me to take a look at this code Manager says to me "can you take a look at this code? "Ken Thompson sent it to us and he says it will this service twenty times faster "but we don't we really understand how, we need someone to step through it" Now I look at my manager Dead serious, I Iook at my manager and say - "Uh, sure but you know, who's Ken Thompson?" (audience laughter) I mentioned I didn't know anything at my second week at Google It was bad! I didn't actually know who Ken Thompson was and my manager very politely (perhaps too politely) smothered his burst of laughter and said: "A very senior engineer" Gave me the code, I went back, I went home, I got on Wikipedia, and looked into this... "Oh Ken Thompson! Oh OK, so the inventor of UNIX has sent me some code to look at..." That'll go over well... So I sat down, I figured It's my second week at Google - I've gotta do this! I can't fail now... So I sit down, stepping through this code Now the first thing I've got to say and I know this is a C++ conference, but Ken Thompson's code is actually what taught me that there was such a thing as beautiful C code I had always thought that C code was terrible Like - TERRIBLE! What I didn't know was that I was looking at terrible C code - it's an easy mistake to make! Terrible C code is terrible - it's not because it's in C it's because it's terrible code Ken Thompson knew how to actually write beautiful C code It was a wonderful experience, got to read through all this code understood the problem... The system we were trying to solve, it, was essentially serving out access control rules from this big long table of rules based on some pattern that you had to match in the table to say what access you were allowed to have. Straightforward problem, everybody has this - firewalls have this, filesystems have this - everybody has this problem it's a really common problem and Ken Thompson heard about this totally in the abstract and he noticed that this actually very similar to regular expression and finite automata construction That's one approach you can take to solving this problem And he's written a few regular expression engines in his life (a few!) Has some domain knowledge, and so he figured he'd help us out and he'd write us a finite automata based implementation of this rule matching And it used some of the regular expression finite automata code that he had written before and it was twenty times faster and in his email he says "this is twenty times faster, we can make it another twenty times faster but it would require substantially more memory "this seems like the sweet spot in terms of the tradeoff between time and space" Very reasonable. So I'm going through and I figure out all of this stuff, I learn all this stuff - it's wonderful! And then I send it to another one of my collegues who is actually responsible for the service and I explain to him how all this works and he says "well that's interesting, "so it really does get twenty times faster... "The funny thing is I was looking at our data the other day "for the queries we actually receive and which rules match them "and essentially every query matches against a single rule in the table "so I put that one first and it's one-hundred times faster now." (audience laughter) Well, that's unfortunate... So the moral of this story is - measure first and tune what matters and it actually solved the problem - it did exactly what he set out to do. Unfortunately, he didn't have all of the information and if you don't have all of the information you CAN'T make the right decision and you can't have the right information until you measure. So, that's my preachy bit of this talk. OK! I thought about spending a whole lot of time here and then I attended Bryce's talk [Benchmarking C++ Code - Bryce Adelstein Lelbach] yesterday and... Bryce where are you? Somewhere round here... We're going to talk about Micro Benchmarks. OK so micro benchmarks Who here knows what micro benchmarks even are? Got a few people... OK, so micro benchmarks are when you're benchmarking some really isolated part of your system Something whose performance matters desperately but is very very narrow in scope. But who here really loves slides? Anyone? Yeah, so I figured - let's just do it live! Alright (audience laugher) So the entire rest of this talk is going to be live Please bear with me, hopefully it's all going to work out well So what I'm going to try and do is I'm going to try and show you guys some microbenchmarking But before I do that I have to show you guys how we're going to write benchmarks If I can get this to work... There's this wonderful library, I can't take much credit for it But my company did this great thing we had the benchmarking library internally and we open-sourced it (this is all on Github) If you just search on Github for google/benchmark you'll find it - it's super easy This is a great little microbenchmark library I don't want to bore you guys with a detailed thing so I'm just going to skip all the fun stuff and point at some example usage So this is actually all on the webpage you can go and find it this is how this thing works You write a little function It takes a state object you run it for a while An unspecified amount of time And you do something in it And this benchmarks it Could, not, be, simpler - right? This is super simple, this is an entire benchmark Anybody have super-big questions about how this works? Please interrupt me at any point So let's look at a little bit more fanciness just so we understand how things work You can also do some other fun stuff with this benchmark library You can have... ranges of inputs So here we've got a range_x accessor on the state And what this is doing is... allowing you to control exactly what input space you benchmark you've got a bunch of arguments down at the bottom These go into x Make sense? Not rocket science here, right? we'll see lots of the fun output So we want to get a std::vector <int> like that... we're just going to push_back(42) Alright? Just to let us benchmark how long it takes to push_back an integer into a vector How many folks like this? Anyone like this? We've got a good benchmark going here Hmm, no-one's really happy about this... What's wrong with this? So we definitely need to fix our typos here Alright That seems plausible to me at least So the other thing we need to is figure out how to compile it Unfortunately compiling this thing requires a bunch of flags For which I'm kind of sorry We've got a bunch of flags here Optimisation flags I'm trying to turn on all the latest, greatest features We've got standard library flags But none of these really matter The only thing I'm really going to point out are a few things here So, I am turning off exceptions and RTTI Please don't throw anything at the stage But this is the environment I'm used to benchmarking in and so I'm trying to give you a faithful representation of what I see day in and day out Alright, We've got these lovely flags... So all I really need to do now... (I think I have a copy of this somewhere... here we go) Need to compile our code... Hopefully compiles... Cool, that worked well... And we can run it. And it's running, fantastic! So now this runs and does a bunch of cool stuff that's kind of under the hood So let's look at what we actually get out of this First thing is we get this nice column layout that tells us what benchmark we ran, how long that particular benchmark took, both in wall-time that's the first column here, and in CPU-time and CPU-time is interesting because we're discarding any kind of overhead where our thing got off the CPU - that's the CPU-time and to try and get some of what Bryce talked about in his benchmark thing This while library has a pile of internal logic To essentially run just amazing numbers of iterations until we get something reasonably stable We've go the iterations over here... So it ran... oh that's a big number 22 million iterations in order to conclude that this operations takes 32 nanoseconds Now the one thing- I hate to rain on your parade- but if I run it again you're going to see some fluctuations So the wall-time here flucuated by three nanoseconds You're going to see the CPU-time fluctuate as well, even with tens of millions of runs we still have a noisy system. If you want to understand more of how to cope with that that's what Bryce was really talking about. Last little bit of statistical confidence of how fast things are Everybody happy with this benchmark now? How many folks are happy now? Why are you guys not happy with my benchmark? (audience member shouts "malloc") Malloc! Oh man! So I'm not benchmarking push_back You're totally right... So how do I actually compensate for that? Well, to compensate for that is actually hard... One thing that's a little bit annoying, if we just take this vector and pull it out of the loop, right? Then we're going to benchmark push_back, But we're going to run out of memory when we push_back too many millions and millions of integers - we're going to see other artefacts of the system because we're not actually clearing the memory. So we need to do something else Now, a good way of this is often to just, kind-of, set up a baseline Let's look at kind-of an example of this So here's a baseline benchmark so instead of push_back, we're going to benchmark just the creation of this vector Alight, is this making sense? So here we've got a new benchmark this is going to benchmark constructing the vector and then the second we're going to benchmark constructing it plus push_back we're going to... be able to look at the delta between these two to understand the performance impact of push_back So, any predictions on how fast the first one is going to run? Like what percentage of the time is this first one going to see? 90%? 0%! Everybody see the bottom here? Zero nanoseconds? ZERO nanoseconds! It's infinitely fast... It's amazing, right? I love code like this! Infinite speed... It's great! it's just the default constructor, it didn't do anything... So we've got to work a little bit harder to isolate this So what do we actually have to do to isolate this? We've got this nice create thing That's great And let's keep that around and we need something a little bit fancier to really benchmark this What if we use reserve? So we here can say And that's going to take the malloc, and actually benchmark the malloc That'll work, that'll work How many folks think this is going to actually work? Yeah, you're catching on to the theme of this talk So now reserve in our push_back benchmark and we also do just a blind reserve so that we should be able to see the delta now How much of the performance is this going to take? Is the reserve going to be 0% or more than 0% of the performance? Go on, shout it! 4%! 4%? More! More than 90%! OK, there we go... So this is going to be all of the performance Let's find out So we call our thing, run it... Wait, now this is a little bit weird So I know reserve is good and all And you know I've read Scott Meyer's books and all these other books I even watched my own talk from last year and so I know I should do reserve whenever I can but I did not expect reserve to make push_back TEN TIMES faster That's a little bit weird... So my FAVOURITE profiler in the world (and I'm actually not being facetious here) is this little tool called perf That kind of stuff doesn't go into this and so wall-time is done at the bottom here, Is not always exactly equal to this task-clock But it also tells you how much precision this has This doesn't necessarily have perfect precision A coarse grained look at exactly what performance I'm seeing it's a fantastic tool But this doesn't actually tell us what on earth our code is doing that made it ten times faster when we reserved So we don't yet have the information - we need to do more detailed performance analysis To do that you can record with perf Now this is going to do a more traditional profiling run it's just going to record what takes place as your program executes and dump out this very dense, very strangely formatted data file at the end that contains the entire profile and once you do this There's a tool, the tool has a way to analyze it You can do "perf report" creates a nice little report perf report looks something like this it's going to show what you're dong here and so it's actually showing from the hardware exactly what functions were executing and how much of the time So this is your most basic profile - it's a flat profile, right? Here are the functions Here is how long we were actually inside of them Making some sense? Alright! So that's great but there's one kind of problem here There's no context to understand "is this function calling some other function?" Because we were trying to isolate out malloc() The allocation of memory from this benchmark But we don't actually get that information here - we don't know if there's memory allocation or something else going on... I see malloc down here, right? - we're spending some time in it but there's no connections between this stuff What we really want is a callgraph A dynamic callgraph that tell us well this function then calls this other function, which then calls this other function, and that's where the performance problem really lies. and when it interrupts your program it looks at the current state of the stack of the thread that is running and then it tries to figure out the call stack from that position Well, the problem here is that it doesn't have any way of finding that call stack Because I've optimized all of the information about the callgraph out of the binary You can actually fix this, OK? I know this is really ghetto - I'm using a text file for flag management But it fits in my presentation so... and I don't like Make There's this magical flag This flag... Is the performance analyzer's most endearing flag OK? Now what does it do? -fno-omit-frame-pointer is a very opaque flag name So what we're actually telling the compiler to do is to stop deleting the "frame pointer" Now the frame pointer on x86 (and most other architectures) is a register which tells you where the bottom of stack frame is Or the top of the stack frame is depending on how it grows on your architecture So with the frame pointer you can kind of orient yourself. At any point in the program's execution you can orient yourself and understand how to walk up the stack and discover the call sequence that led to the place you're currently at. This is AWESOME! from the perf tool You can hit the 'a' key actually hold on, before I do that let's look at perf a little bit more... before we dive deep let's look at this a bit more There's one other thing that's a little bit confusing here So, you'll see if I expand this benchmark thing... Hold on, if I expand this benchmark thing It's not calling push_back... or create... or reserve... So we actually don't yet understand this This is actually my LEAST favourite thing about the perf tool So this callgraph is not the callgraph you expected what I naively expect it to look like OK, and now when I expand this I see We start the benchmark framework Cool and it calls three functions Ah yes, because we have three benchmark functions. Is that making sense to people now (that nice callgraph)? OK good, phew! Now you can keep expanding here, But, you're going to see strange artifacts if you do I strongly encourage you not to just keep expanding up here But drop down and find where this function is rooted right down here and look at that and you'll get slightly better information But you'll still be confused because it says it does weird stuff like, it seems to be calling this apic_timer_interrupt which doesn't make any sense but that's because it's in the kernel this stuff looks really confusing initially when you see it and the thing you have to understand is that the perf profiling tool isn't precise what it's doing is that it's sampling your program in the least intrusive way it can so that your program can keep running at blinding speed forward But it's not infinitely precise in fact there are a lot of artifacts and you're going to see them so it's really important to keep that in mind OK, so let's back-up and get back to what we were trying to do We're trying to understand something that should be simple We're trying to understand why this push_back function Is ten times faster when we reserve first and then push_back than it is when we just push_back Right? Everybody back on track? You know how to use a profiler? Well OK let's dig deep... So the way we dig deep in the perf tool is we hit the lovely "a" key and it's going to annotate this function I told you were were going to look at assembly so you've got to prepare yourself That's going to make the next few instructions look really suspcious We start here We execute down We do some computation which is basically just loading new data... That's why it's ten times faster So what happened here? So what happened here was the optimizer... Took one look at it and said, pff this code doesn't do anything I'm going to delete it That's its job right? It's doing its job Unfortunately it's making your job as a benchmarker or performance analyzer tremendously harder We've run into the first kind of real problem here and that's the optimisers How do we defeat an optimizer Now first off, we've got to pause for a minute I'm about to tell you all how to defeat the optimzier As a compiler and optimzer writer, please use these powers for good... Don't defeat the optimizer because you have undefined behaviour in your source code and you're like, "No, no, no optimizer. You're not going to miscompile MY code today!" Uh-uh, that's not actually going to work for you... it goes very badly OK but how do we defeat it when we need to for good reasons we need to for performance analysis reasons We need two functions These are kind of magical functions I would never expect anyone to know how to write these functions You might reasonably argue that I shouldn't be writing them they should be in my library but I want to show you guys how these things actually work OK we're already in trouble I'm using void *s - I know, I know... Type safety and all that... But there's actually a reason why I'm using void *s It's going to get a lot worse I'm going to next type the two scariest tokens from my compiler/implementor world. OK? Literally the two scariest tokens... (audience laughter) OK! And now I'm going to do something... this doesn't make any sense to anyone What on earth have I done?! OK so what is this? This is a magical escape function - it is truly magical! This has these unique and mysterious properties... It's going to assemble it It's going to build a representation for it and then it's going to notice it doesn't' DO anything... Or in my case that it's EMPTY. That doesn't do what I need. So I have to tell the compiler - "No, no, no, "this assembly has some crazy observable side effect." In my case the observable side effect is the benchmarking. The benchmark numbers. So it's actually doing what I want here This is a correct usage of volatile and in my opinion a correct usage of inline assembly. So the syntax for inline assembly, and GCC came up with the syntax (I think) and clang uses the same syntax, is kind of opaque - it's a little bit confusing. So you have a string literal that is your inline assembly You have a colon, you have a comma separated list of outputs, then you have a colon, you have a comma separated list of inputs, you have a colon and then a list of clobbers is what they're called, "Clobbers". STARTMISPLACEDSUBTITLES
OK so the question is why when there is only one test did it take 20 nanoseconds and then iti got so much faster So... and then I'm going to pick the modulus point at 32, 128, 224 They'll be like 20%, 50% and 80% right? But all the numbers are the same. Anyone spot why? what I naively expect it to look like OK, and now when expand Are yes, because we have three functions. But, So the fact that create takes more time here is totally fine That's not what's interesting I want to actually look- what's reserve doing? This is actually pretty straightforward... Where's our loop? Here's our loop. We come in, we allocate some memory Not too surprising We delete some memory and then we loop. Now one thing that might frustrate you is (It very much frustrates me) Is why are these things still in my program? Ah! Because we MADE them be! Silently, inbetween in this empty space is our escape right? That's the cool thing about escape We can actually force something that the optimizer would love to just delete to stick around But we don't even have a single instruction in here So we actually get the lowest cost measurements we can - awesome! Sad story So when you explicitly write reserve and then you explicitly write pushback, the optimizer does something different than when you just write pushback even though pushback, behind a conditional, calls reserve The reason it does something different, is because it now inlines at different points Unfortunately, it's going to make a different inlining decision and this is just a horrible horrible deficiency of today's optimizers, right? This seemingly meaningless change causes such a dramatic performance swing It's not because it's faster or slower It's because in one case, the optimizer got very lucky and managed to inline all of these things and delete all the code In the other case the inliner didn't quite connect A-B-C-D even though theoretically it could and so it didn't delete everything and so we saw the real performance. Yes? < It that because of the the order that the compiler sees those functions? <I know (inaudible) from my own experience that I believe this is true < Visibility while it is running (inaudible) it always looks back it doesn't look forward (inaudible) No no no, nonono, no Not at all relevant, compiler doesn't care The allocation in reserve and I'm speaking as the author of it - I'm not trying to denigrate it - it's a very very fantastic thing but it gets tripped up easily This has been fun I want to briefly look at one other crazy example Because I thought, you know, I should have some fun stuff to look at but I'm going benchmark it first because I know I'm supposed to measure things So we have this generate_arg_pairs() This is going to generate our input space We're going to walk from 32 I think- no... 16, up to 1024 then on the other dimension we're going to get the numbers 32, 128, 224 These are kind of arbitrary - it's just a two dimensional space we're going to explore in our benchmark we'll come back to them more we pick a ceiling here which is 32, 128 and 224 We create some vectors for our input and our output Mark this down, OK? We're going to take that input and if the input is greater than or equal to my ceiling, I'm going to do a modulo But if it's not I'm just going to use input directly See what I did? It's awesome! It's fantastic! I'm assuming all of these are positive numbers (that's not relevant to this example) Who thinks this is even going to compile? Hey look it compiles! It even runs, cool... What if 80% of the numbers are below it? I'm going to vary my numbers between 0 and 255 and then I'm going to pick the modulus point at 32, 128, 224 They'll be like 20%, 50% and 80%, right? So that should give me kind of a different- but all the numbers are the same... Anyone spot why? the actual sample for a very slow operation gets attributed after the operation completes So we're attributing all the cost of this actual modulo operation on to the store into memory, OK? So all of this cost - this whole thing that was super slow We're actually accounting for it here This can really throw you and confuse you So we are actually understanding how this profile works Cool thing Before we go though, The other thing that's a bit weird is: Why are there two copies here? Well the answer to that is of course very simple We have a nice optimizing compiler! It was like "you have a loop, it has like five instructions in it, "that's a waste of my time. I'm going to duplicate it so that we can actually do more things at a go." Well cool, I like it when my optimizing compiler does work for me But if two is good then I've got to figure that four is better right? Who's with me? C'mon, four better? Yeah! Thank you, thank you! There's one person still awake... (audience laughter) So four is clearly better Let's see what four looks like So unfortunately to get four (because I'm a lazy typist) I'm going to do something terrible Yeah I know - I'm sorry. I'm using a macro. How many folks here have used clang-format? Not really. So why is this not, like, crazy faster? get a look at what's going on For the record we have these numbers, they're good right? But I DOUBLED how many times we did it on each iteration so it should be twice as fast and it's a little faster but not a lot faster, right? What's going on here? This is, of course, where you break out your profiling tool So we want to profile this thing but when we do we actually want to look more specifically at one of these things so you can actually do this benchmark_filter thing (the benchmark library has nice filtering things) and we can do benchmark_fastmod- so it's just bench_fastmod... Alright? Cool! So it just ran one, this time it will be nice and consistent evenly distributed So we can look at this There's only one function Alright! So what's actually taking place here? Let's go up and get ourselves oriented you'll just have to trust me that I know this is actually the start of the loop We're going to load some input data we're going to compare it and if it's less we're going to jump over mod alright? We're going to load some data, compare it and jump over THAT mod... We just keep doing this Now why is this slow? The reason it's slow is because we've got these very small number of instructions and then we jump over even more instructions We're constantly jumping forward Now modern processors primary reason why they are slow is due to cache misses Like the processor is faster than any program you've ever run on it Unfortunately it's spending all its time waiting for your program to have data available for it And we're going to see a really large number here like fifteen here just jumping back up Now, this is not actually the jump back up. Again, right, our measurements are imprecise you'll note the utter lack of samples next to idiv over here, right? We're actually accounting for the cost of computing the modulo Even though we only spent 20% of our time computing modulos, that's hotter than the rest of our code... OK? Only 20% of our "numbers" needed the modulo computed but we actually spent most our time in those instructions which is really frustrating, right? But as we kind of have more and more dense kind of distributions towards under the ceiling, this is going to just get better and better because we're going to reach code less and less often. Make some sense? Fantastic! Now, before I stop and take some questions I bet all you guys have some burning- one burning question I spent a bunch of time talking about modulo and all this other stuff Any questions? What's the most important question right now? What have I failed utterly to do? (audience member says "baseline this") 🎵Baseline! What does it look like if I just do this? Why do need all this complexity? OK, so we've got to actually do our homework here We've kind of being playing fast and loose... Because I was excited - I invented something folks, it was awesome! But now you're going to make me be honest OK, so we- I have- kicking around here... a benchmark that includes both (rapid typing) In that sense we're also getting different samples as well But you can fix the seed if you want - there are lots of techniques to pin this down harder Yes? AUDIENCE MEMBER: Ah yes, I was wondering, can you use instead just using the, uh, boost- the, uh, Google benchmark test could you kind of use, er, std::chronos to kind of get exactly what you really want to measure in a, er, fixed distribution? So if you're doing something, you can actually test how much time is actually running... AUDIENCE MEMBER: Two questions: First, why omit the frame pointer and not use the LBR unwind on Intel or DWARF unwinding? CHANDLER: So, why am I using -fno-omit-frame-pointer? The first option to- sorry. The first alternative to omitting the frame pointer is actually debug information like DWARF unwinding Unfortunately I have had terrible success with DWARF unwinding - it tends to miss frames and misattribute frames really really often, so I don't trust it. I just don't trust it. The frame pointer always gives me consistent results, even when it's a little bit hard to stare at the profile - it's much much more consistent than DWARF. LBR is a fantastic tool, But LBR magnifies the statistical... random errors that you see with the frame unwinding... LBR is a fantastic way to do statistical profiling, it's somewhat less useful for humans. Because it's even more imprecision in your measurements. That tends to be my preference, and I don't think you can use LBR with frame pointers... LBR has to be the unwind information and it has too many error cases AUDIENCE MEMBER: OK thanks. And speaking of precision, why not use precise counters which perf supports, or even better correlating more(?) counters like cycles with instruction counts and cache misses... So you had an example of where because of the sampling rate probably it was missing the div and making the jump look expensive So increasing the sampling rate helps and I personally don't think it would affect the benchmark because you run it enough etc etc. The sample rate won't help here, the reason it's missing div isn't the rate of sampling, it's the imprecision of actually collecting the sample data on the other (?) That to me is a solved problem but One of the more bothersome problems I've seen Especially with vectorization instructions is skid where basically the CPU counter's a little bit off and you can't quite trust them so you kinda of have to do low-level microcodey type things to the CPU to eliminate skid What are the strategies or tactics that you have for doing in perf versus a more complicated tool Or is perf not capable of doing that? Can you just talk about that a little bit? I have no idea if perf is capable of doing that The way I get around this is I read perf with a very very large grain of salt Because it's essentially giving me raw data and it's not correcting for it and I try to mentally correct for it Looking at the instructions, reasoning about what their latency should be and then how these samples actually match up with that The other tool that I'm a huge fan of because it's very easy to use and gives you a lot of this data is the Intel Architecture Code Analyser And you can essentially give it a snippet of instructions and it'll explain the Intel accurate model for how those instructions should execute and if you look at that and you look at your profile and the samples you're seeing in the wild you can get a pretty clear idea of where the samples came from even when they're skewed and so I do more of a mental mapping because I don't really trust any of the tools to undo any of the fundamental skew that's inherent in the counters They are very very general, you can almost always get them to preserve the piece of code you want... and they are unfortunately just a bit more powerful than a volatile varirable they're also a little bit easier to explain in my opinion I actually think that reasoning about things like escape and clobbering memory is fairly reasonable And pretty easy for folks to think about One of the other nice things about having clobber at the end of a loop iteration is that it prevents the compiler from doing loop computation analysis To just figure out the sequence of things you're going to store into this volatile memory and ahead of time prepare them and then just store them. And there are optimizers that will do this So the clobber isn't just helping with the escape it also tends to help with the inputs actually being new each time around the loop iteration rather than being stored and cached
I'm really curious on his vim/iTerm setup.. anyone knows what he's using?
Which talk was Chandler referring to in the beginning of his talk that covered macro optimizations?
The talk was great. I didn't realized what perf was capable of. I've always used valgrind+callgrind with kcachegrind for profiling, which has been excellent, but it isn't quite the same since it is using simulated instead of actual runtimes. Seems like perf and callgrind would be very complementary.
Beautiful thumbnail, by the way.
Is there a decent list of optimizations for c++ code?
Does anyone know what the equivalent of the:
trick would be in Visual Studio? (context)
That was really good and engaging talk except the second part was kinda
random stuff this, random stuff that
. Anyway he was a great speaker, the concept was interesting and should be seen by any c++ programming I think just to get his boner on