CppCon 2017: Chandler Carruth “Going Nowhere Faster”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Can we just give Chandler like twice as much time? Or like a day? I would go to that...

👍︎︎ 14 👤︎︎ u/sphere991 📅︎︎ Nov 03 2017 🗫︎ replies
👍︎︎ 4 👤︎︎ u/mttd 📅︎︎ Nov 02 2017 🗫︎ replies

I bet "jl .myfuncLabel; mov $255, (%rax)" works faster than "cmovge %ebx,%edx; mov %edx, (%rax)" simply because latter uses two extra registers (ebx/edx) with dependency between them. I.e. half of this (decent) presentation is about a problem in optimizer.

👍︎︎ 4 👤︎︎ u/crusader_mike 📅︎︎ Nov 03 2017 🗫︎ replies

Does anyone have an answer to the question asked at the very end about how the processor avoids essentially invalid code?

👍︎︎ 3 👤︎︎ u/Planecrazy1191 📅︎︎ Nov 02 2017 🗫︎ replies

I installed the YouTube addon in Kodi yesterday, so I could spend my evening watching Chandler Carruth talks.

👍︎︎ 1 👤︎︎ u/jnyrup 📅︎︎ Nov 04 2017 🗫︎ replies
Captions
All right, folks, come on in. I wanna give you all a fair warning. You probably want to move towards the front of the room. I know it's scary up here. I'm not gonna throw anything. But there's gonna be a lot of code. It's gonna be kind of small. Because I'm gonna actually be coding it live, and so I can't use super big slide fonts, sorry. But yeah, people way in the back, you're gonna have a hard time. You might wanna move up a little bit. We have enough space here. All right, so my name's Chandler Carruth. I'm here to talk to you about going absolutely nowhere faster, which is probably a little bit of a confusing title. Sorry for that. But it seemed like a good funny joke. This talk is gonna be a little bit unusual. So a lot of my optimization talks, I actually try and tell you how to make things go faster. I've given several optimization talks at CppCon. This one's not going to do that. If that's why you're here, sorry. I'm trying to set expectations, so. I am gonna try and tell you how to understand why your code is going fast or isn't going fast. But today we're gonna focus on when your code is actually going nowhere at all, but either fast or not fast. We're gonna talk about loops. We talked about data structures, talked about a lot of other things around performance. This time we're gonna talk about loops. There's gonna be a decent amount of live demo. I'm gonna show you stuff. I'm gonna try and show you stuff, hopefully. Please don't do anything particularly mean to the wifi. That would be much appreciated. The other thing I do wanna mention is please be thinking of questions. One of the challenges of this topic is that there, and I'm a terrible speaker. I'm gonna give away the punchline on the title slide. There is no one good answer when it comes to loops. I'm gonna talk about two specific things, mostly. There's some bonus stuff if we get to it, but I'm not sure we're gonna get to it. I'm gonna talk about two big things today. But there's a laundry list of things I could talk about. We could keep going for hours and hours. And there's no telling which one's gonna be the most important for you. I'm focusing on the thing that I think is the most common for people to run into and the thing that I think is the hardest to understand. That's how I picked my two. But I don't want you to get the idea that, oh, well, those are the two things that go wrong with your loops. Okay, we're done now, cool. It's not quite like that. And just a little bit of background information in case, how many folks have been to a talk from me at CppCon before? Oh, this is a bad question. How many folks have not been to a talk from me about optimization before? That's a lot of hands too. Okay, so for reference, I work at Google. I work on Google's C++ team. I'm the tech lead for our C++ programming language platform. And when I'm actually coding, the thing I like to code on is LLVM, which is our preferred optimizing compiler. I do a lot of work on LLVM and compilers and optimization, cause I really like performance and I'm totally obsessed with that. And I come to CppCon to get to actually talk to you all about cool things you can do to make programs run faster. That's really why we're here. So with that, let's get started. Briefly, I have given three talks at CppCon before. You should go and watch them, not because I'm super proud of them, necessarily, but because they're way more important than this talk. (laughing) I'm just throwing that out there, these are way more important. You've gotta use efficient algorithms, you've gotta use fast data structures, you have to follow idioms that make your code fast. You can't just let your code slide down everywhere. That doesn't mean you add complexity to do it. There are simple idioms that are fast. And you wanna make sure that you follow those really, really carefully. Always, always, always benchmark the code that matters and understand why its performance matters, right? Understand how it performed, in detail. Otherwise there's no hope. And you gotta use hybrid data structures to optimize your memory allocations and cache locality. These are way more important, but but we gotta talk about some other stuff, too. So I just have to remind everybody you only care about the performance that you have benchmarked. If you didn't write a benchmark for your code, you do not care about its performance. That's not an opinion, okay? That's fact. This is how you show you care about performance. So in every case that I talk about, I'm gonna assume that you have a good benchmark. And the thing that I haven't talked about before is what happens when you have a loop in your benchmark and it's actually this loop that is the critical thing to your performance? You've got good data structures, you've tried to do other things, and you don't understand why this loop is such a bottleneck for your performance. Well, turns out that it's still probably data. I'd like to say there's this brilliant trick to loop optimization, but 99 times out of 100, it's actually still data. If you want to get lots of high-level ideas about how to fix this, there are a bunch of great talks out there. Mike Acton talked about data-oriented design. Several other people have talked about how to design for cache locality and try to minimize your working set of data. I gave a talk about profiling, but I also gave a talk about hybrid data structures that are really trying to help you solve this problem. And we're hoping in the future you can use a tool called Efficiency Sanitizer. How many folks here are familiar with one of the sanitizers? Okay, how many folks here are asleep? (laughing) Fair enough. You just had lunch, I understand. But we actually looked at adding an Efficiency Sanitizer. Instead of looking for correctness bugs, what if you just found really bad performance patterns? It's not really very fleshed out. If you're interested in this, it'd be great to help out some. But that's actually something I hope you can use in the future. But let's actually go back to profiling for a minute, because profiling tends to work pretty well. But there are some tricks to profiling that I didn't even get into the last time I talked about it that are very specifically useful when you have a loop and it's looping over lots of data, right, and that data is the source of your performance problem. All right, let's see if we can switch to live demo here. Nothing could possibly go wrong. All right, so that doesn't look entirely great. That looks better, okay. So let's actually look at a benchmark that specifically is kind of designed around understanding bad data structure or cache locality problems. And writing this benchmark is actually a little bit tricky. So I'm not gonna try and live code it on you. I think that would just make all of us really unhappy. So I've gone ahead and I've written this benchmark up. It's called cache bench, right, I'm very specifically targeting kind of cache locality benchmarks. How do we understand the performance of your cache and of your working set? And since we're targeting this kind of data-based performance problem, we're going to actually parameterize this benchmark on a particular size, okay? And a size in bytes, not on how many elements or something. But we're actually gonna go after very specific byte sizes. So we take in this s.range zero to kind of grab a size from the benchmarking framework. We shift one byte, and this just lets us get a power of two. I don't want to type out really big numbers. I'm bad at it. So we're always gonna use a power of two bytes, okay? Then we're gonna figure out how many elements we want in our data structures in order to use roughly this many bytes. We're gonna build two data structures. They're both gonna be vectors. We're gonna put ints in them. And so we can kind of guess at the amount of elements by dividing the number of bytes by the size of an int and then dividing that into two, cause we're gonna have two vectors, cool. We fill up the first vector here with random data. All right, I have my little random number generator class here to make this code fit on a screen. It's super fancy. All it does is produce something I can iterate in a range-based for loop. I can show it to you if you're desperate to see it. But it's not really that awesome. But it gives me so many random numbers. That's what we got here. And we fill up our vector with it. And then we go and we actually, the next vector's a little bit more interesting. So this is called this indices vector, right? So our indices vector we're going to fill up with indexes into the other vector, but random indexes. And we'll talk about why we're gonna do that in just a second. So with this one we go down, we restrict our range for the random numbers to zero to the count, right? We're gonna put a bunch of stuff into this. And then we're gonna go into our benchmark, we're gonna keep running the benchmark until we get enough samples to kind of have accurate measurements. We're going to sum things. And what we're going to do, and this is the tricky part here, so we're gonna loop over the indices, these random indices, and then for each of these random indices, we're gonna go load the data out of the first vector at that offset and add it to our sum. What does this mean? It means we have two access patterns to memory. We have one vector where we're just walking through memory linearly. And half of our working set is doing that. And the other half of our working set is randomly bouncing all over the place. No way to predict where it goes next. So this is designed to undermine cache locality. This is exactly the kind of access pattern you'd have, for example, if you have a lot of linked list data structures and your allocations are randomized. Then you're gonna be bouncing around all over your data structure. In particular, this happens to be exactly the kind of data structure access you see if you have something like an unordered map or an unordered set out of the standard library, because it stores all of its elements behind one layer of indirection, which means you essentially have a vector of random pointers off into memory, and you see access patterns that look very, very similar to this. This is just a little easier to code and think about. Everybody happy with my benchmark? Please, if someone is confused, if you don't know what I'm talking about, if none of what's of the screen makes sense, shout, raise your hand, jump up and down, find a microphone. I've gotta explain it, I've gotta actually help everyone understand it or we're all lost here. All right, so this is my lovely benchmark. Let's run it. That seems like the most prudent thing to do next. So fortunately, I went and built a whole build system for this particular demo. I didn't wanna sit around here hunting for compiler command line flags. And so I can run this. And when we run this, we're gonna see some lovely numbers come out and some stuff that I've tried to do here just to make this a little bit easier for us to think about when we see these numbers. I've tried to really carefully pick the range of values we're benchmarking, and I've tried to print out kind of why those values are particularly interesting for us to look at, okay? All right, so let's dig in just a little bit here. All right, so we start off with eight kilobytes of data. With eight kilobytes of data, we see things are fast. With 16 kilobytes of data, things are still fast. With 32 kilobytes of data, things are still fast. With 64 kilobytes of data though, things start to slow down a little bit. We didn't change what we were doing, we just grew the amount of data that we're kind of randomly probing. All I'm doing here is just measuring time, and you can already see this. And then by the time you get to 128 kilobytes, we're really starting to slow down. We've gone from 18 gigabytes per second of memory accesses, essentially, and sums, to 15 gigabytes per second. That's a pretty substantial slowdown, given the fact that we've just gone from a very small amount of data, 32 kilobytes, it's tiny, right, to 128 kilobytes. This is also still kind of tiny. But it happens that I know something about this processor and this machine. That's gonna become pretty important. So this is a AMD processor. This is a little server that I built recently. And it's running the new AMD Epyc chips. So I don't know if you all heard of them. They're super fancy little chips. But they have a 64 kilobyte L1D cache. And wouldn't you know, the performance starts to teeter off right as we get up to 64 kilobytes and we're actually getting cache contention. And because of the way caches work, when you're doing randomized access, you're gonna have collisions in your cache. So you don't get exactly 64 kilobytes. You get close to that, you start to see performance fall off, you go past it, performance degrades substantially. And so doing nothing more than measuring time, we can actually observe the cache size. So I think that's pretty cool. Maybe everyone else is like, this is so boring. But I think this is awesome. So then let's look further down here, because we got more numbers here. So if you keep going down, we see we got 128 kilobytes. We've got 256 kilobytes. The performance is kind of degrading, but there's not really another kind of bend in the curve here. We get to 512 kilobytes, still going down, not a huge bend, but it starts, it kind of knocks off a bit after 512 kilobytes, which I found really interesting. Dropped down to 10 gigabytes per second there. It's not a huge drop, but it's not trivial either. And then it starts to drop, drops faster and faster until we get to eight megabytes. And then from eight megabytes to 16 megabytes, the performance just craters, absolutely craters. So does anyone have a guess about the other two cache sizes of my processor? (laughing) Anyone? Anyone? Come on. No, no, come on. Yes? (person talking quietly) 16 megabytes, I heard. (person talking quietly) Eight, why eight? Why do you think it's eight? Look at the jump between four megabytes and eight megabytes. It's not substantially different between two megabytes and four megabytes. And we shouldn't expect to be able to fully use an eight-megabyte cache. Cause there's gonna be collisions in that cache. It's when we hit 16 megabytes that we see a precipitous drop off. And that likely means we're missing cache. And it's really interesting, this one's precipitous. And it makes sense that one of these is way more dramatic than all of the others, cause that's the last level cache. When we miss here, we hit memory. When we miss here, we take orders of magnitude more time. And so we should expect this precipitous drop off, and we should expect it around the actual size of the cache, not before it, not after it, right at the size of the cache, when it's full, not when it's too full. And then it stays down. But there's another cache layer in here. This processor has L1 cache, has L2 cache, has L3 cache. L1 cache is 64 kilobytes. L3 is 16 megabytes. Any guesses what L2 is? How much? (person talking quietly) One megabyte. Any other guesses? 512 kilobytes. Someone's looking up on Wikipedia. It is 512 kilobytes. It's exactly 512 kilobytes. And this one I don't understand. I don't know how it actually has this good a performance at 512 kilobytes of working set. So that's a great question. We should figure out why this has good performance. Okay, so we can actually do something a lot more interesting than just running this benchmark and looking at timing. It's cool that I can build a benchmark to show me the cache and how it behaves. But I probably am not trying to fix the performance of that benchmark. That's not likely an exciting prospect. I built it to perform badly. What I actually wanna be able to do is I wanna be able to recognize when a real program is doing that and I didn't intend for it to be doing that. So we need a better measurement tool. So the first thing we need to do here is we need to come up with a way of getting better data. So the benchmark framework I'm using has some cool flags here. You can filter down to one benchmark measurement, and you could also ask it to run that benchmark measurement for more time. And this gives you a lot more data. So I've asked it to run for probably too much time. Okay, that's just silly. Let's run it for two seconds here. And this'll give us pretty stable numbers. And so this is right at 64 kilobytes. So I probably want to back off one step here so we can look at one step before this. So this is in cache. We expect this to be super fast. This is L1 cache. Cool, it is. Now, how do we actually tell that this is staying in L1 cache? Well, that's where our performance monitoring tools come in, perf. Last time I talked about perf, I showed perf stat. I said, this is really cool, but we're not gonna talk about it a whole lot and ran off to do profiling. Today, we're only gonna talk about perf stat, because there's no interesting source code here. What we're trying to understand is what the processor and the machine is doing. So I run perf stat. I get a bunch of data here. But nothing looks bad. I have very few stalled cycles. My processor's running fast. I don't have any context switches. This looks great. This is about as healthy as you could get a benchmark to run. But as soon as I go up to 64K, I start to see a different result. All of a sudden, I have this 25% of my cycles stalled, right there. So that's your first signal that you have some kind of memory bandwidth problem. It might be something else, though. There are other things that cause front end stalls. So how do we drill down even further? Let's look for a better tool. If you run the perf list command, it actually will list out all of the awesome things perf can measure. And it can measure amazing, amazing things. And right at the top is this beautiful list of things, L1D cache load misses. That looks terribly promising. Okay, we gotta try that one. All right, so if you wanna measure, ah, no. So if you wanna actually measure a specific counter, you have to tell perf to do that instead of its typical counters. So we wanna measure L1D cache loads. Well, actually, we don't know that it's D cache. It might be I cache. And you can actually measure more than one thing at a time, so you can say show me the loads. Also show me the load misses, probably good. Show me the D cache loads and the D cache load misses. Did I manage to typo any of that? I don't think so. We'll see. And this, look at that. Okay, sweet. So we run this, we get some information here about the fraction of L1D cache queries were cache hits. Sorry, sorry, were cache misses. So 0.04%. Now, the thing that's just amazing to me is if we back this up one step, if we go back one step to the one that was super fast, you're just gonna be just staggered by this, I think. So the different between four to five percent front end stalls and 25% front end stalled cycles was a difference of 0.01% L1 cache misses to 0.04% L1 cache misses. That's why everyone is so obsessed about cache locality. We regressed cache locality by 0.03%, and we regressed performance by like 20%. That's nuts. This is crazy talk, right? Well, that's processors for you. And this works all, like we can keep going down here. So if we hop down a bit, I don't remember the exact one. You're gonna see this number just keep growing. So here at one megabyte, we have 12% L1D cache misses. The L2 isn't terribly interesting, so I'm just jumping ahead on you people, sorry. If we go a little bit bigger, we're gonna see even more. So at eight megabytes, we have 1.6% L1D cache misses, 1.6% L1D cache misses. So we have 100 times the D cache miss rate. Because we have preposterously more working set. But now you know how to actually tell am I blowing up my cache? I would love to show you all of the other hardware performance counters that measure different parts of the cache. Unfortunately, most of them don't work. And that's a little concerning to me, cause I work on performance stuff, but hardware performance counters are actually still really finicky and hard to get to work. And so this is a very new processor. The Linux kernel and the profiling tools don't appear to have caught up. So the L1 is the best one I can probe here. Any questions about probing for cache misses before I move on? No, maybe? Everybody happy? Everybody's just like, I don't even know what you're doing anymore. (laughing) You're playing with these toys that don't make sense. All right, let's briefly return to our presentation here, see what else we wanna talk about. Yes? (person talking quietly) Is there something to say about the do not optimize function call in my code? That is the benchmark library's beautiful packaging of that horrible, horrible trick I showed you two years ago to get the compiler to not optimize away your benchmark code. And you can thank Eric Selleg, who's over here. He added it to the benchmark library like at the same time as I was presenting the horrible stuff. But now I get a nice API. But you can go look at the benchmark library. It's Google Benchmark, like Google slash Benchmark on GitHub, and it has documentation for the function. All right, so that's great. This is all great. I've shown you one clever trick that'll let you debug all of your cache locality problems. Well, not really. It'll tell you that you have cache locality problems. And then you have to go do a bunch of work. And we've had a lot of great talks about how to actually fix those. I don't wanna repeat that. So now I'm gonna jump to the other topic for today. The other topic is, I think, the hardest single topic to really internalize in your brain about optimizing loops, okay? And this one applies to tiny, microscopic, tiny little loops that you sometimes end up with. And we still need to make those fast. And we're just gonna go right back to live demo. I hate slides. Live demo's better. All right, let's take a look at a tiny little loop. So this is my least favorite loop in the world. I hate this loop. I'm not actually being facetious. This loop has caused me more pain than any other loop, really. This is kind of a silly version of it. I've really simplified it a bit. This is what's called the clamp loop, where you run over a big pile of integers and we see if any of those integers are bigger than some upper bound, and if they are bigger, we clamp it to that upper bound. This loop, in various forms, not really this form, but in various forms, this loop shows up in unimportant code like every compression algorithm based on zlib. And every decompression algorithm based on zlib. It also shows up in most image compression algorithms. This thing is everywhere. You'd be just amazed at how much of your CPU on your phone and your laptop, how much of your CPU is dedicated to running this annoying little loop. Now, I've also turned off a bunch of fancy, fancy loop optimizations that the compiler does, because I actually wanna look at the simple version of this loop. And in a bunch of cases, we actually see the simple version in real code because in a bunch of cases, we don't see one loop on its own in the middle of a micro-benchmark designed to test it. And so much like we have to turn off the optimizer for micro-benchmarks to keep them from deleting my code, I also kind of need to dial them down a little bit so that I can understand what the behavior of this loop is without it being vectorized and unrolled and peeled and turned into some kind of monstrosity. All right, so I've turned all that off, just to simplify things. And I've got my little benchmark. Any questions about this benchmark? If anyone doesn't understand it, I'm happy to clarify. Otherwise, I just want to get onto showing you the craziness here. All right, we're gonna see if we can do this. So, again, I've got a build system. I'm up to date, excellent. So let's try running my clamp benchmark. I just gave it some random sizes. It doesn't really matter. Right, it's running, it's doing 1.4 billion items per second. It's running pretty fast. This is solid. This is a four-byte item, so it's actually churning through a reasonable amount of data. This looks great, this looks fantastic. But we should actually take a look at what it's doing and see if it's doing something that makes sense to us. And we looked at how to do that last time, so let's just quickly do something like this. So I'm gonna run it for a little while to make sure I get enough sample data to have something useful. Recording it, recording it. I told you it was live. Okay, so this is exactly the profiling steps that I showed you last time I was up here doing live demos with profiling. We wanna annotate this thing and look at the code. Okay, so here's the code. So this is the loop. Hopefully folks can read this. We jump back up to the top, kind of compare these things, we do this fancy cmovge thing. All right, so let's figure out what this actually means. Step one, move. This is loading the data we're gonna clamp. Step two, comparing it with OX100 in hex. This is checking to see whether we need to clamp it at all. Then we do this cmovge thing. This is a special X86 instruction. It conditionally moves a value between registers. And so that GE is a condition predicate. If that comparison found that the value we had was greater than or equal to the thing we compared it with, we wanna replace it, we wanna clamp it. And it happens that we've actually put the replacement into a register already, so we have that in ebx over here, and we move it over top of it. Then we store the result back into memory, so we actually have done our full clamp. We add four to our pointer, because we've gotta go to the next entry. Then we check to see if we're done. We're not done, we brach back up. Make sense? That's the loop. Everybody understand the clamp loop? It's super simple, right? Everybody's like, this is super simple. Okay, so X86 has this awesome instruction called cmovge. It seems designed to implement this clamp loop. Aren't you glad the compiler took care of generating this magical code for you? Well, see, I work on compilers. And I'm suspicious. And so I'm like, I don't know. Last time I got up here I invented a faster mod. We can do something here, okay. So maybe we're crazy and we think we can totally do better than this. This is totally within our power. So let's see if we can get at this thing. So I went ahead and compiled this thing down so I could get the assembly, because I don't really like staring at object code or looking at build systems all that much. So this should give me the assembly. All right, so here's my loop. You can see the cmovege. There's a bunch of other stuff here. Actually, I can clear the other stuff we don't need. We don't need debug information. We're good without debug information. Where did you go? Yeah, we don't need this stuff. I'll make it easier to read. If I can type. Okay, cool. So there's my cmovge. This looks like the code we just saw. This looks the same. I mean, it's using a decimal number. I don't know why it's using a decimal number. But it looks the same, right? Okay, so if I'm sitting here thinking, look, I'm way more clever than my compiler is. I don't need this cmov thing. I think that's kinda lame. Because maybe I'm thinking, this cmov thing seems kinda clever, but what if we just didn't do that move at all? What we could do here is we could just branch, we could branch to something. But I need something to branch to, so I have to make up a name down here. But we could branch to, I don't know, seems fine. All right. So do you all see what I'm doing here? So we're still doing the comparison, but we're jumping if not greater than or equal to. If it's not greater than or equal to, we're gonna jump over it. And if it's not greater than or equal to, we're gonna have to clamp it down. Is that in ebx, I think? Have I made any bugs here? I think we're good. Yeah, so if it's greater than or equal to, we're gonna go and we're gonna store our constant into it. If it's not greater than or equal to, it's already below our clamp. We're just gonna skip over the store. How many folks here think this is gonna be faster? (person talking quietly) Depends on my data, depends on my data. So do we not remember what the data is? Okay, here, let me go look at my data. That's a great question. So if you look up here, it's kinda hard to read, hold on. So this should be easier to read. So I'm picking random numbers between zero and int max. So there're not gonna be a whole lot of them below 256. So anyone think this is gonna be faster? Why do you think this is gonna be faster? (person talking quietly) Branch prediction, why? (person talking quietly) It's all the same branch. But you still have an extra branch. (person talking quietly) One more time? (person talking quietly) Branch prediction? (person talking quietly) So branch prediction guesses the jump forwards. Not true in modern processors, it turns out. Used to be true. Pentium six had a forward-based predictor, but modern processors don't anymore. Any other theories? So for the record, my theory was this is not gonna be faster. We had one branch, now we have two. They're super close together. This isn't good for performance, I was thinking. This is a terrible idea, because we're never actually gonna take the branch. We're just gonna consider taking the branch, take up an entry in our branch predictor for no good reason, we're still gonna store the same number of times to memory. This is the perfect case for a cmov. It's a cmov that actually always happens. That's the only thing that made sense to me. I was actually pretty convinced that this was a bad idea. I was so convinced I actually spent a bunch of time making LLVM produce that cmov. (laughing) It's a true story, I'm not making that up, okay? I really cared about this. I thought this was a bad way to go. But we should find out whether this is actually a bad idea or not. What do I wanna do? None of that, none of that, none of that, come on. There we go, I wanted it to look like that. Let's make myself a fancy clamp. Oh, goodness, I'm not waiting five seconds. I'm an impatient soul. All right, so uh oh, uh oh, hold on. I don't remember what the other one was. This is the problem with doing it live. So we have our fancy clamp, and we have our not fancy clamp. And it turns out that cmov is faster. So the really funny thing is that the cmov wasn't faster earlier today. (laughing) (applause) Which is rather surprising. And now of course you all know the gag I was trying to play on you and failing just really miserably. Did I get this wrong? Did you people let me down and let me program this wrong? Let's do it the other way, cause that's how I did it when I tested this. I think that's right. (person talking quietly) Oh, ebx is set way, way, way, way, way, oh, but I bet I know what's going on here. See, when I did this, I was kind of annoyed at that ebx anyways, and so I did this, cause that seems obvious. Oh, I have to rebuild it. So somewhat hopefully this will reproduce results. There we go, cool. (laughing and applause) In case you were wondering whether this was live. Okay, so lo and behold, a branch can be faster than cmov. And I have to say, I did not expect that at all. It's not a lot faster, but it's faster. And I wanted to understand why, not just in this case, but actually we see this a lot. We actually see it a bunch of places where you would kind of expect cmov to be better, and it doesn't end up being a significant win. And it's kind of interesting to try and figure out why that's happening. But to explain that, I actually, I can't use a tool to really explain that. It turn out that this is really hard to explain. We're also running a little bit short on time. So I want to prepare all of you. This is going to be a bit of crazy. So we're gonna go to slides here, because I can't do anything but slides. I'm gonna try and relay to you a talk that one of the really senior engineers who worked on LLVM for a lot of years gave as a keynote for the LLVM developers meeting a bunch of years back, back in 2013. He used a different example than clamp, and there's a good reason to use this example. Branches make thinking about this even harder, and this, it turns out, remember what I said at the beginning, this is one of the hardest to think about problems we have with loop performance. So we're gonna simplify this as something that doesn't have branches. We're gonna look at a dot product, all right? And this is exactly the code you would write to compute dot product. There's nothing fancy going on here. And if you compile this code, that inner loop is gonna turn into this X86 instruction sequence. It's just gonna turn directly into that. And I promise, I actually did compile it. I'm not making this up. All right. But this is not what the processor is going to execute. The X86 processor doesn't actually execute X86 instructions. X86 instructions are very high level. The X86 processor executes what's called microcode. It decodes the X86 instructions into smaller, more precise operations. And we can also kind of use a better notation, which actually shows you where the destination is, where the destination is and separates an input that might also happen to be a destination. And so I'd like to use the code on the right here, not the code on the left. The right is kind of what X86's micro ops are. Let's step through this. So we start off with a load. A load is actually just a direct mapping into microcode. No problem there. It's just one micro op, sorry, not microcode, micro op. But this imull has two things going on. It's loading, but it's also doing a multiply. And so it's doing those as two separate micro ops. The first one is just gonna load. And then the second one is gonna do the multiply. Then you have an add. That translates very naturally. You have another add for kind of the array index. Then you have a comparison. And you shake your head, because I've been lying to you this entire time. Sometimes micro ops are actually bigger than the X86 instruction. We have this thing called a macro op fusion in every modern X86 processor, and it says, well, I see you're doing a comparison and then you're branching on the result. I'm just gonna bake in an operation to do exactly that. And so we get this as a single operation in X86 when we're actually executing it. Make sense? Everybody happy with my little code sample here? Wonderful, okay. So there's one other thing the processor knows about this that's really important. It knows that this branch is going to be taken. It is confident this branch is gonna be taken. Of course it is, we're looping. It's taken every time until we stop caring. And so the branch is gonna be predicted as taken. Now, because this branch is predicted as taken, there's this fall-through path that's actually not interesting to the processor. And so notionally, we can think about this inverted. We can think about it the other way around, where we actually continue executing the loop body and we branch out of this loop if we're done. Okay? And that inversion is really key, because now we're gonna kind of notionally, this doesn't actually happen, but think about it as if you unrolled an iteration of that loop. So you have a second iteration of the same loop. And then another iteration after that. You can kind of stack these iterations up. It turns out this is exactly what the processor does, but not when it's executing it. It stacks up the iterations for the result of each computation. It has what's called a reorder buffer that looks pretty much like this and captures the result of every computation and can capture several iterations' worth of computation, not just one. Now, that's hard because we've got these registers in here that are reused over and over again. And so the processor also has an amazing feature called register renaming. Modern processors don't have the number of physical registers that you can write in assembly. They have many times that number of physical registers. And so when you write percent ecx here, it's actually gonna go and rename that to something else. And so we can kind of rename this entire set to have unique names so that we can execute any of these and capture any of their results, provided the dependencies are satisfied. Make some sense? So this is how the processor wants to think about finishing the execution of this loop. It thinks about finishing multiple iterations concurrently, all right? It's called speculative execution. It's incredibly powerful. Let's see exactly how powerful. To understand how this executes, you have to think about the parts of the CPU that are doing the execution. These are the functional units, or the execution units inside the processor. We have a load unit that's loading from memory. We have arithmetic units that are doing things like addition, multiplication. And we might have multiple units in the processor. Most modern processors, ARM, X86, most modern processors have multiple units that can all run at the same time. So I'm gonna kind of use a hypothetical processor with one load unit and two arithmetic units here, okay? And we're gonna kind of step through, cycle by cycle, how this executes on the processor or how it could execute on the processor. First, we have kind of a simple chain of events where we have no really interesting choices. We need to load two values. There's one unit that can do that we have to go there. We're loading from memory, and so there's latency involved in that. And so the next operation can't execute immediately. But at some point, the results become available and we can do the multiply. And then at some point the multiply results become available, and we can this accumulate. And multiply's a relatively slow operation, so it might take a few cycles. That's what this is trying to represent. Don't get the idea that this is precisely accurate for X86. This is just illustrative. Okay, and then there are two other instructions that execute in that first iteration, but they're actually completely independent once you do register renaming. We can execute the increment on that array index right away because we can just remember the old value and use it if we need it. And so here we're seeing out of order execution and speculative execution from the processor. Sorry, sorry, not speculative, just out of order. So this is great, right? This means we've already handled those two instructions. But what happens next? We have this branch, but we predicted that it would be taken. And so the processor just assumes that this is gonna be taken and continues executing this stream of instructions. And so it starts off on the next iteration, which it can pack in very densely because there's a lot of remaining space. So it can go ahead and execute the next increment. It can execute the next increment before any of the loads have finished, okay? Before any of the loads have finished. And then it can actually do the next test. And because it's already doing the next test, it doesn't have to work very hard to actually continue speculating, pass this iteration into the third iteration. It's going to finish the third increment before the first load becomes available, okay? But look at what we've done here. Look at what we've done here. The latency to get from the very first operation to the actual result, the ax1 there, was eight cycles. The latency to get to the next end result was two cycles. This is why this is how you get nowhere faster. This is actually probably the most important thing in modern CPUs for performance. It's just unbelievable what this does to your execution performance. And just to be fair, let's roll this whole thing forward until we've actually fully packed the execution units. And you can see there's actually no space to fully pack these execution units. We had one stall cycle here because we can't compute flag six. We can't compute the comparison for flag six on the sixth iteration because the increment to the index hasn't finished yet. And we can't quite do the increment to the index earlier because the previous increment to the index hadn't finished yet, because we're on the sixth iteration. And look at where the sixth iteration starts. The same cycle as the first iteration's result. Okay, so we have over six iterations of this loop executing in parallel on a single processor. All the way around the back edge. How many folks here realized that this is what executing this loop would look like in the processor? You guys pay more attention than I thought, good. So this is what makes it really hard to reason about performance. Now, let's think about something else with this. We have this cmov. Why is cmov weird? Because we can't predict cmov. We simply have to evaluate both inputs. See, we didn't wait to find out anything about the flag. When we were scheduling this, we didn't have to wait for flag zero to finish resolving to start the next operation. We just keep going, right? But with cmov, you can't do that. You can't speculate through cmov in the same way you can through a branch. And this is why branches, in a bunch of cases end up faster than cmov. Making some sense? Looks like I have a question. Or you can speak up, I'll try and listen better, sorry. (person talking quietly) So the question is, why doesn't cmov get microcode translated into the right thing? Because that's hard. That's a very expensive transformation to do in a very constrained part of the processor. Knowing when it's safe to do that, when it's correct to do that is very, very hard. And there are cases where cmov is absolutely correct. There are cases where there's no advantage to speculative execution, all of the inputs are going to be ready long before the cmov has to be executed, and where actually doing the speculation might be hard. And so how does the processor know? And the answer is it doesn't. And so sadly, this falls on us compiler folks. All right. So this seems really complicated. I would like tools to help me reason about this kind of pipelining. And I showed you kind of a fictitious processor, but it would be nice if there were better tools. I was gonna try and live demo a bunch more tools for you, but unfortunately we only have enough time for a Q and A session. And so instead, I'm gonna mention the tool, and I'm happy to show you guys some other time. It's called the Intel Architecture Code Analyzer. It's an amazing, awesome tool. It's just a piece of magic. You should all play with it. It's gonna give you the exact graph I did, except it'll be real for the code you give it. It's just amazing. I could try and do it, but it would take me like way more than the amount of time we have here. But let's do the other live demo, which I think is faster. And then folks, get ready for your questions cause we only have a few minutes. So one thing that I think is useful to do here is actually to look at what the performance counters tell you when you run these two different versions of the benchmark. Because the performance counters actually give you some good clues as to what's going on here. So if we look at the original one with cmov in it, one thing that we're going to notice is we have a tremendous number of backend stalls here. And that's not a good sign. We are actually struggling to execute that clamp loop. But this is also probably coming from storing to memory, and sadly, we are storing to memory on every iteration. And if I switch over to my fancy clamp loop, we're gonna see a pretty substantially different profile. So it finishes faster, but now you see something really confusing. Even more backend stalls. And this should confuse you, right? Like why did it get worse? But think about what we just showed. The processor is going to speculate that branch. And it's going to have a store into memory ready on every single cycle. All this benchmark is doing is storing to memory. It should be stalled every cycle, trying to store to memory. The fact that the other one wasn't is because it was doing something else. It was waiting on cmov. And I mentioned this was hard to tease out. It looks very counterintuitive. But this is the reality of X86 processors and most modern CPUs. All right, with that we should wrap it up here. Let you guys ask me the really hard questions that you've been holding back on, cause I know you have. It's good, I would too. Like I said, hopefully this lets you understand why your loops are going fast or not, lets you reason about at least two of the interesting cases. There are a million more. Unfortunately, I can't stand up here and monopolize the entire conference. So I'll have to come back and talk about more fun versions of this next time. All right, so questions? I don't know which side was up first. Let's go over here. (applause) - [Man] Thank you, great demo. And you showed it on AMD processor, right? Did you see any difference on Intel processors? I will freely admit, I haven't tried this on Intel processors very recently. However, in the past, I have not seen substantive differences. Most modern processors are very similar here. You'll find all of what I'm talking about here generally holds for modern desktop or server-oriented and even the high end of mobile-oriented Intel processors, AMD processors, ARM processors, and PowerPC processors. It's really ubiquitous. This is not actually specific to one architecture at all. For example, the out-of-order execution talk, the original version of that was given talking about an ARM processor, not an X86 processor. But it translates perfectly. The processors are much more similar than you might imagine. - [Audience Member] Working through SG14 and AWG is a proposal for likely unlikely attributes to standardize things like built-in expect. We actually noticed the same thing while validating the use of that that when you put likely unlikely on, it turns the cmov into the jump. Yes. - [Audience Member] And we saw it got better around one tenth or a hundredth of a percent of the time, it would sharp drop off. Do you have any thoughts on using notations like that and a good way of dealing with them? I love them, they're great. I have one problem with them. Sometimes they get old and out of date and incorrect. What I would really, really, really like, and sadly I've never had the opportunity to go with build, is a tool that takes an actual profile of the binary and checks to see, so, did this profile and your static hints actually agree at all? Were they even close? Were they actually contradicting? And actually tells you, the programmer, hey, maybe you should go check these two annotations over here because when you ran your binary, those annotations weren't just unimportant. They were dead wrong. But I don't have that tool yet. And so the only hesitation I have is that if you're not careful, they can get out of date. Otherwise, I love them. Oh, one more thing about them. Those are especially great when you have some fundamental reason for them to be correct. Best example in the world, vectors grow during pushback, right? We actually have an algorithmic guarantee that vectors grow during pushback is cold. Because it's amortized away. That kind of guarantee, fantastic. - [Man] Hey, Chandler. You actually showed us live that comparing to a literal is faster than comparing to a register, which seemed, well, quite unnatural to me. Comparing? - [Man] Yeah, the cmp and the move, no, the move, sorry. In the move, yeah. - [Man] Yeah, it seems strange to me that moving a literal is faster than moving from another register where you already have the value. Yeah, I don't know. (laughing) I mean, I can sit up here and try to figure it out, but I don't think I'm gonna succeed. It might be entertaining, but I don't know that I'm gonna have an answer. My best guess is actually that it may have gotten in the way of register renaming. It does mean we have another register live, right? I don't know why that would be the case. It seems very strange for that to be the case. This loop doesn't seem complicated for the processor to handle. But I can't argue with your conclusion. - [Man] All right, thank you. - [Audience Member] Hey, great talk. So I had a question about the last loop that was faster but it had more stalls. Yes. - [Audience Member] So you said that it's stalling because we are storing to memory, right? So what about the store buffer? How does it... So the store buffer, first off, for folks who don't understand, there's a buffer inside the processor that kind of takes stores before they actually arrive at memory. That's true. However, we're also reading from memory. And we're not reading from the same memory location. And unfortunately, X86 makes guarantees about memory orderings. And so we're not waiting on the store, the store buffer can't help us here when we actually need to let the store flush all the way out. - [Audience Member] Okay, thanks. - [Man] I actually got up here to ask that same question about ebx and the literal. I mean, also I should be clear, I'm guessing. Again, I can sit up here and try and prove it. If we had more time, I would, cause that's actually really interesting, to try and understand whether it's stalling because of synchronization, like establishing the TSO, because it does have to establish TSO for these stores, sadly. And we could do something to try and actually see that. For example, we could switch them to non-temporal stores. Because we happen to know we're not going to read them again, and see if that changes the behavior of the processor. And it might. It might actually make a big difference. - [Man] So instead of asking that, I'll piggyback on Paul Hansom's question about the built-in expand down notations. And I think a couple of the HFT talks here mentioned that sometimes you don't expect something to happen but you still need it to be fast. Got any opinions on that? No, it's absolutely relevant. We see this actually a lot with more genuine profile-based optimizations because there're kind of very boring reasons why a profile is sometimes wrong. However, there is some good news here. The compiles doesn't just go home when it sees something is cold. It tries to make the path that isn't cold as fast as possible. And it will make the cold path slower when doing so makes the other one faster. But it won't insert arbitrary slowness there. And so I would not expect this to be that bad. I mean, it could be, but that seems kind of weird. The way we actually talk about it inside of the compiler, inside of the optimizer is whenever we have a choice between making one side faster or making both sides faster, we always prioritize. But it's about priority. It's not really about pessimizing. I don't know if that really helps though. - [Audience Member] Given that cmov doesn't seem to do speculative execution Yes. - [Audience Member] In the loops, what are the cases where it makes sense to use cmov? So it makes a lot of sense to use cmov when your dependencies are already going to be required. That's the classic way. In particular, for example, if you have to materialize the values to compute the condition, it can just not matter. And so then you get rid of a branch, branch predictor gets one less branch it has to keep track of. It can be a win. But I'm sympathetic. I've never met a cmov I liked. I just, it hasn't happened to me yet. I don't know why. But there is a theory behind the instruction. - [Man] On the first live, when you were populating the micro ops into your pretend processor, you did a load on I think two, and then you didn't use the result until six. I was surprised to see that you put another load right on three. I was thinking, are the loads sort of fire and forget on a real processor? You can just load, load, load, load, load, and not worry about how long it's gonna take to come back? It depends. But yes, in many cases there is overlap there. There's two sources of delay. One is figuring out what to load. The unit's busy, it can't take more. But there is also just waiting for the caching memory subsystem to produce the value. That's writing into a register file. The unit's not really involved in that. You're just waiting for a dependency to be satisfied, and then you keep executing. - [Man] So you definitely don't have to wait until six to put another load line in. Almost certainly not. - [Man] Okay, thank you. Almost certainly not. - [Audience Member] In this first case, in the cmov, have you tried profiling-guided optimizations? Did it help? Yeah, LLVM kills the cmov and switches to a jump if you give it a profile. - [Audience Member] Oh, okay. - [Man] In my amateurish view of things, isn't it clear that when you store the literal directly, it's faster than doing this because it doesn't have to, it already knows, when it decodes the instruction, it knows what the literal is. It knows what it has to store. So it doesn't have to wait for the register to actually have the value. I would not expect the processor to work that way. I would actually expect that the wires that do the store come from a register file entry. - [Man] Okay. But I'm also not a hardware architect. I tried to give you enough understanding of the hardware to work with it. But I don't build processors. You should track down someone who does. They might be able to give you a much more detailed answer. I know there's one person here who might have a clue, from Intel, at least. - [Man] Okay, thanks. - [Audience Member] So my question is about going nowhere faster. So if you're gonna speculate the instructions, and let's say the processor gonna speculate doing some loads out of the bounds of the RA. How does it work in the hardware that it doesn't crash? I don't even know. Magic and unicorns? Anyways, sadly, the session's over. I'm happy to try and find folks and answer more questions later. Thank you all. (applause)
Info
Channel: CppCon
Views: 76,138
Rating: undefined out of 5
Keywords: Chandler Carruth, CppCon 2017, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: 2EWejmkKlxs
Channel Id: undefined
Length: 60min 57sec (3657 seconds)
Published: Thu Nov 02 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.