All right, folks, come on in. I wanna give you all a fair warning. You probably want to move
towards the front of the room. I know it's scary up here. I'm not gonna throw anything. But there's gonna be a lot of code. It's gonna be kind of small. Because I'm gonna actually
be coding it live, and so I can't use super
big slide fonts, sorry. But yeah, people way in the back, you're gonna have a hard time. You might wanna move up a little bit. We have enough space here. All right, so my name's Chandler Carruth. I'm here to talk to you about going absolutely nowhere faster, which is probably a little
bit of a confusing title. Sorry for that. But it seemed like a good funny joke. This talk is gonna be
a little bit unusual. So a lot of my optimization talks, I actually try and tell you how to make things go faster. I've given several
optimization talks at CppCon. This one's not going to do that. If that's why you're here, sorry. I'm trying to set expectations, so. I am gonna try and tell
you how to understand why your code is going
fast or isn't going fast. But today we're gonna
focus on when your code is actually going nowhere at all, but either fast or not fast. We're gonna talk about loops. We talked about data structures, talked about a lot of other
things around performance. This time we're gonna talk about loops. There's gonna be a decent
amount of live demo. I'm gonna show you stuff. I'm gonna try and show
you stuff, hopefully. Please don't do anything
particularly mean to the wifi. That would be much appreciated. The other thing I do wanna mention is please be thinking of questions. One of the challenges of this topic is that there, and I'm a terrible speaker. I'm gonna give away the
punchline on the title slide. There is no one good answer
when it comes to loops. I'm gonna talk about two
specific things, mostly. There's some bonus stuff if we get to it, but I'm not sure we're gonna get to it. I'm gonna talk about two big things today. But there's a laundry list
of things I could talk about. We could keep going for hours and hours. And there's no telling
which one's gonna be the most important for you. I'm focusing on the thing that I think is the most common for people to run into and the thing that I think
is the hardest to understand. That's how I picked my two. But I don't want you to get the idea that, oh, well, those are the two things that go wrong with your loops. Okay, we're done now, cool. It's not quite like that. And just a little bit of
background information in case, how many folks have been to a talk from me at CppCon before? Oh, this is a bad question. How many folks have not
been to a talk from me about optimization before? That's a lot of hands too. Okay, so for reference, I work at Google. I work on Google's C++ team. I'm the tech lead for our C++
programming language platform. And when I'm actually coding, the thing I like to code on is LLVM, which is our preferred
optimizing compiler. I do a lot of work on LLVM and compilers and optimization, cause
I really like performance and I'm totally obsessed with that. And I come to CppCon to get to actually talk to you all about
cool things you can do to make programs run faster. That's really why we're here. So with that, let's get started. Briefly, I have given three
talks at CppCon before. You should go and watch them, not because I'm super proud of them, necessarily, but because they're way more
important than this talk. (laughing) I'm just throwing that out there, these are way more important. You've gotta use efficient algorithms, you've gotta use fast data structures, you have to follow idioms
that make your code fast. You can't just let your
code slide down everywhere. That doesn't mean you
add complexity to do it. There are simple idioms that are fast. And you wanna make sure
that you follow those really, really carefully. Always, always, always
benchmark the code that matters and understand why its
performance matters, right? Understand how it performed, in detail. Otherwise there's no hope. And you gotta use hybrid data structures to optimize your memory
allocations and cache locality. These are way more important, but but we gotta talk about
some other stuff, too. So I just have to remind everybody you only care about the performance that you have benchmarked. If you didn't write a
benchmark for your code, you do not care about its performance. That's not an opinion, okay? That's fact. This is how you show you
care about performance. So in every case that I talk about, I'm gonna assume that you
have a good benchmark. And the thing that I
haven't talked about before is what happens when you
have a loop in your benchmark and it's actually this loop that is the critical thing to your performance? You've got good data structures, you've tried to do other things, and you don't understand why this loop is such a bottleneck for your performance. Well, turns out that
it's still probably data. I'd like to say there's
this brilliant trick to loop optimization,
but 99 times out of 100, it's actually still data. If you want to get lots
of high-level ideas about how to fix this, there are a bunch of great talks out there. Mike Acton talked about
data-oriented design. Several other people have talked about how to design for cache locality and try to minimize your
working set of data. I gave a talk about profiling, but I also gave a talk
about hybrid data structures that are really trying to
help you solve this problem. And we're hoping in the future you can use a tool called Efficiency Sanitizer. How many folks here are familiar with one of the sanitizers? Okay, how many folks here are asleep? (laughing) Fair enough. You just had lunch, I understand. But we actually looked at
adding an Efficiency Sanitizer. Instead of looking for correctness bugs, what if you just found really
bad performance patterns? It's not really very fleshed out. If you're interested
in this, it'd be great to help out some. But that's actually something I hope you can use in the future. But let's actually go back
to profiling for a minute, because profiling tends
to work pretty well. But there are some tricks to profiling that I didn't even get into the last time I talked about it that are
very specifically useful when you have a loop and it's looping over lots of data, right, and that data is the source
of your performance problem. All right, let's see if we can switch to live demo here. Nothing could possibly go wrong. All right, so that doesn't look entirely great. That looks better, okay. So let's actually look at a benchmark that specifically is kind of designed around understanding bad data structure or cache locality problems. And writing this benchmark is actually a little bit tricky. So I'm not gonna try
and live code it on you. I think that would just make
all of us really unhappy. So I've gone ahead and I've
written this benchmark up. It's called cache bench, right, I'm very specifically
targeting kind of cache locality benchmarks. How do we understand the performance of your cache and of your working set? And since we're targeting
this kind of data-based performance problem,
we're going to actually parameterize this benchmark
on a particular size, okay? And a size in bytes,
not on how many elements or something. But we're actually gonna go
after very specific byte sizes. So we take in this s.range zero to kind of grab a size from
the benchmarking framework. We shift one byte, and
this just lets us get a power of two. I don't want to type
out really big numbers. I'm bad at it. So we're always gonna use
a power of two bytes, okay? Then we're gonna figure
out how many elements we want in our data structures in order to use roughly this many bytes. We're gonna build two data structures. They're both gonna be vectors. We're gonna put ints in them. And so we can kind of
guess at the amount of elements by dividing the number of bytes by the size of an int and
then dividing that into two, cause we're gonna have two vectors, cool. We fill up the first vector
here with random data. All right, I have my little random number generator class here to make this code fit on a screen. It's super fancy. All it does is produce
something I can iterate in a range-based for loop. I can show it to you if
you're desperate to see it. But it's not really that awesome. But it gives me so many random numbers. That's what we got here. And we fill up our vector with it. And then we go and we actually, the next vector's a little
bit more interesting. So this is called this
indices vector, right? So our indices vector
we're going to fill up with indexes into the other vector, but random indexes. And we'll talk about
why we're gonna do that in just a second. So with this one we go down, we restrict our range for the random numbers to zero to the count, right? We're gonna put a bunch
of stuff into this. And then we're gonna
go into our benchmark, we're gonna keep running the benchmark until we get enough samples to kind of have accurate measurements. We're going to sum things. And what we're going to do, and this is the tricky part here, so we're gonna loop over the indices, these random indices, and then for each of these random indices, we're gonna go load the data out of the first vector at that offset and add it to our sum. What does this mean? It means we have two
access patterns to memory. We have one vector where we're just walking through memory linearly. And half of our working set is doing that. And the other half of our working set is randomly bouncing all over the place. No way to predict where it goes next. So this is designed to
undermine cache locality. This is exactly the kind of
access pattern you'd have, for example, if you have
a lot of linked list data structures and your
allocations are randomized. Then you're gonna be bouncing around all over your data structure. In particular, this happens to be exactly the kind of data structure access you see if you have something
like an unordered map or an unordered set out
of the standard library, because it stores all of its elements behind one layer of indirection, which means you essentially have a vector of random pointers off into memory, and you see access
patterns that look very, very similar to this. This is just a little easier
to code and think about. Everybody happy with my benchmark? Please, if someone is confused, if you don't know what I'm talking about, if none of what's of
the screen makes sense, shout, raise your hand, jump up and down, find a microphone. I've gotta explain it, I've gotta actually help everyone understand
it or we're all lost here. All right, so this is my lovely benchmark. Let's run it. That seems like the most
prudent thing to do next. So fortunately, I went and built a whole build system for this particular demo. I didn't wanna sit around here hunting for compiler command line flags. And so I can run this. And when we run this, we're gonna see some lovely numbers
come out and some stuff that I've tried to do
here just to make this a little bit easier for us to think about when we see these numbers. I've tried to really carefully pick the range of values we're benchmarking, and I've tried to print out kind of why those values are
particularly interesting for us to look at, okay? All right, so let's dig
in just a little bit here. All right, so we start off
with eight kilobytes of data. With eight kilobytes of data, we see things are fast. With 16 kilobytes of data,
things are still fast. With 32 kilobytes of data,
things are still fast. With 64 kilobytes of data though, things start to slow down a little bit. We didn't change what we were doing, we just grew the amount of data that we're kind of randomly probing. All I'm doing here is just measuring time, and you can already see this. And then by the time you
get to 128 kilobytes, we're really starting to slow down. We've gone from 18 gigabytes per second of memory accesses, essentially, and sums, to 15 gigabytes per second. That's a pretty substantial slowdown, given the fact that we've just gone from a very small amount of data, 32 kilobytes, it's tiny, right, to 128 kilobytes. This is also still kind of tiny. But it happens that I know something about this processor and this machine. That's gonna become pretty important. So this is a AMD processor. This is a little server
that I built recently. And it's running the new AMD Epyc chips. So I don't know if you all heard of them. They're super fancy little chips. But they have a 64 kilobyte L1D cache. And wouldn't you know, the performance starts to teeter off right as we get up to 64 kilobytes and we're actually getting cache contention. And because of the way caches work, when you're doing randomized access, you're gonna have
collisions in your cache. So you don't get exactly 64 kilobytes. You get close to that, you start to see performance fall off, you go past it, performance degrades substantially. And so doing nothing
more than measuring time, we can actually observe the cache size. So I think that's pretty cool. Maybe everyone else is
like, this is so boring. But I think this is awesome. So then let's look further down here, because we got more numbers here. So if you keep going down, we see we got 128 kilobytes. We've got 256 kilobytes. The performance is kind of degrading, but there's not really
another kind of bend in the curve here. We get to 512 kilobytes, still going down, not a huge bend, but it starts, it kind of knocks off a
bit after 512 kilobytes, which I found really interesting. Dropped down to 10
gigabytes per second there. It's not a huge drop, but
it's not trivial either. And then it starts to drop,
drops faster and faster until we get to eight megabytes. And then from eight
megabytes to 16 megabytes, the performance just
craters, absolutely craters. So does anyone have a
guess about the other two cache sizes of my processor? (laughing) Anyone? Anyone? Come on. No, no, come on. Yes? (person talking quietly) 16 megabytes, I heard. (person talking quietly) Eight, why eight? Why do you think it's eight? Look at the jump between four
megabytes and eight megabytes. It's not substantially different between two megabytes and four megabytes. And we shouldn't expect to be able to fully use an eight-megabyte cache. Cause there's gonna be
collisions in that cache. It's when we hit 16 megabytes that we see a precipitous drop off. And that likely means we're missing cache. And it's really interesting,
this one's precipitous. And it makes sense that one of these is way more dramatic
than all of the others, cause that's the last level cache. When we miss here, we hit memory. When we miss here, we take orders of magnitude more time. And so we should expect
this precipitous drop off, and we should expect it
around the actual size of the cache, not before it, not after it, right at the size of the
cache, when it's full, not when it's too full. And then it stays down. But there's another cache layer in here. This processor has L1 cache, has L2 cache, has L3 cache. L1 cache is 64 kilobytes. L3 is 16 megabytes. Any guesses what L2 is? How much? (person talking quietly) One megabyte. Any other guesses? 512 kilobytes. Someone's looking up on Wikipedia. It is 512 kilobytes. It's exactly 512 kilobytes. And this one I don't understand. I don't know how it actually has this good a performance at
512 kilobytes of working set. So that's a great question. We should figure out why
this has good performance. Okay, so we can actually do something a lot more interesting than just running this benchmark and looking at timing. It's cool that I can build a benchmark to show me the cache and how it behaves. But I probably am not trying to fix the performance of that benchmark. That's not likely an exciting prospect. I built it to perform badly. What I actually wanna be able to do is I wanna be able to recognize when a real program is doing that and I didn't intend for
it to be doing that. So we need a better measurement tool. So the first thing we need to do here is we need to come up with a
way of getting better data. So the benchmark framework I'm using has some cool flags here. You can filter down to
one benchmark measurement, and you could also ask it to run that benchmark measurement for more time. And this gives you a lot more data. So I've asked it to run
for probably too much time. Okay, that's just silly. Let's run it for two seconds here. And this'll give us pretty stable numbers. And so this is right at 64 kilobytes. So I probably want to
back off one step here so we can look at one step before this. So this is in cache. We expect this to be super fast. This is L1 cache. Cool, it is. Now, how do we actually tell that this is staying in L1 cache? Well, that's where our
performance monitoring tools come in, perf. Last time I talked about perf, I showed perf stat. I said, this is really cool, but we're not gonna talk
about it a whole lot and ran off to do profiling. Today, we're only gonna
talk about perf stat, because there's no
interesting source code here. What we're trying to understand is what the processor
and the machine is doing. So I run perf stat. I get a bunch of data here. But nothing looks bad. I have very few stalled cycles. My processor's running fast. I don't have any context switches. This looks great. This is about as healthy as you could get a benchmark to run. But as soon as I go up to 64K, I start to see a different result. All of a sudden, I have this
25% of my cycles stalled, right there. So that's your first signal that you have some kind of memory bandwidth problem. It might be something else, though. There are other things that
cause front end stalls. So how do we drill down even further? Let's look for a better tool. If you run the perf list command, it actually will list
out all of the awesome things perf can measure. And it can measure
amazing, amazing things. And right at the top
is this beautiful list of things, L1D cache load misses. That looks terribly promising. Okay, we gotta try that one. All right, so if you
wanna measure, ah, no. So if you wanna actually
measure a specific counter, you have to tell perf to do that instead of its typical counters. So we wanna measure L1D cache loads. Well, actually, we don't
know that it's D cache. It might be I cache. And you can actually
measure more than one thing at a time, so you can say show me the loads. Also show me the load
misses, probably good. Show me the D cache loads and the D cache load misses. Did I manage to typo any of that? I don't think so. We'll see. And this, look at that. Okay, sweet. So we run this, we get
some information here about the fraction of L1D cache queries were cache hits. Sorry, sorry, were cache misses. So 0.04%. Now, the thing that's just amazing to me is if we back this up one step, if we go back one step to the one that was super fast, you're just gonna be just
staggered by this, I think. So the different between
four to five percent front end stalls and 25%
front end stalled cycles was a difference of 0.01% L1 cache misses to 0.04% L1 cache misses. That's why everyone is so obsessed about cache locality. We regressed cache locality by 0.03%, and we regressed performance by like 20%. That's nuts. This is crazy talk, right? Well, that's processors for you. And this works all, like we
can keep going down here. So if we hop down a bit, I don't remember the exact one. You're gonna see this
number just keep growing. So here at one megabyte, we
have 12% L1D cache misses. The L2 isn't terribly interesting, so I'm just jumping ahead
on you people, sorry. If we go a little bit bigger,
we're gonna see even more. So at eight megabytes, we
have 1.6% L1D cache misses, 1.6% L1D cache misses. So we have 100 times
the D cache miss rate. Because we have preposterously
more working set. But now you know how to actually tell am I blowing up my cache? I would love to show you all of the other hardware performance counters that measure different parts of the cache. Unfortunately, most of them don't work. And that's a little concerning to me, cause I work on performance stuff, but hardware performance
counters are actually still really finicky
and hard to get to work. And so this is a very new processor. The Linux kernel and the profiling tools don't appear to have caught up. So the L1 is the best
one I can probe here. Any questions about
probing for cache misses before I move on? No, maybe? Everybody happy? Everybody's just like, I don't even know what you're doing anymore. (laughing) You're playing with these
toys that don't make sense. All right, let's briefly return to our presentation here, see what else we wanna talk about. Yes? (person talking quietly) Is there something to say
about the do not optimize function call in my code? That is the benchmark library's beautiful packaging of that
horrible, horrible trick I showed you two years
ago to get the compiler to not optimize away your benchmark code. And you can thank Eric
Selleg, who's over here. He added it to the benchmark library like at the same time as I was presenting the horrible stuff. But now I get a nice API. But you can go look at
the benchmark library. It's Google Benchmark,
like Google slash Benchmark on GitHub, and it has
documentation for the function. All right, so that's great. This is all great. I've shown you one clever trick that'll let you debug all of
your cache locality problems. Well, not really. It'll tell you that you have
cache locality problems. And then you have to
go do a bunch of work. And we've had a lot of great talks about how to actually fix those. I don't wanna repeat that. So now I'm gonna jump to
the other topic for today. The other topic is, I think, the hardest single topic to really
internalize in your brain about optimizing loops, okay? And this one applies to tiny, microscopic, tiny little loops that
you sometimes end up with. And we still need to make those fast. And we're just gonna go
right back to live demo. I hate slides. Live demo's better. All right, let's take a look at a tiny little loop. So this is my least
favorite loop in the world. I hate this loop. I'm not actually being facetious. This loop has caused me more pain than any other loop, really. This is kind of a silly version of it. I've really simplified it a bit. This is what's called the clamp loop, where you run over a big pile of integers and we see if any of those integers are bigger than some upper bound, and if they are bigger, we clamp it to that upper bound. This loop, in various forms, not really this form,
but in various forms, this loop shows up in
unimportant code like every compression algorithm based on zlib. And every decompression
algorithm based on zlib. It also shows up in most
image compression algorithms. This thing is everywhere. You'd be just amazed
at how much of your CPU on your phone and your laptop, how much of your CPU
is dedicated to running this annoying little loop. Now, I've also turned
off a bunch of fancy, fancy loop optimizations
that the compiler does, because I actually wanna look at the simple version of this loop. And in a bunch of cases, we actually see the simple version in real code because in a bunch of cases, we don't see one loop on its own in the middle of a micro-benchmark designed to test it. And so much like we have to turn off the optimizer for micro-benchmarks to keep them from deleting my code, I also kind of need to
dial them down a little bit so that I can understand what the behavior of this loop is without
it being vectorized and unrolled and peeled and turned into some kind of monstrosity. All right, so I've turned all that off, just to simplify things. And I've got my little benchmark. Any questions about this benchmark? If anyone doesn't understand it, I'm happy to clarify. Otherwise, I just want to get onto showing you the craziness here. All right, we're gonna
see if we can do this. So, again, I've got a build system. I'm up to date, excellent. So let's try running my clamp benchmark. I just gave it some random sizes. It doesn't really matter. Right, it's running, it's doing 1.4 billion items per second. It's running pretty fast. This is solid. This is a four-byte item, so it's actually churning through a
reasonable amount of data. This looks great, this looks fantastic. But we should actually take a look at what it's doing and see
if it's doing something that makes sense to us. And we looked at how to do that last time, so let's just quickly do something like this. So I'm gonna run it for a little while to make sure I get enough sample data to have something useful. Recording it, recording it. I told you it was live. Okay, so this is exactly
the profiling steps that I showed you last time I was up here doing live demos with profiling. We wanna annotate this
thing and look at the code. Okay, so here's the code. So this is the loop. Hopefully folks can read this. We jump back up to the top,
kind of compare these things, we do this fancy cmovge thing. All right, so let's figure out what this actually means. Step one, move. This is loading the
data we're gonna clamp. Step two, comparing it with OX100 in hex. This is checking to see whether we need to clamp it at all. Then we do this cmovge thing. This is a special X86 instruction. It conditionally moves a
value between registers. And so that GE is a condition predicate. If that comparison found
that the value we had was greater than or equal to the thing we compared it with, we wanna replace it, we wanna clamp it. And it happens that we've actually put the replacement into a register already, so we have that in ebx over here, and we move it over top of it. Then we store the result back into memory, so we actually have done our full clamp. We add four to our pointer, because we've gotta go to the next entry. Then we check to see if we're done. We're not done, we brach back up. Make sense? That's the loop. Everybody understand the clamp loop? It's super simple, right? Everybody's like, this is super simple. Okay, so X86 has this awesome
instruction called cmovge. It seems designed to
implement this clamp loop. Aren't you glad the compiler took care of generating this magical code for you? Well, see, I work on compilers. And I'm suspicious. And so I'm like, I don't know. Last time I got up here
I invented a faster mod. We can do something here, okay. So maybe we're crazy and we think we can totally do better than this. This is totally within our power. So let's see if we can get at this thing. So I went ahead and compiled this thing down so
I could get the assembly, because I don't really
like staring at object code or looking at build systems all that much. So this should give me the assembly. All right, so here's my loop. You can see the cmovege. There's a bunch of other stuff here. Actually, I can clear the
other stuff we don't need. We don't need debug information. We're good without debug information. Where did you go? Yeah, we don't need this stuff. I'll make it easier to read. If I can type. Okay, cool. So there's my cmovge. This looks like the code we just saw. This looks the same. I mean, it's using a decimal number. I don't know why it's
using a decimal number. But it looks the same, right? Okay, so if I'm sitting
here thinking, look, I'm way more clever than my compiler is. I don't need this cmov thing. I think that's kinda lame. Because maybe I'm thinking, this cmov thing seems kinda clever, but what if we just didn't
do that move at all? What we could do here
is we could just branch, we could branch to something. But I need something to branch to, so I have to make up a name down here. But we could branch to, I don't know, seems fine. All right. So do you all see what I'm doing here? So we're still doing the comparison, but we're jumping if not
greater than or equal to. If it's not greater than or equal to, we're gonna jump over it. And if it's not greater than or equal to, we're gonna have to clamp it down. Is that in ebx, I think? Have I made any bugs here? I think we're good. Yeah, so if it's greater than or equal to, we're gonna go and we're gonna
store our constant into it. If it's not greater than or equal to, it's already below our clamp. We're just gonna skip over the store. How many folks here think
this is gonna be faster? (person talking quietly) Depends on my data, depends on my data. So do we not remember what the data is? Okay, here, let me go look at my data. That's a great question. So if you look up here, it's
kinda hard to read, hold on. So this should be easier to read. So I'm picking random numbers between zero and int max. So there're not gonna be a
whole lot of them below 256. So anyone think this is gonna be faster? Why do you think this is gonna be faster? (person talking quietly) Branch prediction, why? (person talking quietly) It's all the same branch. But you still have an extra branch. (person talking quietly) One more time? (person talking quietly) Branch prediction? (person talking quietly) So branch prediction
guesses the jump forwards. Not true in modern
processors, it turns out. Used to be true. Pentium six had a forward-based predictor, but modern processors don't anymore. Any other theories? So for the record, my theory was this is not gonna be faster. We had one branch, now we have two. They're super close together. This isn't good for
performance, I was thinking. This is a terrible idea,
because we're never actually gonna take the branch. We're just gonna consider
taking the branch, take up an entry in our branch predictor for no good reason,
we're still gonna store the same number of times to memory. This is the perfect case for a cmov. It's a cmov that actually always happens. That's the only thing
that made sense to me. I was actually pretty convinced that this was a bad idea. I was so convinced I actually spent a bunch of time making
LLVM produce that cmov. (laughing) It's a true story, I'm
not making that up, okay? I really cared about this. I thought this was a bad way to go. But we should find out whether this is actually a bad idea or not. What do I wanna do? None of that, none of that,
none of that, come on. There we go, I wanted
it to look like that. Let's make myself a fancy clamp. Oh, goodness, I'm not
waiting five seconds. I'm an impatient soul. All right, so uh oh, uh oh, hold on. I don't remember what the other one was. This is the problem with doing it live. So we have our fancy clamp, and we have our not fancy clamp. And it turns out that cmov is faster. So the really funny thing is that the cmov wasn't faster earlier today. (laughing) (applause) Which is rather surprising. And now of course you all know the gag I was trying to play on you and failing just really miserably. Did I get this wrong? Did you people let me down and let me program this wrong? Let's do it the other way,
cause that's how I did it when I tested this. I think that's right. (person talking quietly) Oh, ebx is set way, way, way, way, way, oh, but I bet I know what's going on here. See, when I did this,
I was kind of annoyed at that ebx anyways, and so I did this, cause that seems obvious. Oh, I have to rebuild it. So somewhat hopefully this
will reproduce results. There we go, cool. (laughing and applause) In case you were wondering
whether this was live. Okay, so lo and behold, a
branch can be faster than cmov. And I have to say, I did
not expect that at all. It's not a lot faster, but it's faster. And I wanted to understand why, not just in this case, but
actually we see this a lot. We actually see it a bunch of places where you would kind of
expect cmov to be better, and it doesn't end up
being a significant win. And it's kind of interesting to try and figure out why that's happening. But to explain that, I actually, I can't use a tool to really explain that. It turn out that this is
really hard to explain. We're also running a
little bit short on time. So I want to prepare all of you. This is going to be a bit of crazy. So we're gonna go to slides here, because I can't do anything but slides. I'm gonna try and relay to you a talk that one of the really senior engineers who worked on LLVM for a lot of years gave as a keynote for the
LLVM developers meeting a bunch of years back, back in 2013. He used a different example than clamp, and there's a good reason
to use this example. Branches make thinking
about this even harder, and this, it turns out, remember what I said at the beginning, this is one of the hardest to think about problems we have with loop performance. So we're gonna simplify this as something that doesn't have branches. We're gonna look at a
dot product, all right? And this is exactly the
code you would write to compute dot product. There's nothing fancy going on here. And if you compile this
code, that inner loop is gonna turn into this
X86 instruction sequence. It's just gonna turn directly into that. And I promise, I actually did compile it. I'm not making this up. All right. But this is not what the
processor is going to execute. The X86 processor doesn't actually execute X86 instructions. X86 instructions are very high level. The X86 processor executes
what's called microcode. It decodes the X86
instructions into smaller, more precise operations. And we can also kind of
use a better notation, which actually shows you
where the destination is, where the destination is
and separates an input that might also happen
to be a destination. And so I'd like to use the
code on the right here, not the code on the left. The right is kind of
what X86's micro ops are. Let's step through this. So we start off with a load. A load is actually just a
direct mapping into microcode. No problem there. It's just one micro op, sorry,
not microcode, micro op. But this imull has two things going on. It's loading, but it's
also doing a multiply. And so it's doing those
as two separate micro ops. The first one is just gonna load. And then the second one
is gonna do the multiply. Then you have an add. That translates very naturally. You have another add for
kind of the array index. Then you have a comparison. And you shake your head,
because I've been lying to you this entire time. Sometimes micro ops are
actually bigger than the X86 instruction. We have this thing
called a macro op fusion in every modern X86 processor, and it says, well, I see
you're doing a comparison and then you're branching on the result. I'm just gonna bake in an operation to do exactly that. And so we get this as a
single operation in X86 when we're actually executing it. Make sense? Everybody happy with my
little code sample here? Wonderful, okay. So there's one other thing
the processor knows about this that's really important. It knows that this branch
is going to be taken. It is confident this
branch is gonna be taken. Of course it is, we're looping. It's taken every time
until we stop caring. And so the branch is gonna
be predicted as taken. Now, because this branch
is predicted as taken, there's this fall-through path that's actually not
interesting to the processor. And so notionally, we can
think about this inverted. We can think about it
the other way around, where we actually continue executing the loop body and we branch
out of this loop if we're done. Okay? And that inversion is really key, because now we're gonna
kind of notionally, this doesn't actually happen, but think about it as if you unrolled an iteration of that loop. So you have a second
iteration of the same loop. And then another iteration after that. You can kind of stack these iterations up. It turns out this is exactly
what the processor does, but not when it's executing it. It stacks up the iterations for the result of each computation. It has what's called a reorder buffer that looks pretty much like this and captures the result
of every computation and can capture several
iterations' worth of computation, not just one. Now, that's hard because
we've got these registers in here that are reused
over and over again. And so the processor also
has an amazing feature called register renaming. Modern processors don't have the number of physical registers that
you can write in assembly. They have many times that
number of physical registers. And so when you write percent ecx here, it's actually gonna go and rename that to something else. And so we can kind of
rename this entire set to have unique names
so that we can execute any of these and capture
any of their results, provided the dependencies are satisfied. Make some sense? So this is how the
processor wants to think about finishing the
execution of this loop. It thinks about finishing
multiple iterations concurrently, all right? It's called speculative execution. It's incredibly powerful. Let's see exactly how powerful. To understand how this executes, you have to think about
the parts of the CPU that are doing the execution. These are the functional units, or the execution units
inside the processor. We have a load unit that's
loading from memory. We have arithmetic units
that are doing things like addition, multiplication. And we might have multiple
units in the processor. Most modern processors, ARM, X86, most modern processors have multiple units that can all run at the same time. So I'm gonna kind of use
a hypothetical processor with one load unit and two
arithmetic units here, okay? And we're gonna kind of step through, cycle by cycle, how this
executes on the processor or how it could execute on the processor. First, we have kind of
a simple chain of events where we have no really
interesting choices. We need to load two values. There's one unit that can do that we have to go there. We're loading from memory, and so there's latency involved in that. And so the next operation
can't execute immediately. But at some point, the
results become available and we can do the multiply. And then at some point
the multiply results become available, and
we can this accumulate. And multiply's a
relatively slow operation, so it might take a few cycles. That's what this is trying to represent. Don't get the idea that this
is precisely accurate for X86. This is just illustrative. Okay, and then there are
two other instructions that execute in that first iteration, but they're actually
completely independent once you do register renaming. We can execute the increment
on that array index right away because we can
just remember the old value and use it if we need it. And so here we're seeing
out of order execution and speculative execution
from the processor. Sorry, sorry, not speculative,
just out of order. So this is great, right? This means we've already handled those two instructions. But what happens next? We have this branch, but we predicted that it would be taken. And so the processor just assumes that this is gonna be taken and continues executing this stream of instructions. And so it starts off
on the next iteration, which it can pack in very densely because there's a lot of remaining space. So it can go ahead and
execute the next increment. It can execute the next increment before any of the loads
have finished, okay? Before any of the loads have finished. And then it can actually do the next test. And because it's already
doing the next test, it doesn't have to work very hard to actually continue speculating, pass this iteration into
the third iteration. It's going to finish the third increment before the first load
becomes available, okay? But look at what we've done here. Look at what we've done here. The latency to get from
the very first operation to the actual result, the ax1 there, was eight cycles. The latency to get to the next
end result was two cycles. This is why this is how
you get nowhere faster. This is actually probably
the most important thing in modern CPUs for performance. It's just unbelievable what this does to your execution performance. And just to be fair, let's
roll this whole thing forward until we've actually fully
packed the execution units. And you can see there's actually no space to fully pack these execution units. We had one stall cycle here because we can't compute flag six. We can't compute the
comparison for flag six on the sixth iteration
because the increment to the index hasn't finished yet. And we can't quite do the
increment to the index earlier because the previous
increment to the index hadn't finished yet, because we're on the sixth iteration. And look at where the
sixth iteration starts. The same cycle as the
first iteration's result. Okay, so we have over six iterations of this loop executing in parallel on a single processor. All the way around the back edge. How many folks here realized that this is what executing this loop would look like in the processor? You guys pay more attention
than I thought, good. So this is what makes it really hard to reason about performance. Now, let's think about
something else with this. We have this cmov. Why is cmov weird? Because we can't predict cmov. We simply have to evaluate both inputs. See, we didn't wait to find
out anything about the flag. When we were scheduling
this, we didn't have to wait for flag zero to finish resolving to start the next operation. We just keep going, right? But with cmov, you can't do that. You can't speculate through cmov in the same way you can through a branch. And this is why branches,
in a bunch of cases end up faster than cmov. Making some sense? Looks like I have a question. Or you can speak up, I'll
try and listen better, sorry. (person talking quietly) So the question is, why doesn't cmov get microcode translated
into the right thing? Because that's hard. That's a very expensive
transformation to do in a very constrained
part of the processor. Knowing when it's safe to do that, when it's correct to do
that is very, very hard. And there are cases where
cmov is absolutely correct. There are cases where there's no advantage to speculative execution,
all of the inputs are going to be ready long before the cmov has to be executed, and where actually doing the speculation might be hard. And so how does the processor know? And the answer is it doesn't. And so sadly, this falls
on us compiler folks. All right. So this seems really complicated. I would like tools to help me reason about this kind of pipelining. And I showed you kind of
a fictitious processor, but it would be nice if
there were better tools. I was gonna try and live demo
a bunch more tools for you, but unfortunately we only have enough time for a Q and A session. And so instead, I'm
gonna mention the tool, and I'm happy to show
you guys some other time. It's called the Intel
Architecture Code Analyzer. It's an amazing, awesome tool. It's just a piece of magic. You should all play with it. It's gonna give you the exact graph I did, except it'll be real for
the code you give it. It's just amazing. I could try and do it,
but it would take me like way more than the
amount of time we have here. But let's do the other live demo, which I think is faster. And then folks, get
ready for your questions cause we only have a few minutes. So one thing that I think
is useful to do here is actually to look at what
the performance counters tell you when you run these two different versions of the benchmark. Because the performance counters actually give you some good clues
as to what's going on here. So if we look at the
original one with cmov in it, one thing that we're going to notice is we have a tremendous number
of backend stalls here. And that's not a good sign. We are actually struggling
to execute that clamp loop. But this is also probably
coming from storing to memory, and sadly, we are storing to
memory on every iteration. And if I switch over
to my fancy clamp loop, we're gonna see a pretty
substantially different profile. So it finishes faster, but now you see something really confusing. Even more backend stalls. And this should confuse you, right? Like why did it get worse? But think about what we just showed. The processor is going
to speculate that branch. And it's going to have a store into memory ready on every single cycle. All this benchmark is
doing is storing to memory. It should be stalled every cycle, trying to store to memory. The fact that the other one wasn't is because it was doing something else. It was waiting on cmov. And I mentioned this
was hard to tease out. It looks very counterintuitive. But this is the reality of X86 processors and most modern CPUs. All right, with that we
should wrap it up here. Let you guys ask me the
really hard questions that you've been holding back on, cause I know you have. It's good, I would too. Like I said, hopefully
this lets you understand why your loops are going fast or not, lets you reason about at least two of the interesting cases. There are a million more. Unfortunately, I can't stand up here and monopolize the entire conference. So I'll have to come back and talk about more fun versions of this next time. All right, so questions? I don't know which side was up first. Let's go over here. (applause) - [Man] Thank you, great demo. And you showed it on AMD processor, right? Did you see any difference
on Intel processors? I will freely admit, I haven't tried this on Intel processors very recently. However, in the past, I have not seen substantive differences. Most modern processors
are very similar here. You'll find all of what
I'm talking about here generally holds for modern desktop or server-oriented and even the high end of mobile-oriented Intel processors, AMD processors, ARM processors,
and PowerPC processors. It's really ubiquitous. This is not actually specific
to one architecture at all. For example, the
out-of-order execution talk, the original version of that was given talking about an ARM processor,
not an X86 processor. But it translates perfectly. The processors are much more similar than you might imagine. - [Audience Member] Working through SG14 and AWG is a proposal for
likely unlikely attributes to standardize things
like built-in expect. We actually noticed the same thing while validating the use of that that when you put likely unlikely on, it turns the cmov into the jump. Yes. - [Audience Member] And
we saw it got better around one tenth or a hundredth
of a percent of the time, it would sharp drop off. Do you have any thoughts on
using notations like that and a good way of dealing with them? I love them, they're great. I have one problem with them. Sometimes they get old and
out of date and incorrect. What I would really, really, really like, and sadly I've never had the
opportunity to go with build, is a tool that takes an
actual profile of the binary and checks to see, so, did this profile and your static hints actually agree at all? Were they even close? Were they actually contradicting? And actually tells you, the programmer, hey, maybe you should go check these two annotations over here because when you ran your binary, those annotations
weren't just unimportant. They were dead wrong. But I don't have that tool yet. And so the only hesitation I have is that if you're not careful,
they can get out of date. Otherwise, I love them. Oh, one more thing about them. Those are especially great when you have some fundamental reason for them to be correct. Best example in the world, vectors grow during pushback, right? We actually have an algorithmic guarantee that vectors grow during pushback is cold. Because it's amortized away. That kind of guarantee, fantastic. - [Man] Hey, Chandler. You actually showed us live that comparing to a literal is faster than comparing to a
register, which seemed, well, quite unnatural to me. Comparing? - [Man] Yeah, the cmp and the
move, no, the move, sorry. In the move, yeah. - [Man] Yeah, it seems strange to me that moving a literal
is faster than moving from another register where
you already have the value. Yeah, I don't know. (laughing) I mean, I can sit up here and try to figure it
out, but I don't think I'm gonna succeed. It might be entertaining, but I don't know that I'm
gonna have an answer. My best guess is actually that it may have gotten in the
way of register renaming. It does mean we have another
register live, right? I don't know why that would be the case. It seems very strange
for that to be the case. This loop doesn't seem complicated for the processor to handle. But I can't argue with your conclusion. - [Man] All right, thank you. - [Audience Member] Hey, great talk. So I had a question about the last loop that was faster but it had more stalls. Yes. - [Audience Member] So you
said that it's stalling because we are storing to memory, right? So what about the store buffer? How does it... So the store buffer, first off, for folks who don't understand, there's a buffer inside the processor that kind of takes stores before they actually arrive at memory. That's true. However, we're also reading from memory. And we're not reading from
the same memory location. And unfortunately, X86 makes guarantees about memory orderings. And so we're not waiting on the store, the store buffer can't help us here when we actually need to let the store flush all the way out. - [Audience Member] Okay, thanks. - [Man] I actually got up here to ask that same question
about ebx and the literal. I mean, also I should
be clear, I'm guessing. Again, I can sit up here
and try and prove it. If we had more time, I would, cause that's actually really interesting, to try and understand
whether it's stalling because of synchronization,
like establishing the TSO, because it does have to establish TSO for these stores, sadly. And we could do something to try and actually see that. For example, we could switch them to non-temporal stores. Because we happen to know we're not going to read them again, and see if that changes the
behavior of the processor. And it might. It might actually make a big difference. - [Man] So instead of asking that, I'll piggyback on Paul Hansom's question about the built-in expand down notations. And I think a couple of the HFT talks here mentioned that sometimes you don't expect something to happen but you still need it to be fast. Got any opinions on that? No, it's absolutely relevant. We see this actually a lot with more genuine profile-based optimizations because there're kind
of very boring reasons why a profile is sometimes wrong. However, there is some good news here. The compiles doesn't just go home when it sees something is cold. It tries to make the path that isn't cold as fast as possible. And it will make the cold path slower when doing so makes the other one faster. But it won't insert
arbitrary slowness there. And so I would not expect
this to be that bad. I mean, it could be, but
that seems kind of weird. The way we actually talk about it inside of the compiler,
inside of the optimizer is whenever we have a choice between making one side faster or
making both sides faster, we always prioritize. But it's about priority. It's not really about pessimizing. I don't know if that really helps though. - [Audience Member] Given that cmov doesn't seem to do speculative execution Yes. - [Audience Member] In the loops, what are the cases where
it makes sense to use cmov? So it makes a lot of sense to use cmov when your dependencies are already going to be required. That's the classic way. In particular, for example, if you have to materialize the values
to compute the condition, it can just not matter. And so then you get rid of a branch, branch predictor gets one less branch it has to keep track of. It can be a win. But I'm sympathetic. I've never met a cmov I liked. I just, it hasn't happened to me yet. I don't know why. But there is a theory
behind the instruction. - [Man] On the first live, when you were populating the micro ops
into your pretend processor, you did a load on I think two, and then you didn't use
the result until six. I was surprised to see that you put another load right on three. I was thinking, are the loads sort of fire and forget on a real processor? You can just load, load, load, load, load, and not worry about how long
it's gonna take to come back? It depends. But yes, in many cases
there is overlap there. There's two sources of delay. One is figuring out what to load. The unit's busy, it can't take more. But there is also just waiting for the caching memory
subsystem to produce the value. That's writing into a register file. The unit's not really involved in that. You're just waiting for a
dependency to be satisfied, and then you keep executing. - [Man] So you definitely
don't have to wait until six to put another load line in. Almost certainly not. - [Man] Okay, thank you. Almost certainly not. - [Audience Member] In this first case, in the cmov, have you
tried profiling-guided optimizations? Did it help? Yeah, LLVM kills the cmov
and switches to a jump if you give it a profile. - [Audience Member] Oh, okay. - [Man] In my amateurish view of things, isn't it clear that when you
store the literal directly, it's faster than doing this because it doesn't have to, it already knows, when it decodes the instruction, it knows what the literal is. It knows what it has to store. So it doesn't have to
wait for the register to actually have the value. I would not expect the
processor to work that way. I would actually expect that the wires that do the store come
from a register file entry. - [Man] Okay. But I'm also not a hardware architect. I tried to give you enough understanding of the hardware to work with it. But I don't build processors. You should track down someone who does. They might be able to give you a much more detailed answer. I know there's one person here who might have a clue,
from Intel, at least. - [Man] Okay, thanks. - [Audience Member] So
my question is about going nowhere faster. So if you're gonna
speculate the instructions, and let's say the
processor gonna speculate doing some loads out of
the bounds of the RA. How does it work in the hardware that it doesn't crash? I don't even know. Magic and unicorns? Anyways, sadly, the session's over. I'm happy to try and find folks and answer more questions later. Thank you all. (applause)
Can we just give Chandler like twice as much time? Or like a day? I would go to that...
Slides: http://chandlerc.github.io/talks/cppcon2017/going_nowhere_faster.html
I bet "jl .myfuncLabel; mov $255, (%rax)" works faster than "cmovge %ebx,%edx; mov %edx, (%rax)" simply because latter uses two extra registers (ebx/edx) with dependency between them. I.e. half of this (decent) presentation is about a problem in optimizer.
Does anyone have an answer to the question asked at the very end about how the processor avoids essentially invalid code?
I installed the YouTube addon in Kodi yesterday, so I could spend my evening watching Chandler Carruth talks.