Branchless Programming in C++ - Fedor Pikus - CppCon 2021

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this one in particular it's at a beautiful venue pretty much everybody's staying in the hotel here and so that means you know after the sessions people are going out and getting a beer and continuing the conversation and it's just a great way to experience the confidence [Music] please come on in take seats to talk on performance the closer you see the better the performances also [Music] performance you know the observation itself the data is what it is but what do you do with this is subjective depends on your point of view uh i want to demonstrate the importance of the point of view i need two volunteers from the audience two people from the audience please come on up here one okay we can stand here and someone else on the other side please just face each other okay uh can come closer to the to me do you see the arrow okay which way is it pointing okay there is some disagreement on this subject let me so everybody sees what they're seeing i'll also show it on screen [Music] does everybody see the arrow which way is this pointing all right now you can observe this one everybody else observe the screen i'm going to flip the arrow [Music] which way [Music] okay so you see the difference the point of view makes [Music] all right well now we can start with the talk [Music] well thank you for coming let's talk about performance in c plus plus well if you program in c plus plus you're probably interested in performance that's usually one of the reasons uh it'll be a specific area of uh uh of performance optimizations if you want more information about performance in particular about c plus plus from from me i have these two books the the the one on the right the art of writing efficient programs just came out and the publisher sent me a discount code for conference participants so i'll share that let's go straight into the talk what's our plan for the next hour we're going to talk uh about what constitutes what allows good performance and in particular about using the hardware capabilities efficiently because uh the more basically the better you use the hardware resources the more computing you can do with it we'll focus on the conditional code and branches and then i'll show you the optimizations of branches including branchless programming which sounds like removing the branches which is what it is uh toward the end you'll see the actual benchmarks just to kind of set you up for what kind of results could be expected the first line is some very common code and then let me point at it a very common code in this function here there is a very simple equivalent transformation that speeds it up well on this machine by about 3x here is another very common code again there is a very simple transformation that speeds it up by more than 3x and i'll show you the full code and the the benchmark and everything but this is the kind of thing that you can expect when these optimizations really work so uh what do the branchless optimizations fundamentally owe their efficiency to it's the idea that modern hardware is very capable it can do a lot of stuff computing hardware and your performance well first of all there is an algorithm so get the result you want with minimal work second there is the efficient use of language don't do any work that you don't need done and then there is okay you now have a certain amount of computing to do you have the computing resources of your processor and other computing capabilities you want to use all of these resources all of the time ideally to get through the computations that you need done uh what do the processors have this is actually a pretty old one but still there aren't simple like additions multiplications they have to do a lot of other stuff to get here the results back so the execution units themselves there are actually a lot of them and they aren't even the biggest part of the silicon there is other stuff there pretty much every block of this diagram can be used more efficiently with certain optimization techniques and you would have to find the book on performance i talk about it in my book there are of course others we're going to focus on the execution units very simple code so we have some arrays maybe vectors whatever there are sequences and we're doing some repetitive computation in a loop so this loads up the cpu here but the truth is it's a very inefficient use of the cpu it doesn't load up much of it what it does is still actually pretty complex you have some data in memory the cpu does needs them in registers so first we load the data second we load more data then we actually do the operation like multiplication and that gets us the result we're doing one multiplication per like data per iteration that's very inefficient as far as cpu is concerned if you ask the cpu to do two multiplications or multiplication and addition or division it will do that in exactly the same time so this code in the previous code would run on most uh high performance cpus like on x86 cpus would run on arms would run at the same time so you get the second operation basically for free but that's actually not the limit depending on what cpu you have you can fit in a lot more and you can write your benchmarks like this to see on your particular cpu just how much more you can stuff into it a lot somewhere between five and ten easily and that's before we get to specialized hardware units like vectorized units which have their own math components completely separate from the general uh arithmetic units now this all sounds very good so i have computations and i'll just have my cpu do all of them at once that's great the problem is you almost never can do that because you have data dependencies in order to do the next computation you have to have the results of the previous computation which means all of these compute units basically wait for one of them to complete then the result goes to the next one which does the next operation and so on so forth that's the data dependency things can get even worse if you have code dependency so you may have branches conditions and now not only the cpu must wait for to execute some instructions because it doesn't have the data to do it it also doesn't know which instruction has to be executed until it computes that result there wouldn't be much point in putting so much compute power on a single processor if we didn't have some workaround for these problems so let's start with data dependency data dependency uh workaround provided by clever cpu designers who came up with the idea of the pipeline what does the pipeline do well we have our iteration of the loop so we can do some operations on the data that we have let's say for v1 and v2 for a particular iteration so we can do addition and multiplication well there is also the product in that same iteration we can't do it yet because we don't have the results of addition and multiplication what can we do in the same time the multiplier unit is just sitting there it's available what could we do with it we could do the multiplication from the previous iteration so if we're adding and subtracting the array elements for some value of i we could be at the same time multiplying the results of the addition and multiplication for i minus one so this is how the pipeline works we have identified in our computation a stream of instructions that has a data dependency within the stream so the instructions are ordered but there are two separate streams and they're interleaved so there is no data dependency that between between them at any given moment of time so we can add elements i minus 1 while we're multiplying the elements i minus 2 and so on any questions on the pipeline could you turn on the mic please just come closer why are the results of minus one and minus two the expressions there not a good data dependency for the multiplication okay uh the question was why are the results for minus one and minus two not a data dependency for the multiplication they are but not for the same iteration so the uh the addition for mine minus one is a data dependency for multiplication for i minus 1 but it's not a data dependency for multiplication for i minus 2 oh i see okay uh yeah so there probably is another round of the pipeline where eyes are incremented but in reality increment is coded into the instructions so uh so so yeah there is another kind of another train of instructions where starts from zero and computes the in some register computes the value of one then if it needs a register to index it will compute a value of two in another register while keeping the value of one so now you have two streams of instruction one two one goes to three two goes to four and as many as it needs of to hold the parallel streams there are some tricks in hardware i'm oversimplifying there are some tricks and hardware to actually optimize indexing but that's basically the if you didn't it will it would just cost you two more streams on the pipeline any more questions okay so that's the pipeline the result of the pipelining is that basically the entire pipeline is busy doing parallel streams of execution and the end goal of it is to increase the cpu utilization ideally we would have all of the cpu compute units doing something and not waiting for anybody's data and if the pipeline is wide enough you may be even able to do that so far so good until we get to conditional code so we handled data dependent basically data dependent data by just waiting for this data to be computed while doing something else now we have the problem that we don't know what to do until we we know the result of let's say this comparison what's greater v1 or v2 we don't even know what's the next instruction so this is the bane of all pipelines now you have to wait conditional instruction take a while to actually get you the answer you have to wait and you don't know what to put in the pipeline so you create bubbles in the pipeline during which you do nothing this negates our wonderful achievement of fully loading the cpu with computations now in truth this isn't 100 accurate the cpu can do some conditional operations in a single operation without jumping uh we're going to set that aside in my code i'll use simple conditional operations so if you see some conditional operation that in your compiler is replaced by cmo or something like that just ignore it pretend it isn't there are some limited things compilers and processors can do to avoid certain branches i'm showing you the examples that may be optimized just if you know that in your compiler it's optimized just ignore it and go with it you can get more more complex stuff that isn't optimized even if that condition thing was optimized our loop actually has a hidden conditional that should ruin our pipeline to begin with and that is this wonderful pipeline goes with i plus 1 i plus 2 and so on only if i plus 2 was valid in the first place there is a condition there for i less than n at the end of the loop so we shouldn't have been able to do our pipelining at all and what i told you should not be possible because before starting the second to the third stream of instruction we actually have to check if i plus two is even valid there is of course an antidote to that problem as well again courtesy of cpu designers and what they do is for this loop it's something very natural it would be a great shame to lose all the benefits of the pipeline for a long loop where the failure of the condition occurs exactly once at the end you know we would have to pay for the possibility that i is greater than n for like a loop of size of i don't know millions when in reality there is only one of those so what do they do they make a guess the processor itself not the compiler the processor makes a guess about which way the condition will evaluate and in case of the loop one branch which is the condition you are in the loop i is less than n this the processor very quickly will figure out that that's what usually happens so it'll go and it's going to assume that it will keep happening and the other branch the one that exits the loop will just be idle now again very important this is not the compiler thing your source code and your machine code look exactly like the written with the loop the machine code also has the loop what your processor executes looks like this with loop being effectively unrolled so it's hardware unrolling it has nothing to do with the compiler compiler may decide to unroll some loops has nothing to do with that this is hardware loop unrolling as soon as the hardware unrolled the loop it immediately pipelines it and we're back to what we already figured out about keeping the cpu loaded now if your machine code looks like a loop and your hardware unrolls it how does it work you in your machine code we say load visa bind to i don't know eix low v 2 into ebx do the add and then the next iteration again it's the same in this it's the same machine code so again eix ebx but they're busy they're busy for the first stream and you're telling it to do to to use them for the second stream how does that work well it works like this eix doesn't mean ax in reality cpu has physical registers which whose names you normally don't know and the conceptual registers which are what we used to you know know about let's say x86 rcx ebx all of those and there is a mapping between the logical names and the physical names there are a lot more physical registers and logical registers so if you see eix being used twice in this hardware enrolled loop it's not the same one it's called register aliasing so that lets the hardware pipeline work okay any questions on the successful prediction and pipelining yes pablo the hardware is speculatively going through the more expected okay uh next next slide uh the question was uh what happens if the spec the the the gas resulted in some catastrophic fault uh we'll get there any other questions no uh online no okay so uh so so far we're assuming that the gas is successful so what did we get pipeline relies on continuous stream of known instructions they're fetch decoded and executed in parallel conditional branches disrupt this great ability because we don't know what to fetch and execute cpu just assumes that it knows so all the gray stuff will pretend that it's never it's not even there so instead of saying jump is true we'll just say jump and execute immediate to the next instruction now we can pipeline okay that's if all of it works the success of this critically depends on how good you can predict the future as long as your future is 100 accurately predicted you're fine you could pretend that jump is if true is really jump what happens if you are wrong if you're wrong the first problem is you're not going to find out for a while when you're going to find out you need to do several things everything you computed you need to throw away you need to start the other branch load up those instructions and get them to run okay that's expensive but there is an even bigger problem anything that you've done speculatively has to be discarded what if you've done something that cannot be undone well what could that be here is a very common prediction if p is not now the reference p p is almost never null what does the processor do the processor says oh look it's almost never now let's pretend it's never null that's our prediction for the future and the next thing it does is the references p happens if p was null it still dereferences p so what happens is any errors that happened in speculative execution must be somehow held suspended they'll become real and be reported if the evaluation of the condition confirms that the branch the speculative branch should have been taken otherwise they just have to be disappeared if a particular error cannot be handled in this way you can't proceed with the speculative evaluation you have to stop and wait for reality to catch up with your guesses another interesting thing is there is one thing that absolutely cannot be undone and that's right into memory once you write into memory the old content is gone so rights must be held cpus have right buffers exactly for that purpose and the right buffers are basically you can call them registers they're buffers sitting between the processor and the memory and that's where the values are queued up until confirmation comes in that yeah condition was indeed what we guessed now we can commit did answer your question okay now the bottom line is the process for recovering from that mess is really really expensive which means you really don't want to guess wrong well how do we guess basically fundamentally it's predicting the future based on the past it's actually not that trivial they are pretty smart but fundamentally predicting the future based on the past so this is going to be a the loop itself is going to be very easily predictable it's a long loop predicting that it's going to continue as a loop is a good bet at the end it will be a pipeline flash but it will be amortized over many many iterations so that's going to be a good prediction the one inside the loop now that's where it gets interesting if that boolean is always false or always true it's going to be well predicted if it's varies well depends on how predictably it varies i'm going to be using uh the google benchmark and the the standard perf profiler on linux you can use vtune or whatever profiler you want the capabilities that i'm going to show you are not exotic and the benchmark again any benchmark will do what with what i'm about to do let's see some code i'll just show you the difference okay is the code visible yeah okay so before i show the code any questions on what i sort of promised to prove to you i'm going to now prove it with code but is it clear what i set up to prove any questions on that so so far so i i yes just reiterate it question okay so what i was what i set up to prove is that uh the pipelining will improve the utilization of the hardware and therefore give you more better performance and the cpu will figure out a way around the data dependency that is present in code like this and pipeline and if there is a branch in which case it can't pipeline because it doesn't know what instructions come next if it can predict it it will still pipeline successfully and if it can't you'll pay a huge penalty in performance so that's what i set up to uh to prove to you and then of course next thing is i'll show you how to optimize so the huge penalty doesn't happen all right so uh by the way you know we're talking about predicting by the processor one thing we must absolutely avoid when doing these tests is predicting by the compiler and compilers are damn clever well almost damn clever in particular no compiler that i know knows that rand returns positive values so this defeats every so so the the the one on the left is always true this defeats every compiler i know they basically they they do not compile optimize at compile time even though it actually could be optimized you really don't want compile time optimizations to happen here when you're you want them to happen in a real program you don't want them to happen when you're doing the experiment so the difference is going to be one of them is always true and the other one is randomly true and false more or less randomly otherwise they're doing exactly the same work well let's see what happens the one that was predicted on this machine this is running on this laptop about 1.2 billion iterations per second one sixth of that that's the cost of constantly mispredicting the branch right there so half the time whatever the predictor did roughly half the time on that branch it had to recover from a false prediction flush the pipeline if it did the right into the wrong place it has to flush the buffers since basic and then load up the instructions that should have been done and execute those takes time all right now in this case you actually know why it's so slow because we started from the code and then we figured out that like one of them might be slow and we confirmed it what if it was a real program and you didn't know so if we started from the slow version didn't have any idea about why it's low how would we figure out where to get to the fast version we need to know we need to somehow find out where the branch miss prediction happens so now i'm going to show you how to profile for this and again i'll be using perf and i'll just run perf stat which will tell me the average profile for the entire program so it'll tell me well i'll show you what it'll tell me so what it tells me is for the entire program i executed this many instructions i had this many cycles and so on in particular the interesting bit is here 10 of all branches were mispredicted in that entire program now this includes the branch at the end of the loop which is almost always predicted and whatever other branches like the benchmark itself has set up so this is for everything this tells you that you have a lot of 10 a lot of misprediction doesn't tell you where in our case it's easy to find out where in general you would ask it to generate annotated source or annotated assembly and it would show you the line in the code where the branch mis predictions are i'm not going to show that just in the interest of time but if you want to just come see me after after the talk and i'll show you it just takes a little while to actually set it up and run and get into the menus now ten percent of all branches were mispredicted is it good is it a lot or is it not i mean we don't know what 10 means well let's see what what the good program does less than one tenth of a percent 10 is a lot you keep doing these experiments you get some experience for it ten percent is really bad one percent is actually pretty bad and the reason it's even one percent is bad is because the mispredictions are really really expensive so even one percent of those but they're each of them is very long any questions on this basic example where i have shown you the cost of misprediction well first of all i can't do i can't disable it in general well okay that's not entirely true but the ways to disable it require like severely messing with the cpu if you keep generating interrupts all the time you will keep resetting the history uh now this is actually an interesting question how much time okay i'll answer in in this case the answer is probably yes and we can get some data on it because one of the cpu vendors made an embarrassing mistake at one point in history and made their their branch predictor always do the exact opposite of what they should have done and the performance of that particular cpu could be increased by almost an order of magnitude if you kept hitting it with interrupts as fast as you can true story okay so summarizing our results uh some small percent fraction of a percent of branch misses is good uh 10 really bad now it's it's very important to point out that even though i have shown you that yeah the mispredictions are expensive you always have to make sure that you really have those mispredictions are expensive misguided optimizations are also expensive you have to know that you have them as predictions you have to measure every time you and i'll show you actually the example why so let me show you the next example and the next example is going to be something very similar again this code is the same the only differences are conditions i'm starting from random true or false and then i'm going to alternate every time so it's going to be true false true false true false now if you if we think that cpu predicts based on the last one it's going to get it wrong every time if we think that the cpu predicts based on some average it's going to just be confused and not know what to do because 50 is is one way 50 is the other way well let's see what it actually does around was burst out immediately first of all it was slower than the fa the fastest one but much faster than the slow one and second the miss prediction rate is as you can see uh still less than one-tenths of one percent what the cpu predictor figured out that pattern it figured out the alternating pattern strictly based on the history and predicted that every other one is true false and what phase it's in so do not try to predict what the cpu predictor is going to do while it is pretty good about guessing the future we aren't so you have to always do these measurements any questions so far sorry the code we do have a few online questions all right let's do online questions so the first one is who made the drawings uh there will be my sister-in-law but there will be acknowledgement at the end and then a couple questions about the likely and unlikely attributes okay so the question was about likely unlikely let me explain what they're talking about most compilers have built-ins uh and so basically there is a way for you to put a like a prediction in on the conditional that will translate into some hardware thing that will guide the cpu predictor towards certain decision what exactly does various by the hardware like would you completely override the predictor and just force it or if you're always wrong and usually if you're always wrong it's going to be pretty bad now the the problem with those is in reality programmers are really much worse at guessing the future than they think they are you have to benchmark your code to see how good you are about it you can it always pays off in trivial cases so you have like you know debug asserts so they're saying like if this condition is false just a board well a you hope that it's doesn't happen that often and b you don't really care how slow a board is so yeah put the predict false like the abort abort is not going to happen on that it's going to make the that uh assert which hopefully is never going to fire is going to make it much cheaper a well-predicted branch is very inexpensive so if you're if you're very sure yes they can pay off uh in any non-trivial case the odds of you getting it wrong are high unfortunately do side channel and branch predictions branch prediction exploit mitigations of the recent years okay the question was about side channel and exploits based on speculative execution so this is basically the question about spectre meltdown and all their friends uh i have a whole section in my book on that it's long i it would take the rest of this talk for me to tell you you know how it works and what it does yes the speculative execution which is supposed to be completely undone leaves very subtle traces that can be observed that's the fundamental issue with this i'll leave it at that i can explain you'll find me after the talk and i'll explain if you want to all the details including the working code so the question specifically was do the mitigations that have been put in place against those do they mess with the pipeline optimization maybe the mitigations vary depending on which of which of these like the spectrum one spectre ii some of them don't mess with anything uh unless you actually try so some of them basically if you are not trying to re to do a real exploit don't do anything uh for you so i can show you the code which runs in presence of spectre one mitigations that actually would do specs still do spectre one on the data that you have legitimate access to but just not not accessing so that doesn't doesn't hurt anything and so it depends on which exact vulnerability you're trying to exploit some of the and which of the preventions you know we're talking about so some of them do some of them don't anymore yeah there are two more joining continue well let's see where are we on time okay let me go a little forward and then we'll try to pick up more questions any questions in the audience okay oh please okay that's what i'm going to show you next so uh in general if there is like a simple basically if there is a way to deduce what it's going to be the hardware is going to do it for you so you won't even see it so if you see it show up on a profiler the question was sorry the question was basically okay the profiler ran showed you something what do you do next so if there is a way for the cpu to predict it which means if you can optimize without destroying that branch the cpu will handle it so if it shows up on the profiler messing with the branch is probably not going to help you you have to do something pretty drastic which is why i'm saying these optimizations are invasive you have to be sure that you need them which means you have to run that profile so now i'll show you the invasive optimizations so well we talked about this you have to measure you have to confirm as a profiler all right let's look at several cases now there are lots of these variations i'll show you kind of the basic types of it and then you can apply them in a lot of similar situations in your code here is a very common one and that gets us to the point of using questions so if x or y then do something otherwise do something else here is how i as a programmer look at this there is a condition if it's true i do it otherwise i don't that's not what the compiler sees that's not what the processor sees they both see totally differently what they see is x is a is one condition if it's true we take the first of the true branch if it's false we look at y and if that is true we stay we take the true branch and if they're both false then we take the false branch what what's the so what's the subtle difference the subtle difference is i think of it as a programmer think of it as there is one condition x or y the processor thinks of it as two conditions which means if x or if the the the branch taken is predictable but how you get there is not you think that this is a predictable condition your processor disagrees with you 100 because it thinks that it's not a predictable condition because it's half the time it's x that got you there and half the time it's why that got you there and the comparisons must be done separate the checks the tests for if x equal true and with y equal to must be done separately because the logical or is defined with a short circuit execution which means if x was true we don't even look at y this is very important it gets you do things like the reference pointers that could be null and you don't get there before because the previous condition prevents you from happening but it may have a cost so i'll show you the cost same body of the loop again i have but now i have two branches so there are two boolean values here and this is how i initialize them so if one of them is true the other one is always full so i'm basically i'm demonstrating you the worst case now in the best case everything is like everything is always true everything is predictable predicted branch is very cheap so then it's not interesting the profiler will show you that nothing bad happened you're good i'm teaching you about optimization so i'm going to show you the worst case and the worst case is you always get into the true branch but you have 50 chance of getting there through x or through y same ten percent of mispredictions and the total rate was still about 200 million operations per second which it's the same loop so you should be running it over a billion if everything is perfectly pipelined so because of the short circuit the processor has to evaluate x predict the result which is a fundamental failure it's random depending on the result jump to evaluating y or not evaluating y predict the result again which is another fundamental failure because y is also random so two branch mispredictions if x was false one branch misprediction on x if x was true and then you go do the computation the annoying part of this is the result is actually well predicted you always go into the true branch you're paying for nothing we should there should be a way to figure out that the whole thing evaluates to true without hitting the two mispredictions along the way yes would it help if we used bitwise or next slide okay so what can you do well some people try to put a temporary variable utter failure let me tell you right away two reasons one still needs to do the short circuited evaluation two the compiler is really good at eliminating those it'll undo your temporary variable so don't even try what does work is replacing short-circuited logic evaluation with something that doesn't have a short circuit if we're dealing with booleans which are zero and once additions or multiplications for end can handle that part or bitwise logic can also handle that part now this actually works well let's see if it works the only difference is plus versus logical or uh much very few mispredictions not quite as fast as our fastest loop but we do more work there there are two conditions to evaluate uh much faster than the one that was plagued by bad mispredictions so yes it worked it got the mispredictions down and it became much faster now the the kind of main danger here is some compilers in some of these loops will figure out that you're really trying to do logical or and will helpfully optimize it for you and generate a logical order and the code yes uh it could in very specific cases so like for example uh why doesn't the optimizer replace logical orbits like bitwise or whatever well first of all it needs to know that your bulls are true bulls that because remember three is true doing a bit while or two is true doing a bitwise or between two and one i mean gets you three but uh doing the end that doesn't get you anything uh so doing the or works doing the end doesn't so one is that has to know that they're actually booleans uh two it it could only do that without your consent if there were if basically the execution could not cannot fail so if you're doing star p you may be sure that the pointers are never now it can't in general uh and three it doesn't always work which actually doesn't necessarily stop the compiler from doing it uh but it is it is an optimization you have to confirm and i'll demonstrate uh any other questions before i uh confirm yes i replace two branches by one eights per second so it's a question what happened to the total number of branches remember these are per second so i'm running faster i'm taking more branches per second uh this is the good one the optimized one and this is the non-optimized one so yes i'm doing more branches per second because i'm running the computation a lot faster yes this isn't this is right so the second from the bottom is branches per second the one on the bottom is percentage of all branches that were mispredicted so i'm doing a lot more branches per second i'm actually doing a lot more instructions per second in general right let's see do i have instructions yeah so i was doing 0.78 instructions per second now i'm doing 2.41 instructions per second and some of those are branches so i'm just doing more instructions for a lot more instructions per second including the branches per second that's a good thing and i'm mispredicting much fewer of them so i replace two branches by one because instead of checking separately x and y i'm checking the result now if the result was also random this would not have paid off at all so if x if the overall result was 50 you go into the true branch and 50 to go into the false branch you would still have that misprediction and this would not have paid off at all the reason it pays off is that the total result is fairly well predictable in this case is perfectly predictable but the way you get there isn't so i'm removing the mispredictions on the way to get to a very predictable result that's the optimization works great if the total result is predictable any more questions all right well let's uh let's get to to the next of it all right now there is also there is one other thing that we could notice in this particular optimization and that is that there is a cost in removing that branch and the cost is we are doing more work the short-circuited evaluation though which had the condition spared us from evaluating y half the time in doing that in replacing it with plus or bitwise or we're reading x reading y and and doing the bitwise or plus every time regardless of the outcome so we are doing more work on the cpu so that's uh that's the trade-off now it is actually very general typically replacing branches with any of the optimizations and that i've shown you will show you results in doing more work why is that okay why was it faster the reason it's okay is because cpu has a lot of spare computing capability you can slot in one or two ads almost all the time for free those units were sitting idle anyway so you can get if you can get one or two ads to evaluate along with what you're already doing that isn't going to cost you anything most of the time not always true but most of the time because typically you're nowhere close to loading the cpu 100 it's just really hard to do now this is important to you if your metric of performance is the throughput how much can i compute or how fast can i finish the computation if your metric of performance is how quickly the battery drains you probably don't want to do that and around unnecessary unnecessary computations through that processor so now you have a trade-off to make but we're talking about throughput for now so optimizing branches away typically results in more work often but not always you can fit that work into computation capabilities that would have been idle otherwise not always true which which means again you have to measure that you're gaining you do this optimization you have to measure that you're gaining in particular if x and y were very complex expressions themselves you probably would not have gained if you had to evaluate the both of them all the time all right let me show you one more optimization because so i've shown you one for different branches for short-circuited expressions applies to question colon and other things i'll show you one more that is kind of fundamental of of the other type of common branchless optimizations we went from two branches to one and we saw that it was good what's better than one zero going from two branches to one and then to zero branches that should be even better the fundamental trick there is always the same and it's basically you know this is on the slide everything else is syntactic sugar around it you use booleans or conditionals as indices into an array so instead of returning conditionally expression1 or instead of returning a result i don't mean returning from a function instead of computing the result conditionally as expression one or expression two you compute both of them put them in the array and fetch the right element of the array using the index operator removes the conditions costs you unconditional computation of both expressions if the expressions are short and the branch was poorly predicted you often win if the branch was well predicted you always lose if the expressions are long enough you have to measure i'll show you the example and [Music] that will be all that we have time for but i'll show you the example of that so this is a branchless optimization the code is the the result of the computation is exactly the same the branches optimization on the left i just take the branch on the right i put both the results into an array and choose one using the index the branch itself is going to be mispredicted it's random this is this is where i uh i determine the condition so this is going to be mispredicted badly that's the branch the branch version that's the branchless version what four times something like that uh faster so branches version is faster even though i'm computing two expressions now if the branch was perfectly predictable so this exact same thing but i'm going to make the branch perfectly predictable so it's just always true with the branch without the branch or same code but i just changed the uh the branch to be always predictable now my branches optimization is slightly losing well not that slight i mean it's considerable because i'm evaluating both expressions all the time so before you do that look with the profiler to see that that branch was badly mispredicted in the first place the other thing sometimes even from the source code it looks like your branch is going to be mispredicted the compilers will actually do that optimization for you in particular that can a question call on the return usually gets optimized not always usually what makes it worse is different compilers do different things so i have a very short example where clang and gcc do different optimizations gcc makes its own branches optimization and you can't beat that and clank doesn't and you can beat that so you have to profile the results of of your code sometimes very similar code that compiler did the branchless transformation it doesn't those optimizations aren't very well sort of developed i'll show you that example oh okay uh we have to show the end of the code all right uh that's actually that function i've shown you on the second slide i'm calling this function the result is uh the the rest of the code is exactly the same so i'm adding the result of that function to to some value but how do i compute the return value in one case i say if b return like add something and in that case i use boolean as a as an integer multiply by zero if i don't want to change anything otherwise multiply by one if i want to add that number branchless uh costs me a little bit of extra computation again this is a mispredicted branch with the branch mispredicted branchless optimization now with the branches optimization i don't care if it's mispredicted or not but if the branch was perfectly predicted that number would be around 700. so perfectly predicted branch doesn't pay off mispredicted branch pays off spectacularly okay let me show just we're basically dan dalsh i'll caution you against one more thing because there is a very similar optimization you can do with functions if you don't know what you call you may be tempted to do a jump by function pointer which is the exact same idea as i did with the branchless optimization almost never works and if those functions were in line it's a spectacular failure so i'm not saying it never works but be really careful with this always profile it and even then like if you get to a better compiler maybe that can in line could still be a spectacular failure so be really careful with doing that thing okay so let me just restate the train of thought the more of your computing capabilities you can use the faster the code executes the processors have a lot of computing capability and usually they're waiting on data the designers came up with a clever workaround which is the pipelining organize detect multiple independent instruction streams and do them interleaved works great until you don't know what to do next because not just the data depends on the previous data but the code depends on the previous data that's conditional that's branches they came up with a workaround for that too make a guess about the outcome of the condition pick that code and run with it very expensive if the guess was wrong now there is no workaround for that and the processor now it's on us make an optimization by replacing a non-predictable guess with a predictable guess or removing the gas into the condition entirely so what did we learn predicted branches cost you almost nothing mispredicted branches terribly expensive optimization use fewer of them there is a cost to it don't do it unless it's confirmed and verify that it actually works both are done with the profiler well uh as i mentioned the illustrations done by by my sister-in-law and two more members of the family that deserve credit for allowing me to use their laptop if if you want to read a lot more about the performance in c plus including more about the branches optimizations and what gets you there and the hardware you can find it in my book and that link on the bottom gives you 25 off for all the conference participants and now let's take all the questions [Applause] in the more let's say common case in which i can both have some prediction on the things on the booleans i'm checking i can order them by how likely each of them will fail and also if they're false usually i care less about performance like i care more about the performance of one of the branches than the other because that's the good path and the bad path is throwing exception and i'm going to pay a lot anyway does it make sense to do those kind of optimizations in such a case okay so first so kind of two questions in it one if you have a composite logical expression that is sufficiently complex it depends on how predictable each term is not how predictable their result is so if you have terms that are much worse predictable than the overall result yes it pays off to replace short circuit with uh like bitwise ores and stuff like that when it comes to the overall expression now you know the fact that one of the branches is like emergency branch by itself doesn't matter what matters is you'll probably rarely get there which means once you replaced individual unpredictable terms with the overall result that is well predictable i mean you're not going to be throwing exceptions often so you're basically postulating that the overall result is well predictable at that point you're basically done because a well-predicted branch is cheap you can't optimize your way around that you just can't beat it this is the case for which they re try to optimize their processors really hard you can't beat that yeah but as you said the the i'm confusing because i'm using multiple terms and individual terms within the expression if they are much less predictable than the overall result so you say x or y yeah if x is you know poorly predictable y is poorly predictable but the result is well predictable like at least one so let's say you know 60 of the time x is true 60 of the time y is true but they're strongly anti-correlated so like only there's maybe one tenth of a percent of a chance that they're both false i see but if you don't have correlation then i can use i can use the short circuit oh yeah okay uh do we have okay let's go the other side we'll come back here uh session is over thank you [Applause] you
Info
Channel: CppCon
Views: 117,046
Rating: undefined out of 5
Keywords: c++ talk, c++ talk video, cpp talk, cpp talk video, c++, cpp, cppcon, c++con, cpp con, c++ con, c++ tutorial, c++ workshop, learn cpp, learn c++, programming, coding, software, software development, cppcon 2021, bash films, branchless programming c++, branchless programming examples, branchless programming techniques, false branches, branchless code, branchless coding, cpp branchless programming, cpp branchless code, branchless programming, fedor pikus c++, coding bootcamp
Id: g-WPhYREFjk
Channel Id: undefined
Length: 63min 56sec (3836 seconds)
Published: Wed Jan 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.