Branchless Programming: Why "If" is Sloowww... and what we can do about it!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Semi-related: the Xbox One's security processor's boot ROM uses branchless programming for security-critical functionality in order to mitigate fault injection attacks. More info here (31:20): https://youtu.be/U7VwtOrwceo?t=1875

The details are mostly in the next slide, but the slide I linked to is highly relevant.

πŸ‘οΈŽ︎ 259 πŸ‘€οΈŽ︎ u/anxxa πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

I coded xbox 360 and ps3, every if removed was accompanied with jingles and bells.

Nowadays though, barely any changes in the profiler.

Hardware designers work really hard to remove a lot of the fun of low level programming.

πŸ‘οΈŽ︎ 93 πŸ‘€οΈŽ︎ u/polytopey πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

Avoiding branches is still common advice in GPU programming, shader code really tries to squeeze out every last bit of performance. Also GPU architecture is massively parallel so it doesn't behave the same with branches.

πŸ‘οΈŽ︎ 74 πŸ‘€οΈŽ︎ u/0x00000000 πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

This engaging lecturer could have summarized the whole video at the front by saying "if you want to improve performance at the expense of readability, you can do it with branchless programming. And if you need yet more performance, you can get it by using processor-specific instructions in assembly."

I think we know there's room for improvement if we're willing to do bespoke assembly-code. In 20 years I've never seen anyone need to do that at the application level.

What's more interesting is that his branchless programming is just one of many tricks, tricks that are worth knowing about because they allow an application programmer to get some performance benefit without having to throw away platform-independent code. And his specific branchless programming trick is still brittle, because it depends on one feature of modern processors: the fact that doing extra math is inexpensive relative to the cost of branching in a pipelined CPU.

He SHOULD have said this: if you want to do TOUPPER(c), you can do one of

  • arithmetic expression of the form result = (conditional) * R1 + (!conditional) * R2, which does extra arithmetic to avoid an if statement
  • local array of the form result = R[(conditional)], which does the same thing but uses extra stack memory to save a conditional expression
  • table lookup, which precomputes results early in the program so no comparisons have to be done at all
  • result cacheing, which computes results on the fly but uses data structures to get some of the benefit of a table lookup by re-using values instead of generating a rainbow table ahead of time

I haven't done a test yet, but I'll bet that paying the cost of precomputing a 256-byte array will run faster than his ToUpperBranchless and be both platform-independent and also clear and easily maintainable.

(edit: I did some tests and on my machine a ternary is faster than a plain if statement, his branchless expression was slower, and a precomputed array was fastest of all...but not by much.)

πŸ‘οΈŽ︎ 159 πŸ‘€οΈŽ︎ u/leberkrieger πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

Reminder: For 95% of you, your code is slow because of I/O or network calls. You don't need branchless code, you need a caching layer and make sure you never do two network calls where one could work.

πŸ‘οΈŽ︎ 10 πŸ‘€οΈŽ︎ u/GrinningPariah πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

With a modern branch predictor, if usually doesn't add a significant amount of time. With that said, you should consider alternatives if (sorry) trying to optimize inner loops. Things like case statements can evaluate faster (unconditional vs conditional jump) for example. Loop unrolling is another obvious example.

πŸ‘οΈŽ︎ 108 πŸ‘€οΈŽ︎ u/gimpwiz πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

If people are interested in extreme optimisation, this guide is a really good start: https://www.agner.org/optimize/optimizing_cpp.pdf

There is more here: https://www.agner.org/optimize/

Some people here questioning why you'd ever need to optimize down to this level. Well, I do some real-time DSP programming as a hobby (maybe soon a job if my wannabe startup goes well :), dabbled in chess engines, and have worked professionally on high frequency trading algorithms; this is basically what I do day in day out :)

πŸ‘οΈŽ︎ 22 πŸ‘€οΈŽ︎ u/Vedorias πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

Please don’t start programming like this unless it is extremely important with the performance gain. Otherwise it will take a week for the next developer just to figure out what you’re actually trying to accomplish, and also, you might just as well write your code in an assembly language instead.

πŸ‘οΈŽ︎ 9 πŸ‘€οΈŽ︎ u/einord πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies

I wonder what it would take for the compiler to do this for me.

πŸ‘οΈŽ︎ 24 πŸ‘€οΈŽ︎ u/realbrokenlantern πŸ“…οΈŽ︎ Apr 08 2021 πŸ—«︎ replies
Captions
okay you know all and welcome to another videos so the topic of today is branchless programming yeah we're gonna have a bit of a description of branchless programming and then we're gonna have a look at a couple of examples so one really simple example and then after that something's slightly more complicated alright so I will be putting up a companion ebook for the patreon as well as the source code so if you'd like to become a patron and support the channel then jump over to patreon I'll put a link up above and another one down below and that is ok branchless programming branchless programming what is branchless programming well first of all what is a branch a branch in programming is something like an if statement the switch statements the conditional operator a branch occurs whenever the CPU has more than one path that it can take so usually the CPU just executes one statement after another but a branch occurs when there's a condition and the CPU could potentially take one of two paths so the reason that that's slow is because the CPU actually tries to stay ahead of where it's got to execute so it tries to read the upcoming instructions so that it can prepare them for when it's going to execute them but the trouble is if there's a branch the CPU doesn't necessarily know which path that's going to take so it doesn't know which instructions to load and what it does is it guesses their guesses which path to load and it loads a bunch of instructions which might be good that might be the instructions that it needs to execute but on the other hand it might actually be the wrong instructions they'll have to flush all of those instructions out and it'll have to read the real path that flushing of the wrong path and the loading of the real path actually takes a lot of time so it's really really slow and in branchless programming what we're trying to do is is minimize or reduce the number of branches so we want to get rid of things like if statements in the middle of our loops and yeah hopefully the CPU will run our code much faster since it doesn't have to flush when it reads the wrong path ok so that's the basics of a branch but let's have a look at a bit of an example just yes so here I've got the first example this is really really simple stuff but we will get into more sophisticated stuff and eventually jump into assembly but right here we've got a small function it's just called smaller and it takes two integer parameters a and be and it returns the smaller of the two so this is a pretty logical way to code it just here would just say something like if a is smaller than B then return a else return B whenever we call this function if a and B are sort of randomly distributed then you could expect that 50% of the time this function will return a and 50% of the time this function will return B the CPU doesn't actually know without performing the comparison which one it's going to return okay so that's actually branching code that's perfectly normal C++ code that's a natural way to code and honestly it's a fairly good way to code so particularly for this function this is a very small simple function and as we'll see in just a minute when we disassemble this thing out the compiler actually knows exactly what we're doing now all right but this is second example just here is an effort to make that branchless so this is just a little introduction to the concepts it's not a good way to go for this particular example but let's just have a bit of look at it so we've got here smaller branch lists it takes two parameters a and B and it returns an int just the same as before but the expression that it returns doesn't have an if statement in at all we say return a multiplied by a smaller than B plus B multiplied by B smaller than or equal to a that my friends is a branchless smaller function so this actually shows a really common technique in branchless programming and that is arithmetic using the conditional operators so the conditional operators just here this less than just here and the same with this less than or equal to over here they actually return one when the condition is true and zero when it's false so they do actually return that numerical values and we can do arithmetic on that so this expression just here if a happens to be smaller than B then this little return statement just here will become return a times 1 plus B times 0 and if BAE happens to be the smaller of the two or if they're both equal then we'll get something like return a times zero plus B times 1 yeah you see because the conditional operators just return 1 for true and 0 for false so you can actually string together a whole bunch of these additions and multiplications you just have to do something like this pattern just here we've got a times the comparison that leads to a being returned plus B times the comparison that leads to B being returned plus C times the comparison that leads to C being returned so you can actually make a you know a whole string of these things together you can do more than now just two options so in this particular example this is actually going to be much much slower so let's have a bit of a look at why that might be modern compilers actually pretty good at optimizing and they know a whole bunch of optimization tricks include a branchless programming ok so the compiler actually saw what we were trying to do in the first example and it said well you're just trying to return the smallest int and I know a branch this way to do that so the compiler optimized the branch out completely and it led to something like this assembly go down here so this is the assembly code from the first example we've got to move EB XD word pointer B so just move B into the register EBX then compare that value with a and see Bob L seam of L conditionally move L yeah move a into EBX if if it's less so the branch just here has been replaced by this instruction seam of L so the C++ compiler has actually recognized what we're trying to do it's made of branchless version using a really good instruction for this particular code if we have a look at the second example so the second example I was trying to be tricky I was trying to be clever I was trying to help think the compiler and I put in a branchless code myself hoping that it would go faster but it did so if we have a look at the disassembly of the second version we see that things are a lot more complicated so remembering that the first listing up above was really really fast there's like three instructions but this one here is much much longer so this is actually branchless as well so you'll see in here the this one here set L a so that's actually a branchless condition there as well set le is going to set the bite to one if some condition less for equal in this particular instance it's going to set BL to one if that's true and it's going to set it to zero if that's false yeah so then it performs the I miles another set L down here now that I'm all add the results together so you'll see looking at that code there that that's more or less a word-for-word assembly translation of the original c++ the compiler wasn't able to see what we were trying to do it didn't recognize the pattern that we were trying to achieve here so in this particular example our branchless code is much much slower and often it really pays to have a bit of a look at the disassembly and see what the compiler is doing with your code just so that you can get a bit of an idea if the compiler is actually doing things well or if it's just gonna be really really slow branchless programming is not always beneficial especially in circumstances where the compiler is able to optimize the code itself it's able to realize what we're trying to do and out code us now so you do have to be careful where you apply this sort of stuff okay moving on to the second example so this one I think is slightly more interesting than the first only slightly module let us now consider a more complex example we want to write a function that takes a string of ASCII characters and converts all the lowercase letters to uppercase yeah so we're writing a to papa method in other words you don't want to convert uppercase letters obviously they're already uppercase and you also don't want to convert bytes that happen to be things like numbers or punctuation marks okay so it might help to mention that the ASCII numbers or codes I actually made so that the lowercase versions of letters are 32 above the uppercase versions now so the uppercase letters have of codes from 65 to 90 with 65 being a capital A and the lowercase versions of the letters are from 97 to 122 so 65 is an uppercase a and 97 is a lowercase a which means that all you've got to do to convert is first of all make sure that it is within the range of 97 222 but if it is in that range then subtract 2030 from it now you'll have a lower case version okay so a little bit about the timings here I've actually run these two upper functions and million times with strings of a thousand and twenty four characters so it's going to be minimal cache misses and we're really getting the best possible circumstance here in practice this is probably not going to work as fast as it does in this benchmark okay but let's have a look at the first example so the first example is just your basic C++ code I equals 0 all the way up to count and all we say is if D I is greater than or equal to lowercase a and if di is less than or equal to lowercase ed then subtract 32 from D PI all righty so that's our C++ function let's disassemble that bad boy and see what the compiler come up with would you look at that ok so you don't have to look at this very long to see that there are a bunch of instructions called job J a jump if above my friends those are branches yeah those are conditional branches jump if above is a conditional branch and that is what we're trying to get rid of and what I found when I ran this a million times was that this code takes something like three point three three seconds to completes our little benchmark of a million iterations there so what we got to do is have a bit of a look at some branchless techniques and see if we can improve that more ID so here's my first branchless version just here this code here is completely branchless nary a branch in sight so let's let's break it down a little bit so it's in two halves if D is greater than or equal to a and D is less than or equal to Z it's gonna give us one if di is a lowercase letter and it's going to give us zero if it's not and then we've flipped that with a little not just yet and I'm multiplying that by di yeah so that'll actually maintain all of the uppercase letters or the punctuation marks or the numbers or that sort of thing and the second part down here is the lowercase part yeah so instead of di equaling itself multiplied by you know that expression we've got di minus 32 multiplied by now the same expression so di is greater than equal to a and D is less than or equal to Z yeah so we just compute the expression these are brackets ISTE and there's an imaginary expression right and multiplied by the value that we want and if we run this code here on my particular little machine it actually comes out to around about three times faster than the original to upper so that's a pretty good speedgame we go from about three and a third seconds down to about one second yeah so that's a nice game but let's have a bit of a look at how we can improve things more so in this second version just here to offer branchless we've got di minus equals thirty-two times that expression just there yeah so once again that expression is going to come out as zero if di is not a lowercase letter and it's going to come out as one if it is a lowercase letter then we're multiplying that by 32 and subtracting it from di so this is actually much faster than our first branchless programming version I think this is something like six or seven times faster than the original C++ and the running time there that I got on my particular machine just here was about 471 milliseconds yeah so we're down to less than half a second now so seven times speed gain on a to upper that's not bad really yeah the code is a little bit tricky to read but we're gonna get a whole lot more tricky to read so let's just buckle our seatbelts all right but that's all safe bus Plus so now we're going to jump over to assembly and have a look at some similar things in assembly and see just how much speed we can get okay so the first Assembly version just here is not branchless this is branching code perfectly normal branching code we've got two upper as someone just here JG is a branch JL is a branch and also the bottom of the loop is a branch to J&Z but often you can't get rid of all the branches you know you'll have to have a bottom of your loop at some point or not just have all the view code written in red one after another okay so that's just a simple branching version in assembly so the runtime there is not great so the speed of that one there is about 3.6 seconds so that's actually slower than the original c++ which is about 3 and 1/3 seconds so it just goes to show that you don't you don't automatically get speed from something over to assembly you have to code it well let's have a look at a second example so this is Brad this assembly right here we're going to use the Siemens the conditional moves because we're great and we know how to do it alright so once again we read a bytes from our CX we store it in our just here LD Meola then we more negative 1 into our 18 and negative 1 is 32 ones all in a line and we 0 90 so I could have X or just there X or our 90 our 90 but I wasn't feeling cool I was feeling like no old school so just anyway come out Z all right so if al is greater than Z then this lines stsc mob G or conditionally move on greater than condition is going to move on 90 which is 0 into D which at the moment has negative 1 so this will actually clear our ad if X is greater than Z yeah so we're just making a bit of a bit mask here with our 8d fill it up with ones and then 0 it if al is outside the range of the lowercase letters that's pretty much all we're doing okay so this version to steer is branchless and it runs in about one point four seven seven milliseconds now which is not too bad really yeah not too bad but it's not as fast as our branchless versions in c++ that we saw at the very start I'm a branchless programming is a Sindhi technique isn't it really yeah I'll just yeah I can do China people let's let's do this for real shall we so branchless programming and Cindy go hand in hand Cindy bracelets programming is really how you put the pedal to the metal okay so Cindy single instruction multiple data you could do this in C++ using the intrinsics this CPU and a whole bunch of them out there in the heart in the wild world actually has something called a VX in them advanced vector extensions and these can actually do 32 bytes operations at once now so we might as well use the AVX registers and see if we can get a little bit of extra speed so this final version just here is an AV X branchless version of our little - upper function all right pause RB x save the callers are the X I'll be X's not scratch people we got to save it I guess a lot of this stuff at the top is just to figure out if there's residuals or AVX deals with 32 bytes at once and if the counts isn't actually divisible by 32 bytes then we might have some extra at the end Vica equal B okay so that's assembly way to set one of the AVX registers Y mm3 just here - all once across the entire register now if you ever got to one's complement things in a V X then you can just XOR with another register that is filled with ones alright then we read a bunch of little constants just here 32 G T equal a array and 0 ray those are just defined at the very top of the file 32 is just a whole bunch of 32 s GT equal a array is actually filled with a minus 1 and the cetera is filled with the SEDs so those are just for performing their comparisons but let's have a look at out the main sim D loop works so we read 32 bytes into ymm 5 so all of this business just here is all about figuring out whether or not we've got bytes that are within the range of lowercase letters yeah we've got to get a mask a bit masked so that we want once for all of the letters of a lowercase and zeros for all of the letters that are uppercase all right then we copy the original characters over to ymm for so that at the moment they're just in that white mm5 yeah we want to complement the mark that we made just before so that instead of having all of the letters that are lowercase we've got another mask which is all of the letters that are not lowercase yeah that's why we set up that white nm3 with all ones before and we and that not lowercase mask with y mm 5 so y mm 5 was actually the data that we read from the array and anding that with the not lowercase mask will result in all of the letters that were not lowercase o the uppercase letters and the digits or punctuation now they're all going to stay exactly the same but any of the characters that were lowercase will actually be zeroed yeah they'll become nothing and the next thing that we've got to do is that subtract 32 from the other values these are the values that are going to become the lowercase converted to uppercase yeah so subtract 32 from those values and that with the lowercase mask so this is the other mask that we made and what we'll end up with there is 0 for all of the values that were not lowercase and we'll end up with uppercase versions of all of the values that were lowercase there and then finally we can all outer results together so that's the two results from our masks and store everything after that is just dealing with residuals yeah so the residual loop just here just deals with things in scalar one at a time yeah okay so the important thing about this code and the reason why you might want to write this really stupid code reason why we would do something like branchless MD is this code actually runs at somewhere between say 25 and 40 times faster than the original C++ so that's really really worth doing and I will say that I've actually had a lot of trouble timing this code it tends to pretty much take no time at all and the time to randomize the data takes longer than the time to send it to upper now if you do tighten this at home you'll find just wild times it'll report a hundred times faster 200 times faster yeah I don't think it's anything like that but it is much much faster than the c-plus class it's fairly generic so in this particular example we're converting to uppercase but really all we're doing here is an hour we're we're conditionally subtracting some value from bytes if they're within a range yeah so this is really quite flexible you could use this for a lot of different things I mean you wouldn't have to subtract you could add or whatever sort of operation you want to do but conditionally perform some operation on bytes if they're within some range okay so the conclusion branches are slow now to improve your code by using branchless techniques this works in high-level languages it even works in c-sharp Java those sorts of things yeah you'll find that the compiler often knows a bunch of bracelets techniques and if you disassemble your compiled code sometimes you get a better idea of exactly what the compiler is doing it's really worth looking into Brad's list techniques are not always going to help but they're really really interesting it's just one of those techniques that I think is largely lost in today's programming education fields I was never taught this I was never taught this I said to study it from my various sources online you have Wikipedia and things like that performance manuals and things like that from from AMD Intel yeah but it does make for a really really interesting study and really good fun trying to out code the compiler with your branchless techniques anyway as I said they'll be putting up a companion ebook for this video up on patreon and you can also get the source code on patreon other than that I want to say thank you very much happy branchless coding people have a good one
Info
Channel: Creel
Views: 594,458
Rating: 4.9064674 out of 5
Keywords: branchless, programming, tutorial, demonstration, explained, how to, examples, c++ asm, x64 assembly, avx, simd, toupper, upper case, uppse case algorithm, performance programming, optimization, optimisation, convert to upper case, ascii, code, bit hacking, if statements, assembly, smaller than, comparison, branchless if, branchless comparisons, computer science, compiler, disassembly, conditional moves, cmov, Creel, what’s a creel, whatsacreel
Id: bVJ-mWWL7cE
Channel Id: undefined
Length: 19min 25sec (1165 seconds)
Published: Tue Jun 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.