Branchless Programming: Why "If" is Sloowww... and what we can do about it!
Video Statistics and Information
Channel: Creel
Views: 594,458
Rating: 4.9064674 out of 5
Keywords: branchless, programming, tutorial, demonstration, explained, how to, examples, c++ asm, x64 assembly, avx, simd, toupper, upper case, uppse case algorithm, performance programming, optimization, optimisation, convert to upper case, ascii, code, bit hacking, if statements, assembly, smaller than, comparison, branchless if, branchless comparisons, computer science, compiler, disassembly, conditional moves, cmov, Creel, whatβs a creel, whatsacreel
Id: bVJ-mWWL7cE
Channel Id: undefined
Length: 19min 25sec (1165 seconds)
Published: Tue Jun 30 2020
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.
Semi-related: the Xbox One's security processor's boot ROM uses branchless programming for security-critical functionality in order to mitigate fault injection attacks. More info here (31:20): https://youtu.be/U7VwtOrwceo?t=1875
The details are mostly in the next slide, but the slide I linked to is highly relevant.
I coded xbox 360 and ps3, every if removed was accompanied with jingles and bells.
Nowadays though, barely any changes in the profiler.
Hardware designers work really hard to remove a lot of the fun of low level programming.
Avoiding branches is still common advice in GPU programming, shader code really tries to squeeze out every last bit of performance. Also GPU architecture is massively parallel so it doesn't behave the same with branches.
This engaging lecturer could have summarized the whole video at the front by saying "if you want to improve performance at the expense of readability, you can do it with branchless programming. And if you need yet more performance, you can get it by using processor-specific instructions in assembly."
I think we know there's room for improvement if we're willing to do bespoke assembly-code. In 20 years I've never seen anyone need to do that at the application level.
What's more interesting is that his branchless programming is just one of many tricks, tricks that are worth knowing about because they allow an application programmer to get some performance benefit without having to throw away platform-independent code. And his specific branchless programming trick is still brittle, because it depends on one feature of modern processors: the fact that doing extra math is inexpensive relative to the cost of branching in a pipelined CPU.
He SHOULD have said this: if you want to do TOUPPER(c), you can do one of
result = (conditional) * R1 + (!conditional) * R2
, which does extra arithmetic to avoid an if statementresult = R[(conditional)]
, which does the same thing but uses extra stack memory to save a conditional expressionI haven't done a test yet, but I'll bet that paying the cost of precomputing a 256-byte array will run faster than his ToUpperBranchless and be both platform-independent and also clear and easily maintainable.
(edit: I did some tests and on my machine a ternary is faster than a plain if statement, his branchless expression was slower, and a precomputed array was fastest of all...but not by much.)
Reminder: For 95% of you, your code is slow because of I/O or network calls. You don't need branchless code, you need a caching layer and make sure you never do two network calls where one could work.
With a modern branch predictor,
if
usually doesn't add a significant amount of time. With that said, you should consider alternatives if (sorry) trying to optimize inner loops. Things like case statements can evaluate faster (unconditional vs conditional jump) for example. Loop unrolling is another obvious example.If people are interested in extreme optimisation, this guide is a really good start: https://www.agner.org/optimize/optimizing_cpp.pdf
There is more here: https://www.agner.org/optimize/
Some people here questioning why you'd ever need to optimize down to this level. Well, I do some real-time DSP programming as a hobby (maybe soon a job if my wannabe startup goes well :), dabbled in chess engines, and have worked professionally on high frequency trading algorithms; this is basically what I do day in day out :)
Please donβt start programming like this unless it is extremely important with the performance gain. Otherwise it will take a week for the next developer just to figure out what youβre actually trying to accomplish, and also, you might just as well write your code in an assembly language instead.
I wonder what it would take for the compiler to do this for me.