Fastware - Andrei Alexandrescu

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good morning everyone morning you're late all right five five pounds all right we've got one of these little batteries itty-bitty battery your engineers you're supposed to have one at all times on you come to the package yeah I could use a clicker thank you so sorry I'm noticing only now this is not working actually the laser does work so if I put in my eye probably is gonna hurt it Thank You Hubert thanks all right I got it from here yeah ladies you want Eliza no thanks of course terrific good morning my name is Andre and I'm seeing a couple of familiar faces in a few who might not have seen before who'd seen me before like talking the two familiar faces yes thank you so my name is Andre I'm good I'm going to talk about fast wear which is the token efficiency in the systems level languages I've had a great time getting here I hope you had an easier time it was 2 hours 35 to fly from Bucharest to London and 12:35 to get from the airport to here so that was pretty awesome and then I'm not wearing my badge so the security guard is like what you do to work here I was like I don't work I mostly talk and that didn't help either ok I figure out of the following one fact which is I have too many things to say so I'm going to this I'm never done it before and let me tell you the way I imagined it the way I imagined it was like there's like 500 people here and I would say choose your own adventure and I would give you the choices and you'd vote by means of applause but given that there's like you know 30 of us here primary we're going to do kind of a raise of hands kind of saying all right so we have two choices I'm not kidding you choose so choice number one is down to earth discussion and we're going to discuss a few close to silicon optimizations strength reduction which is a traditional compare optimization but in fact it can be done to redesign algorithms improve instruction level parallelism of code discuss a bit of the irishman arithmetic underpinnings of optimization so what are the links in between the sort of the tricks that we can do in optimization there in the actual fundaments of math that make them possible we again we get to rethink a few simple algorithms such as converting integers to strings in lub and back and we're going to show quite a bit of code and measurements so the disadvantages would be the fairly low level talk some of this advice I'm giving may not age very well with improvements in compiler and CPU technology and the standard of codes which not necessarily everybody likes or as a second choice new research and going to this would be a more research kind of topic which focused on two core computer science algorithms there's only maybe a few dozen essential algorithms in computer science of which to our petition Tony Hoare partition which actually veidt's an array into in two halves one is that it's smaller than an element and there is larger than the given element and quickselect which is how to find the answer element the smallest element in an array so that talk would focus on these two algorithms or by means of sentinels in unexpected places which is kind of an interesting topic I'm going to discuss layout in iteration strategy and I'm going to show significant new results academically speaking in an area considered closed because partition and quickselect are literally older than all of us in this room except me no need not even 69 so these are like 61 65 and ever since then have been considered the optimum nobody worked on improving them anymore so we're going to shown your results here on the downside this is a complex topic if you've never heard of partitioning quickselect before it's unlikely this talk is going to cause you a lot of joy and happiness it's difficult to appreciate without set background and it doesn't translate to obvious practical advice for your own work that you can take tomorrow it is however a nice sort of researchy topic so now by means of by means of a show of hands please vote for this all right vote for this right well I gotta tell you the following both are interesting so we chose this right the dangerous and I gotta tell you I've done a bit of branch prediction here because the following slides are going to be this so otherwise I would have to change the slide set and that would kill the peg guys so nice let's talk about strength reduction so stress reduction is a tradition a compiler optimization so it means reduce try to replace in any algorithm stronger operations with you know not so strong operations for example the classic like when when you're kind of you know 50 year old guy coworker who thinks is great talks about strength reduction they say oh you should replace that thing with that thing like the division by two it's a shift and I'm awesome right it's not that awesome because compiler start doing this since 1980 last time he looked problem right so this is sort of a classic example of strength reduction but I wouldn't recommend you sit down and do it because you actually open yourself to liability such as if you know is on sign and one is signed one is obviously signed then you're unclear on the side - or the result depending on the size of the operator center type and everything whereas the division always gets it gets it right so um but this is the this is the idea of strengths reduction you want to replace the stronger operations with simpler operations and let me give you sort of the Harkey of speech here we have comparisons that would be the cheapest I'm saying comparison without the jump which you know a equals beam less than T a gets B less and C that's possible and it's actually the fastest so less than a cycle you can think of it then comes integral add subtract beat operations and shift or each of which takes one cycle like is the absolute awesome awesome you know most awesome thing you can think of is one cycle to do 64-bit edition which is quite the feat if you think of it like adding to 64-bit numbers kind of see now it's kind of all the carry and you know that stuff is not easy and to wit in 2004 that was a paper it was the topic of a paper on hardware how to do 64-bit edition in one cycle and not everybody does it so right now you can think whenever you need you have an addition to do we can think it's going to take one cycle you know this less than one nanosecond so that's pretty awesome then kind of faster stronger operator we're kind of getting from weakest to strongest so then comes floating point add and subtract and those are a separate unit all together so the the floating point is its own hardware what follows after that is multiplication of integrals and floating point so that's more costly stronger and then goes the floating point division remainder there is a such a thing as a floating point remainder did you know about this chance they take in a folder like I gotta show this on Twitter right now and the most expensive thing the operated primitive operation in a CPU is actually integral division by far it's like we're looking at a big margin between the last one and everything else now you know this is kind of weird because consider this integral division takes more time than floating point division which is kind of you know real numbers I mean come on so there's got to be an explanation for that and for this I'm going to open to the floor so please let me know what why do you think integral division takes more time than floating point division you all right I'll give you a hint I'll consider this they told me they told me not to ever do this but I'm going to kind of walk a bit here at my own peril so floating point division is faster than interval division and let's think of the floating-point number format so 64-bit floating-point vs. 64-bit integer in a 64-bit floating-point what do we find how is it laid out how is it structured well floating point command Monte's yes yes thank you so there's an exponent and there's a mantissa the exponent is like I think was that 11 bits 53 yeah 10 bits 1 bit for sine 10 bits for exponent and then 53 bits for the mantissa and the actual number you're looking for is the mantissa times 2 to the power of X 1 minus some constant so that's a floating-point number for you right so it's stored in that it just so happens as the weight stored now consider so the number is actually you know some you know 53 bit number times 2 to the power of whatever right okay now I have two numbers in this format and I want to divide them so at this point you got a desperately reach for that high school-level algebra that you you know you forgot most about and when it divides you know a to the power of B to the a to the power of C what do you get you subtract the exponents thank you so it's actually when you when it divided two floating point numbers it turns out you need to the much simpler operation which is divide 253 bit numbers and also take the difference of 210 bit numbers so the floating point format is already rigged to favor fast multiplication and division which is pretty awesome I mean how can I it's amazing they thought of that right this is really cool in contrast integral division not only has 64 bit of kind of payload to deal with it also needs to get to the ultimate precision if you can can get an approximate result when you divide two integers right I gotta get to the whole nine yards there to get like the ultimate bit so um on a more philosophical note I don't agree with division in principle I don't like division it's unnatural I mean in nature only division by two occurs but there's no natural phenomenon that is going to divide by 17 there's no there's nothing in this world that naturally is going to be our assurance this is division by 17 in contrast a lot of other sort of these arithmetic operations do occur in that in nature all right so in our philosophy for the day I'm getting the wrong all right so by means of an example let us discuss strength reduction on a simple function which goes the following way I want to count the number of digits in a number what could be simpler and it's kind of a very nice interview question or very nicely a question for those of you have like kids aged like 6 to 10 like you know write a JavaScript little program you know Johnny to you know figure out how many digits are in a number and the way the natural implementation would go is on gaining a loop to increment the running result and dividing the number right 10 until I ran out of digits and as the wait works right alright so now the problem with this guy is of course it's going to use a bunch of divisions so once optimize this right so want to make this as fast as possible and actually there's a reason for it even though it doesn't look at terribly useful function it actually is so let's make this a lot faster by of presumably reducing the strength of this division here however it's actually a multiplication because there's a paper I think in the nineties that showed how division by a constant can be translated with multiplication into multiplication by different weird very large constant with overflow followed by six upper shift or an add to width let me actually show you and I'm going to take advantage of the very nice tech guys who let me switch slide here which is a non-trivial thing okay so let's do this I'm going to change an example here you should all use this slide GCC Godbole dot org it allows you to type code in C++ and see that this assembly in real time alright so let's actually say our returning by 10 and you don't need to know assembler to make kind of very interesting draw very interesting conclusions from from this take a look at the directed assembly so you're going to see that we have this weird negative constant being multiplied mu Allah is like multiply I don't have to have a PhD for this kind of stuff right you don't need to so it's multiply and that kind of stuffs and it's going to be followed by a shift which is the fix up so very interestingly we translated the division by ten into multiplication which is already the compiler at work doing strength reduction for you and that's pretty awesome because there's a lot of work here behind the scenes here that's happening terrific very nice all right sorry so let's make this faster ideas I'm opening this to the so I'm opening the room here ideas to make this faster get rid of multiplications if you can sorry use the lookup table excellent idea actually we're going to kind of it's a good option to explore use a lookup table how big would the table be it would be fairly large so unless we kind of find a way to carve the space better we would need to to have a very large table or kind of make sure we use a table for only like subsets of these digits um what are yes yes the binary search on the number excellent so actually kind of take the like the middle bits like 32 that kind of stuff actually that's a very good idea but it runs a bit of fall of the law of small numbers oh I didn't mention that it's all part of a play I paid that guy so so the point the point is the law of small numbers is this most numbers in a program small right there's a lot of big numbers which is completely different statistics so the law small number says in most programs you're dealing with like numbers with you like 1 and 100 and that's the most are 0 like 90% are 0 and 90% of all strings are empty so that's your baseline and then you have a few small numbers then you have very few very large numbers so that binary search is going to assume that you're the most likely thing you're looking for is a number that's in the billions which is unlikely right so although it's a great data so it's a great idea to to get us excited here other ideas and I kind of destroyed the I destroyed the mood here yes huh a bias kind of ok all right so that kind of gets us to the following point and here I have a few things to say about this function number one it features an infinite loop and ostensibly infinite loop like this right semicolon sells for like forever all right and let me pause it the following you want to write fast code you always have to start with an infinite loop like the first is I want to make this function fast the first thing you write is this forever I know it's paradoxical because infinite loops take a very long time right but actually they it's a an infinite loop goes like this it's the least is the least work to loop right because the jump at the beginning of the loop it does not no work there I've been working Facebook for most six years and I've interviewed hundreds literally hundreds of people so I've held like the record of like the most whatever the guy who interview the most people at Facebook for a while so that was because I kind of was at these events hiring events and university and whatnot so I would ask people please implement for me HDR HDR brute force but just don't do any unnecessary work if he writes ER find the substring in a string right it's a C function what could be easier further like it's amazing like you look at the social phenomena underlying this question so far like 20% of the people can't get it working programmers so 20% cannot get it off the ground cannot implement it on the interview pressure to start with which I found amazing it's very interesting it tells you that you know interview pressure is huge and also that you know for another experience program is difficult to kind of get this rolling um the second thing is a lot of people would start with a structure loop there is a for I equals 0 I less than oh wait a second Oh ACR Leno's no no that's not going to get you there right because it's quadratic it's going to evaluate a ctrl n every time right so then other you know they're off so as soon as I pointed I thought they would see Oh feisty any SGR Lenin and they would iterate themselves and the problem there that was in a CSTR if you start with a structure loop like a classic for loop with you know initialization condition increment it's just not going to work this good you're going to do a lot of unnecessary work so I would say if you want to write fast code start with that forever and then if you get to the structure loop so you know naturally you can rewrite it but don't start with a structure loop because it sets up your mind the wrong way to wit this cannot be structured in into a classic for loop and the way it works is the following love being a lot of small numbers if the value is less than 10 then I know what to do I'm just returning result and done and then 100 1000 and 10,000 and I'm going to do an addition remember one cycle each addition here is one cycle and then if I got to something greater than 10,000 I'm going to reset my calculation by four digits at a time by dividing by 10,000 and then I'm going to add four to the result and continue so what is this is not enrolling because many people look at this and say oh this is unrolling of course it's a classic no it's not unrolling because unrolling is you do the same thing many times in this is sort of it's just different it looks like it so what what is the strength reduction we've done here what did we get rid of and what did we yes right so we have we have more comparisons and we have a few more additions right but we do 4x fewer divisions right which is often and for small numbers are in really good shape by the way another mistake that this code has just kind of static look at it you can tell it's inefficient by this zero here it is there's no number that has zero digits so starting at zero is already a loss it's you're wasting time you should start someone like the absolute minimum digits a number can have it should be one right so already I'm kind of doing a zero and then I'm incrementing like an idiot and it's just a mess right in contrast this is like clearly just goes the minimum Road to whatever it needs to be great how much faster do you think this is going to be how did you make for me who gives me like I don't know five percent or or more five percent raise your hands five percent or more more okay Oh 20 percent or more will give me 20 percent 50 percent one 1/2 speed improvement or more one and a half 1.5 X a few hands all right thank you have a you know an auction here right I should have that the fast talk of auctioneers in America right actually it's a lot better than that we look at an improvement between two and six times in speed so it's staggering and by the way um about this drop here how we can explain it for us there's a drop here there's a drop here and there's a drop here sorry yes it's exactly where we take the toll of a division this is we doing the division here we go the one here one here so it's exactly the you know eight to nine digits and twelve to thirteen and that kind of stuff why is he not happening as for excellent question I don't know I'm not kidding so if we look at this for for a long time trying to identify the pattern here at the beginning and I can't apparently that you know sort of other noises in the in the cost of the function are influencing this to the extent you don't see the cost of that the price of the first division but it's a very pertinent thing to ask yourself about right awesome so well we have like a very nice going here on a useless function that's terrific and very pleased by that we improve to up towards ooh a function that nobody has a use for however we're going to see soon that it's usable within another context and I'm going to talk about minimization of indirect rights indirect rights are right through a pointer like you have like an array element and you right through the index of the array so it's not like a right to local variable with a name like a equals 5 or a plus plus or whatever so that's not an indirect right that's a sort of a direct right the indirect rights are your enemy you should be very very very very kind of parsimonious about giving away you shouldn't give them away right why because indirect rights are next to impossible to put a register and registering means the compiler takes automatically for your sinks and put them in registers like things like values variables names and put them in registers and it all goes amazingly well another thing that is kind of solid at the right an indirect right is really a read and a write whenever you write something you actually do two things you do a read and then I write who knows why yes Cubert cache line more explanation please that's right thank you so all transfers in memory do not occur at word size level people I you know like people who are not kind of hardware sort of I don't know savvy tend to think that all transfers occur at like it's a 64-bit machine 64-bit transfers from you know words between memory and the Machine yes it's actually not 64-bit it's 64 bytes so they're right about one thing the number is the same it's just different you need to measure that was a joke just I'm going to raise my hands whenever I make one so this you know so just laugh everything you know this should be a guy with a notes laugh so actually whenever you want to write a word like eight sorry 64 bits like it you know like four bytes or eight bytes it's you know what would you do as an engineer if you could only read and write 64 bytes at a time you read the whole goddamn thing you write the little one thing and they write the whole goddamn thing back and that's the way it works so read is actually of write is actually read followed by right there's no instruction in the Intel set that tells you please write me the whole cache line right now I've looked for a way for a while I couldn't find it there are some other architectures that do allow it and those experienced like 2x faster writes because you don't need to read any more and he makes a cache study the problem make it a cache dirty after right is that you know cash you need you need to kind of replace all pages with new pages in caches like classic cache architecture whenever the cache is dirty you need to write back the cache that particular cache line into memory if it's dirty if it's not that you just throw it away so it's better to not dirty the cache so all right um I see how to apply this awesome piece of knowledge here so we're going to implement the you 64 to asking and our baseline looks like the following we have a value and we want to deposit the value at some address in form of ASCII characters so does the you know integral to ASCII conversion it's a classic routine and if you crack up on any library or any book on how to do these things you're going to see that the way it works is called horner scheme the algorithm and it goes well I'm going to deposit the number starting from the least significant digit which is ASCII 0 plus the remainder of dividing value by 10 and that's my last digit this is my last digit here value divided by 10 sorry that value remainder 10 is my least significant digit and in order to bring him into the ASCII realm I'm going to add ASCII 0 to it and if it's 0 I get asked is 0 and if it's not I get asked in 9 is right this is obvious right okay great just not for me please smile and nod okay awesome and then I'm going to make progress by dividing the value by 10 and therefore I'm going to get and the next the next digit and I do this in a loop until I run out of value the problem with the scheme is that I'm obtaining the number in reverse form I'm getting the last significant reason ISKCON digits first and I'm getting a most significant digit last which is terrible because then I need to reverse that the whole thing yes so that's what I'm doing this little fix-up loop here which is I'm going to swap I'm going to reverse this will be like a CD reverse star destination so I'm just explicit ating the loop here to make it clear that this is going on question for everybody how do we improve this function how do we make it faster yes measure the number that's what we need because right now we're discovering the number of digits simultaneously with actually writing them so one one approach that was suggested what how about we figure out how many digits there are and then we know how many digits we need to write and we write them backwards in the first place all right well amazing you actually have a function that does that in it remember digits an all right so here you got to understand this is unclear that it's going to make things any better because it's a gambit we're going through the number twice we're doing twice as many kind of primitive operations with the number but we're taking that risk because we just said minimize or that minimizing Arkwright's and these are all all that loop at the end of the function is indirect twice it's rights to the pointers so the gambit is I'm going to do more calculations by looking at the number twice but I'm going to save on the on these all of the speed fixed up loop I'm going to throw away all right awesome so great great the first line in the new function is digits 10 yay so we got to use it it's not useless and we're going to say Auto comes result gets digits 10 and then I'm going to position myself at the result minus 1 which is the position of the last digit and I'm going to write a buffers position and decrement positions I'm going to write the number backwards in the first place and when I'm done I'm done even better remember the law of big numbers the law small numbers while V greater than or equal to 10 because if I have a number that's only one digit I can do it a lot simpler look at the last two lines of that thing I'm just going to catch the last digit too you ain't and a desk easier to it and that's my last digit and if you just so happens I have a small number that I'm not going to go through that whole loop at all I'm just going to go through the very end make sense yes all right it's not very complicated awesome so well my question to you is like is this faster how much how many percentage percentage points are going to give me if this is actually it's not a it's a function that is actually useful it's a library function that is can be actually very intensively used in things like that databases that only work with the ASCII transfers and there's a bunch of web frameworks that only work with as we transport a lot of numbers across these interfaces you're going to have to do a lot of that you have to do a lot of conversion to ASCII and a lot of conversion from ASCII to wit at the hive database which is a very large-scale database used at Facebook they had this problem which was all transfers occurred in ASCII format and whenever you read a bunch of data you'd have to go through this and back a lot of times and it would be like you know 20% of the whole run time of an interesting application just the conversion which is amazing anyhow getting back here what what is the improvement you're looking at we're looking at here I look at I don't know 25 percent anyone or more everyone's like fool me once huh all right so not a lot of you are like believers here let me kind of tell you what I believe the the sources of possible improvements are your indirect right regular access patterns are going to go through memory just once backwards and that means prefetching and good things fast on small numbers and data dependencies are reduced you're late all right you're not missing out I said nothing interesting so far we're looking at an amazing like two-plus improvement on small numbers up to like 114 or 20% improvement for very large numbers on a function that was considered at its optimum like if you crack up on GCC or Visual Studio is implementation of this primitive you're going to see that everybody does it the first way or baseline nobody does it this way and the fact that we apply two simple principles strength reduction minimize indirect writes to completely redesign and rethink an algorithm that was considered unimprovable this should get you very excited so this is why you're here presumably so look at a very instant improvement which can actually make a difference in the bottom line of an intensive application so switching gears a bit right now I'm going to talk about the topic that is very dear to me which I call the Forgotten parallelism because it's a very hot topic right now pair programming and all that and there is something that we tend to forget it exists to wit let me um bring up the can anyone see this like I'm thinking like the distance to the projector and you know the resolution here thanks for taking that food is not going to help so does anyone see the letters alright if you see the letters you're tailgating right so um it's a cpu and the weight works is the data flows top to bottom this is traditional CPU you know kind of diagrams and we look at the lifetime of an instruction ship which is like you know the reddish stuff is sort of the brief edge buffer and you know the instruction decoder and stuff the instructions go there's a childhood and then I get to the the adolescence which is they're going to have sort of a register location and there's going to be some more slicing and dicing but what I'm trying to get here is where the action happens which is the ALU the arithmetic logic unit which contains and this is sort of you don't need to see the letters to figure this out there's a lot of stuff going on in parallel there are a lot of things that have like arrows flowing into them and out of them but they are horizontally scaled the XK like this and you know presumably I get to kind of feed them all at once if I kind of organize my ducks in a row right and then I have like sort of the memory order buffer to kind of rearrange properly the rights and stuff and hear them the data goes out from the execution the instructions get retired and it's all good and then we have on the side caches and whatnot so the interesting thing is here that this is just one core is not on CPU and you have like several of those literally in your pocket in your pocket in our laptop on a desktop machine on your blades in the data center and if you don't light these up if you don't get to use these they remain unlit forever no many no matter how many of these you have no matter how much you scale in the traditional parallelism sense if you don't get to use these they remain unused then you're wasting actually you're wasting hardware and actually wasting power as well because I gotta tell you even though they're not used they still need to be clocked so they're gonna like clock accounts for like thirty percent of the circuitry and the power in a CPU so you're gonna a lot there's a bunch of stuff going on and essentially you know if they stay dark there's no way you can recover that parallelism that you lost so okay let me actually read to you what this is stored data age you store addressing unit this is for address computation integer a branch unit add and move SSE integer floating point add floating point add sec add floating point multiplication multiplication multiplication and division so you have all of these things that could actually happen simultaneously and the question is now like for us as engineers like I don't know the instructions that take me to that beautiful world inside the core and I have no idea what the simple I need to use to get to light those up and you know yeah I know the constant it sounds very fun and but you know what can I do here I'm no I plus plus okay what what happens here what happens if I say apply plus equal to you know etc so what how are these instructions going to translate into lighting these are these awesome units up well actually there is a way um this is it this is the secret to everything if you want better instruction level parallelism all we have to do at the source code level is have fewer data dependencies which means we get to define those first what are data dependencies what is the data dependency it simply means the the start of an operation depends on the result of an operation preceding it that means this particular operation is dependent upon the previous operation and we want to make as few of those as possible humanly possible so you want to minimize data dependencies and my favorite method of teaching is by means of examples so let's actually do the opposite function ask you to integer and the way it goes is I'm going to add that enforces like an assert you can think of it is going to throw an exception if the condition is not true so I'm validating that by getting the right pointer CTL style begin and end and I want to convert that particular string into into a number awesome so that I'm going to start with a result and again if you look at any book they're going to do the following way result gets result times 10 plus the current digit and I'm going to increase the result by you know going through the digits from most significant to the least significant and at the end I'm going to get the actual number and there's that enforce there inside the loop it's kind of verifying that I'm getting only digits and not something else all right following is just sort of a classic right so I'm sure any of us could sit down and we actually write this down and it works it's a classic it's a classic approach and of course it's suboptimal that's the theme of the day right it's about people I hate it it's terrible actually I'm going to prove by the end of this talk you're going to hate this you're going to you can have a physical aversion toward the function because I'm going to have you look at the underlying mass that's happening inside the function and when you write it in mass format it just looks awful I first like consider this right off the bat I'm multiplying by zero like an idiot there zero times ten plus five that's my most significant digit everything times 10 plus 2 which is the second digit right everything times 10 plus 3 which is the third digit everything multiplied by 10 because I'm you know kind of you know my making progress with my digit here plus mine everything times and I don't I don't have a heart attack right now so I'm not going to continue but in essence what I'm saying here is that we're forcing a look at the process it looks ugly on the page you can't format it right and that's not because I practice our finding list the only place or price is great it's in list because at least they have a different meaning they make it one dimensional structure represent the two dimensional tree so it's completely different in mass what do parentheses do what is the purpose of you inserting a parenthesis in math you force the order of operation to something that's not natural because if it were natural you wouldn't need parentheses so or in you know prophecies are actually dependencies because it means I need to do this before I do anything else so essentially like everything depends on just instead before in this approach and you can see by showing the arithmetic behind the computation and this looks ugly right so you know one of my goals in briefing in this was you know I want to get rid of the prints you know I want to make it beautiful and beautiful is going to make it fast right so you know there's a number of lines of thinking I took and one was how about some divide and conquer thing like a binary search like I'm going to divide the string into two sections and that was kind of an interesting start I kind of saw oh yeah so we have like this number is like 5 to 3 times 1000 plus 9 to 4 so that gives me the following thought how about this representation and how what about the eye implant an algorithm that's going to do this instead of that right oh this is beautiful because it has no parents and it's actually 4 you can format it nicely it does look beautiful right and it does no in unnecessary operation there's no 4 times 1 or there's no time 0 anything it's not it's you know it's just a minimum amount of work that gets you there you can see the number here on vertical is the same number is there and you have a number of constants multiples of 10 and therefore let us actually implement this guy which has much fewer data dependencies actually ok let me Oh let me show you the code first so I'm going to write like a very large library logic you know 20 powers of 10 it's something I'm very good at writing like large number of zeros is very after that so you know and it's going to be a triangle right because going to be like that large number followed by numbers 4 by 1 0 and so on up until like the power of 0 which is 1 and then I'm going to go through the this particular array of constants this don't kind of don't hold it against me just kind of the length of the array so that we like 20 there it's like 20 minus C minus B what is this I doing what is it putting me placing in the array it's placing me exactly at the position of the most significant digit in this particular example here is going to position me at 100,000 so it's positioning me at the right point for the most significant digit in the array and that's my starting point for iteration and then I'm going to iterate from there incrementing I so this this particular arm function implements the underlying math that is simpler and again the question for is is this going to be faster and if so is it going to be significantly faster because we're interested in significant results here I think it's going to be better I wouldn't waste your time right so let's look at the results whoa awesome like 3.3 X improvements for this you're kind of one digit and then it kind of decays to 2x so you got the two x is 100% improvement again a function that was considered unimprovable I hope this is good the stylet is not awkward I hope it's a sensor yeah that's kind of a cool thing yes yes and you know all that thinking is going on right now in your minds prince is like oh my god I can't believe I'm sitting here it's like when it's like that break so three point times two two point X and nice enough like the most improvements are for small numbers which is what I care about let's be awesome great so don't forget this associative means paralyzed about to give something associative going the CPU knows it's associative everybody knows I need the whole the whole you know arithmetic in the comparing the CP are rigged towards they know addition is associative right so they can do it in any order so that's reason we got a great improvement at high level by sorry at low level by using a very high level approach there's no assembly here I mean we don't need to do anything clever you don't need to say oh there's this instruction do you know about you know it should use the secret instruction are nobody knows about and now you just use simple principles to get you there all right so let's actually make you see that red line and force the verification that I'm getting the the right thing so let's make that guys because everything that's in a loop is going to count a lot so how do we make this verification faster what is boiling down to what this is boiling down to is if start be less than zero or start be greater than nine then you know it's kind of it's an incorrect input otherwise I'm going to continue with the computation this is the core of my loop make it faster make that test faster make that test faster knotti not too much blood in the caffeine right so well you know it's I'm not doing this real-time so I've been thinking of this and the way I thought of this first was you know that or the double or means you try the for the left hand first and if it's true you're done if it's false you try the the second right that's the short circuit evaluation in C++ we all know in love but it's actually it imposes a data dependency so actually it's faster to did this way with a pipe with the bitwise or because then you can run both in parallel yes it's semantically the same thing it's just getting there faster so I tried this in GCC and guess what happened which do you think is faster this guy or that guy genius they're exactly the same so GCC was way ahead of me there has icons one must be smart I use a pipe and this is like yeah I'm there buddy so no problem there yeah so GCC actually generating this code from that source so even though I specified the short circuit and whatever GCC figured that the these tube can be valid in parallel doesn't affect the semantics of the program it could if I like a function call here that would be influenced but that would influence this result but you know in this case it's simple enough that the compiler just figured out it generated this so that but then you know I'm a human I'm not a comparison I can be clever I can use a table driven approach by means of each digit is digit is actually a large table and gets you there in like constant time all right so which which do you think is faster is digit or this guy well I mean these guys are equivalent how about is digit is digit faster table-driven well actually I was amazed GCT drink to the sinkhole for his digit right here so you kind of look through it and said oh is digit I know what you're doing buddy I know what you're doing there so I'm going to just direct optimal code for even though you're an idiot because this digit is actually doing a short testing 4-1 and whatnot so it's not optimal all right so I kind of gave up give up on that line of reasoning but I'm still bothered by the two tests here I need to test the same thing twice against during against nine so what I did was I used our sign wrapper rule this is a scientific term you may not have heard of which is unsigned you know just goes away like if you take a subtraction of transline numbers you can become a very large unsigned number right it's like 1 minus 2 it becomes like the maximum unsigned so I did this I took well let me take the unsigned of star b minus 0 and i know if that is greater than 10 is out of range it's not good because it's either number that's below 0 or above 9 so then i replace this thing with one test and even better i got to reuse that thing you CD because I need to the difference anyway so I got to reuse the same computation twice but now there's a profound reason for which this whole unsigned replacement worked because you know it's you've got to think of it like it's possible that I have like I take the difference of Transplant numbers and you know I have the wrapper rule going on but I still kind of get into the right range by mistake what how can you explain that I'm not gonna reach the right result by mistake well yes Shawn you're not going to rub that far yes so that's sort of the intuition exactly so you're not going to wrap that fine you know how exactly do you prove mathematically that you're not going to wrap that far and actually the way it goes is the following subtracting with with wrap around like modular subtraction is an injective function at this point everyone's like okay at this point you lost me here is talking vaccines what's going on here an injective function is a function that map's different values to different values it is never going to map different values to the same value and it turns out that the mod does the subtraction in modulus arithmetic has this particular property which means it's always going to take the you know two different input two different inputs two different outputs which means is never going to go to the right result by mistake so we have this very interesting kind of principle in basic modular arithmetic which take us to a takes us to optimization because I save the comparison and subtraction there which is awesome so it's not tricks we're doing here with doing basic math and to wit actually it is measurable and be significant they're like ILP is our previous the violating and the green thing is the unsigned reprove which becomes significant for larger numbers and - it actually plotted this against several functions several baselines so on the horizontal axis at one I have my baseline which is the built-in library function a to you well and then we have that yellow thing is a sort of a handwritten function that does sort of our baseline that we we showed and then we have the help Ian we have a two-year-old and I T plus one sign is the best is the fastest compared to the - a - ul and what do you think this red thing is like five times slower than anything else including itself what is that guy you can see like lexical cast that uses a stream friends so if you know does something really really really really really slow do like use the sed stream I streams because they're going to like super slow um it's actually fast if you see down and in with pen and paper it's gonna be faster so terrible right that's another joke right there was a guy with a joke banner there alright so we have like very good improvements on classic algorithms and now we're going to apply one of the classical optimizations which is loop unrolling thank you loop unrolling with loop unrolling we are going to simply duplicate the whole thing as the body of the loop several times and then we're going to add everything at the bottom and that is a sort of a very well-known optimization technique and guess what it helps boom especially for large waves and you see kind of the the ripples they are depending on how many passes to the Unruh robot I did so that's pretty awesome so don't forget we talked about strength reduction today and that's not only a compiler optimization that replaces the division with multiplication or division with shift and stuff like that it's it's a way of thinking that allows you to rethink algorithms that are classic into better version of those algorithms right and then we have we've discussed minimization of enterprise which is a very good technique so you know again be extremely sparing with your rights so give away rights like give away money like don't spend them on like material goods there's another joke just because today is not going well ok we gotta get some coffee who's going to be my second talk right after this all right so get some coffee ok and actually country those who don't want to come come because awesome now it's much better than this so don't forget to minimize indirect rights make sure you make as few rights to array elements as you can rights to name variables are very different because those go to our rights to registers and those have much better efficiency profile and again don't forget the Forgotten parallelism which is the instruction level parallelism that you should you should essentially use by means of minimizing the data dependencies in your code so now you know I've showed you a few examples and actually there's a library embodying those examples it's a facebook poly library of which I wrote a fair fraction and you're going to see a bunch of these in there so there are high performance implementations of some some classic algorithms in there and a few data structures so I highly recommend you take a look at it alright with this I'm done thank you so much you've been awesome thank you [Applause] for questions yes which tools they used to find bottlenecks in a program going to redirect that question to the audience because I'm sure there are quite a few ideas that are people are already yes what tools are we using alright so if I saw like this classic profiling proof you know G profs need sort of attune inter has another tool I forgot the name actually I just got the license because somebody saw my talk and it's not do you want the free license for the Intel tools and stuff and I said sure it's free why not so indeed and also that the sort of the classic sort of your own roll around princess based benchmarking and stuff it shouldn't be very hard to get to the core of where the bottlenecks are but definitely need to measure because I've seen so many people who optimized amazingly well like command line processing that's another joke so can i press these days like literally like one microsecond out of an hour long running program and people are expending time on that it's amazing and the engineers have this spirit like in that job with the guillotine that wasn't working everybody would not the guillotine would be grey she would be kind of mercied right and the engineer who was about to be executed by the guillotine said I think I know the problem with that thing I can fix it for you so you can kill me so similarly well not similarly but you know relatedly we have a tendency to work on problems that we find fun and interesting even though they're completely irrelevant or actually work against our our best interest so definitely the first step is to measure and you would find the most surprising results while measuring I've developed I've been working I've been at this for a long time and the one thing I can say that silicon is a human not a nonhuman sorry yeah that's a record non-human it's it's not it doesn't work it can't develop the brain pattern Bay star patterns detecting things and silicon is just weird it doesn't work that way so intuition is not a good way to understand performance of thoughts I feel like the most amazing things in what people think is fast code and like what actually starts code it's amazing the questions so I'm glad this measurement point was brought up questions come on all right so I'll be around here during the break and thanks again see you around

Info

Channel: NDC Conferences

Views: 15,861

Rating: 4.9459457 out of 5

Keywords: Fastware, software, c++, fastware, ndc, ndc london

Id: o4-CwDo2zpg

Channel Id: undefined

Length: 59min 59sec (3599 seconds)

Published: Fri Apr 28 2017