CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello my name is timid owner I work at Rolly in London UK we make audio software audio hardware and music technology and we also make juice which is an open-source cross-platform framework written in C++ it runs on major all major desktop and mobile platforms it really does everything you need and it's the leading framework in this particular pro audio and music production software industry you can also use it for many other things so this is sort of my background now in an audio in music we are essentially in a very low latency real time environment so what does that mean some people you do games so you render 60 frames per second or something like that well we have we have to render 44,000 audio samples per second yeah and that in real-time so that's really fascinating a problem if you want to know more about that and how it actually works you can check out my last CPP contour from last year which is called C++ in the audio industry where I talk a little bit more about that so but the bottom line is we use C++ and we care about performance yeah and one thing that I found is that if you want to know about performance and you know your non-coding then you know you know that for example quicksort is quicker than bubble sort or if you don't see pass pass you know that the arrays you move idiom is faster than you know individually removing elements from from a vector bond by one and things like that and you can understand all this by knowing the language to C++ language very well and knowing the data structures were you well but there's one thing that I found really frustrating which is no matter how well you know the language and the data structures that's just sometimes not enough so you also really need to know how the hardware works you know in order to explain why something performs a doesn't perform and I find this notoriously hard so our code runs on something like this you know this is how a toad chip looks like and but if you're a C++ programmer what does that mean for you you know what what you need to know and then you start googling and then you find diagrams like that of you know for example some intel architecture you find threads at Stack Overflow discussing this stuff you find optimization manuals which are very hard to read you find books about hardware science computer science books but but why do you C++ books not talk about this stuff I mean they say things about being cash friendly and data structures and things like that but normally it's not very useful if you just want to write C++ code and this is where my talk today fits in I want to be a little bit better than that so I want to talk about performance and hardware and just to make it feel this is an introduction talk yeah so this is sort of like an over you if you know all the stuff already then you know maybe it's really useful as a refresher or at least I hope it will be entertaining I'm going to present a few Hardware related effects a few easy C++ snippets that demonstrate you know what's going on and I'm going to show some benchmarks and really the goal of this is to have a mental model of hardware architecture which is easier than than this slide and useful for your daily C++ job if you're in any of the you know industries that do this number crunching thing in C++ you know like audio or games or finance or scientific computing or any of these so let me start for a long time when I started programming my mental model of the computer was something like that yeah so you have memory and we have your numbers which are in memory they have an address and you have a CPU it's doing all the computations and it can fetch you know the data from that memory and write it back and inside it's doing its stuff so and you can get quite far with this you can learn C++ it can do all this stuff but it turns out that sometimes it's not quite enough and I remember when I was a student like the first time I discovered a situation where that was not enough is when I had to traverse a 2d array I think that's like a very classical example of this so this is my first code snippet so you have a two-dimensional array and you need to traverse that and touch every element and there's two two ways you can you can loop through that array either in row major order or in column major order and you know in C++ if you write it like that then the column major order will be this way and the row mate sorry the row major order way this way and the column major order will be that way and of course this way is the way the 2d arrays actually laid out in memory it's actually one contiguous data structure and this would end up with you jumping back and forth in memory so so this is sort of the first the first simple example so for the rest of the talk I'm going to show snippets like that I'm going to show micro benchmarks now what I want to say about micro benchmarks like you want to know how fast the loop runs for example a that's a micro benchmark so I want to say what micro benchmarks that they're very dangerous so if you do micro benchmarks you know there's a thing called confirmation bias where you tend to find the thing that you were looking for in the beginning or you know you can write code snippets and then the optimizer optimizes away the thing that you were actually looking for and you know things like that so you know don't do that or be very very careful when you do that so today I'll be the guy who does that but do you please don't do that measure your real code because that's the only way to actually find out where the problem is the other thing is the last time I did a similar talk like this I had for the benchmarks I had like my own like handwritten macros with some timers but since then I discovered Google benchmark which is really really great and I think last CPP Con last year Chandler Carruth gave a keynote about you know a few few things about Google benchmark and how to use it and thanks Chandler I since switched to Google benchmark and it's a great tool I can really recommend you to use that for micro benchmarks so the other thing I want to say is that for these benchmarks I usually I take the three compiler set you know I'm working with so it's clang GCC and Visual Studio the most recent versions of that and I'm using the default settings with full optimizations on so something like oh three yeah so of course there's a lot you can tweak with compiler flags and things like that and results will change a lot but I think 90% of people they just open Xcode or Visual Studio or you know whatever and hit build or unreleased so you know that's what most people do I think so this is what I'm going to do as well also the other thing is that the benchmarks I ran this on this very MacBook that I'm running right here and also another one which is a bit older with a different processor architecture you know I could have included also arm and you know other architectures in this job but is really not the point the points we just give like a very sort of like quick overview over different aspects of hardware and performance so and also I'm only going to mention these different compilers and different machines if there's actually significant difference in the results otherwise if I'm just going to show like one one bar or whatever then you can assume that it looks similar on all of these now ok so let's go back to this array let's you know take an array that's maybe 10 megabytes large maybe yeah something like you know a picture of your cat or something and then loop through it the right way in the wrong way yeah and then you see that you know if you look at the wrong way around it's going to be 30 times about 30 35 times slower on this particular machine on the other one I think it was like 40 times slower so this is like a typical number so this is obviously very very bad for performance and the first time I saw this many years ago my supervisor told me yeah well that's because processors are really good at you know looping continuously through memory and and they're very bad at jumping which is what happens if you loop the wrong way around and that is true processes are much better at scanning through contiguous memory yeah this is the reason why mostly like vector and array are so good and so better than all the other containers and this is also reason why if you need like a sorted container a sorted vector or something like boost flat set usually outperforms a stood set but to really understand this it's not enough to just know that this is contiguous memory so we have to revise our mental model here to really understand this and of course as you will know there was a cache app so because main memories we slow there was a cache in between and a cache holds less memory but is much faster and closer to the CPU and if a structure like the area doesn't fit into the cache then you know if you loop it if you look through the wrong way around then you're going to get cache misses and then you have to retrieve stuff from main memory and that's going to be very slow so here's another example so we take an area that is very big is 512 megabytes so it doesn't fit into the cache and then we loop through it and then we just touch every nth element yeah so we touch either every end it since right yet in so we touch either every inch or we touch every second in and let me touch every 4th and and so on and so forth and then you know you would expect that the less ins you have to touch the less time it takes the less work it is and it turns out that's mostly true except that if you are there is there's one place there were at 16 ins or 64 bytes because that's the cache line size in this particular machine and if something's on the same cache line then basically it's just as cheap to touch every end as to touch every 16th int yeah so here we see that memory comes in cache lines so you don't just access an individual int but if the cache gets something from memory it gets a whole cache line and then if you if you touch one end on that cash on you get you know all the other data that's on that cache line for free um so let's come back to this case where we loop through the array the wrong way around so um so this is a 10 megabyte area but what we can do obviously is we can see how this changes if we change the size of the array yeah so that's going to do the same plot but vary the size of the array and the solid line is where we loop through the array the right way around and the dashed line is the wrong way around and you see that you know you have a base sort of penalty because you don't loop contiguously so probably you know the optimizer can't can't vectorize this or something but then also you have these steps and these steps occur at certain places where you know if the area gets so big that it doesn't fit into level 2 cache then you get level 2 cache misses and then your performance penalty goes up and then if you ever get even bigger at some point it doesn't fit into a level 3 cache anymore you get level 3 cache misses which is much more expensive so you're going to get another another bump in penalty in performance so cache is hierarchical yeah so um we have typically on a desktop machine maybe something like level 1 2 & 3 caches and everyone is every one of those levels is gets basically there's one order in magnitude in size between these so and the bigger they are the slower they are so if the level 2 cache is 10 times bigger and typically also it's going to be something of an order of magnitude slower and by slower I mean latency so not throughput right so it's not about how much data per second you can move between these levels it's not about that it's about how many how much time how many instructions is it does it initially take to fetch you know one int from level 2 cache and get to level 1 cache yeah that's that's very important like how much does it initially take if you get a cache miss to retrieve it from a higher level and that gets slower if you go up by about an order of magnitude um so let's go back to the studio a for a minute so you know this benchmark is really boring because you don't really do anything here I've just loops to the area and we the incremented or we added we do like 1 addition so that's not a very realistic case right so let's let's add some work here so this is just some made-up stuff where you know each time you access a number in this area you just do some computation yeah so you do like a multiplication in addition and a hash and a square which it doesn't actually mean anything it just means that you keep the CPU busy so it's going to compute stuff every time and then let's do the same benchmark let's loop through the array the right way around at the wrong way around and now it's going to look like this and I think this is interesting because you see that for most so again the every size increases if you go to the right and I think this is interesting because for most of it you don't see a difference here right but then only if you go to very large every size this all of a sudden the version where you loop the wrong way around the dashed one gets lower and basically here you can you can draw a line here and I think this this is really important where basically to the left of that line the execution time if your program is bound by the computation that you're doing right so basically the hash in the square root and you know whatever under number quantity you would do in your program so it's determined by how fast the CPU computes these numbers whereas to the right of the of this line the execution time of your program is entirely determined by how fast you can get this data from the cache yeah because you have to retrieve it from the cache here and it doesn't really matter anymore how fast the computation is because you're bound by the data access and you can see this even more clearly if you if you overlay these two yeah so let's the black ones are the ones where we just loop through the array we don't do anything apart from one addition and the red ones are where we do all this like number crunching stuff on top then you see that if you have small areas then obviously the number crunching one it's much slower but as you get to the right and the error gets bigger the version where you have the bad way of traversing the data if you draw this line again actually the version where you compute a lot of stuff in the version where you don't compute anything and you just access the data they they run with the same speed because you're not bound by your computation time you're bound by your data access and I think that that's really important to keep in mind especially if you do have a performance problem in your code you know what what do people do people people profile so but then this is not always helpful um if you do a time profile for example what most people do so if you time profile this code the the red one so this is for example what what xcode shows you um i mean we can do the same thing in Visual Studio um it would look very very similar so this time profile tells you that no unsurprisingly you spend 100% of your time in main and then 13 percent of that time you spend in this template template stuff down there which is essentially the the square root so the hash the hash one is the other one is missing here right so you can say okay you know maybe the the compiler has in line the hash because that's the only other thing that's computation the expensive right that's going on there so you know what you do is you go off and you optimize your hash function yeah and then after we could come back with the hash function that's two times faster and you run your program again and you realize it runs just as slow as I did before that because it doesn't matter how fast your computation is because you're bound by the data access and the worst thing about this is that these on time profilers they don't show you this thing they show you how much time you spend in a function um then we're showing you that you actually spent most of the time waiting for data yeah they're not telling you that and for example Xcode does have a counter somewhere it's quite hidden away away you can tell you how many cache misses per second you have but you know again that's a number yeah so 1 million cache misses per second no is that a lot I don't know so it's really difficult to I find it very difficult to actually profile that stuff and and find out what's going on and the key to all this is you know in this case you have the right data structure and the right to access the data that you need in the right order so if you want to traverse a 2d array it's going to be in that you're going to traverse it in this way like you just scan sequentially through the memory if you have an image file for example and you want to you want to edit an image file maybe you want to edit like always you want to always edit some sections that some parts in that image file maybe that's not the most optimal way to store an image maybe it will be more something like that where you have like tires yeah so every time you you change something in a tile that's going to be sort of close together in space in the memory or maybe for another problem you know this would be the best way of traversing your data it doesn't actually matter that much it doesn't have to be the best it just has to be good enough so you know you're not bound by you know waiting for that data to be fetched so that's that's the thing you really want and that leads to things like for example if you have this is this is a made-up example yeah so ah if you have two different classes like foo and bar and you know you have areas of them and you tend to use always one foo and a bar together in a function you have this do something function there then probably it's a better idea to actually stick that fluent that bar into one struck so they come just one after the other in memory and then then use that and then you know the data that you actually use is going to be close together in memory and then it's less likely that your program is going to be bound by the time it takes to fetch the right data and then you can go off and optimize your actual computation the same holds in time it's something that we call temporal cache coherency so you know if you use the green piece of data here and then the red and blue one and then you know the green one again by the time you do that maybe it might have been already evicted from the cache so what you ideally want to do is you want to you know work with the green data and then work with the red data and then work with the blue data and basically if you generate some data use it immediately and not let it hang in there in the cache and hoping that it's going to it's going to still be around later on because maybe it won't okay one last thing about traversing arrays so we saw our you know we can do it this way which is the correct way we can you can jump around what about if we jump around randomly yeah because this is often how how it looks like if you traversing a stood list or a still map so at last last year meeting C++ there was a great talk by Klaus Eagleburger it's called obtaining the performance beast where he benchmarks a lot of different STL containers so I can really recommend this talk this is great I'm not going to talk a lot about STL containers here just one thing you know basically to avoid this situation if you have if you have to use like a list or a map you know don't keep inserting and deleting because then you're going to end up with the structure looks like this but it's always better to you know allocate the nodes close together initially and then reuse those nodes rather than inserting entity you can do that for example by writing a custom educator but yeah let's not talk about list and map here let's talk about arrays because then when you're doing um you know scientific company was doing scientific computing we were doing you were using arrays most of the time and now I mean audio technology and we use areas most of the time so it's really important to understand arrays so I want to model this thing using the same the same array and so basically the next example is where the third one here where I do the same thing but except for just traversing the array the wrong way around I'm actually accessing random random elements each time now and you might think that the second and the third would be probably simp similar in terms of performance because they're both jumping around a lot and they're just not accessing the data in the right way and you're getting cache misses all the time and that is true but if you benchmark it it turns out that actually the the random one is much worse even than the the wrong column major order one and actually this factor like so the random is more than a hundred times slower than if you accept it access the array in the right order and actually if you look at how fast caches are and how the latency of a cache is this 100 something factors actually what you would expect yeah so this is this is actually the fracture how much slower it is if you have a bad data structure but now the question is why why is the red one not quite as bad as the blue one and you know there are probably different reasons for this but one reasons for one reason for this is that even though the red one is a very bad while you're fitting an array you still use the same stride every time right so you jump but you jump by the same amount of bytes each time in the in the column-major example at the red one now modern hardware has a thing which is called the prefetcher which is helping a lot with optimizing performance in this case without even people noticing so what the prefecture does is it's the thing that sits between the CPU in the cache and it just listens to you know all the traffic that goes on between those two and whenever it notices a pattern as for example you using a constant stride it sort of latches onto that pattern and starts prefetching things now and this is why if you use the regular pattern to access data even if it's not you know even if you expect it to be slow the prefetcher can still speed it up somewhat and so that's a very interesting sort of like hardware a technique to speed things up and I read somewhere a number that in practice that can actually speed up code about ten to thirty percent and like typical high-performance applications like that so that so that's really interesting I think that actually the hardware helps you here and you don't even notice it so let's talk about one one last example about the cache here so let's go to this case where we have an array and it's a big array and then you access every nth element in the array but not now what I want to do now is I want to basically loop through this loop and I want to access every second or every fourth or every sixth item in the and the array but the amount of int I want to access is the same time so I want to access ten ten thousand ins every time but sometimes it's going to be every second in some time it's going to be every every twentieth in yeah and you would expect that you know at least if every end is on a different cache line you know ten thousand accesses would take more or less the same amount of time yeah well it turns out that is the case if you touch every sixteenth and yeah it's taking the same amount of time is if you touch every 128th and but when you get to 256 as you stride all of a sudden performance drops by a factor of twenty yeah and then and then it goes down again and it's fine again until you get to 512 and then it again drops by a factor of 20 so yeah when I first saw this I found this really surprising um because for some reason choosing a stride of a power of two seems to be a very bad idea and I think this an intuitive in a way because you know computers are really good with powers of two right so who can tell me why this happens right exactly um so it's not so every end is on a different cache on anyway but what you're referring to is called cache associativity so let me try to explain that in one minute let's see let's see how good how good I can do this imagine all your bytes in memory are cars and you have a hundred thousand of them and that's that's your main memory and then every car has a license plate that's the memory address now so you have a hundred thousand memory addresses and then your cache you have a parking lot with ten parking spaces yeah that's your cache so you can park ten cars in this in this cache now there's different ways to build this cache one way is to say any car can park anywhere on that parking lot that's called a fully associative cache that's really good because we flexible you don't suffer from you know cache line misses that much because you can really optimize the you can really use the cash on a very very optimized way but the problem is that it's difficult to build them or they require more hardware so they're going to be more expensive and also they're going to be slower because they require more circuitry right the other way if building a cache is what's called a direct mapped cache where you can say okay of like if you have a car and the last digit is 2 it can only go to the parking space with a number 2 you have a car and the license plate ends on 3 it can go only to the space now label 3 and this is much easier to build yeah because it's much easier to build a silicon to route every like date piece of data to to its right place so it's going to be faster it's going to be cheaper the problem is that it's going to suffer from more cache misses now because you can't always keep the things in - if you need to so most desktop machines have like a compromise kind of thing where which is called an n way set associative cache where you divide the cache and sets and every cache every set individually will be a we be like like the direct mapped cache so in this case you have two sets so every car can go into two slots now now if you do the math for example on your machine you can actually look it up there's like always like a command line thing you can run to find out about your cache so let's say for example you have a level two cache which is 128 kilobytes and it's 8 way associative yeah that's fairly typical I think so it means that you have 8 sets and each set is 16 kilobytes big now imagine that you have 20 numbers 20 ins and there each of them is like the 16 kilobytes apart and now you want to access those 20 ins it's not much yeah it's just 20 numbers but it turns out that those 20 numbers end up competing for the same eight slots in the cache there's only eight slots where these 20 numbers can go ah which means that they're just going to keep evicting each other out all the time and you're going to get a cache miss on every single one of them and this is why the performance goes down by a factor of 20 so so this is this is cache associativity um I found a web page on the internet which is called a gallery of cache effects so by a guy called eager Ostrovsky so I can really recommend you to Google that that's pretty cool so his examples are in c-sharp so not all of that translates to c++ the same way but there's like a lot of more a lot more cache effects there so I think it's quite interesting so one last note about the cache is that there's a date there's also data caching and instruction cache so if you execute code and you need to fetch the instructions to execute then actually they will go to the same kind of hierarchy in the cache so the instruction cache is not that much of an issue in practice because I will essentially code that is bad for the instruction caches code that you know jumps a lot and branches a lot we're going to talk about branches later because there's actually bigger problems there but for now let's talk about memory first so you've talked about the cash but if the the other thing is that if you look at memory sort of like that's what I you know also that's what I thought for a long time is that okay so you have these caches but then you know memory are essentially these slots in one way or another and then you know if you have different data types they just they can go to these different slots so you know if you have an image then it can be anywhere anywhere in that memory and then actually then I found out that that's not actually how memory works that's not how memory looks like memory looks much more like this now where every type has an alignment requirement and there is a word size and there's these slots yeah for example an int so you can find that out in C++ by doing align off that's the key word to get the alignment requirement so you find that you know an int is not only four bytes big but it can only be at a position in memory which is aligned by four bytes so and the word size for example in a 32-bit machine that would be 4 bytes on a 64-bit machine would be 8 bytes and and memory is organized in this way now and this leads to a curious property that for example if you have a struct you declare a struct then with different members and then these members will be laid out in memory aligned like this but then if you actually change the order of these members then sometimes the size of your the size of your strap can change yeah so that's the memory layout of a struct and the compilers do that for you for example clang-clang as a thing called F dumb memory layout where you can it can display like the memory layout of your classes so that's what the compiler is going to do you write portable C++ code you're always going to get this now this is just the way memory is organized and actually you can override this and this is now platform specific you know clang for example has attribute packed or I think in visual Studios pragma pack where you can tell the compiler no don't do that do packed structs and sometimes that's actually required in different applications now that really depends on the hardware you know what the computer will do with this because fundamentally memory is organized in this in this aligned way so if you force the computer to do underline memory access then for example one thing could happen is that it just can't do that so if you in this case the the sort of the read for squares sorry the read eight squares are the eight bytes of the double and in this case like the D and this is now underlined so if you want to read or write to that then sometimes what what some architectures will have to do is they have to write read like a two line bytes and then eight align by it's after that and then you know shift shift both of them and then combine them into into the double and that's the only way to actually access that unaligned double in there so for some architecture is going to look like that and then if you if you need to write it me to do the same thing the other way in Reverse now so for some architectures they do this for some architectures you know they the CPU can't do that for you and you would do that and so like the operating system go to that and software obviously that would be slow and that really depends on on your architecture as well this is for example one reason why reinterpret cast is undefined behavior yeah this is something that really struggles understanding for awhile I thought why can't you just you know there's like eight bytes there why can't you just take them and interpret them as a double why is that undefined behavior like just just do it that's what computers do they interpret bits and bytes right well it turns out that you know one of the reasons an event behavior said they might not be aligned so that you can actually read double from it you know the other reason why our interpret cost and if I behavior is aliasing but that's another topic but basically I mean this is is important and we will come back to that so it's an it's an important concept so I did a benchmark with that as well I'm just going to skip skip over the benchmark because the details are an important but actually I found that on this machine here which is a modern MacBook um actually that's not an issue like the newest generation of these Intel machines they just do unaligned access and island access with the same speed actually the red one is too packed the underlined one actually it was in this particular benchmark it was a bit faster because the structure and they was was more compact yeah because it was packed but then if I ran the same test on on the slightly older MacBook with the Core 2 Duo architecture I found that unaligned memory access actually is three times slower which is sort of dislike to reads and then the combination so that's the massive performance impact and that that was the exact same binary just running on two different Mac books yeah three times slower in one case so basically the messages here alignment is important we'll come back to that let's look at something else for let's look at something more fun because I think alignment is is maybe a bit dry let's look at this example here so that that's something else now so we have a vector of floats 32,000 of them so that it fits into the cache and then I'm going to fill them with random numbers either minus 1 or 1 yeah that's what this to generate does so I'm going to take a vector and fill it with minus 1 1 1 1 minus 1 minus 1 1 minus 1 1 minus 1 random yeah and then this really silly coded by the way um so then what I want to do now is I want to count how many ones I have so how many elements are bigger than bigger than than 0 yeah that's the counted so so that's the code and now the thing that I going to do next is I'm going to sort the array before I do it and obviously and so now I'm going to measure how long the count will take the count if and obviously each time whether we sort it or not it's going to be the same number of positive elements right so the count if we're going to do exactly the same number of comparisons and additions in both cases but does it make a difference if I sort it before yes is it slower or faster how much faster 1.5 - so someone thinks it's more than that okay who thinks it's going to be five times faster okay three four people let's see I benchmarked it here the results it depends on the compiler yeah that's the first one we're different compilers give you different results so it turns out that I'm clang and GCC it doesn't matter and if you do it version studio then the sorted one runs six times faster yeah that's the the red bar at six times shorter so that if you don't sort it if it's random it's going to be six times slower yeah although it's the exact same number of instructions you like the exact same number of comparisons and additions you're executing like why is that okay let's look let's look at this loop yeah so this is just a basically a fancy way of writing a for X if X bigger than zero account class bus yeah so what is it about this loop so if different compilers give you different results one way to find out what's going on is to look at the assembly output so I did that if you look at the visual studio one it looks like that where the highlighted red section is basically yep it compares if it's smaller than zero and then if yes it jumps otherwise it increases a counter yeah so that's what what the the Microsoft compiler generates if you look at the GCC one it's slightly different it's also comparing that X to zero and then depending on the result of that comparison it's going to do a conditional move and so that's the C move a instruction and then it's going to carry on so there's no jump here and then if I look at what clang produces then it's some SSE instructions so there's even though I think there's no counter in there so I this works somehow magically this code does what I want I'm sure if you stare at it for like an hour it would be possible to understand how it works but I I wasn't patient enough to do that but it does what you what you want I'm sure someone here can read this stuff but anyway they all do the same thing now why does the Microsoft version have this like difference between sorted and unsorted and the reason is to jump right so the reason lies in the pipeline which is the part of the CPU that does your computations where all the instructions go through and this is the picture from Wikipedia if you just google CPU pipeline you're going to see that picture so you see how like the different instructions goes to this pipeline unfortunately in reality it's not quite like that because you don't have four stages but you have more like something like 20 I think and more than CPUs and also you don't just have instructions but the decomposed further into like micro instructions it's really really complicated and but anyway actually you could you could use that benchmark to estimate the length of the pipeline but you know that's something so so the point here is that basically um if you have too blue instruction you know it goes through the pipeline and the next one is the red one if you have a jump basically it depends on the result of the blue instruction which one will be the red instruction the next one so if you look at this pipeline obviously this is not going to work because by the time you've evaluated the result of the blue instruction the red instructions already in the pipeline so you already need to know no much earlier what the next instruction is and if you don't know then you have to wait so the pipe that has to stall it has to are introduced what's called a pipe and bubble by you your weight or you do something else maybe in the mean time but sometimes if you're in a hot loop there's nothing else to do so you just have to you know wait and that's going to hit performance so that's really bad for performance so you know one thing that hardware people you know do to solve this is there is a thing in your CPU called a branch predictor which is going to guess you know what the next instruction is going to be and speculatively execute what it thinks the next instructions could be and this is really really complicated as well how this thing works but most of the time it works really well and it works especially well with things like this where you know these are branches that are not not problematic like if you had a loop and you have a loop condition you know it's going to be true all the time except the last way the last time around or if you do like a check for some error and then you return that's something that's not most of the time it's not going to happen yeah so the branch predictor can easily pick up that most of the time you know that's the result of the condition and it's going to insert the right instruction so obviously if it fails to predict the future then what it has to do is it has to basically flush the pipeline fetch the right instruction and then you know that's going to take a while but most of the time that's not going to happen and I see branch predictors that even better than that they can even support patterns these days yeah but one thing they can't do is predict random numbers so that's the one thing that branch predictors no matter how good they are can never do if it's really random whether your branch is going to be true like where you're going to jump then no branch predictor in the world can can figure out what to do here so for example if I write an audio algorithm I'm not going to write an if in there yeah I'm going to I'm going to use like bitmask operations for example floating points you know they have a sign bit and you can do some math with that so to avoid this kind of condition and branch kind of thing because there are cases where in real world in the real world it does hit performance so another another example for for when branch has happened our virtual function calls yeah so in this example we have a very simple sort of made up our class a hierarchy yeah we have your mammals and dogs and cats I think this very popular example from computing books and they have a virtual function which just returns a different number that's very very simple and what we do now is we we put bits Givi we have like a vector of mammal mammals and then we're going to put 10,000 mammals in there and then you're going to put 10,000 dogs and darling when you put 10,000 cats in there and I'm going to loop through all of them and add up the different numbers at different virtual functions right and before we do that so that's one way of doing it and then the the other way is before we do that we just shuffle it just shuffle the array now by the way I'm really sad that C++ not deprecated random shuffle because four things are easy things like that I think it's a very handy function but but you don't actually care about like whether it's in resent whist I just want to shuffle stuff but anyway so yeah so if you do that it's the same thing right you always have ten thousand dogs and ten thousand cats and you always have to call this like virtual function each time around in the loop ah and then if you benchmark this it turns out that you know if you shuffle them you're going to get this branch perform branch predictor of performance it again but this time you're going to get it in all of the compilers because there's no way compiler can optimize away your virtual func because it's not it's okay sometimes it can actually optimize a way polymorphism but that's that's most of the time that's not possible and the compiler is not going to do that so you're going to suffer this hit in this case and you know that's the thing people sometimes claim that virtual functions and V tables are bad for performance I would say that 99% of the time that is not going to be a problem but if it is a lot of the time the problem will not be the indirection and the offset but the additional branch yeah that you're introducing by figuring out which function to call okay so that's about the branch predictor um let me jump to the next example ah let's see how much time we have left fifteen minutes okay okay let me skip the arm really quickly go through the full sharing one so that's that's another thing Cassius actually shared between cores right so in this example we have atomic counter and we have a function work that just going to do something silly with that counter it's going to increase it and then you access this counter from you know four different threads in this case and they all you'll use this counter and then they're all doing the same work and then it turns out that the more on the more threads you have the slower that the thing gets so actually on in this case it's actually slower it's actually going to be faster to run the threads in success like one after the other and to run them in parallel and the reason for this is that if you have a something that's shared between like a piece of data shared between two course so you know for example you have this variable a you know it's present in all these different levels and to course access it then if one could modifies it you know it's going to flush so if it's going to be used somewhere else it's going to flush that to the to the shared cache and then that's going to invalidate the copy of a that core to has and the next time cord to are wants to access a it's going to have a cache miss and it's going to have to fetch a from on the higher level and that's going to that's going to be performance penalty and this is this is the reason why why stead atomic is sometimes so slower compared to mutexes so so this is this is a well known thing and but then one interesting effect here that you can sometimes observe axials in real code is something that's called full sharing where in this example you have not won a but you have four different four different variables yeah and each thread works on another one of those so you see this no data at all shared between the the shared between the threads here but then if you run this you see that still you you suffer this immense performance penalty where if you run these four threads in parallel it's going to be 12 times slower and it's actually faster to run them one after one after the other although they don't share any data at all so that's surprising right and there's an article by hops at are called affective concurrency eliminated for sharing right here's like an example of like how how this happens actually in real code as well and the reason for this if you go back to you know alignment and how memory is organized it turns out that you know if that's what memory looks like you know a cache line remember data comes in cache lines right a cache line is like one of those bookshelf right and then memory is basically many of those bookshelves and then it turns out well if one of them is a cache line well guess what it's also aligned it's aligned by the sides of a cache line which means that in this example it turns out that all of these four variables there are on the same cache line although there are four different variables you know in this case you know everything that if the cache line size is 64 bytes then everything with addresses like blah blah blah 0:02 blah blah blah 3f it's going to be in the same cache line and that thing well that works in cache lines right so if you have different variables but they're on the same cache line they're going to stuff for the same thing as if you access the same variable from different course one way to prevent this is to align your data structure by the size of the cache line so that you know different data that's used by different course to force it to be on different cache lines and if you do that you get the red bars where you see that you basically don't have you don't have that performance penalty anymore um okay let's look at this again so the thing that we saw earlier with a branch predictor where you don't know which is the next instruction to fetch as a thing called branch hazard if you're in this pipeline here there's another type of hazard which is a data hazard which means basically you have the blue instruction for example which goes through and then the result of the blue instruction is used by the red instruction yeah and you can see how how that in this case will also be a problem because by the you know the red structure has to be fetched and executed way before you get the result of the instruction before that and this is something that's called instruction level parallelism and this is one of the reasons why CPUs are so fast and basically most of time we don't have this problem because even though you know you write your code such that one instruction come after the other first the optimizer it's going to see oh here's a data dependency there and it's going to reorder your instructions yeah and then the next thing is that in Hardware actually the hardware will also reorder your instructions it's another whole like level of complexity that's going on there but sometimes you know the only thing you execute is the blue instruction than the red instruction then you do that in a hot loop and then and then you have a performance problem right there so this is one example here where um this is a function that takes basically string and translates that to an integer so that's an example from a talk by andrea alexandra school from last year called writing fast code which i really recommend you to check out so the problem here is that it is the only thing here that we're doing is to to do this thing in a loop right and you see that you know every new number depends on the result of the previous one so that's an example for this thing andrea i think he he talks about half an hour and his talk about how to how to optimize this how to change the implementation such that you get rid of data dependency and you can make it three times faster i'm not going to talk about this because he's already done that so what's his talk I'm going to present another example here where you have a loop um and it's a bit of a silly loop yeah but um you see here that so you have three arrays here and then you see that if you compute so you compute a and then you compute B and the next time around for the next day you're going to use the previous B right so there's a data dependency here can you yeah you can see that right now that's a problem right um because um you compute B and then and then and then you keep looping around next time you need that so that that's that's a data dependency here how can you make this better and you want to just a better way of writing this exact same same loop yeah well but then you're going to have two or three loops right so that's probably going to be slower than one loop yeah so the suggestion was just repeat the same thing in the loop three times that's called loop unrolling that's something that the compiler already the optimizing compiler only does for you but yeah sometimes sometimes it helps in this particular case actually the best way is to just slightly rewrite it like that um we're basically the right version is doing the same thing right so basically instead of having like the first and a second line in one loop and then the third and fourth and one loop you have to first and then the second the third will be one loop and then the fourth and the fifth of you on loop so you just shift where the loop is by one by one line yeah can everyone see that the right thing is the same as the left thing now you still have the data dependency there right so you have in the right loop you have to be I plus one is is set and then in the very next line of code the B plus one is used again so that's the same data dependency there so you could argue that this is not actually better but why is this better does anyone know why the right version is better than the left version yes the values are going to be in a register so the computations are the same in both cases so I'm so I didn't I didn't observe this particular thing here a locality no I mean the data will be in the same place every time right ah so the suggestion was the compiler can see that these are the same statements hmm maybe okay I didn't maybe you're right I don't I don't quite understand it oh it's the same amount so oh hang on actually in the okay there's a type of there in the right case this there's one one loop iteration less right yeah so actually in the first one it would be go from four and I equals 0 to 9 9 H and then the right one is from 1 to 9 9 8 sorry that that's a type of there but actually the reason is different the reason is yes it's the exact same stuff but now each iteration of the loop doesn't depend on the previous iteration of the loop right and if one iteration of the loop doesn't depend on the previous iteration of the loop what can the compiler do with that it can vectorize it right and surprisingly none of the compilers that have checked has done this automatically if you write the loop in this like wrong way right but you write it like you do it right yes it can vectorize it because in the modern compiler in modern CPU you have Zim d you can yeah you can do a single instruction multiple data so you have all these things like SSE and AVX and on an arm you have neon where you can you know do four additions for one instruction and if I do that I see yes the right hand side of the loop is going to perform three times faster than the left one and just to double check ah that's the assembly that that clang produces and yes in the left version is just going to do some ads and moves whereas on the right it's going to insert some nice SSE instructions so there's one KVUE there so if you do this SSE thing remember that memory is aligned well yeah you also have to align to the SSE register size so if you if you if you have as a si instructions that takes 4 into the x 16 bytes you really want those 16 bytes to be aligned to a 16 byte boundary right so here's another example demonstrating that so I have have two buffers here I have two to erase with floats and then I do a multiply add that's a really really common thing to do in an audio code for example where it's like in mixing for example yeah where you have one area you have another area that you multiply the secondary and you add it to the first area and you do that in the loop and this is something that every compiler will vectorize yeah but now if I do this thing but then I offset the erase by one inch so I offset one array by one end and then offset another array by two ends so it's not underlined data but it's not aligned to like a sim D register boundary and then I'm going to measure that turns out yep it's slower actually on the new Mac here it's about 20% slower which is very significant if I do this on the slightly older Mac it's actually three times slower or 2.5 times slower which completely negates like the benefit you would get from vectorization now so um so that's one effect here okay I have three minutes left um so let me get to my very very last example um this is a fun one so I have a float and I'm just going to do 10,000 multiplications on it with a number that's not going to change the value of the float very much now so I'm just going to take a float and do 10,000 multiplications and I'm going to do that with different floats I'm going to do it with one I'm going to do it with one divided by zero which is what in I'm going to do it with 0/0 what's that nan I'm going to do but - zero which is actually something else than plus zero and then I'm going to do it with a very small number so is any one of those going to give us a problem the very small number right how much lower will it be like how much yep it's going to be about thirty times slower on this particular Mac here and the reason is that it's a D normal and D normals are yeah if you have floating-point numbers the normals are the ones where the exponent is zero and then the fraction is nonzero of the red portion and then basically the way Hardware handles floats with the normative it's not really the then we fit in there so I think I think I'm not sure what the actual like reason is why it's so slow in hardware but it is and from the software you can disable the normals so you know if you know about this thing you can there's a switch you can you can say flush them to zero and then you're not going to have this problem but you know if you're not aware of it then you will and I actually in my job in audio software I'm aware of like two times where this was actually causing significant performance problems in production code because you know there was some algorithm generating two normals and performance would go with plummet and then you know you would spend like two days trying to figure out what's actually going on so yeah these are the normals so that was my last example so that concludes my talk so here's what we talked about and here's a summary slide if you want fast super supposed to be nice to your heart where you know be conscious about whether you bought by data or by computation don't be bound by data access you really want to be like in the sweet spot all right we do data access and computation ideally prefer data that's contiguous if you can't prefer the constants Trice randomness keep data close together in space and time avoid dependencies between successive computations avoid hard to predict branches be aware of cache lines and alignment minimize the number of cache lines that access pad multiple threads um you know and don't be surprised if you're hot but does weird stuff yeah so that's that's a summary here thank you very much

Info

Channel: CppCon

Views: 130,353

Rating: undefined out of 5

Keywords: Timur Doumler, CppCon 2016, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording, video services

Id: BP6NxVxDQIs

Channel Id: undefined

Length: 59min 44sec (3584 seconds)

Published: Sun Oct 02 2016