Intrinsic Functions - Vector Processing Extensions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello in this video we're going to look at what are called compiler intrinsic functions sometimes they're just called intrinsic s-- sometimes they're called compiler intrinsic it depends on who you're talking to but the basic premise is this that there exists a set of functions that the c and c++ compiler can use to more or less program directly in assembly language and the reason they're wrapped in functions is the compiler can go ahead and perform optimizations on these intrinsic functions so in a way I guess it's useful to use intrinsic functions because we're really sort of manipulating the processor at a very low level but we can rely on the compiler to help us out where necessary they're a good go-between between full-on assembly language and C and C++ code when you talk with somebody about intrinsic functions usually they are referring to the extended feature sets of your processor now my last video was about optimizing the processing of a Mandelbrot fractal and it's probably worth watching some of that video to understand the second half of this video at least to put it into some context because I pulled in a load of intrinsic code that I'd written to prove a point about parallelism and I said well if anybody's interested in how I got to this code I will make a separate video about it and this is that video but before we dive into specifically the Mandelbrot calculation code I'm just going to give a very brief and simplistic overview of intrinsics we did cover this slightly in the last video but I feel to make this a complete video I'll include a little bit more about how they get set up and how they're implemented but before we do I just want to show a little demonstration of how the compiler itself is actually pretty good at using intrinsics without being told to use them here I have a very simple program and I create three reasonably large arrays each containing 16384 floating-point elements the first thing my program will do is initialize each of these arrays to a constant value the second thing it'll do is an ADD array want array two and write the result in array three and finally because people write in and complain I'm deleting the arrays in the properties for this project I'm going to explicitly set the enabled enhanced instruction set option to no enhanced instructions I wanted to sort of exploit any of the additional processing capabilities of the processor and I'm also then going to go and compile in debug mode and I'm going to instruct Visual Studio because we're in debug mode to run to the cursor and the bit that I'm interested in is this calculation big array three equals big array 1 plus big array - I've told it to show me the processor registers and we can see the the execution pointers stopped at the location I required now don't forget we're in debug mode and one of the things I want to look at in debug mode is if I go up here and go to Windows I can look at the disassembly now it might be quite confusing to look at this assembly at this stage in the video but it does highlight something quite interesting if I scroll up a little bit well here we've got the assembly language for the implementation of the for loop and here for this line beneath it we've got the assembly language for calculating the well the expression and even though it looks horribly complicated we can make some educated guesses at what's going on here we're moving some values from memory into the processor so here we're moving in a value from big array 1 and this is the floating-point load command then we move in data from big array 2 and we call the floating point add command we then use the floating point store command to well write to big array 3 so it's a very simple set of assembly instructions I can see why assembly has a bit of a hard time because when you look at this and you see all of these different registers and offsets it can be quite confusing to look at but we're not that interested what we're interested in is these fundamental mnemonics down here move move floating point load move move floating point add move move floating point store and then we jump back round to the start of our for loop now I've deliberately chosen to run this in debug mode and it's not for the reason you might expect but I'd also like to demonstrate why debug mode is a little bit different to release mode in debug mode we expect our program to behave in the particular semantic way we've employed with the code that we've written I've just nipped out of the assembly language view so we can see our program and what I was saying was it's important that the program behave the way we expect it to so when I press f10 to step through the code I'm going to be stepping through this 16384 and the assembly reflects this there's nothing particularly special going on here we're just performing floating point load floating point add and floating points store and then we jump back round to the start of our loop I'll stop the program but now I want to run it in release mode and I'm going to run to the same point with a break point now you might think oh hang on you can't use break points in release mode actually you can it's just you've got to be aware of what's going on when you do so here I am in release mode at the crank location and now I'm going to look at the disassembly for it and we can see our for loop up here there's a bit more stuff involved in our for loop now and but here we've got the fundamental expression we're interested in and what we see it doing is floating point load add store load add store load add store load add store and it's doing this over and over again it's unrolled it so now one iteration through our loop no longer corresponds to one sequence of loading adding and storing the compiler is recognized that actually what we're doing here can be performed faster by sequentially using these instructions so every iteration of our loop no longer corresponds to the increment of just one single element the compiler has made some decisions for us here and it's optimize the code appropriately it's doing eight additions per loop and I think that's quite a clever thing the compiler is looked all we're doing is adding together two arrays I can really optimize this and it will have chosen to do it like this because well processes are actually a lot more complicated than we give them credit for the last thing we want to do is substance of so it knows that by keeping this particular part of the processor nice and active we can get the most speed out of it whereas in debug mode well we were unnecessarily branching in and out of our loop the point I'm trying to make here is that the compiler will often make decisions for us which it thinks aren't better in a given scenario and this is why there is some truth when people say that you can't really outsmart the compiler it knows better than you do and that's getting very very true these days but what we'll demonstrate in this video is there are situations that it simply can't it doesn't have a large enough view or a broad in view of the entire problem in order to make a sensible decision about how to optimize it so we'll come back to all of this in a minute in the fractal video the objective was to number crunch as much as possible and we managed to render quite a high-resolution Mandelbrot - a great level of depth by abusing parallelism available on your computer not only did we use lots of threads in a thread pool we also use what's known as vector processing and accessing the vector processing components of your CPU requires intrinsic functions now I apologize for a little bit of repetition here from the fractal video but it makes a good point inside your central processing unit you have something called an arithmetic logic unit an ALU it's the thing that does the stuff it's the maths it takes in an input it might take in a bunch of inputs and it takes in an operation and hopefully gives us a result it's the silicon part of your chip that can do the plus minus multiply divide and other things too your processor has additional processing instruction set beyond the simple ALU we're looking at here in fact the code we've just shown in the disassembly view was using just the basic ALU these additional instruction sets can allow us to process more inputs simultaneously but there's a catch even though the enhanced instruction sets are effectively like having additional al use on your CPU and again these can take in inputs and produce results they can't have unique operations the operation specified is applied to all of the Al use simultaneously and this gives way to what is known as sim D single instruction multiple data so all of the data going into these ALU components are unique and wonderful but the instruction that they're all operating is the same so single instruction but working on multiple data and so the next natural question to ask is well how many of these ALU cause in quotations are available to us well if we first look at a slightly older technology SSE which was streaming sim the extension or Cindy streaming extension I can never remember these provided a special register which helped a hundred and twenty eight bits and the nature of these bits was customizable for example we could easily store inside this four 32-bit floats we could also store four of the other 32-bit data types such as integers if we're working with say a 64-bit double we can store two of those and we can even have a more fine grain parallelism - we could store for example eight 16-bit values such as short integers therefore it's not directly possible to say there a specific number of additional ALU cause it really depends on how these registers are structured and taking this first line as an example it would allow us to operate on four floating point values simultaneously but the operation would need to be the same for all of those values for various business political and legal reasons SSE stopped growing in size and eventually a new technology called a BX came out and currently most processors will ship with a V X - not all of them and this is one of the problems with intrinsics even though you think you might have bought the most fantastic wizzy processor you can afford it's always worth checking its feature set on the data sheets before you buy it it's not always about clock cycles so different processor models will have different feature sets available and some of the more budget processes for example won't supports this feature set but in principle the only real difference between avx2 and SSE is that the registers and now 256 bits wide which of course allows us to fit in eight floating-point 32-bit values or for 64-bit values and it was this storage of for 64-bit doubles which I exploited to help me render the Mandelbrot fractal faster the reason we need intrinsic functions to exist in the C++ language is well C and C++ has no way of really knowing how to handle the maths between these types of registers these extended registers and that's because the language itself strives to be sort of implementation indepen so when we want to use and abuse these extended feature sets the compiler needs to be aware that they exist because ultimately it's got to produce unique assembly code which can actually use these instructions because these instructions will be unique to the processor they are not a standard set of instructions in the x86 format there they are an extension and therefore we need them to be defined somewhere so compilers do need to be aware that these platforms exist but the language itself doesn't really have any mechanism for expressing the operating on eight simultaneous floating-point values for example so we need to dip into the intrinsic functions to allow us to do that and as I'm about to demonstrate your C++ compiler is actually quite proficient at identifying that if it is allowed to use some of these extension sets that it can in certain situations I've come back to our very simple program again and this time in the project properties I'm going to enable the application of avx2 advanced vector extensions to here you can see the other's sse sse2 and even though my processor doesn't support it the latest version of the visual studio compiled it does support avx-512 I'll give you one guess if you can identify what the 512 really means so I'll choose a B x2 and click apply and just as we did before I'm going to run in debug mode and we'll have a look at the assembly that's produced here we've got our for loop and as before we don't really need to understand everything that we can see here what we are interested in are these two instructions here we've got V morph SS and V and SS these are the vector extension specific assembly language instructions and we can see that it's moving data into register X mm0 and if I put my cursor over X mm 0 our array don't forget is a floating-point erase that's 32 bit float and on the left here I've pulled in the full 128 bit register for X mm 0 and as I step through the code well it's only pulled in one value so this is the hex for the entire register so we could split that into four 32-bit values but there's also something else strange going on here I thought we chose avx2 why are we using these bog-standard 128-bit registers were my 256 bit ones I've enabled them to be viewed down here and well something interesting is already visible we can see that they're actually shared that this X mm 0 which is part of the SSE instruction set it's the lower half of this Y mm 0 which is the new avx2 256 bit register just to keep things a little bit clearer I'm going to get rid of this there we go but disappointingly what we see is it's really only loaded in one value so this is the hexadecimal equivalent of one floating-point value and if I put my cursor over the register I'll zoom in in the video what we can see is that one of the elements has been set to 20 well that's kind of what we expect it because we know that our big array one is filled with 20s and big array two is filled with 50s and we're going to add them together so let's have a look xmm zero just has 20 in it execute the next line well it's added the 50 directly for memory 2x mm zero and then it's going to write that back out to the memory and it does this each time so even though it's identified it can use an extension to do the floating-point calculations it's not doing anything in parallel and this is because we're running in debug mode and as before debug mode to the programmer needs to behave how the programmer expected it to behave so even though we've specified we can use these extension sets it's not using them properly because as the programmer steps through this loop he expects to step through it 16384 times so I guess the next question to naturally ask is well why is it chosen to use an extension at all well typically because these extensions are dedicated floating point processing systems they are fast in their own right and they're potentially faster than the standard floating-point processing capabilities supplied by the ALU and so the compiler is of work well let's use the fastest resources we can even though we're not going to fully exploit their capabilities so what I need to do now is compile this in release mode and see what decisions the compiler makes then so again except release mode that's a breakpoint and we'll have a look at the disassembly well this looks a little better so here we see the expression and underneath we're moving some data around into these ym M 0 now don't forget these ymm registers are the ones down here that twice as big as these XM m 0 registers so let's see what happens when I execute this line brilliant what we see are our floating-point hexadecimal values being loaded in in two different locations within this ymm register we've effectively loaded in 8 32-bit floating-point values into this single register I then now add to that register and all of the values changed in one single instruction which simultaneously performed 8 32-bit floating-point additions and we can see the values in the register when I put my cursor over it now yet again the compiler has done something else it's also unrolled this sequence of calculations so we've not left the loop yet but it's decided that it would be more efficient to do multiple numbers of these additions before jumping back round in the loop again and this is for the same reasons outlined before once you get these extensions up and running you really want them to run as fast as possible you don't want anything to interrupt them and you don't want them to stall it's important when you're working with code at this level that you think about data locality we know that in our CPU we have registers we've just seen them we've seen the xmm registers and we've got the ymm registers of course data needs to be loaded into these registers and the CPU will attempt to get that from its cache now its cache could be split into several different layers but cache is very fast so when it's trying to access data the first thing we'll do is check is the data at once in its cache cache also sits on the CPU if the data is looking for isn't in the cache that's known as a cache miss and you'll remember all of this from computer science 101 if we miss the cache we've then got to go and look in RAM now it's not as common on systems today but it's very possible that the memory are trying to load into the registers isn't even in ramp it could exist in a page file on your hard drive so if the CPU looks for data in cache and misses that's a cache miss it could look for a particular page in RAM that's known as a page fault in which case it then goes to see does the page exists in some virtual memory on the hard disk drive so this is a really important thing to think about when you're starting to write code at this very low level how close is the data to my CPU if we wanted to extract something from cache we might be looking at I'm picking numbers out of thinner here just to prove a point but maybe something like 10 @ 1 to 10 clock cycles if it isn't in the cache well then it's got to go and find it in the RAM via the cache and that could easily be ten to a hundred clock cycles if not a thousand clock cycles maybe but certainly getting data from a hard drive into a register that's going to be in the millions of clock cycles so what we want to make sure is that the data were working with is as close to the CPU as possible therefore were not stalling any of the instructions that were trying to execute keeping your programs optimized for cache is actually quite a complicated process in its own right I would say don't worry about it unless you have to and with modern computer systems with abundance of RAM it's not very often now that you see things dipping to the hard drive so things are getting better this is why it's good to have lots of RAM it always makes me laugh a bit when you think about just how primitive this really is so if you've got to go and find something your CPUs blazing along here at multiple gigahertz and then suddenly goes up so I need this location from memory and that memory happens to be on a hard drive you've literally got to wait for a motor to spin up on your hard disk and physically move a read head to the right location in terms of the processors timing this is hundreds of millennia of waiting just for that bit of data to arrive but as I say less of a problem these days and that's one of the reasons why the compiler unrolls things were it can yes it generates more code it does mean you're executable is bigger but it's far more effective at just pulling in the data because data tends to move around in groups you don't just read one byte from the harddrive you'll read in a section of bikes and it's usually typical you access bytes in a sequential order and that's certainly implied by the nature of our program it's very simple we're just looping through these elements of an array so rather than having to deal with the loop every single time it's done in addition it's worked out it's much faster just to try and do as many operations as we can something else to take away from this is actually looking at the assembly language eval it's a nightmare isn't it and you do need to have quite a degree of skill and patience and well probably a little bit of insanity too to actively program in this form certainly on a desktop computer simply but it's not very readable even though we can quite happily understand what it's doing trying to reverse-engineer all of these offsets in different registers is quite a challenge but now I want to demonstrate the problem up until this point the compiler has been able to easily assume what our intentions are we're just adding two arrays together what if we add in a degree of unpredictability well to do this I'm going to specify that array 1 is primed with some random numbers I'm not bothered about the quality of that random number it's just something different and what if I want to change the nature of our expression from just being simply add to the erase together to mostly out the erase together unless the value of one of the arrays happens to be 23 in which case I don't want to perform the addition I just want to write the value 23 into our target array totally pointless program but it does demonstrate a point let's compile this and have a look at the disassembly and up here we see the start of the for loop fine it's priming some information ready for us to use and down here we see the C++ code handling that conditional expression and rather disappointingly the first thing that we can notice is has gone back to using these 128 bit registers it's not using the new 256 bit registers so that's already a warning sign that it's not going to be doing things as optimally as it can if we step through the code a little bit we can see it's also only changing one value of these registers at a time so we're not even using the parallel processing capabilities of these fantastically wide registers the compiler has completely failed to understand how to solve this problem using the technologies it's got available to it but it knows it must solve the problem and so it's resorted to just doing things element by element even though it's using these faster processing capabilities of the ALU it's not really using them in the parallel form like it managed to do before and so it is in situations like this where we might need to start writing code using intrinsic functions ourselves the compiler just isn't aware of the bigger picture it doesn't know how to solve this problem and it might do in the future compiler technology is getting much much better and I really do believe one day we'll be able to have compilers which can solve these sorts of things by doing a really thorough analysis of the code but right now certainly for this visual studio compiler it can't do it I've not tried it using the other compilers maybe they can in certain situations but on the whole this is too challenging a problem for the compiler to understand how to optimize and use all of its available features appropriately so it's time to do it by hand in the fractal video I created different functions to handle different levels of sort of hand tuning the optimization but the reference function was this one create fractal basic and it took in a pixel coordinate on the screen that represented the top left of where we wanted to draw and we took in the bottom right where we wanted to draw so that was always set to the full screen but then we also looked at the fractals space so this was the top left and bottom right of the zoomed in rectangular region of the fractal that we wanted to map to the whole screen and we passed in the number while the maximum number of iterations it's allowed to go up to and essentially we worked out to values X scale and Y scale which tell us how much fractal space is represented by one screen pixel then we went ahead and iterated through all of the pixels on the screen and calculated the Mandelbrot equation using the complex number type now don't let this suddenly alienate you we're not going to talk about fractals particularly we've not watched that video it's not entirely rel but what I chose to do was to turn this set of maths into the intrinsic form for each stage of that video we looked at a different type of representation of the same thing so we created a second function called create fractal no complex under the assumption that the complex component of the standard library wasn't very well optimised it turned out that wasn't the case and there were a few comments leader what was the point in doing that at all well the point was to break it out into the simplest form possible because that's the first step when we want to start trying to write things in assembly language or in our case intrinsic functions so this performs exactly the same set of calculations as this basic function up here but it does it with the most minimal required operations and it's nice because everything is explicitly defined as being multiplies additions subtractions so there's no divisions in this with that simplified form we can start to build up the intrinsic equivalent rather oddly I'll be doing some of this out of order simply because the narrative is better but the principle is quite similar we're still going to iterate through all of the visible pixels but the first thing to note is in the X direction we're going to be increasing by 4 now there were some additional comments about would it be more optimal to do this in the y-axis or the x-axis for this particular application and this is just to satisfy those people had asked it doesn't really matter because I'm not ever reading anything from the memory this is purely a data generative approach so there might be some very minor gains but on the whole it really isn't a problem so why are we doing X plus equals 4 well the first thing is I know I'm going to be using avx2 which gives me a 256 bit register into which I can fit for 64 bit doubles and that was important because we needed that precision in order to go deeper into the fractal so I'm going to always work with 4 pixels simultaneously so I've got my entire fractal image at any given location I'm always processing 4 pixels so on the next iteration I'm processing these 4 and so forth because my goal is to process all four pixels in parallel now you might be quite right to ask well is this not something the compiler can just sort out for us well not really as I've just demonstrated the algorithm isn't that simple there's conditions in there and we'll see actually that branching and conditions is the real problem when it comes to doing intrinsic calculations like this now as I mentioned I'm going to be doing things slightly out of order so the first bit I'm going to attack is the while loop component this is the bit that just repeatedly performs calculations until either it's resolved as a fractal or we've hit the maximum number of iterations were allowing because it could just go on and on and on forever so I might just keep this down here as a quick reference I'll worry about the loop in a moment let's first handle this component to try and keep things clear for this video I'm going to be bringing in little comment examples like this and I'll also provide some source code as well for you to look at because I've gone to town on trying to develop the comments in such a way that sir makes a good follow along guide for this video but that was the standard complex form that we're trying to resolve and for the second approach we broke it down into the most minimal mathematics form I've called that the manual form in this case and so it is these expressions that we want to try and implement using intrinsic functions first things first though we're going to need some registers before we can do anything with intrinsic functions we need to actually include the header file in this case it's in int R in dot H and it does seem to be it's reasonably cross-platform but as soon as we start delving into the world of intrinsic you can't assume anything and an in good intrinsic code the first thing you should do is test your host system does it actually have the capabilities to use these intrinsic functions it may not have a V X to in which case you need to provide a fallback or some other solution or just notify the user to buy a new processor there are also slight differences between the Linux and Windows versions of the interring library that fortunately though they're much better than they used to be and I've found that the code was on the whole entirely compatible across both operating systems with a minor change and we'll see that the and so looking at this bit of coal already I can see I'm going to need registers to hold the real components of Z the imaginary components of Z the real components of Si and imaginings of Si and I'm also going to need these a and B's as well so let's pull those in to define something as being of these 256 bit types we use the double underscore m256 d type yes everything's going to get a bit of a mouthful here and when I'm working with intrinsics I like to prefix them with an underscore because it just sends that little mental flag to me that hang on these are not normal there's something a bit strange going on here so we know that we needed Z real we know we're going to need Z D imaginary we know we're going to need to see real not serial C real and we know we're going to need to see imaginary and we know also we're going to need a and B so this type is saying please use these 256 bit registers and we're going to store doubles in them looking back at the code I can see here I've got zer squirt and Zed I squirt I know that I need those else were later on so it might be useful to just store those as well so I'll put in a Zed our squirt and a Zed ice would it might be tempting to think let's just create as many registers as we possibly can but it's probably better practice to use as few as you possibly can to simply maintain that data locality we were talking about earlier we want to ensure that everything can actually reside in the processor and we also know that there is a limited number of these special registers available so yes we can declare as many as we like but at any one time I think it's only 8 can be resident in the CPU registers the rest will go out to cache and of course then you've got the potential for cache misses so that's why I would recommend just only use what you need to use don't be too trigger-happy at creating special registers so let's do our first instruction let's actually calculate Z art time to Zr and yes it's going to be a copy and paste video this because these instructions are actually quite simple so here's ed R squared equals ed r times r da the intrinsic instruction we're going to use is this mouthful here underscore mm 2 5 6 so we're saying let's use 256-bit registers I want to perform a multiply between these two registers and I'm going to do them in parallel for all of them and they are of type double in fact if I put my cursor over intellisense is quite happily telling me what's really going on here says here performs a sim D multiply of the four packed double-precision floating-point values from the first operand to the second operand and stores it in the destinations precisely what we want so the two operands I'm going to provide I'll just simply said are is it out I want to square them and I'm putting them in this zr squad register so that's done that to four values simultaneously and this is where i think people will start to think is quite simple its intrinsic stuff and the nice part is yes it is on the whole so i can move along quite quickly I also now have Z I squid exactly the same the next part to look at is this we're subtracting Z I squared from Z our squad and the instructions look very familiar already I'm storing this into our temporary value a and instead of mullet subtract this time it really is that simple on the right here I'm trying to sort of express what we have calculated in various forms just to try and keep things as obvious as possible when you're working with intrinsics or assembly language that's one of those rare times where comments really do help keep things clear because we'll see later on when we start doing the branching and the conditional stuff it can become quite a mind the next instruction to think about is this edition of C real to what we've just calculated so it fortunately we've stored that in a already so all I need to do is add C real to a so in this case I've called the addition function and added cereal to a so this has performed all of these calculations to four values as simultaneously we've now completed this line so let's calculate a value for B said real times Z imaginary times 2 plus C imaginary well the first part should be quite obvious now but the second part well this is going to introduce something new we've got a constant here what is a constant in a parallel vector so our aim here is to calculate this expression B times 2 plus C I but that 2 needs to be represented in a parallel form now I've adopted this notation to say we've got four doubles of value two in one of our vectors so how do we go about setting that up well the simplest way is to have a register predefined with the values you want so I'm going to call this register two and I need to initialize it appropriately with the four two values and to do that I can use this intrinsic function called set one parallel double again and so this means you're going to provide one argument and it will set all of the vector elements to that one argument so now my two contains two two two two with that in mind I can now multiply B and then add C imaginary to it but this is a multiply add and that's quite a unique thing we could go and do it by hand like we did up here call multiply and call out but it's always worth checking the documentation for what intrinsics are available to you because the developers of the extension set hard work realize we'll hang on it's all very well having this parallel ability but what people really want is and so these extensions also provide lots of additional functions which will save significant numbers of clock cycles trying to calculate and they have provided one which will multiply and add simultaneously it's called floating point multiply add parallel doubles so we provide in the B our register filled with twos and C I and it will go and calculate all of that in one hit multiply accumulates like this are quite an important thing in digital signal processing so it's no wonder that it's provided here as an intrinsic function one of the features of intrinsic functions could be that if the extension didn't have an FM add instruction it could go away and synthesize it using the instructions that are available on that particular piece of hard work and that's why these functions are a little more friendlier to use than writing directly in the assembly language itself so calculating this component was much simpler we've now just got this very simple assignment there's nothing special to do here we simply assign them looking back at our little reference implementation we must always remain the cells that this is happening per element but now we're starting to do things in groups of four elements at a time we've calculated the complex number component of this algorithm and if we were to think of this as a single element we would then increase the number of iterations and then perform this condition conditions are a problem for intrinsic as we saw at the start of this video the compiler simply can't work out how to solve this problem but fortunately the intrinsic functions work in such a way that we can emulate conditions without disrupting the flow of data through the processor but it does mean you have to think a little bit differently this is no longer n plus plus because it's not just a single element we're operating on some of them will be n plus plus and some of them we don't want to increase n at all and so one of those elements might need an increment but the others don't how do we handle that kind of situation it isn't possible to simply switch off an element because it follows the SIMD paradigm we talked about earlier all of the elements get the same operation every single time but since we can't switch off a particular element we can do the next best thing we can mask it out we can make it so the operation is still performed on it but ultimately it doesn't write anything out to the memory the data doesn't get changed and that's what makes conditions and branching a little bit tricky we're going to have to start thinking in terms of masks and to think in terms of masks you're going to need to be familiar with bitwise operations and if you're not familiar with bitwise operations then I have a video all about them it's the first part of the nez emulator series that entire video is dedicated to nothing but bitwise operations so let's start thinking about how we handle this while loop as before I've written out what is the purest form and what was the reduced form so we know what components to calculate so to begin with we've got some simple calculations to perform well we know how to do those now we can see we've got zed r squared + zi squirt fortunately we have the zed r squared + zi squid already calculated so we can reuse those let's do that in this case they're just added together and what we're checking for are which elements of our newly created a vector can the result of this calculation are less than four well we can see we've got another constant here it's four and so as before I'm going to create a special register just to hold that constant the next step is to actually evaluate this inequality and there are intrinsic functions to do that and the outputs of these functions are what are known as a mask and we can use these masks to effectively well emulate the switching on and off of individual vector elements I need to store my mask somewhere somewhere to create another register called mask one and I'm creating it at the same type of as the double because I'm going to be comparing doubles and that's just the way intrinsic work and we'll see there may be a little bit of casting involved later on to sort all this out but fundamentally if we're comparing two doubles in a to 5/6 D register then our masks will also be a 2 5 6 D type and so now let's go through the Equality so in its simplest form what I'm going to calculate now is our mask 1 is going to be the equivalent of if a is less than 4 and if we expand that out and I apologize for doing this in code and not in slides it's just it's simpler this way if we expand that out what we're going to actually check are in the elements locations if if this element is less than 4 if this element is less than 4 if this element is less than 4 and if this element is less than 4 we're going to do all of this in parallel the result of a condition like this yields what's known as a mask and for all of the elements that are true as a result of this condition all of the bits in this case all 64 bits will be set to 1 if it's false all 64 bits is set to 0 now I put into three dots here because I'm not typing out that many bits so you don't just get a true or false you don't get a single zero or one you get the entire element bit content set to 1 or 0 depending on the result of the condition and this is very useful and I think just to reduce the typing even further I'm going to stick with this notation now to show exactly the same thing all of these elements have a bit set to 1 and all of these elements have a bit set to 0 and I will now go and use the intrinsic function compare to go and produce this mask register for me so it can purr parallel double our a value with four and what I'm comparing is less than so this is compare less than and this will have populated my mask with the appropriate ones and zeros in the corresponding element locations so now we've generated a mask for this part of our condition we now need to look at a mask for this part of our condition and things are a little bit different on this side N and iterations are integer types they're not doubles and in fact they're 64-bit integer types and when you want to work with intrinsic functions with 64-bit integers you use the underscore underscore m256 I type and so far I know I need one that represents N one that represents the iterations and I'm going to create another mask which is working with integers iterations here is the constant value we pass in with the function so I'll need to expand that out and we have a similar way of doing that with a constant now we call this set one function again as you can see we did it for two and four but this time we're specifying please use a 64-bit integer some of these suffixes to the functions can be a little bit cryptic but I'll show you a documentation page for the intrinsics later on you actually see there documented quite well but this is going to take our number of iterations and expand it across all four elements of our 64-bit integer vector register it's important that this second mask is of type two five six I because as before when we were comparing doubles we ended up with a double mask we're now comparing integers we're going to end up with an integer mask so even though it makes little sense when all of your bits are zero or one to be a double or an integer that's just how it is for the second part of our wild condition we're checking if n is less than iterations now you start getting into some of the quirks of the language here for my particular install of the intrinsics header I don't have a less than comparison function for integers what I do have is a greater than so I can do exactly the same thing here I just need to flip these around is iterations greater than N and I want to populate my last two with the appropriate result the intrinsic function I'm going to use in this case is compare greater than into just 64 types so whereas before we could specify as an argument which particular inequality we wanted we can't do that here our integer types have a specific function for comparison so now I have two registers filled with nothing but zeros and ones in the corresponding locations our condition and these together and I'm going to do exactly that I'm going to take my m2 and I went illogically and m2 with m1 which will perform this combined condition and I'm just going to stick with the comments here because it can help us understand what we're doing if that's the contents of my m2 Register and of course these are made up these numbers but that could be the contents of it and that's the contents of my m1 register bitwise logically anding them together produces the result you might expect so here we've got false and true and the results of course is false and here we've got true and false is equal to false but here we've got true and true so our result is equal to true so we've managed to reduce our number of active elements in this large register down to just the one that satisfies both parts of this condition and to perform the logic and I'm going to use this compiler intrinsic function and this wants to perform a comparison between two 256 bit integer registers now one of my registers happens to be of type double so I can use this intrinsic function which is cast out parallel double vector to a single single register of 256 bits white.like a 256 bit integer almost fortunately this cast doesn't cost anything it's there purely to keep the compiler happy and I'm glad it exists because it does make you think about the domains of the registers that you're using but of course both of these registers are just 256 bit long filled with zeros and once there's nothing stopping you doing a logic and the fact that some of these zeros and ones may represent all the number types doesn't matter so this cast doesn't cost anything to do but it does keep your code same so after all of this this means in our mask 2 we've got just the active elements that satisfy that condition so if this condition is true then we are effectively incrementing our n in those locations so now we want to check that wherever these locations are equal to all one's that's where we want to increment our n for that pixel of the fractal to perform this increment I'm going to add another integer type variable which is just going to be a temporary C and I'm going to use C to store a numeric integer one in the correct location of our vector and I'll do that by taking a predefined constant vector that stores nothing but integer ones 64-bit integer ones in this case and I'm going to end it with our mask register up here and just to follow through with the style of the comments our C's are filled with genuine numeric integer one so in this case it's all zeros but we've got a 1 in the least significant bit for that element if I logic and that with the mask that we've calculated I then end up with a register that contains mostly zeros but it contains a numeric 1 in the locations where we need it this of course implies that I need a constant one somewhere in my code and so just as before and I've used the same function to pad out all of the elements of this register with an integer 1 now I can perform the increment for the correct element of the vector and that's quite simply a case now of adding n to C which contains the ones in the right location because if it's got zeros in them of course that value isn't going to change we can't add 0 to it and expect it to change but we can increment it by adding 1 appropriately and just to pad out the comments again so you can download this source code and I've sort of visually described what's going on here and could contain various numbers of iterations depending on how that pixel has been calculated our C contains 1 in those locations that need to be incremented and n of course is just n plus C after that so we only see the increment in the pixels that need to be incremental looking at our loop we use the condition to stop our while loop when we're working with things element by element when we're working in groups of 4 elements at a time we don't want to stop until all four have satisfied that condition we've arranged it in such way with masks that even though elements that may have satisfied that condition still get computed they can't physically change because we never allow the values to get written out we don't allow any further incrementing and what we want to avoid is a situation where we terminate our loop early simply because one of the elements has satisfied the condition fortunately intrinsic functions are there to help us again instead of using a while loop in this program I've chosen to use go to repeat and so I'll stick a repeat label at the top I don't think there's anything wrong with using go-to and labels when fundamentally your program is an assembly language program after all goto is a fairly fundamental assembly language instruction so we only want to go to repeat ie emulate our while loop when this condition is true for any one of the elements we know which elements in our vector are active because the mask in that corresponding location will be set to all ones if any of them are active they're still processing we still need to keep looping it would be very useful if there is a function that could reduce our 256 bit register into a simpler form that we can evaluate unfortunately there is we can use the intrinsic function move mask parallel double now our mask too was of type integer so I'm casting it again again doesn't cost anything and this move mask function will create a single integer value where the bits in that integer value are extracted from the most significant bits of the masks of the individual elements of the vector effectively there the sign bits of the double values that are being used and so in this example we've got one active element when we combine all of these serially into a single integer that has the value 2 well that's greater than zero because zero would be all of these set to zero and if it is zero then we know there's no further computation required for that group of four pixels so we don't have to go to repeat we now have an extension set vector n that contains the number of iterations for that pixel and that is of course what goes on to be visualized but it's difficult for our visualizer to work with this funky type of m256 I instead I'd rather they just be standard integers in a standard integer array for visualization therefore I'll unpack our complicated vector type into four integers and for Windows this is how you do it you take your mm2 5 6 I type and you can extract an individual 64-bit integer just by accessing it like an array I'm not interested in 64-bit integers in this case I do I know that it's not going to have gone beyond 4 billion iterations because my computer would have melted so I'm quite happy with casting it to an integer in this case performing a truncation that's completely satisfactory for me there were a few comments raised on the github about people getting the fractal code to compile there is a slight difference on Linux platform so I'll just include that here for completeness you don't do it this way you just simply access the register directly like an array there's only one thing to do and compared to everything we've done so far it's very very simple of course these computations need to know where they are in fractal space at the moment they're just calculating something but we've not actually told them were they are we've not initialized at any of our complex number vector registers when I wrote this code originally I chose to increment along the x-axis so my registers contain groups of 4 pixels aligned horizontally therefore I only need to concern myself with using vector registers along the x-axis the y-axis can remain exactly the same as it did before so I'm just going to pull in the code from the previous function none of that's going to change we don't need to use vector extensions to compute it but I do need something for the x-axis and I'm going to add in a bunch of additional registers just to handle x-axis calculations the X scale value here I want to be a constant and don't forget this represents the amount of fractal space occupied by a single pixel the X jump register I want to populate with how far in fractal space do I move when I go from one cluster of 4 pixels to the next cluster of 4 pixels well of course that's just simply 4 times the scale so going back to this original diagram let's assume this location in the x-axis is 5.0 and I know that per pixel in fractal space I'm going naught point 1 per pixel so this eventually becomes 5.0 5.1 5.2 and 5.3 I then jump knowing that this location here is five point four so to handle that I'm going to fill my expose offsets register with the naught point 1 equivalent in this case depending on what the scale is and we're going to use a new intrinsic function here which is set parallel double it's not set one parallel double this allows you to set to the four values explicitly so I'm just obsessing those two zero one two and three and then I use the intrinsic function multiplied to multiply these values by our scale ultimately this will give me a vector that contains the expositional offsets of each pixel in that vector in fractal space getting complicated this isn't it using this information and knowing where we are in pixel space I can populate a final vector containing those pixel positions and I need to do that in each iteration of my loop when I'm iterating in the y axis I know that my X is at the start of the screen so I can reset my x position I'm going to hijack my temporary value a here set its value to the left hand side of the fractal space and then add to that value the expositional offsets that will give me my final expose value the imaginary component of our fractal CI is simply the Y location and the real component is our X location as calculated finally we've got some initializations to do this is why I said things would be a little bit out of order we just want to initialize a few things to zero so before we do any calculations let's make sure our registers contain zero and there is an intrinsic function called set zero it does what it says on the tin so I've zeroed out my Z real zero doubts my said imaginary and I also want to zero out the number of iterations our end value that we've been calculating I have to use set zero and then some funky integer type at the end if you ever do want to know what intrinsic functions are available to you you can simply google it but there's a good list on the Microsoft website for this particular compiler and you can actually use this list to look for other compilers too because they're quite similar between the different compilers and it's an extensive list that's the really beautiful thing about intrinsics there are loads of them loads and loads and loads and loads and it tells you which particular feature set is required in order to use that intrinsic and so for example here is one we've been using quite a bit mm 2 5 6 and parallel double it says it needs a V X doesn't need a V X 2 for that specific one tells us which header file contains the definition of that intrinsic and gives us a prototype for the function in fact we can click on this and it takes us straight to the Intel website and this is actually a much better resource because it allows you to search for the types of intrinsic that you want and so it's found specifically that one gives us a little example of what it does in pseudocode and that can be quite useful for some of the more esoteric instructions that are available and so there you have it potentially the most hardcore video we have have on the one lung coder channel ever how to calculate the Mandelbrot fractal using intrinsic sub part of the avx2 extension set it is complicated stuff initially but it really only gets complicated when it comes to this masking side of things the rest of it if you've got simple array operations you want to do the instructions are quite simple but it does require you have a slightly intuitive understanding of what your hardware is doing and that is to be expected you're going lower level using this stuff and so you do really don't need to understand what your hard work is doing and how you can best exploit intrinsic functions to calculate what you need the compiler can't always do it for you as usual I'll provide this code on the github so you can explore it with all of the comments at your leisure there isn't a sort of an executable program to provide with this one it was already done as part of the fractal video if you've enjoyed this video and you know it's been a pretty tough one please give me a big thumbs up however think about subscribing come and have a chat on the discord and I'll see you next time and I think next time might be the OLC beat the Borden Game Jam review video so a nice community showcase one good stuff take care see you later
Info
Channel: javidx9
Views: 77,321
Rating: 4.9791856 out of 5
Keywords: one lone coder, onelonecoder, learning, programming, tutorial, c++, beginner, olcconsolegameengine, command prompt, ascii, game, game engine, pixelgameengine, olc::pixelgameengine
Id: x9Scb5Mku1g
Channel Id: undefined
Length: 55min 39sec (3339 seconds)
Published: Sat May 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.