Writing Code That Runs FAST on a GPU

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
break out your 30 90s boys because today we're mining some bitcoin now just kidding uh today we're talking about nvidia cuda core programming the reason why i want to talk about this is because i think it's a really interesting topic of discussion when it comes to how to use different parts of your computer to do different things so why would we do gpu programming right so gpus are able to do extremely high throughput paralyzed processing this is in comparison to your cpu which is designed for serialized execution in a low latency fault tolerant kind of way and the reason this happens is because of the way that the cpu and gpu are designed if you look here on the left you have a cpu and it has maybe eight to 16 cores and a core consists of a control unit and an alu right the control unit does the logic control and the alu does the math and it has cache where it can very quickly look up memory and data as it needs it's meant to be serialized and very fault tolerant a gpu on the other hand is meant to do things extremely fast and extremely paralyzed they're designed with parallelization in mind and that being said the modern gpu i think the 3090 has 11 000 cores right so if you are able to do data processing in a paralyzed way you want to be using a gpu and you'll get way faster results out of it so with that being said what problems are good for a gpu well like i said before any problem that can benefit from mass parallelization is good for a gpu so for example the common example you use when talking about cuda core programming is scalarize a large set of vectors right so we're gonna run through this example using the cpu model so we have the vectors over here one comma two is our first vector the left column is a and the middle column is b and then the destination is going to go into c and we have up to n number of vectors how would we do this on a cpu on the cpu we would say 4 in i equals 0 we would just iterate over this loop and we just add them up right until we get to the very end and what would actually happen in execution is we would have a very linear lookup vector element a look at vector element b add it into c and do that for the first second and up to nth execution but what we actually over time get is we incur a cost in looking up and running the actual program that has to do this data lookup and and mathematics and over time it becomes less efficient to do this serialized as opposed to using multiple cores and that's where the gpu comes in way we do this on a gpu is instead of serializing it and doing it you know one by one by one we could write a function that is meant to be used in a paralyzed fashion and we would assign that function in index for every vector that we have and we'll figure out where that index comes from in a second it's a pretty cool way that the cuda framework set this up but basically we would create this function and we would execute this function in parallel for every vector and again that i the index we used to index into our vectors would come from the cuda framework and then as you can see here our execution timeline right how long it takes to do this gets cut by in this case three your gains would go up per the size of your n right so if your n is serialized out to 10 24 elements if you could do this in complete perfect parallelization so 10 24 cores you would have a speed up of 1024x right so how do we determine where this eye comes from i know this graphic's a little complicated but it's actually not that bad so the way the cuda framework is designed it's meant to take away the complexity of parallelization so the abstract away from the cpu to the gpu is what's known as a kernel right and a kernel consists of these things called grids and then grids consist of blocks and then blocks consist of threads and what actually happens is when we execute a kernel we tell the gpu how many grids and blocks and threads we want to have in our program so for the case before where we have this n value we can tell the gpu hey instantiate n number of blocks for this program and within that function we will actually get an index value that will represent the number of the block that we're in when this program executes with that number we can index into the array and the block will know which index to go into and do the math for you like so what we're going to do now is we're going to actually go into visual studio i'm going to show you guys how to set up the cuda core framework and i'm going to run you through this very example and we're going to do it in a way where nvidia parallelizes the problem for us and does it way faster than a cpu could let's get into it right now okay so i'm going to assume that you guys already have visual studio 19 installed if you don't go ahead pause the video also like the video uh and install visual studio 2019 um then what we'll do is we'll go to this url right here so developer.net cuda downloads and then you just download whatever os you have so i have windows x86 64 and i have windows 10 and i would like the local version uh and you go ahead and hit download and i'm going to cancel this because i already have it once that executable gets downloaded you just go ahead and run it and i'll make sure i show you guys how that works real quick let me pull it up okay great we have the installer right there boom it's going to go ahead and self-extract and run um and it should give you the option to just do a full install i'm not gonna walk you through how to do the install i'm sure you guys can figure that out but it's very painless um once you get that thing installed you should be able to just open up visual studio and then that new project you should have the option way at the bottom to do a cuda 11.3 runtime project hit next a few times blah blah blah we'll do my vector project that will actually give you a lot of code that you don't need i get open that project and i delete all of it except for the include lines for the cuda runtime header files uh the obviously standard io and then i leave my blank main function right so once we have that we need to start writing some code so to go back to our example we're going to create three arrays to represent our vectors so 1 2 3 is our a and remember that's the left hand column of those arrays or of those vectors rather and we'll do four five six and we want to have our destination uh our destination array and that's going to be of size size of a divided by size of int and we'll set it all to zero okay great so now i'm just gonna write the cpu example which is really simple right 4i equals zero i less than size of c over size of int i plus plus c of i equals a of i plus b of i and we'll say uh return and we're gonna want to break right here to prove that c actually got set so it should be right five seven nine we'll run that real quick yep let me see here we got five oh i'll zoom in for you guys actually it's pretty small um five seven and nine in our vector right very very cool okay well that's boring that's the cpu example we wanna use the gpu to do this right how do we do that um there's a couple things we have to do right so first we need to have a pointer that points into memory that is controlled by the gpu so we're going to have int star cuda a equals zero this is going to be a pointer to the gpu's memory we're gonna do that actually three times and that'll be cuda b and cuda c what we wanna do is we wanna actually allocate memory in the gpu to copy our data out to the way we do that is with cuda malik and kudamalik takes a pointer pointer so we're going to actually overwrite cuda a on the stack with this value right so it's going to be the address of cuda a and how much do we allocate we're going to allocate size of a bytes and that's correct because this is the the size of a and we can just copy this and do it three times so we're going to allocate good b and then a b and c okay great so if this all happens correctly which we should check for return values but today we're being bad programmers um we should get pointers into the gpu that have enough room to hold all of our vectors okay so now that we have all the room to hold our vectors we need to put the vectors into the gpu for processing right so we use cuda mem copy to do this and the destination is going to be cuda a the source is going to be from a and the count is the size of a this is your standard mem copy right you have destination source and size the final thing you have to do is you have to actually tell the cuda framework what direction is the data going is it going from the host to the device or the device to the host so the host in this case the cpu the device the gpu and in this case we're going from uh cuda mem copy host to device and we can again copy this three times boom boom boom go to b go to c b c b and c okay we're going to start commenting to make this a little easier so create pointers into the gpu allocate memory in the gpu and then copy the vectors into the gpu okay pretty straightforward right create these pointers malloc memory to them and then copy into them great and we're gonna get rid of our cpu example because again that's like hyper boring no one cares about cpu vectors um great so now we have our memory put into actually we can just skip this line because it's zero at this point so we don't have to do that um now we want to run our code that adds these vectors together okay so how do we do that well first we have to write our program that gets ran in the gpu or write our function so the way you do that is what's called a double double global this double double global tells the compiler that this is going to be a function that actually gets ran in the gpu so it prepares it that way and it creates the memory map so that the gpu knows how to pick it up okay so then we say void because we don't want to return anything and then we say vector add and it's going to take three parameters a b and c and note these are all pointers right because we're giving a function that is able to be parallelized to the gpu and then the gpu is going to handle it using the cuda block grid thread framework right and then what we're going to do is we need to access the elements based on the index of the thread that we are the way that works is we say that int i which is our index we're going to use to index into the arrays is equal to thread idx dot x so what this means is that we are going to create a list of threads and this will get called on every vector in the list this x value is going to represent the number of the vector that we are and we're going to use that to index into these arrays and add them up boom there we go and then return but it's avoid so it'll automatically return um okay great so now we need to instantiate this we need to run this right how do we do that the way we do it is we say the name of our function vector add but then we have to add the special syntax which is one two three carrots and the syntax that we use here is grid size block size we're not actually going to leave this here i want to kind of explain this so the grid size is the size of the grid which means it is the number of blocks that we have and then we say the block size so per amount of blocks in here this says how many threads exist per block so i'm going to say we only want one block because it's not a very big program i don't want to over parallelize it and that will actually cause issues here if i do anything outside of one block and then the block size is actually going to be the number of vectors and what is the number of vectors it's size of a over size of n and then i call it with the parameters cuda a cuda b cuda c so at this point and again there's gonna be an error here it will actually compile just fine you can just ignore this um so what will happen here is the cpu will tell the gpu hey run this function vector add in a cuda kernel with a grid that has one block and that block has this many threads and call it with these parameters that will get ran here and then for every thread we do our thing and we add the values together right and then finally once that happens i'm going to delete the cpu part first because i want to confirm to you guys that you know this actually happens so then once we have that we want to cuda mem copy the result out of cuda c so the destination is going to be c the source is going to be cuda c and then the size is going to be size of c and instead of being a hosted device this is going to be a device to host and again we're going to break right here and just prove to you guys that happened so we'll go ahead and run this so we're at the return here so here is c you can see that c579 got parallelized and it actually added them in the gpu and then it brought them out so we can do is we can actually make our vector a little bit bigger here so i'll just go ahead and copy this bad boy i'll make b the same way and then c will grow dynamically with that these things will all be the same yep so you can actually stop and run this again i'll show you guys how it works it should based on this number properly instantiate and add our vector together yep boom so you see that what happened in c is it just added all these together so again pretty cool right what's happening we're creating these vectors locally on our computer we're creating a destination register on our computer we're saying these are going to be pointers into the gpu we allocate memory in the gpu and overwrite those pointers we then mem copy from our computer to the gpu using this flag here and then we say run this function this many times in parallel with these parameters and then once that's been executed copy that memory out of the gpu and then we display it here guys i hope that was useful i hope you guys learned something the power of the gpu is you know very very high you just need to learn how to write a program that can parallelize in a way that can be useful when you're doing cuda core programming if you guys enjoyed this video do me a favor hit like hit subscribe and i'll see you guys in the next video you
Info
Channel: Low Level Learning
Views: 545,497
Rating: undefined out of 5
Keywords: cuda core programming, gpu programming, nvidia programming, nvidia c, nvidia c++, parallel programming, multithreading
Id: 8sDg-lD1fZQ
Channel Id: undefined
Length: 15min 32sec (932 seconds)
Published: Sat Jul 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.