Getting Started With CUDA for Python Programmers

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi there, I'm Jeremy Howard from Answer.AI and  this is Getting Started with CUDA. CUDA is of   course what we use to program NVIDIA GPUs if we  want them to go super fast and we want maximum   flexibility. And it has a reputation of being  very hard to get started with. The truth is, it's   actually not so bad. You just have to know some  tricks. And so in this video I'm going to show   you some of those tricks. So let's switch to the  screen and take a look. So I'm going to be doing   all of the work today in notebooks. This might  surprise you. You might be thinking that to do   work with CUDA we have to do stuff with compilers  and terminals and things like that. And the truth   is actually it turns out we really don't, thanks  to some magic that is provided by PyTorch. You can   follow along in all of these steps and I strongly  suggest you do so in your own computer. You can   go to the CUDA mode organization in GitHub, find  the Lecture 2 repo there and you'll see there is   a Lecture 3 folder. This is Lecture 3 of the CUDA  mode series. You don't need to have seen any of   the previous ones, however, to follow along. In  the readme there you'll see there's a Lecture 3   section and at the bottom there is a click to  go to the Colab version. Yep, you can run all   of this in Colab for free. You don't even have  to have a GPU available to run the whole thing.   We're going to be following along with some of  the examples from this book Programming Massively   Parallel Processes. Programming Massively Parallel  Processes is a really great book to read and once   you've completed today's lesson you should be able  to make a great start on this book. It goes into   a lot more details about some of the things that  we're going to cover on fairly quickly. It's okay   if you don't have the book, but if you want to  go deeper I strongly suggest you get it. In fact   you'll see in the repo that Lecture 2 in this  series actually was a deep dive into chapters   1 to 3 of that book. So actually you might want to  do Lecture 2, confusingly enough, after this one,   Lecture 3, to get more details about some of what  we're talking about. Okay, so let's dive into the   notebook. So what we're going to be doing today  is we're going to be doing a whole lot of stuff   with old PyTorch first to make sure that we get  all the ideas and then we will try to convert   each of these things into CUDA. So in order to do  this we're going to start by importing a bunch of   stuff. In fact let's do all of this in Colab.  So here we are in Colab and you should make   sure that you set in Colab your runtime to the T4  GPU. This one you can use plenty of for free and   it's easily good enough to run everything we're  doing today. And once you've got that running we   can import the libraries we're going to need and  we can start on our first exercise. So the first   exercise actually comes from Chapter 2 of the  book and Chapter 2 of the book teaches how to   do this problem which is converting an RGB color  picture into a grayscale picture. And it turns out   that the recommended formula for this is to take  0.21 of the red pixel, 0.72 of the green pixel,   0.07 of the blue pixel and add them up together  and that creates the luminance value which is   what we're seeing here. That's a common way,  the kind of the standard way to go from RGB   to grayscale. So we're going to do this, we're  going to make a CUDA kernel to do this. So the   first thing we're going to need is a picture and  anytime you need a picture I recommend going for   a picture of a puppy. So we've got here a URL to  a picture of a puppy. So we'll just go ahead and   download it and then we can use torchvision.io to  load that. So this is already part of Colab. If   you're interested in running stuff on your own  machine or a server in the cloud I'll show you   how to set that up at the end of this lecture. So  let's read in the image and if we have a look at   the shape of it it says it's 3 by 1066 by 1600.  So I'm going to assume that you know the basics   of PyTorch here. If you don't know the basics of  PyTorch I am a bit biased but I highly recommend   my course which covers exactly that. You can go  to course.fast.ai and you get the benefit also   of having seen some very cute bunnies and along  with the very cute bunnies it basically takes   you through all of everything you need to be an  effective practitioner of modern deep learning.   So finish part one if you want to go right into  those details but even if you just do the first   two or three lessons that will give you more than  enough you need to know to understand this kind of   code and these kinds of outputs. So I have a sigma  you've done all that. So you'll see here we've got   a rank 3 tensor. There are three channels so  they're like the faces of a cube if you like.   There are 1066 rows on each face so that's the  height and then there are 16 columns in each   row so that's the width. So if we then look at  the first couple of channels and the first three   rows and the first four columns you can see here  that these are unsigned 8-bit integers. So they're   bytes and so here they are. So that's what an  image looks like. Hopefully you know all that   already. So let's take a look at our image. To  do that I'm just going to create a simple little   function show image that will create a mat lip  plot. Remove the axes. If it's color which this   one is it will change the order of the axes from  channel by height by width which is what PyTorch   uses to height by width by channel which is what  mat plot lip I'm having trouble today expects.   So we change the order of the axes to be 1 2 0  and then we can show the edge. Putting it on the   CPU if necessary. Now we're going to be working  with this image in Python which is going to be   just pure Python to start with before we switch to  CUDA. It's going to be really slow so we'll resize   it to have the smallest length smallest dimension  be 150. So that's the height in this case. So   we end up with a 150 by 225 shape which is a  rectangle which is 3 3 750 pixels each one with R,   G and B values and there is our puppy. So you see  it wasn't a good idea to make this a puppy. Okay   so how do we convert that to grayscale? Well the  book has told us the formula to use. Go through   every pixel and do that to it. Alright so here is  the loop we're going to go through every pixel and   do that to it and stick that in the output. So  that's the basic idea. So what are the details   of this? Well here we've got channel by row by  column. So how do we loop through every pixel?   Well the first thing we need to know is how many  pixels are there. So we can say channel by height   by width is the shape. So now we've defined those  three variables. So the number of pixels is the   height times the width. And so to loop through all  those pixels an easy way to do them is to flatten   them all out into a vector. Now what happens when  you flatten them all out into a vector? Well as   we saw they're currently stored in this format  where we've got one face and then another face   and then there's a we haven't got it printed here  but there's a third face. Within each face then   there is one row. We're just showing the first few  and then the next row and then the next row and   then with each row you've got column, column,  column. So let's say we had a small image in   which in fact we can do it like this. We could say  here's our red so we've got the pixels of 0, 1, 2,   3, 4, 5. So let's say this was a height 2, width  3, 3 channel image. So then there'll be 6, 7, 8,   9, 10, 11, RGB, 12, 13, 14, 15, 16. So let's say  these are the pixels. So when these are flattened   out it's going to turn into a single vector just  like so. 6, 7, 8, 12, 13, 14. So actually when   we talk about an image we initially see it as  a bunch of pixels. We can think of it as having   3 channels but in practice in our computer the  memory is all laid out linearly. Everything has   just an address in memory. It's just a whole  bunch. You can think of it as your computer's   memory is one giant vector. And so when we say,  when we say flatten then what that's actually   doing is it's turning our channel by height  by width into a big vector like this. Okay,   so now that we've done that we can say, alright,  the place we're going to be putting this into,   the result, we're going to start out with just  an empty vector of length n. We'll go through   all of the n values from 0 to n minus 1 and we're  going to put in the output value 0.29-ish times   the input value at x i. So this will be here in  the red bit. And then 0.59 times x i plus n. So n   here, n here is this distance. It's the number of  pixels, 1, 2, 3, 4, 5, 6. So that's why to get to   green we have to jump up to i plus n. And then to  get to blue we have to jump to i plus 2n. And so   that's how this works. We've flattened everything  out and we're indexing into this flattened out   thing directly. And so at the end of that we're  going to have our grayscale is all done. So we can   then just reshape that into height by width. And  there it is. There's our grayscale puppy. And you   can see here the flattened image is just a single  vector with all those channel values flattened   out as we described. Okay now that is incredibly  slow. It's nearly two seconds to do something with   only 34,000 pixels in. So to speed it up we are  going to want to use CUDA. How come CUDA is able   to speed things up? Well the reason CUDA is able  to speed things up is because it is set up in a   very different way to how a normal CPU is set up.  And we can actually see that if we look at some of   this information about what is in an RTX 3090 card  for example. Now an RTX 3090 card is a fantastic   GPU. You can get them secondhand, pretty good  value. So a really good choice particularly for   hobbyists. What is inside a 3090? It has 82 SMs.  What's an SM? An SM is a streaming multiprocessor.   So you can think of this as almost like a separate  CPU in your computer. And so there's 82 of these.   So that's already a lot more than you have CPUs in  your computer. But then each one of these has 128   CUDA cores. So these CUDA cores are all able to  operate at the same time. These multiprocessors   are all able to operate at the same time. So that  gives us 128 times 82. 10,500 CUDA cores in total   that can all work at the same time. So that's a  lot more than any CPU we're familiar with can do.   And the 3090 isn't even at the very top end. It's  really a very good GPU but there are some with   even more CUDA cores. So how do we use them all?  Well we need to be able to set up our code in such   a way that we can say here is a piece of code that  you can run on lots of different pieces of data,   lots of different pieces of memory at the same  time so that you can do 10,000 things at the same   time. And so CUDA does this in a really simple and  pretty elegant way which is it basically says okay   take out the kind of the inner loop. So here's our  inner loop. The stuff where you can run 10,000 of   these at the same time. They're not going to  influence each other at all. So you see these   do not influence each other at all. All they do is  they stick something into some output memory. So   it doesn't even return something. You can't return  something from these CUDA kernels as they're going   to be called. All you can do is you can modify  memory in such a way that you don't know what   order they're going to run in. They could all  run at the same time. Some could run a little bit   before another one and so forth. So the way that  CUDA does this is it says okay write a function.   And in your function write a line of code which  I'm going to call as many dozens, hundreds,   thousands, millions of times as necessary to do  all the work that's needed. And I'm going to do   this in parallel for you as much as I can. In the  case of running on a 3090, up to 10,000 times,   up to 10,000 things all at once. And I will get  this done as fast as possible. So all you have to   do is basically write the line of code you want  to be called lots of times. And then the second   thing you have to do is say how many times to call  that code. And so what will happen is that piece   of code called the kernel will be called for you.  It'll be passed in whatever arguments you ask to   be passed in, which in this case will be the input  array tensor, the output tensor, and the size of   how many pixels are in each channel. And it'll  tell you, okay this is the ith time I've called   it. Now we can simulate that in Python, very,  very simply, a single for loop. Now this doesn't   happen in parallel, so it's not going to speed  it up, but the kind of results, the semantics,   are going to be identical to CUDA. So here is a  function we've called run kernel. We're going to   pass it in a function. We're going to say how many  times to run the function and what arguments to   call the function with. And so each time it will  call the function passing in the index, what time,   and the arguments that we've requested. Okay,  so we can now create something to call that. So   let's get the, just like before, get the number of  channels, height and width, the number of pixels,   flatten it out, create the result tensor that  we're going to put things in. So this time,   rather than calling the loop directly, we will  call run kernel. We will pass in the name of the   function to be called as f. We will pass in the  number of times, which is the number of pixels for   the loop. And we'll pass in the arguments that  are going to be required inside our kernel. So   we're going to need out, we're going to need  x, and we're going to need n. So you can see   here we're using no external libraries at all. We  have just plain Python and a tiny bit of PyTorch,   just enough to create a tensor index into tensors.  And that's all that's being used. Conceptually,   it's doing the same thing as a CUDA kernel would  do, nearly. And we'll get to the nearly in just a   moment. But conceptually, you could see that you  could now potentially write something, which if   you knew that this was running a bunch of things  totally independently of each other, conceptually,   you could now truly easily parallelize  that. And that's what CUDA does. However,   it's not quite that simple. It does not simply  create a single list of numbers like range n   does in Python and pass each one in turn into your  kernel. But instead, it actually splits the range   of numbers into what's called blocks. So in this  case, maybe there's like 1,000 pixels we wanted to   get through. It's going to group them into blocks  of 256 at a time. And so in Python, it looks like   this. In practice, a CUDA kernel runner is not  a single for loop that loops n times. Instead,   it is a pair of nested for loops. So you don't  just pass in a single number and say this is   the number of pixels, but you pass in two numbers,  number of blocks and the number of threads. We'll   get into that in a moment. But these are just  numbers. They're just you can put any numbers   you like here. And if you choose two numbers  that multiply to get the thing that we want,   which is the n times we want to call it, then this  can do exactly the same thing because we're now   going to pass in which of the what's the index of  the outer loop we're up to, what's the index in   the inner loop we're up to, how many things do we  go through in the inner loop, and therefore inside   the kernel, we can find out what index we're up  to by multiplying the block index times the block   dimension. So that is to say the i by the threads  and add the inner loop index, the j. So that's   what we pass in with the i, j threads. But inside  the kernel, we call it block index, thread index   and block dimension. So if you look at the CUDA  book, you'll see here this is exactly what they   do. They say the index is equal to the block index  times the block dimension plus the thread index.   There's a dot x thing here that we can ignore  for now. We'll look at that in a moment. But in   practice, this is actually how CUDA works. So it  has all these blocks and inside there are threads   and you can just think of them as numbers. You  can see these blocks, they just have numbers, o,   one, dot, dot, dot, dot, and so forth. Now that  does mean something a little bit tricky though,   which is, well, the first thing I'll say is how  do we pick these numbers, the number of blocks and   the number of threads? For now in practice, we're  just always going to say the number of threads   is 256. And that's a perfectly fine number to use  as a default. Anyway, you can't go too far wrong,   just always picking 256, nearly always. So don't  worry about that too much for now optimizing that   number. So if we say, okay, we want to have 256  threads, so remember that's the inner loop, or   if we look inside our kernel runner here, that's  our inner loop. So we're going to call each of,   this is going to be called 256 times. So how many  times do you have to call this? Well, you're going   to have to call it n, number of pixels, divided  by 256 times. Now that might not be an integer,   so you'll have to round that up. So that's  how we can calculate the number of blocks   we need to make sure that our kernel is called  enough times. Now we do have a problem though,   which is that the number of times we would have  liked to have called it, which previously was   equal to the number of pixels, might not be a  multiple of 256. So we might end up going too   far. And so that's why we also need in our kernel  now this if statement. And so this is making sure   that the index that we're up to does not go past  the number of pixels we have. And this appears   in basically every CUDA kernel you'll see, and  it's called the guard, or the guard block. So   this is our guard to make sure we don't go out of  bounds. So this is the same line of code we had   before. Now we've also just added this thing to  calculate the index, and we've added the guard.   And this is like the pretty standard first lines  from any CUDA kernel. So we can now run those,   and they'll do exactly the same thing as before.  And so the obvious question is, well, why? Why do   CUDA kernels work in this weird block and thread  way? Why don't we just tell them the number of   times to run it? Why do we have to do it by blocks  and threads? And the reason why is because of some   of this detail that we've got here, which is that  CUDA sets things up for us so that everything in   the same block, or to say it more completely  thread block, which is the same block, they   will all be given some shared memory. And they'll  also all be given the opportunity to synchronize,   which is to basically say, OK, everything in  this block has to get to this point before you   can move on. All of the threads in a block will be  executed on the same streaming multiprocessor. And   so we'll see later in later lectures, but won't  be taught by me, that by using blocks smartly,   you can make your code run more quickly. And the  shared memory is particularly important. So shared   memory is a little bit of memory in the GPU that  all the threads in a block share and it's fast.   It's super, super, super fast. Now when we say not  very much, it's like on a 3090, it's 128K. So very   small. This is basically the same as a cache in  a CPU. The difference, though, is that on a CPU,   you're not going to be manually deciding what goes  into your cache. But on the GPU, you do. It's all   up to you. So at the moment, this cache is not  going to be used when we create our CUDA code   because we're just getting started. And so we're  not going to worry about that optimization. But   to go fast, you want to use that cache. And also,  you want to use the register file. Something a lot   of people don't realize is that there's actually  quite a lot of register memory, even more register   memory than shared memory. So anyway, those  are all things to worry about down the track,   not needed for getting started. So how do we go  about using CUDA? There is a basically standard   setup block that I would add, and we are going to  add. And what happens in this setup block is we're   going to set an environment variable. You wouldn't  use this in production or for going fast. But   this says, if you get an error, stop right away,  basically. So wait to see how things go. And then   that way, you can tell us exactly when an error  occurs and where it happens. So that slows things   down, but it's good for development. We're also  going to install two modules. One is a build tool,   which is required by PyTorch to compile your C++  CUDA code. The second is a very handy little thing   called WorldLitza. And the only place you're  going to see that used is in this line here,   where we load this extension called WorldLitza.  Without this, anything you print from your CUDA   code, in fact, from your C++ code, won't appear  in a notebook. So you always want to do this in a   notebook where you're doing stuff in CUDA so that  you can use print statements to debug things. OK.   So if you've got some CUDA code, how do you use  it from Python? The answer is that PyTorch comes   with a very handy thing called LoadInline, which  is inside torch.utils.cpp extension. LoadInline is   a marvelous function that you just pass in a list  of any of the CUDA code strings that you want to   compile, any of the plain C++ strings that you  want to compile, any functions in that C++ you   want to make available to PyTorch. And it will go  and compile it all, turn it into a Python module   and make it available right away, which is pretty  amazing. I've just created a tiny little wrapper   for that called LoadCuda, just to streamline it  a tiny bit. Behind the scenes, it's just going   to call LoadInline. The other thing I've done is  I've created a string that contains some C++ code.   This is all C code, I think, but it's compiled  as C++ code. We'll call it C++ code. C++ code   we want included in all of our CUDA files. We  need to include this header file to make sure   that we can access PyTorch tensor stuff. We want  to be able to use I.O. and we want to be able to   check for exceptions. And then I also define  three macros. The first macro just checks that   a tensor is CUDA. The second one checks that it's  contiguous in memory, because sometimes PyTorch   can actually split things up over different memory  pieces. And then if we try to access that in this   flattened out form, it won't work. And then the  way we're actually going to use it, check input,   we'll just check both of those things. So if  something's not on CUDA and it's not contiguous,   we aren't going to be able to use it. So we  always have this. And then the third thing   we do here is we define ceiling division. Ceiling  division is just this. Although you can implement   it a different way like this. And so this will do  ceiling division. And so this is how we're going   to, this is what we're going to call in order  to figure out how many blocks we need. So this   is just, you don't have to worry about the details  of this too much. It's just a standard setup we're   going to use. OK, so now we need to write our CUDA  kernel. Now, how do you write the CUDA kernel?   Well, all I did, and I recommend you do, is take  your Python kernel and paste it into chat GPT   and say convert this to equivalent C code using  the same names, formatting, etc. Where possible,   paste it in and chat GPT will do it for  you. Unless you're very comfortable with C,   which goes just write it yourself is fine. But  this way, since you've already got the Python,   why not just do this? It basically was pretty  much perfect. I found, although it did assume   that these were floats, they're actually not  floats. We had to change a couple of data types,   but basically I was able to use it almost as is.  And so particularly, you know, for people who are   much more Python programmers nowadays like me,  this is a nice way to write 95% of the code you   need. What else do we have to change? Well, as we  saw in our picture earlier, it's not called block   IDX, it's called block IDX.x, block dim.x, thread  IDX.x. So we have to add the dot X there. Other   than that, if we compare. So as you can see, these  two pieces of code look nearly identical. We've   had to add data types to them. We've had to add  semicolons. We had to get rid of the colon. We had   to add curly brackets. That's about it. So it's  not very different at all. So if you haven't done   much C programming, yeah, don't worry about it too  much because, you know, the truth is actually it's   not that different for this kind of calculation  intensive work. One thing we should talk about   is this. What's unsigned cast R? This is just how  you write Uint 8 in C. You can just if you're not   sure how to change change a data type between the  Pytorch spelling and the C spelling, you could ask   chat GPT or you can Google it. But this is how you  write byte. The star in practice, it's basically   how you say this is an array. So this says that  X is an array of bytes. It actually means it's a   pointer, but pointers are treated, as you can  see here, as arrays by C. So you don't really   have to worry about the fact that the pointer, it  just means for us that it's an array. But in C,   the only kind of arrays that it knows how to deal  with these one dimensional arrays. And that's why   we always have to flatten things out. We can't  use multidimensional tensors really directly in   these CUDA kernels in this way. So we're going to  end up with these one dimensional C arrays. Yeah,   other than that, it's going to look exactly in  fact, I mean, even because we did our Python like   that, it's going to look identical. The void here  just means it doesn't return anything. And then   the Dunder global here is a special thing added  by CUDA. There's three things that can appear.   And this simply says, what should I compile this  to do? And so you can put Dunder device, and that   means compile it so that you can only call it on  the GPU. You can say Dunder global, and that says,   OK, you can call it from the CPU or GPU and it'll  run on the GPU. Or you can write Dunder host,   which you don't have to. And that just means  it's a normal C or C++ program that runs on the   CPU side. So any time we want to call something  from the CPU side to run something on the GPU,   which is basically almost always when we're  doing kernels, you write Dunder global. So   here we've got Dunder global, we've got  our kernel and that's it. So then we need   the thing to call that kernel. So earlier to  call the kernel, we called this block kernel   function passed in the kernel and passed in the  blocks and threads. And the arguments with CUDA,   we don't have to use a special function. There  is a weird special syntax built into kernel to do   it for us. To use the weird special syntax, you  say, OK, what's the kernel, the function that I   want to call? And then you use these weird triple  angle brackets. So the triple angle brackets is a   special CUDA extension to the C++ language. And it  means this is a kernel, please call it on the GPU.   And between the triple angle brackets, there's  a number of things you can pass, but you have to   pass at least the first two things, which is how  many blocks, how many threads. So how many blocks,   ceiling division, number of pixels divided by  threads. And how many threads, as we said before,   let's just pick 256 all the time and not  worry about it. So that says call this   function as a GPU kernel and then passing in these  arguments. We have to pass in our input tensor,   our output tensor and how many pixels. And  you'll see that for each of these tensors,   we have to use a special method dot data pointer.  And that's going to convert it into a C pointer to   the tensor. So that's why by the time it arrives  in our kernel, it's a C pointer. You also have to   tell it what data type you want it to be treated  as. This says treat it as Untates. So that's this   is a C++ template parameter here. And this is  a method. The other thing you need to know is   in C++ dot means call a method of an object,  or else colon colon is basically like in C,   in Python calling a method of a class. So you  don't say torch dot empty, you say torch colon   colon empty to create our output. Or else back  when we did it in Python, we said torch dot empty.   Also in Python. Oh, OK. So in Python, that's  right. We just created a length and vector and   then did a dot view. It doesn't really matter how  we do it. But in this case, we actually created   a two dimensional tensor by passing. We pass in  this thing in curly brackets here. This is called   a C++ list initializer. And it's just basically  a little list containing height comma width. So   this tells it to create a two dimensional matrix,  which is why we don't need dot view at the end.   We could have done it the dot view way as well.  Probably be better to keep it consistent. But   this is what I wrote at the time. The other  interesting thing when we create the output   is if you pass in input dot options. So this is  our input tensor that just says, oh, use the same   data type and the same device, CUDA device as our  input has. This is a nice, really convenient way,   which I don't even think we have in Python to say,  make sure that this is the same data type in the   same device. If you say auto here, this is quite  convenient. You don't have to specify what type   this is. We could have written torch colon colon  tensor. But by writing auto, it just says figure   it out yourself. Which is another convenient  little C++ thing. After we call the kernel,   if there's an error in it, we won't necessarily  get told. So to tell it to check for an error,   you have to write this. This is a macro that's  again provided by PyTorch. The details don't   matter. You should just always call it after you  call a kernel to make sure it works. And then you   can return the tensor that you allocated and then  you passed as a pointer and then that you filled   in. OK. Now, as well as the CUDA source, you  also need C++ source. And the C++ source is just   something that says, here is a list of all of the  details of the functions that I want you to make   available to the outside world. In this case,  Python. And so this is basically your header,   effectively. So you can just copy and paste the  full line here from your function definition and   stick a semicolon on the end. So that's something  you can always do. And so then we call our load   CUDA function that we looked at earlier, passing  in the CUDA source code, the C++ source code, and   then a list of the names of the functions that are  defined there that you want to make available to   Python. So we just have one, which is the RGB to  grayscale function. And believe it or not, that's   all you have to do. This will automatically,  you can see it running in the background now,   compiling with a hugely long thing our files  from, so it's created a main.cpp for us, and it's   going to put it into a main.o for us and compile  everything up, link it all together and create a   module. And you can see here, we then take that  module that's been passed back and put it into a   variable called module. And then when it's done,  it will load that module. And if we look inside   the module that we just created, you'll see now  that apart from the normal auto-generated stuff,   Python adds, it's got a function in it, RGB to  grayscale. Okay, so that's amazing. We now have   a CUDA function that's been made available from  Python. And we can even see if we want to, this   is where it put it all. So we can have a look. And  there it is. You can see it's created a main.cpp,   it's compiled it into a main.o, it's created a  library that we can load up, it's created a CUDA   file, it's created a build script. And we could  have a look at that build script if we wanted to.   And there it is. So none of this matters too  much, it's just nice to know that PyTorch is   doing all this stuff for us. And we don't have  to worry about it. So that's pretty cool. So in   order to pass a tensor to this, we're going to  be checking that it's contiguous and on CUDA.   So we better make sure it is. So we're going to  create an imageC variable, which is the image made   contiguous and put on through the CUDA device. Now  we can actually run this on the full sized image,   not on the tiny little minimized image we  created before. This has got much more pixels,   it's got 1.7 million pixels, whereas before we had  I think it was 35,000, 34,000. And it's gone down   from one and a half seconds to one millisecond.  So that is amazing. So it's dramatically faster,   both because it's now running in compiled code  and because it's running on the GPU. The step of   putting the data onto the GPU is not part of what  we timed. And that's probably fair enough because   normally you do that once and then you run a whole  lot of CUDA things on it. We have though included   the step of moving it off the GPU and putting it  onto the CPU as part of what we're timing. And   one key reason for that is that if we didn't do  that, it can actually run our Python code at the   same time that the CUDA code is still running. And  so the amount of time shown could be dramatically   less because it hasn't finished synchronizing. So  by adding this, it forces it to complete the CUDA   run and to put the data back onto the CPU. That  kind of synchronization you can also trigger by   printing a value from it or you can synchronize  it manually. So after we've done that and we can   have a look and we should get exactly the same  grayscale puppy. Okay. So we have successfully   created our first real working code from Python  CUDA kernel. This approach of writing it in Python   and then converting it to CUDA is not particularly  common. But I'm not just doing it as an   educational exercise. That's how I like to write  my CUDA kernels, at least as much of it as I can,   because it's much easier to debug in Python. It's  much easier to see exactly what's going on. And   so I don't have to worry about compiling. It takes  about 45 or 50 seconds to compile even our simple   example here. I can just run it straight away.  And once it's working to convert that into C,   as I mentioned, you know, chat GPT can do most  of it for us. So I think this is actually a   fantastically good way of writing CUDA kernels  even as you start to get somewhat familiar with   them. Because it lets you debug and develop much  more quickly. A lot of people avoid writing CUDA   just because that process is so painful. And so  here's a way that we can make that process less   painful. So let's do it again. And this time  we're going to do it to implement something   very important, which is matrix multiplication.  So matrix multiplication, as you probably know,   is fundamentally critical for deep learning. It's  like the most basic linear algebra operation we   have. And the way it works is that you have a  input matrix M and a second input matrix N. And   we go through every row of M. So we go through  every row of M until we get to here we are up to   this one and every column of N. And here we are  up to this one. And then we take the dot product   at each point of that row with that column. And  this here is the dot product of those two things.   And that is what matrix multiplication is. So  it's a very simple operation conceptually. And   it's one that we do many, many, many times in deep  learning. And basically every deep learning, every   neural network has this as its most fundamental  operation. Of course, we don't actually need to   implement matrix multiplication from scratch  because it's done for us in libraries. But we   will often do things where we have to kind of fuse  in some kind of matrix multiplication like pieces.   And so, you know, and of course, it's also just a  good exercise. So let's take a look at how to do   matrix multiplication first of all in pure Python.  So in the, actually in the fast AI course that I   mentioned, there's a very complete in-depth dive  into matrix multiplication in part two, lesson 11,   where we spend like an hour or two talking about  nothing but matrix multiplication. detail here.   But what we do do in that is we use the MNIST  dataset to do this. And so we're going to do   the same thing here. We're going to grab the MNIST  dataset of handwritten digits. And they are 28 by   28 digits. They look like this. 28 by 28 is 784.  So to do a, you know, to basically do a single   layer of a neural net or without the activation  function, we would do a matrix multiplication of   the image flattened out by a weight matrix with  784 rows and however many columns we like. And   I'm going to need if we're going to go straight  to the output. So this would be a linear function,   a linear model. We'd have 10 layers, one for each  digit. So here's this is our weights. We're not   actually going to do any learning here. This is  just not any deep learning or logistic regression   learning. This is just for an example. OK, so  we've got our weights and we've got our input   data. X train and X valid. And so we're going to  start off by implementing this in Python. Now,   again, Python is really slow, so let's make this  smaller. So matrix one will just be five rows.   Matrix two will be all the weights. So that's  going to be a five by 784 matrix multiplied by   a 784 by 10 matrix. Now these two have to match.  Of course they have to match because otherwise   this dot product won't work. Those two are going  to have to match the row by the column. OK,   so let's pull that out into A rows, A columns,  B rows, B columns. And obviously A columns and   B rows are the things that have to match. And  then the output will be A rows by B columns.   So five by 10. So let's create an output full  of zeros with rows by columns in it. And so   now we can go ahead and go through every row of  A, every column of B, and do the dot product,   which involves going through every item in the  innermost dimension, or 784 of them, multiplying   together the equivalent things from M1 and M2 and  summing them up into the output tensor that we   created. So that's going to give us, as we said, a  five by 10. Five by 10 output. And here it is. OK,   so this is how I always create things in Python.  I basically almost never have to debug. I almost   never have like errors, unexpected errors in my  code, because I've written every single line one   step at a time in Python. I've checked them all as  they go. And then I copy all the cells and merge   them together, stick a function header on like  so. And so here is matmul. So this is exactly the   code we've already seen. And we can call it. And  we'll see that for 39,200 innermost operations,   we took us about a second. So that's pretty slow.  OK, so now that we've done that, you might not   be surprised to hear that we now need to do the  innermost loop as a kernel call in such a way that   it is can be run in parallel. Now, in this case,  the innermost loop is not this line of code. It's   actually this line of code. I mean, we can choose  to be whatever we want it to be. But in this case,   this is how we're going to do it. We're going to  say for every pixel, we're not going to be every   pixel, for every cell in the output tensor like  this one here is going to be one CUDA thread. So   one CUDA thread is going to do the dot product.  So this is the bit that does the dot product.   So that'll be our kernel. So we can write that  matmul block kernel is going to contain that. OK,   so that's exactly the same thing that we just  copied from above. And so now we're going to need   something to run this kernel. And you might not  be surprised to hear that in CUDA, we are going to   call this using blocks and threads. But something  that's rather handy in CUDA is that the blocks and   threads don't have to be just a 1D vector. They  can be a 2D or even 3D tensor. So in this case,   you can see we've got one, two, it's a little hard  to see exactly where they stop, two, three, four   blocks. And so then for each block, and that's  kind of in one dimension, and then there's also   one, two, three, four, five blocks in the other  dimension. And so each of these blocks has an   index. So this one here is going to be zero, zero,  a little bit hard to see. This one here is going   to be one, three, and so forth. And this one over  here is going to be three, four. So rather than   just having a integer block index, we're going to  have a tuple block index. And then within a block,   there's going to be, to pick, let's say, this  exact spot here, didn't do that very well, there's   going to be a thread index. And again, the thread  index won't be a single index into a vector. It'll   be two elements. So in this case, it'll be 0,  1, 2, 3, 4, 5, 6 rows down. And 1, 2, 3, 4, 5,   6, 7, 8, 9, 10, is that 11, 12? I can't count. 12  maybe across. So this here is actually going to   be defined by two things. One is by the block. And  so the block is 3, 4. And the thread is 6, 12. So   that's how CUDA lets us index into two-dimensional  grids using blocks and threads. We don't have to.   It's just a convenience if we want to. And in  fact, we can use up to three dimensions. So to   create our kernel runner, now rather than just  having two nested loops for blocks and threads,   we're going to have to have two lots of two nested  loops for both of our x and y blocks and threads,   or our rows and columns blocks and threads. So it  ends up looking a bit messy because we now have   four nested for loops. So we'll go through our  blocks on the y-axis and then through our blocks   on the x-axis and then through our threads on the  y-axis and then through our threads on the x-axis.   And so what that means is that for you can think  of this Cartesian product as being for each block,   for each thread. Now to get the dot y and the dot  x, we'll use this handy little Python standard   library thing called simple namespace. I use  that so much I just give it an NS name because   I use namespaces all the time in my quick and  dirty code. So we go through all those four. We   then call our kernel and we pass in an object  containing the y and x coordinates. And that's   going to be our block. And we also pass in  our thread, which is an object with the y   and x coordinates of our thread. And it's going to  eventually do all possible blocks and all possible   threads numbers for each of those blocks. And  we also need to tell it how big is each block,   how high and how wide. And so that's what this  is. This is going to be a simple namespace, an   object with an x and y, as you can see. So we need  to know how big they are. Just like earlier on,   we had to know the block dimension. That's why we  passed in threads. So remember this is all pure   PyTorch. We're not actually calling any out to any  CUDA. We're not calling out to any libraries other   than just a tiny bit of PyTorch for the indexing  and tensor creation. So you can run all of this   by hand. Make sure you understand. You can put  it in the debugger. You can step through it. And   so it's going to call our function. So here's  our matrix multiplication function. As we said,   it's a kernel that contains the dot product  that we wrote earlier. So now the guard is   going to have to check that the row number we're  up to is not taller than we have and the column   number we're up to is not wider than we have.  And we also need to know what row number we're   up to. And this is exactly the same. Actually, I  should say the column is exactly the same as we've   seen before. And in fact, you might remember  in the CUDA we had block IDX.x. This is why,   right? Because in CUDA, it's always gives you  these three dimensional dim three structures. So   you have to put this dot x. So we can find out  the column this way. And then we can find out   the row by seeing how many blocks have we gone  through, how big is each block in the y-axis,   and how many threads have we gone through in the  y-axis. So what row number are we up to? What   column number are we up to? Is that inside the  bounds of our tensor? If not, then just stop and   then otherwise do our dot product and put it into  our output tensor. So that's all pure Python. And   so now we can call it by getting the height and  width of our first input, the height and width of   our second input. And so then k and k2, the inner  dimensions ought to match. We can then create our   output. And so now threads per block is not just  the number 256, but it's a pair of numbers. It's   an x and a y. And we've selected two numbers that  multiply together to create 256. So again, this is   a reasonable choice if you've got two dimensional  inputs to spread it out nicely. One thing to be   aware of here is that your threads per block can't  be bigger than 1024. So we're using 256, which is   safe. And notice that you have to multiply these  together. 16 times 16 is going to be the number   of threads per block. So these are safe numbers to  use. You're not going to run out of blocks though.   2 to the 31 is the number of maximum blocks for  dimension 0 and then 2 to the 16 for dimensions   1 and 2. I think it's actually minus 1, but don't  worry about that. So don't have too many threads,   but you can have lots of blocks. But of course,  each symmetric model processor is going to run   all of these on the same device and they're  also going to have access to shared memory.   So that's why you use a few threads per block. So  our blocks, the x, we're going to use the ceiling   division. The y, we're going to use the same  ceiling division. So if any of this is unfamiliar,   go back to our earlier example because the code's  all copied from there. And now we can call our   2D block kernel runner passing in the kernel. The  number of blocks, the number of threads per block,   our input matrices flattened out, our output  matrix flattened out, and the dimensions that   it needs because they get all used here and return  the result. And so if we call that mapMole with a   2D block and we can check that they are close  to what we got in our original manual loops.   And of course they are because it's running  the same code. So now that we've done that,   we can do the CUDA version. Now the CUDA version  is going to be so much faster, we do not need to   use this slimmed down matrix anymore. We can use  the whole thing. So to check that it's correct, I   want a fast CPU based approach that I can compare  to. So previously it took about a second to do   39,000 elements. So I'm not going to explain how  this works, but I'm going to use a broadcasting   approach to get a fast CPU based approach. If  you check the fast AI course, we teach you how   to do this broadcasting approach. But it's a pure  Python approach, which manages to do it all in a   single loop rather than three nested loops. It  gives the same answer for the cut down tensors.   But much faster, only four milliseconds. So  it's fast enough that we can now run it on   the whole input matrices. And it takes about 1.3  seconds. And so this broadcast optimized version,   as you can see, it's much faster. And now we've  got 392 million additions going on in the middle   of our three loops, effectively three loops, but  we're broadcasting them. So this is much faster.   But the reason I'm really doing this is so that  we can store this result to compare to. So that   makes sure that our CUDA version is correct. Okay,  so how do we convert this to CUDA? You might not   be surprised to hear that what I did was I grabbed  this function and I passed it over to ChatGPT and   said, please rewrite this in C. And it gave me  something basically that I could use first time.   And here it is. This time I don't have unsigned  cast star, I have float star. Other than that,   this looks almost exactly like the Python  we had with exactly the same changes we saw   before. We've now got the .y and .x versions.  Once again, we've got DunderGlobal, which says,   please run this on the GPU when we call it from  the CPU. So the kernel, I don't think there's   anything to talk about there. And then the thing  that calls the kernel, matmul, is going to be   passed in two tensors. We're going to check that  they're both contiguous and check that they are on   the CUDA device. We'll grab the height and width  of the first and second tensors. We're going to   grab the inner dimension. We'll make sure that the  inner dimensions of the two matrices match just   like before. And this is how you do an assertion  in PyTorch CUDA code. You call torch check, pass   in the thing to check, pass in the message to pop  up if there's a problem. These are a really good   thing to spread around all through your CUDA code  to make sure that everything is as you thought it   was going to be. Just like before, we create an  output. So now when we create a number of threads,   we don't say threads is 256. We instead say, this  is a special thing provided by CUDA for us, dim3.   So this is basically a tuple with three elements.  So we're going to create a dim3 called tpb. It's   going to be 16 by 16. Now I said it has three  elements. Where's the third one? That's okay.   It just treats the third one as being one. So we  can ignore it. So that's the number of threads   per block. And then how many blocks will there be?  Well, in the X dimension, it'll be W divided by X,   ceiling division. In the Y dimension, it will be H  divided by Y, ceiling division. And so that's the   number of blocks we have. So just like before, we  call our kernel just by calling it like a normal   function. But then we add this weird triple angle  bracket thing telling it how many blocks and how   many threads. So these aren't ints anymore. These  are now dim3 structures. And that's what we use,   these dim3 structures. And in fact, even before,  what actually happened behind the scenes when we   did the grayscale thing is even though we passed  in 256, for instance, we actually ended up with a   dim3 structure and just in which case the index  one and two or the dot X and dot Y and dot Z   values would have set to one automatically. So  we've actually already used a dim3 structure   without quite realizing it. And then just like  before, pass in all of the tensors we want,   casting them to pointers, maybe they're not just  casting, converting them to pointers through some   particular data type and passing in any other  information that our function will need, that   kernel will need. Okay, so then we call load CUDA  again, that'll compile this into a module, make   sure that they're both contiguous and on the CUDA  device. And then after we call module dot mapmul,   passing those in, putting on the CPU and checking  that they're all close, and it says yes, they are.   This is now running not on just the first five  rows, but on the entire MNIST data set. And on   the entire MNIST data set, using a optimized  CPU approach, it took 1.3 seconds. Using CUDA,   it takes 6 milliseconds. So that is quite a big  improvement. Cool. The other thing I will mention,   of course, is PyTorch can do a matrix  multiplication for us just by using at. How   long does, and obviously gives the same answer,  how long does that take to run? That takes 2   milliseconds. So three times faster. And in many  situations, it'll be much more than three times   faster. So why are we still pretty slow compared  to PyTorch? I mean, this isn't bad to do 392   million of these calculations in 6 milliseconds.  But if PyTorch can do it so much faster,   what are they doing? Well, the trick is that they  are taking advantage in particular of this shared   memory. So shared memory is a small memory space  that is shared amongst the threads in a block,   and it is much faster than global memory. In our  matrix multiplication, when we have one of these   blocks, and so it's going to do one block at  a time all in the same SM, it's going to be   reusing the same 16 by 16 block. It's going to  be using the same 16 rows and columns again and   again and again, each time with access to the same  shared memory. So you can see how you could really   potentially cache the information, a lot of the  information you need, and reuse it rather than   going back to the slower memory again and again.  So this is an example of the kinds of things that   you could optimize potentially once you get to  that point. The only other thing that I wanted to   mention here is that this 2D block idea is totally  optional. You can do everything with 1D blocks or   with 2D blocks or with 3D blocks and threads. And  just to show that, I've actually got an example at   the end here which converts RGB to grayscale using  the 2D blocks. Because remember earlier when we   did this, it was with 1D blocks. It gives exactly  the same result. And if we compare the code,   so if we compare the code, the version actually  that was done with 1D threads and blocks is quite   a bit shorter than the version that uses  2D threads and blocks. And so in this case,   even though we're manipulating pixels, where you  might think that using the 2D approach would be   neater and more convenient, in this particular  case it wasn't really. It's still pretty simple   code that we have to deal with the columns and  rows.x.y separately. The guards are a little bit   more complex. We have to find out what index we're  actually up to here. Where else this kernel, it   was just much more direct, just two lines of code.  And then calling the kernel, again it's a little   bit more complex with the threads per block stuff  rather than this. The key thing I wanted to point   out is that these two pieces of code do exactly  the same thing. So don't feel like if you don't   want to use a 2D or 3D block thread structure,  you don't have to. You can just use a 1D one.   The 2D stuff is only there if it's convenient  for you to use and you want to use it. Don't   feel like you have to. So yeah, I think that's  basically all the key things that I wanted to   show you all today. The main thing I hope you take  from this is that even for Python programmers,   for data scientists, it's not way outside our  comfort zone. We can write these things in Python,   we can convert them pretty much automatically, we  end up with code that doesn't look, you know, it   looks reasonably familiar even though it's now in  a different language. We can do everything inside   notebooks, we can test everything as we go, we can  print things from our kernels. And so, you know,   it's hopefully feeling a little bit less beyond  our capabilities than we might have previously   imagined. So I'd say, yeah, you know, go for it.  I think it's also like, I think it's increasingly   important to be able to write CUDA code nowadays  because for things like flash attention or for   things like quantization, GPTQ, AWQ, bits and  bytes, these are all things you can't write in   PyTorch. You know, our models are getting more  sophisticated. The kind of assumptions that   libraries like PyTorch make about what we want to  do, you know, increasingly less and less accurate.   So we're having to do more and more of this stuff  ourselves nowadays in CUDA. And so I think it's a   really valuable capability to have. Now, the other  thing I mentioned is we did it all in Colab today,   but we can also do things on our own machines  if you have a GPU or on a cloud machine. And   getting set up for this, again, it's much less  complicated than you might expect. And in fact,   I can show you it's basically like four lines of  code or four lines or three or four lines of bash   script to get it all set up. It'll run on Windows  or under WSL. It'll also run on Linux. Of course,   Mac stuff doesn't really work on... CUDA  stuff doesn't really work on Macs. They're   not on Mac. Actually, I'll put a link to  this into the video notes. But for now,   I'm just going to jump to a Twitter thread  where I wrote this all down to show you all   the steps. So the way to do it is to use something  called Conda. Conda is something that very, very,   very few people understand. A lot of people think  it's like a replacement for like PIP or poetry or   something. It's not. It's better to think of it as  a replacement for Docker. You can literally have   multiple different versions of Python, multiple  different versions of CUDA, multiple different C++   compilation systems all in parallel at the same  time on your machine and switch between them. You   can only do this with Conda. And everything just  works. So you don't have to worry about all the   confusing stuff around .run files or Ubuntu  packages or anything like that. You can do   everything with just Conda. You need to install  Conda. I've actually got a script which you just   run the script. It's a tiny script, as you see.  If you just run the script, it'll automatically   figure out which mini Conda you need. It'll  automatically figure out what shell you're on,   and it'll just go ahead and download it and  install it for you. OK, so run that script,   restart your terminal. Now you've got Conda. Step  two is find out what version of CUDA PyTorch wants   you to have. So if I click Linux, Conda, CUDA  12.1 is the latest. So then step three is run this   shell command, replacing 12.1 with whatever the  current version of PyTorch is. It's actually still   12.1 for me at this point. And that will install  everything. All the stuff you need to profile,   debug, build, etc. All the NVIDIA tools you  need, the full suite, will all be installed,   and it's coming directly from NVIDIA so you'll  have the proper versions. As I said, you can have   multiple versions installed at once in different  environments. No problem at all. And then finally,   install PyTorch. And this command here will  install PyTorch. For some reason I wrote nightly   here. You don't need the nightly. Just remove dash  nightly. So this will install the latest version   of PyTorch using the NVIDIA CUDA stuff that you  just installed. If you've used Conda before and   it was really slow, that's because it used to  use a different solver which was thousands or   tens of thousands of times slower than the modern  one. It's just been added and made default in the   last couple of months. So nowadays this should  all run very fast. And as I said, it'll run under   WSL on Windows. It'll run on Ubuntu. It'll run  on Fedora. It'll run on Debian. It'll all just   work. So that's how I strongly recommend getting  yourself set up for local development. You don't   need to worry about using Docker. As I said,  you can switch between different CUDA versions,   different Python versions, different compilers and  so forth without having to worry about any of the   Docker stuff. And it's also efficient enough  that if you've got the same libraries and so   forth installed in multiple environments, it'll  hard link them. So it won't even use additional   hard drive space. So it's also very efficient.  Great. So that's how you can get started on   your own machine or on the cloud or whatever. So  hopefully you'll find that helpful as well. All   right. Thanks very much for watching. I hope you  found this useful and I look forward to hearing   about what you create with CUDA. In terms of going  to the next steps, check out the other CUDA mode   lectures. I will link to them. And I would also  recommend trying out some projects of your own. So   for example, you could try to implement something  like 4-bit quantization or Flash Attention or   anything like that. Now those are pretty big  projects, but you can try to break them up into   smaller things that you build up one step at a  time. And of course, look at other people's code.   So look at the implementation of Flash Attention.  Look at the implementation of bits and bytes. Look   at the implementation of GPTQ and so forth. The  more time you spend reading other people's code,   the better. All right. I hope you found this  useful and thank you very much for watching.
Info
Channel: Jeremy Howard
Views: 49,773
Rating: undefined out of 5
Keywords: deep learning, fastai
Id: nOxKexn3iBo
Channel Id: undefined
Length: 77min 56sec (4676 seconds)
Published: Sun Jan 28 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.