Hi there, I'm Jeremy Howard from Answer.AI and
this is Getting Started with CUDA. CUDA is of course what we use to program NVIDIA GPUs if we
want them to go super fast and we want maximum flexibility. And it has a reputation of being
very hard to get started with. The truth is, it's actually not so bad. You just have to know some
tricks. And so in this video I'm going to show you some of those tricks. So let's switch to the
screen and take a look. So I'm going to be doing all of the work today in notebooks. This might
surprise you. You might be thinking that to do work with CUDA we have to do stuff with compilers
and terminals and things like that. And the truth is actually it turns out we really don't, thanks
to some magic that is provided by PyTorch. You can follow along in all of these steps and I strongly
suggest you do so in your own computer. You can go to the CUDA mode organization in GitHub, find
the Lecture 2 repo there and you'll see there is a Lecture 3 folder. This is Lecture 3 of the CUDA
mode series. You don't need to have seen any of the previous ones, however, to follow along. In
the readme there you'll see there's a Lecture 3 section and at the bottom there is a click to
go to the Colab version. Yep, you can run all of this in Colab for free. You don't even have
to have a GPU available to run the whole thing. We're going to be following along with some of
the examples from this book Programming Massively Parallel Processes. Programming Massively Parallel
Processes is a really great book to read and once you've completed today's lesson you should be able
to make a great start on this book. It goes into a lot more details about some of the things that
we're going to cover on fairly quickly. It's okay if you don't have the book, but if you want to
go deeper I strongly suggest you get it. In fact you'll see in the repo that Lecture 2 in this
series actually was a deep dive into chapters 1 to 3 of that book. So actually you might want to
do Lecture 2, confusingly enough, after this one, Lecture 3, to get more details about some of what
we're talking about. Okay, so let's dive into the notebook. So what we're going to be doing today
is we're going to be doing a whole lot of stuff with old PyTorch first to make sure that we get
all the ideas and then we will try to convert each of these things into CUDA. So in order to do
this we're going to start by importing a bunch of stuff. In fact let's do all of this in Colab.
So here we are in Colab and you should make sure that you set in Colab your runtime to the T4
GPU. This one you can use plenty of for free and it's easily good enough to run everything we're
doing today. And once you've got that running we can import the libraries we're going to need and
we can start on our first exercise. So the first exercise actually comes from Chapter 2 of the
book and Chapter 2 of the book teaches how to do this problem which is converting an RGB color
picture into a grayscale picture. And it turns out that the recommended formula for this is to take
0.21 of the red pixel, 0.72 of the green pixel, 0.07 of the blue pixel and add them up together
and that creates the luminance value which is what we're seeing here. That's a common way,
the kind of the standard way to go from RGB to grayscale. So we're going to do this, we're
going to make a CUDA kernel to do this. So the first thing we're going to need is a picture and
anytime you need a picture I recommend going for a picture of a puppy. So we've got here a URL to
a picture of a puppy. So we'll just go ahead and download it and then we can use torchvision.io to
load that. So this is already part of Colab. If you're interested in running stuff on your own
machine or a server in the cloud I'll show you how to set that up at the end of this lecture. So
let's read in the image and if we have a look at the shape of it it says it's 3 by 1066 by 1600.
So I'm going to assume that you know the basics of PyTorch here. If you don't know the basics of
PyTorch I am a bit biased but I highly recommend my course which covers exactly that. You can go
to course.fast.ai and you get the benefit also of having seen some very cute bunnies and along
with the very cute bunnies it basically takes you through all of everything you need to be an
effective practitioner of modern deep learning. So finish part one if you want to go right into
those details but even if you just do the first two or three lessons that will give you more than
enough you need to know to understand this kind of code and these kinds of outputs. So I have a sigma
you've done all that. So you'll see here we've got a rank 3 tensor. There are three channels so
they're like the faces of a cube if you like. There are 1066 rows on each face so that's the
height and then there are 16 columns in each row so that's the width. So if we then look at
the first couple of channels and the first three rows and the first four columns you can see here
that these are unsigned 8-bit integers. So they're bytes and so here they are. So that's what an
image looks like. Hopefully you know all that already. So let's take a look at our image. To
do that I'm just going to create a simple little function show image that will create a mat lip
plot. Remove the axes. If it's color which this one is it will change the order of the axes from
channel by height by width which is what PyTorch uses to height by width by channel which is what
mat plot lip I'm having trouble today expects. So we change the order of the axes to be 1 2 0
and then we can show the edge. Putting it on the CPU if necessary. Now we're going to be working
with this image in Python which is going to be just pure Python to start with before we switch to
CUDA. It's going to be really slow so we'll resize it to have the smallest length smallest dimension
be 150. So that's the height in this case. So we end up with a 150 by 225 shape which is a
rectangle which is 3 3 750 pixels each one with R, G and B values and there is our puppy. So you see
it wasn't a good idea to make this a puppy. Okay so how do we convert that to grayscale? Well the
book has told us the formula to use. Go through every pixel and do that to it. Alright so here is
the loop we're going to go through every pixel and do that to it and stick that in the output. So
that's the basic idea. So what are the details of this? Well here we've got channel by row by
column. So how do we loop through every pixel? Well the first thing we need to know is how many
pixels are there. So we can say channel by height by width is the shape. So now we've defined those
three variables. So the number of pixels is the height times the width. And so to loop through all
those pixels an easy way to do them is to flatten them all out into a vector. Now what happens when
you flatten them all out into a vector? Well as we saw they're currently stored in this format
where we've got one face and then another face and then there's a we haven't got it printed here
but there's a third face. Within each face then there is one row. We're just showing the first few
and then the next row and then the next row and then with each row you've got column, column,
column. So let's say we had a small image in which in fact we can do it like this. We could say
here's our red so we've got the pixels of 0, 1, 2, 3, 4, 5. So let's say this was a height 2, width
3, 3 channel image. So then there'll be 6, 7, 8, 9, 10, 11, RGB, 12, 13, 14, 15, 16. So let's say
these are the pixels. So when these are flattened out it's going to turn into a single vector just
like so. 6, 7, 8, 12, 13, 14. So actually when we talk about an image we initially see it as
a bunch of pixels. We can think of it as having 3 channels but in practice in our computer the
memory is all laid out linearly. Everything has just an address in memory. It's just a whole
bunch. You can think of it as your computer's memory is one giant vector. And so when we say,
when we say flatten then what that's actually doing is it's turning our channel by height
by width into a big vector like this. Okay, so now that we've done that we can say, alright,
the place we're going to be putting this into, the result, we're going to start out with just
an empty vector of length n. We'll go through all of the n values from 0 to n minus 1 and we're
going to put in the output value 0.29-ish times the input value at x i. So this will be here in
the red bit. And then 0.59 times x i plus n. So n here, n here is this distance. It's the number of
pixels, 1, 2, 3, 4, 5, 6. So that's why to get to green we have to jump up to i plus n. And then to
get to blue we have to jump to i plus 2n. And so that's how this works. We've flattened everything
out and we're indexing into this flattened out thing directly. And so at the end of that we're
going to have our grayscale is all done. So we can then just reshape that into height by width. And
there it is. There's our grayscale puppy. And you can see here the flattened image is just a single
vector with all those channel values flattened out as we described. Okay now that is incredibly
slow. It's nearly two seconds to do something with only 34,000 pixels in. So to speed it up we are
going to want to use CUDA. How come CUDA is able to speed things up? Well the reason CUDA is able
to speed things up is because it is set up in a very different way to how a normal CPU is set up.
And we can actually see that if we look at some of this information about what is in an RTX 3090 card
for example. Now an RTX 3090 card is a fantastic GPU. You can get them secondhand, pretty good
value. So a really good choice particularly for hobbyists. What is inside a 3090? It has 82 SMs.
What's an SM? An SM is a streaming multiprocessor. So you can think of this as almost like a separate
CPU in your computer. And so there's 82 of these. So that's already a lot more than you have CPUs in
your computer. But then each one of these has 128 CUDA cores. So these CUDA cores are all able to
operate at the same time. These multiprocessors are all able to operate at the same time. So that
gives us 128 times 82. 10,500 CUDA cores in total that can all work at the same time. So that's a
lot more than any CPU we're familiar with can do. And the 3090 isn't even at the very top end. It's
really a very good GPU but there are some with even more CUDA cores. So how do we use them all?
Well we need to be able to set up our code in such a way that we can say here is a piece of code that
you can run on lots of different pieces of data, lots of different pieces of memory at the same
time so that you can do 10,000 things at the same time. And so CUDA does this in a really simple and
pretty elegant way which is it basically says okay take out the kind of the inner loop. So here's our
inner loop. The stuff where you can run 10,000 of these at the same time. They're not going to
influence each other at all. So you see these do not influence each other at all. All they do is
they stick something into some output memory. So it doesn't even return something. You can't return
something from these CUDA kernels as they're going to be called. All you can do is you can modify
memory in such a way that you don't know what order they're going to run in. They could all
run at the same time. Some could run a little bit before another one and so forth. So the way that
CUDA does this is it says okay write a function. And in your function write a line of code which
I'm going to call as many dozens, hundreds, thousands, millions of times as necessary to do
all the work that's needed. And I'm going to do this in parallel for you as much as I can. In the
case of running on a 3090, up to 10,000 times, up to 10,000 things all at once. And I will get
this done as fast as possible. So all you have to do is basically write the line of code you want
to be called lots of times. And then the second thing you have to do is say how many times to call
that code. And so what will happen is that piece of code called the kernel will be called for you.
It'll be passed in whatever arguments you ask to be passed in, which in this case will be the input
array tensor, the output tensor, and the size of how many pixels are in each channel. And it'll
tell you, okay this is the ith time I've called it. Now we can simulate that in Python, very,
very simply, a single for loop. Now this doesn't happen in parallel, so it's not going to speed
it up, but the kind of results, the semantics, are going to be identical to CUDA. So here is a
function we've called run kernel. We're going to pass it in a function. We're going to say how many
times to run the function and what arguments to call the function with. And so each time it will
call the function passing in the index, what time, and the arguments that we've requested. Okay,
so we can now create something to call that. So let's get the, just like before, get the number of
channels, height and width, the number of pixels, flatten it out, create the result tensor that
we're going to put things in. So this time, rather than calling the loop directly, we will
call run kernel. We will pass in the name of the function to be called as f. We will pass in the
number of times, which is the number of pixels for the loop. And we'll pass in the arguments that
are going to be required inside our kernel. So we're going to need out, we're going to need
x, and we're going to need n. So you can see here we're using no external libraries at all. We
have just plain Python and a tiny bit of PyTorch, just enough to create a tensor index into tensors.
And that's all that's being used. Conceptually, it's doing the same thing as a CUDA kernel would
do, nearly. And we'll get to the nearly in just a moment. But conceptually, you could see that you
could now potentially write something, which if you knew that this was running a bunch of things
totally independently of each other, conceptually, you could now truly easily parallelize
that. And that's what CUDA does. However, it's not quite that simple. It does not simply
create a single list of numbers like range n does in Python and pass each one in turn into your
kernel. But instead, it actually splits the range of numbers into what's called blocks. So in this
case, maybe there's like 1,000 pixels we wanted to get through. It's going to group them into blocks
of 256 at a time. And so in Python, it looks like this. In practice, a CUDA kernel runner is not
a single for loop that loops n times. Instead, it is a pair of nested for loops. So you don't
just pass in a single number and say this is the number of pixels, but you pass in two numbers,
number of blocks and the number of threads. We'll get into that in a moment. But these are just
numbers. They're just you can put any numbers you like here. And if you choose two numbers
that multiply to get the thing that we want, which is the n times we want to call it, then this
can do exactly the same thing because we're now going to pass in which of the what's the index of
the outer loop we're up to, what's the index in the inner loop we're up to, how many things do we
go through in the inner loop, and therefore inside the kernel, we can find out what index we're up
to by multiplying the block index times the block dimension. So that is to say the i by the threads
and add the inner loop index, the j. So that's what we pass in with the i, j threads. But inside
the kernel, we call it block index, thread index and block dimension. So if you look at the CUDA
book, you'll see here this is exactly what they do. They say the index is equal to the block index
times the block dimension plus the thread index. There's a dot x thing here that we can ignore
for now. We'll look at that in a moment. But in practice, this is actually how CUDA works. So it
has all these blocks and inside there are threads and you can just think of them as numbers. You
can see these blocks, they just have numbers, o, one, dot, dot, dot, dot, and so forth. Now that
does mean something a little bit tricky though, which is, well, the first thing I'll say is how
do we pick these numbers, the number of blocks and the number of threads? For now in practice, we're
just always going to say the number of threads is 256. And that's a perfectly fine number to use
as a default. Anyway, you can't go too far wrong, just always picking 256, nearly always. So don't
worry about that too much for now optimizing that number. So if we say, okay, we want to have 256
threads, so remember that's the inner loop, or if we look inside our kernel runner here, that's
our inner loop. So we're going to call each of, this is going to be called 256 times. So how many
times do you have to call this? Well, you're going to have to call it n, number of pixels, divided
by 256 times. Now that might not be an integer, so you'll have to round that up. So that's
how we can calculate the number of blocks we need to make sure that our kernel is called
enough times. Now we do have a problem though, which is that the number of times we would have
liked to have called it, which previously was equal to the number of pixels, might not be a
multiple of 256. So we might end up going too far. And so that's why we also need in our kernel
now this if statement. And so this is making sure that the index that we're up to does not go past
the number of pixels we have. And this appears in basically every CUDA kernel you'll see, and
it's called the guard, or the guard block. So this is our guard to make sure we don't go out of
bounds. So this is the same line of code we had before. Now we've also just added this thing to
calculate the index, and we've added the guard. And this is like the pretty standard first lines
from any CUDA kernel. So we can now run those, and they'll do exactly the same thing as before.
And so the obvious question is, well, why? Why do CUDA kernels work in this weird block and thread
way? Why don't we just tell them the number of times to run it? Why do we have to do it by blocks
and threads? And the reason why is because of some of this detail that we've got here, which is that
CUDA sets things up for us so that everything in the same block, or to say it more completely
thread block, which is the same block, they will all be given some shared memory. And they'll
also all be given the opportunity to synchronize, which is to basically say, OK, everything in
this block has to get to this point before you can move on. All of the threads in a block will be
executed on the same streaming multiprocessor. And so we'll see later in later lectures, but won't
be taught by me, that by using blocks smartly, you can make your code run more quickly. And the
shared memory is particularly important. So shared memory is a little bit of memory in the GPU that
all the threads in a block share and it's fast. It's super, super, super fast. Now when we say not
very much, it's like on a 3090, it's 128K. So very small. This is basically the same as a cache in
a CPU. The difference, though, is that on a CPU, you're not going to be manually deciding what goes
into your cache. But on the GPU, you do. It's all up to you. So at the moment, this cache is not
going to be used when we create our CUDA code because we're just getting started. And so we're
not going to worry about that optimization. But to go fast, you want to use that cache. And also,
you want to use the register file. Something a lot of people don't realize is that there's actually
quite a lot of register memory, even more register memory than shared memory. So anyway, those
are all things to worry about down the track, not needed for getting started. So how do we go
about using CUDA? There is a basically standard setup block that I would add, and we are going to
add. And what happens in this setup block is we're going to set an environment variable. You wouldn't
use this in production or for going fast. But this says, if you get an error, stop right away,
basically. So wait to see how things go. And then that way, you can tell us exactly when an error
occurs and where it happens. So that slows things down, but it's good for development. We're also
going to install two modules. One is a build tool, which is required by PyTorch to compile your C++
CUDA code. The second is a very handy little thing called WorldLitza. And the only place you're
going to see that used is in this line here, where we load this extension called WorldLitza.
Without this, anything you print from your CUDA code, in fact, from your C++ code, won't appear
in a notebook. So you always want to do this in a notebook where you're doing stuff in CUDA so that
you can use print statements to debug things. OK. So if you've got some CUDA code, how do you use
it from Python? The answer is that PyTorch comes with a very handy thing called LoadInline, which
is inside torch.utils.cpp extension. LoadInline is a marvelous function that you just pass in a list
of any of the CUDA code strings that you want to compile, any of the plain C++ strings that you
want to compile, any functions in that C++ you want to make available to PyTorch. And it will go
and compile it all, turn it into a Python module and make it available right away, which is pretty
amazing. I've just created a tiny little wrapper for that called LoadCuda, just to streamline it
a tiny bit. Behind the scenes, it's just going to call LoadInline. The other thing I've done is
I've created a string that contains some C++ code. This is all C code, I think, but it's compiled
as C++ code. We'll call it C++ code. C++ code we want included in all of our CUDA files. We
need to include this header file to make sure that we can access PyTorch tensor stuff. We want
to be able to use I.O. and we want to be able to check for exceptions. And then I also define
three macros. The first macro just checks that a tensor is CUDA. The second one checks that it's
contiguous in memory, because sometimes PyTorch can actually split things up over different memory
pieces. And then if we try to access that in this flattened out form, it won't work. And then the
way we're actually going to use it, check input, we'll just check both of those things. So if
something's not on CUDA and it's not contiguous, we aren't going to be able to use it. So we
always have this. And then the third thing we do here is we define ceiling division. Ceiling
division is just this. Although you can implement it a different way like this. And so this will do
ceiling division. And so this is how we're going to, this is what we're going to call in order
to figure out how many blocks we need. So this is just, you don't have to worry about the details
of this too much. It's just a standard setup we're going to use. OK, so now we need to write our CUDA
kernel. Now, how do you write the CUDA kernel? Well, all I did, and I recommend you do, is take
your Python kernel and paste it into chat GPT and say convert this to equivalent C code using
the same names, formatting, etc. Where possible, paste it in and chat GPT will do it for
you. Unless you're very comfortable with C, which goes just write it yourself is fine. But
this way, since you've already got the Python, why not just do this? It basically was pretty
much perfect. I found, although it did assume that these were floats, they're actually not
floats. We had to change a couple of data types, but basically I was able to use it almost as is.
And so particularly, you know, for people who are much more Python programmers nowadays like me,
this is a nice way to write 95% of the code you need. What else do we have to change? Well, as we
saw in our picture earlier, it's not called block IDX, it's called block IDX.x, block dim.x, thread
IDX.x. So we have to add the dot X there. Other than that, if we compare. So as you can see, these
two pieces of code look nearly identical. We've had to add data types to them. We've had to add
semicolons. We had to get rid of the colon. We had to add curly brackets. That's about it. So it's
not very different at all. So if you haven't done much C programming, yeah, don't worry about it too
much because, you know, the truth is actually it's not that different for this kind of calculation
intensive work. One thing we should talk about is this. What's unsigned cast R? This is just how
you write Uint 8 in C. You can just if you're not sure how to change change a data type between the
Pytorch spelling and the C spelling, you could ask chat GPT or you can Google it. But this is how you
write byte. The star in practice, it's basically how you say this is an array. So this says that
X is an array of bytes. It actually means it's a pointer, but pointers are treated, as you can
see here, as arrays by C. So you don't really have to worry about the fact that the pointer, it
just means for us that it's an array. But in C, the only kind of arrays that it knows how to deal
with these one dimensional arrays. And that's why we always have to flatten things out. We can't
use multidimensional tensors really directly in these CUDA kernels in this way. So we're going to
end up with these one dimensional C arrays. Yeah, other than that, it's going to look exactly in
fact, I mean, even because we did our Python like that, it's going to look identical. The void here
just means it doesn't return anything. And then the Dunder global here is a special thing added
by CUDA. There's three things that can appear. And this simply says, what should I compile this
to do? And so you can put Dunder device, and that means compile it so that you can only call it on
the GPU. You can say Dunder global, and that says, OK, you can call it from the CPU or GPU and it'll
run on the GPU. Or you can write Dunder host, which you don't have to. And that just means
it's a normal C or C++ program that runs on the CPU side. So any time we want to call something
from the CPU side to run something on the GPU, which is basically almost always when we're
doing kernels, you write Dunder global. So here we've got Dunder global, we've got
our kernel and that's it. So then we need the thing to call that kernel. So earlier to
call the kernel, we called this block kernel function passed in the kernel and passed in the
blocks and threads. And the arguments with CUDA, we don't have to use a special function. There
is a weird special syntax built into kernel to do it for us. To use the weird special syntax, you
say, OK, what's the kernel, the function that I want to call? And then you use these weird triple
angle brackets. So the triple angle brackets is a special CUDA extension to the C++ language. And it
means this is a kernel, please call it on the GPU. And between the triple angle brackets, there's
a number of things you can pass, but you have to pass at least the first two things, which is how
many blocks, how many threads. So how many blocks, ceiling division, number of pixels divided by
threads. And how many threads, as we said before, let's just pick 256 all the time and not
worry about it. So that says call this function as a GPU kernel and then passing in these
arguments. We have to pass in our input tensor, our output tensor and how many pixels. And
you'll see that for each of these tensors, we have to use a special method dot data pointer.
And that's going to convert it into a C pointer to the tensor. So that's why by the time it arrives
in our kernel, it's a C pointer. You also have to tell it what data type you want it to be treated
as. This says treat it as Untates. So that's this is a C++ template parameter here. And this is
a method. The other thing you need to know is in C++ dot means call a method of an object,
or else colon colon is basically like in C, in Python calling a method of a class. So you
don't say torch dot empty, you say torch colon colon empty to create our output. Or else back
when we did it in Python, we said torch dot empty. Also in Python. Oh, OK. So in Python, that's
right. We just created a length and vector and then did a dot view. It doesn't really matter how
we do it. But in this case, we actually created a two dimensional tensor by passing. We pass in
this thing in curly brackets here. This is called a C++ list initializer. And it's just basically
a little list containing height comma width. So this tells it to create a two dimensional matrix,
which is why we don't need dot view at the end. We could have done it the dot view way as well.
Probably be better to keep it consistent. But this is what I wrote at the time. The other
interesting thing when we create the output is if you pass in input dot options. So this is
our input tensor that just says, oh, use the same data type and the same device, CUDA device as our
input has. This is a nice, really convenient way, which I don't even think we have in Python to say,
make sure that this is the same data type in the same device. If you say auto here, this is quite
convenient. You don't have to specify what type this is. We could have written torch colon colon
tensor. But by writing auto, it just says figure it out yourself. Which is another convenient
little C++ thing. After we call the kernel, if there's an error in it, we won't necessarily
get told. So to tell it to check for an error, you have to write this. This is a macro that's
again provided by PyTorch. The details don't matter. You should just always call it after you
call a kernel to make sure it works. And then you can return the tensor that you allocated and then
you passed as a pointer and then that you filled in. OK. Now, as well as the CUDA source, you
also need C++ source. And the C++ source is just something that says, here is a list of all of the
details of the functions that I want you to make available to the outside world. In this case,
Python. And so this is basically your header, effectively. So you can just copy and paste the
full line here from your function definition and stick a semicolon on the end. So that's something
you can always do. And so then we call our load CUDA function that we looked at earlier, passing
in the CUDA source code, the C++ source code, and then a list of the names of the functions that are
defined there that you want to make available to Python. So we just have one, which is the RGB to
grayscale function. And believe it or not, that's all you have to do. This will automatically,
you can see it running in the background now, compiling with a hugely long thing our files
from, so it's created a main.cpp for us, and it's going to put it into a main.o for us and compile
everything up, link it all together and create a module. And you can see here, we then take that
module that's been passed back and put it into a variable called module. And then when it's done,
it will load that module. And if we look inside the module that we just created, you'll see now
that apart from the normal auto-generated stuff, Python adds, it's got a function in it, RGB to
grayscale. Okay, so that's amazing. We now have a CUDA function that's been made available from
Python. And we can even see if we want to, this is where it put it all. So we can have a look. And
there it is. You can see it's created a main.cpp, it's compiled it into a main.o, it's created a
library that we can load up, it's created a CUDA file, it's created a build script. And we could
have a look at that build script if we wanted to. And there it is. So none of this matters too
much, it's just nice to know that PyTorch is doing all this stuff for us. And we don't have
to worry about it. So that's pretty cool. So in order to pass a tensor to this, we're going to
be checking that it's contiguous and on CUDA. So we better make sure it is. So we're going to
create an imageC variable, which is the image made contiguous and put on through the CUDA device. Now
we can actually run this on the full sized image, not on the tiny little minimized image we
created before. This has got much more pixels, it's got 1.7 million pixels, whereas before we had
I think it was 35,000, 34,000. And it's gone down from one and a half seconds to one millisecond.
So that is amazing. So it's dramatically faster, both because it's now running in compiled code
and because it's running on the GPU. The step of putting the data onto the GPU is not part of what
we timed. And that's probably fair enough because normally you do that once and then you run a whole
lot of CUDA things on it. We have though included the step of moving it off the GPU and putting it
onto the CPU as part of what we're timing. And one key reason for that is that if we didn't do
that, it can actually run our Python code at the same time that the CUDA code is still running. And
so the amount of time shown could be dramatically less because it hasn't finished synchronizing. So
by adding this, it forces it to complete the CUDA run and to put the data back onto the CPU. That
kind of synchronization you can also trigger by printing a value from it or you can synchronize
it manually. So after we've done that and we can have a look and we should get exactly the same
grayscale puppy. Okay. So we have successfully created our first real working code from Python
CUDA kernel. This approach of writing it in Python and then converting it to CUDA is not particularly
common. But I'm not just doing it as an educational exercise. That's how I like to write
my CUDA kernels, at least as much of it as I can, because it's much easier to debug in Python. It's
much easier to see exactly what's going on. And so I don't have to worry about compiling. It takes
about 45 or 50 seconds to compile even our simple example here. I can just run it straight away.
And once it's working to convert that into C, as I mentioned, you know, chat GPT can do most
of it for us. So I think this is actually a fantastically good way of writing CUDA kernels
even as you start to get somewhat familiar with them. Because it lets you debug and develop much
more quickly. A lot of people avoid writing CUDA just because that process is so painful. And so
here's a way that we can make that process less painful. So let's do it again. And this time
we're going to do it to implement something very important, which is matrix multiplication.
So matrix multiplication, as you probably know, is fundamentally critical for deep learning. It's
like the most basic linear algebra operation we have. And the way it works is that you have a
input matrix M and a second input matrix N. And we go through every row of M. So we go through
every row of M until we get to here we are up to this one and every column of N. And here we are
up to this one. And then we take the dot product at each point of that row with that column. And
this here is the dot product of those two things. And that is what matrix multiplication is. So
it's a very simple operation conceptually. And it's one that we do many, many, many times in deep
learning. And basically every deep learning, every neural network has this as its most fundamental
operation. Of course, we don't actually need to implement matrix multiplication from scratch
because it's done for us in libraries. But we will often do things where we have to kind of fuse
in some kind of matrix multiplication like pieces. And so, you know, and of course, it's also just a
good exercise. So let's take a look at how to do matrix multiplication first of all in pure Python.
So in the, actually in the fast AI course that I mentioned, there's a very complete in-depth dive
into matrix multiplication in part two, lesson 11, where we spend like an hour or two talking about
nothing but matrix multiplication. detail here. But what we do do in that is we use the MNIST
dataset to do this. And so we're going to do the same thing here. We're going to grab the MNIST
dataset of handwritten digits. And they are 28 by 28 digits. They look like this. 28 by 28 is 784.
So to do a, you know, to basically do a single layer of a neural net or without the activation
function, we would do a matrix multiplication of the image flattened out by a weight matrix with
784 rows and however many columns we like. And I'm going to need if we're going to go straight
to the output. So this would be a linear function, a linear model. We'd have 10 layers, one for each
digit. So here's this is our weights. We're not actually going to do any learning here. This is
just not any deep learning or logistic regression learning. This is just for an example. OK, so
we've got our weights and we've got our input data. X train and X valid. And so we're going to
start off by implementing this in Python. Now, again, Python is really slow, so let's make this
smaller. So matrix one will just be five rows. Matrix two will be all the weights. So that's
going to be a five by 784 matrix multiplied by a 784 by 10 matrix. Now these two have to match.
Of course they have to match because otherwise this dot product won't work. Those two are going
to have to match the row by the column. OK, so let's pull that out into A rows, A columns,
B rows, B columns. And obviously A columns and B rows are the things that have to match. And
then the output will be A rows by B columns. So five by 10. So let's create an output full
of zeros with rows by columns in it. And so now we can go ahead and go through every row of
A, every column of B, and do the dot product, which involves going through every item in the
innermost dimension, or 784 of them, multiplying together the equivalent things from M1 and M2 and
summing them up into the output tensor that we created. So that's going to give us, as we said, a
five by 10. Five by 10 output. And here it is. OK, so this is how I always create things in Python.
I basically almost never have to debug. I almost never have like errors, unexpected errors in my
code, because I've written every single line one step at a time in Python. I've checked them all as
they go. And then I copy all the cells and merge them together, stick a function header on like
so. And so here is matmul. So this is exactly the code we've already seen. And we can call it. And
we'll see that for 39,200 innermost operations, we took us about a second. So that's pretty slow.
OK, so now that we've done that, you might not be surprised to hear that we now need to do the
innermost loop as a kernel call in such a way that it is can be run in parallel. Now, in this case,
the innermost loop is not this line of code. It's actually this line of code. I mean, we can choose
to be whatever we want it to be. But in this case, this is how we're going to do it. We're going to
say for every pixel, we're not going to be every pixel, for every cell in the output tensor like
this one here is going to be one CUDA thread. So one CUDA thread is going to do the dot product.
So this is the bit that does the dot product. So that'll be our kernel. So we can write that
matmul block kernel is going to contain that. OK, so that's exactly the same thing that we just
copied from above. And so now we're going to need something to run this kernel. And you might not
be surprised to hear that in CUDA, we are going to call this using blocks and threads. But something
that's rather handy in CUDA is that the blocks and threads don't have to be just a 1D vector. They
can be a 2D or even 3D tensor. So in this case, you can see we've got one, two, it's a little hard
to see exactly where they stop, two, three, four blocks. And so then for each block, and that's
kind of in one dimension, and then there's also one, two, three, four, five blocks in the other
dimension. And so each of these blocks has an index. So this one here is going to be zero, zero,
a little bit hard to see. This one here is going to be one, three, and so forth. And this one over
here is going to be three, four. So rather than just having a integer block index, we're going to
have a tuple block index. And then within a block, there's going to be, to pick, let's say, this
exact spot here, didn't do that very well, there's going to be a thread index. And again, the thread
index won't be a single index into a vector. It'll be two elements. So in this case, it'll be 0,
1, 2, 3, 4, 5, 6 rows down. And 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, is that 11, 12? I can't count. 12
maybe across. So this here is actually going to be defined by two things. One is by the block. And
so the block is 3, 4. And the thread is 6, 12. So that's how CUDA lets us index into two-dimensional
grids using blocks and threads. We don't have to. It's just a convenience if we want to. And in
fact, we can use up to three dimensions. So to create our kernel runner, now rather than just
having two nested loops for blocks and threads, we're going to have to have two lots of two nested
loops for both of our x and y blocks and threads, or our rows and columns blocks and threads. So it
ends up looking a bit messy because we now have four nested for loops. So we'll go through our
blocks on the y-axis and then through our blocks on the x-axis and then through our threads on the
y-axis and then through our threads on the x-axis. And so what that means is that for you can think
of this Cartesian product as being for each block, for each thread. Now to get the dot y and the dot
x, we'll use this handy little Python standard library thing called simple namespace. I use
that so much I just give it an NS name because I use namespaces all the time in my quick and
dirty code. So we go through all those four. We then call our kernel and we pass in an object
containing the y and x coordinates. And that's going to be our block. And we also pass in
our thread, which is an object with the y and x coordinates of our thread. And it's going to
eventually do all possible blocks and all possible threads numbers for each of those blocks. And
we also need to tell it how big is each block, how high and how wide. And so that's what this
is. This is going to be a simple namespace, an object with an x and y, as you can see. So we need
to know how big they are. Just like earlier on, we had to know the block dimension. That's why we
passed in threads. So remember this is all pure PyTorch. We're not actually calling any out to any
CUDA. We're not calling out to any libraries other than just a tiny bit of PyTorch for the indexing
and tensor creation. So you can run all of this by hand. Make sure you understand. You can put
it in the debugger. You can step through it. And so it's going to call our function. So here's
our matrix multiplication function. As we said, it's a kernel that contains the dot product
that we wrote earlier. So now the guard is going to have to check that the row number we're
up to is not taller than we have and the column number we're up to is not wider than we have.
And we also need to know what row number we're up to. And this is exactly the same. Actually, I
should say the column is exactly the same as we've seen before. And in fact, you might remember
in the CUDA we had block IDX.x. This is why, right? Because in CUDA, it's always gives you
these three dimensional dim three structures. So you have to put this dot x. So we can find out
the column this way. And then we can find out the row by seeing how many blocks have we gone
through, how big is each block in the y-axis, and how many threads have we gone through in the
y-axis. So what row number are we up to? What column number are we up to? Is that inside the
bounds of our tensor? If not, then just stop and then otherwise do our dot product and put it into
our output tensor. So that's all pure Python. And so now we can call it by getting the height and
width of our first input, the height and width of our second input. And so then k and k2, the inner
dimensions ought to match. We can then create our output. And so now threads per block is not just
the number 256, but it's a pair of numbers. It's an x and a y. And we've selected two numbers that
multiply together to create 256. So again, this is a reasonable choice if you've got two dimensional
inputs to spread it out nicely. One thing to be aware of here is that your threads per block can't
be bigger than 1024. So we're using 256, which is safe. And notice that you have to multiply these
together. 16 times 16 is going to be the number of threads per block. So these are safe numbers to
use. You're not going to run out of blocks though. 2 to the 31 is the number of maximum blocks for
dimension 0 and then 2 to the 16 for dimensions 1 and 2. I think it's actually minus 1, but don't
worry about that. So don't have too many threads, but you can have lots of blocks. But of course,
each symmetric model processor is going to run all of these on the same device and they're
also going to have access to shared memory. So that's why you use a few threads per block. So
our blocks, the x, we're going to use the ceiling division. The y, we're going to use the same
ceiling division. So if any of this is unfamiliar, go back to our earlier example because the code's
all copied from there. And now we can call our 2D block kernel runner passing in the kernel. The
number of blocks, the number of threads per block, our input matrices flattened out, our output
matrix flattened out, and the dimensions that it needs because they get all used here and return
the result. And so if we call that mapMole with a 2D block and we can check that they are close
to what we got in our original manual loops. And of course they are because it's running
the same code. So now that we've done that, we can do the CUDA version. Now the CUDA version
is going to be so much faster, we do not need to use this slimmed down matrix anymore. We can use
the whole thing. So to check that it's correct, I want a fast CPU based approach that I can compare
to. So previously it took about a second to do 39,000 elements. So I'm not going to explain how
this works, but I'm going to use a broadcasting approach to get a fast CPU based approach. If
you check the fast AI course, we teach you how to do this broadcasting approach. But it's a pure
Python approach, which manages to do it all in a single loop rather than three nested loops. It
gives the same answer for the cut down tensors. But much faster, only four milliseconds. So
it's fast enough that we can now run it on the whole input matrices. And it takes about 1.3
seconds. And so this broadcast optimized version, as you can see, it's much faster. And now we've
got 392 million additions going on in the middle of our three loops, effectively three loops, but
we're broadcasting them. So this is much faster. But the reason I'm really doing this is so that
we can store this result to compare to. So that makes sure that our CUDA version is correct. Okay,
so how do we convert this to CUDA? You might not be surprised to hear that what I did was I grabbed
this function and I passed it over to ChatGPT and said, please rewrite this in C. And it gave me
something basically that I could use first time. And here it is. This time I don't have unsigned
cast star, I have float star. Other than that, this looks almost exactly like the Python
we had with exactly the same changes we saw before. We've now got the .y and .x versions.
Once again, we've got DunderGlobal, which says, please run this on the GPU when we call it from
the CPU. So the kernel, I don't think there's anything to talk about there. And then the thing
that calls the kernel, matmul, is going to be passed in two tensors. We're going to check that
they're both contiguous and check that they are on the CUDA device. We'll grab the height and width
of the first and second tensors. We're going to grab the inner dimension. We'll make sure that the
inner dimensions of the two matrices match just like before. And this is how you do an assertion
in PyTorch CUDA code. You call torch check, pass in the thing to check, pass in the message to pop
up if there's a problem. These are a really good thing to spread around all through your CUDA code
to make sure that everything is as you thought it was going to be. Just like before, we create an
output. So now when we create a number of threads, we don't say threads is 256. We instead say, this
is a special thing provided by CUDA for us, dim3. So this is basically a tuple with three elements.
So we're going to create a dim3 called tpb. It's going to be 16 by 16. Now I said it has three
elements. Where's the third one? That's okay. It just treats the third one as being one. So we
can ignore it. So that's the number of threads per block. And then how many blocks will there be?
Well, in the X dimension, it'll be W divided by X, ceiling division. In the Y dimension, it will be H
divided by Y, ceiling division. And so that's the number of blocks we have. So just like before, we
call our kernel just by calling it like a normal function. But then we add this weird triple angle
bracket thing telling it how many blocks and how many threads. So these aren't ints anymore. These
are now dim3 structures. And that's what we use, these dim3 structures. And in fact, even before,
what actually happened behind the scenes when we did the grayscale thing is even though we passed
in 256, for instance, we actually ended up with a dim3 structure and just in which case the index
one and two or the dot X and dot Y and dot Z values would have set to one automatically. So
we've actually already used a dim3 structure without quite realizing it. And then just like
before, pass in all of the tensors we want, casting them to pointers, maybe they're not just
casting, converting them to pointers through some particular data type and passing in any other
information that our function will need, that kernel will need. Okay, so then we call load CUDA
again, that'll compile this into a module, make sure that they're both contiguous and on the CUDA
device. And then after we call module dot mapmul, passing those in, putting on the CPU and checking
that they're all close, and it says yes, they are. This is now running not on just the first five
rows, but on the entire MNIST data set. And on the entire MNIST data set, using a optimized
CPU approach, it took 1.3 seconds. Using CUDA, it takes 6 milliseconds. So that is quite a big
improvement. Cool. The other thing I will mention, of course, is PyTorch can do a matrix
multiplication for us just by using at. How long does, and obviously gives the same answer,
how long does that take to run? That takes 2 milliseconds. So three times faster. And in many
situations, it'll be much more than three times faster. So why are we still pretty slow compared
to PyTorch? I mean, this isn't bad to do 392 million of these calculations in 6 milliseconds.
But if PyTorch can do it so much faster, what are they doing? Well, the trick is that they
are taking advantage in particular of this shared memory. So shared memory is a small memory space
that is shared amongst the threads in a block, and it is much faster than global memory. In our
matrix multiplication, when we have one of these blocks, and so it's going to do one block at
a time all in the same SM, it's going to be reusing the same 16 by 16 block. It's going to
be using the same 16 rows and columns again and again and again, each time with access to the same
shared memory. So you can see how you could really potentially cache the information, a lot of the
information you need, and reuse it rather than going back to the slower memory again and again.
So this is an example of the kinds of things that you could optimize potentially once you get to
that point. The only other thing that I wanted to mention here is that this 2D block idea is totally
optional. You can do everything with 1D blocks or with 2D blocks or with 3D blocks and threads. And
just to show that, I've actually got an example at the end here which converts RGB to grayscale using
the 2D blocks. Because remember earlier when we did this, it was with 1D blocks. It gives exactly
the same result. And if we compare the code, so if we compare the code, the version actually
that was done with 1D threads and blocks is quite a bit shorter than the version that uses
2D threads and blocks. And so in this case, even though we're manipulating pixels, where you
might think that using the 2D approach would be neater and more convenient, in this particular
case it wasn't really. It's still pretty simple code that we have to deal with the columns and
rows.x.y separately. The guards are a little bit more complex. We have to find out what index we're
actually up to here. Where else this kernel, it was just much more direct, just two lines of code.
And then calling the kernel, again it's a little bit more complex with the threads per block stuff
rather than this. The key thing I wanted to point out is that these two pieces of code do exactly
the same thing. So don't feel like if you don't want to use a 2D or 3D block thread structure,
you don't have to. You can just use a 1D one. The 2D stuff is only there if it's convenient
for you to use and you want to use it. Don't feel like you have to. So yeah, I think that's
basically all the key things that I wanted to show you all today. The main thing I hope you take
from this is that even for Python programmers, for data scientists, it's not way outside our
comfort zone. We can write these things in Python, we can convert them pretty much automatically, we
end up with code that doesn't look, you know, it looks reasonably familiar even though it's now in
a different language. We can do everything inside notebooks, we can test everything as we go, we can
print things from our kernels. And so, you know, it's hopefully feeling a little bit less beyond
our capabilities than we might have previously imagined. So I'd say, yeah, you know, go for it.
I think it's also like, I think it's increasingly important to be able to write CUDA code nowadays
because for things like flash attention or for things like quantization, GPTQ, AWQ, bits and
bytes, these are all things you can't write in PyTorch. You know, our models are getting more
sophisticated. The kind of assumptions that libraries like PyTorch make about what we want to
do, you know, increasingly less and less accurate. So we're having to do more and more of this stuff
ourselves nowadays in CUDA. And so I think it's a really valuable capability to have. Now, the other
thing I mentioned is we did it all in Colab today, but we can also do things on our own machines
if you have a GPU or on a cloud machine. And getting set up for this, again, it's much less
complicated than you might expect. And in fact, I can show you it's basically like four lines of
code or four lines or three or four lines of bash script to get it all set up. It'll run on Windows
or under WSL. It'll also run on Linux. Of course, Mac stuff doesn't really work on... CUDA
stuff doesn't really work on Macs. They're not on Mac. Actually, I'll put a link to
this into the video notes. But for now, I'm just going to jump to a Twitter thread
where I wrote this all down to show you all the steps. So the way to do it is to use something
called Conda. Conda is something that very, very, very few people understand. A lot of people think
it's like a replacement for like PIP or poetry or something. It's not. It's better to think of it as
a replacement for Docker. You can literally have multiple different versions of Python, multiple
different versions of CUDA, multiple different C++ compilation systems all in parallel at the same
time on your machine and switch between them. You can only do this with Conda. And everything just
works. So you don't have to worry about all the confusing stuff around .run files or Ubuntu
packages or anything like that. You can do everything with just Conda. You need to install
Conda. I've actually got a script which you just run the script. It's a tiny script, as you see.
If you just run the script, it'll automatically figure out which mini Conda you need. It'll
automatically figure out what shell you're on, and it'll just go ahead and download it and
install it for you. OK, so run that script, restart your terminal. Now you've got Conda. Step
two is find out what version of CUDA PyTorch wants you to have. So if I click Linux, Conda, CUDA
12.1 is the latest. So then step three is run this shell command, replacing 12.1 with whatever the
current version of PyTorch is. It's actually still 12.1 for me at this point. And that will install
everything. All the stuff you need to profile, debug, build, etc. All the NVIDIA tools you
need, the full suite, will all be installed, and it's coming directly from NVIDIA so you'll
have the proper versions. As I said, you can have multiple versions installed at once in different
environments. No problem at all. And then finally, install PyTorch. And this command here will
install PyTorch. For some reason I wrote nightly here. You don't need the nightly. Just remove dash
nightly. So this will install the latest version of PyTorch using the NVIDIA CUDA stuff that you
just installed. If you've used Conda before and it was really slow, that's because it used to
use a different solver which was thousands or tens of thousands of times slower than the modern
one. It's just been added and made default in the last couple of months. So nowadays this should
all run very fast. And as I said, it'll run under WSL on Windows. It'll run on Ubuntu. It'll run
on Fedora. It'll run on Debian. It'll all just work. So that's how I strongly recommend getting
yourself set up for local development. You don't need to worry about using Docker. As I said,
you can switch between different CUDA versions, different Python versions, different compilers and
so forth without having to worry about any of the Docker stuff. And it's also efficient enough
that if you've got the same libraries and so forth installed in multiple environments, it'll
hard link them. So it won't even use additional hard drive space. So it's also very efficient.
Great. So that's how you can get started on your own machine or on the cloud or whatever. So
hopefully you'll find that helpful as well. All right. Thanks very much for watching. I hope you
found this useful and I look forward to hearing about what you create with CUDA. In terms of going
to the next steps, check out the other CUDA mode lectures. I will link to them. And I would also
recommend trying out some projects of your own. So for example, you could try to implement something
like 4-bit quantization or Flash Attention or anything like that. Now those are pretty big
projects, but you can try to break them up into smaller things that you build up one step at a
time. And of course, look at other people's code. So look at the implementation of Flash Attention.
Look at the implementation of bits and bytes. Look at the implementation of GPTQ and so forth. The
more time you spend reading other people's code, the better. All right. I hope you found this
useful and thank you very much for watching.