CUDA Programming in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we'll go over how to speed up a python function with only one line of code then we'll rewrite the function to take advantage of the GPU take a guess right now on how much faster we can get it to go for our demonstration today we will Implement a soble filter a soble filter is useful for detecting edges and images for example you might use this in computer vision applications we'll talk about the algorithm in more detail later for now note that there are nested for Loops in the main function we read the file then we convert it to grayscale finally we run the soble filter on the image using a timer function this prints out their laps time in milliseconds we'll use this image let's run the script one minute later well actually 72 seconds later we have our filtered image let's compare the two the good news it appears to work the bad news it takes 72 seconds to work let's work on cutting that time down Numba is an open- Source justtin Time compiler for python it translates a subset of python and numpy code into fast machine code we'll talk about how it does that in the last section of the video Let's import number now we add a decorator this will tell the python interpreter to call number to convert the soo filter function into machine code and execute it nuble will do this the first time it is called subsequent calls to the function will call the machine code directly let's run it and see what happens that cuts down the runtime to about 3.3 seconds that's a good little Improvement let's compare this to a c implementation open CV has a soble function it's a python wrapper around a c function I expect this to be pretty fast so let's run it an extra 50 times let's run it the first run is 56 milliseconds then it's a 50 millisecond average thereafter now I know what you're saying it would be much faster if you just didn't write shitty code how dare you suggest that I write code I have chat GPT do it for me let's see if we can speed this puppy up python uses what is called duck typing this means that no has to look up the type of an object before it can compile a function that uses that object we can Define the object type in The Decorator to help speed up the compile process here we tell num that we are passing a two-dimensional list of float 32 we expect the function to return a two-dimensional list of 8bit unsigned integers the closer you get to the metal the more you got that's to know we also have to tell n to import U 8 and Float 32 let's run it 11 times Al together that's somewhat better 682 milliseconds or so previously it was 3.2 seconds but we're still well off the 50 milliseconds of the open CV version if a function is parallelizable that's a funny word parallelizable we can tell num to run it on multiple CPU cores we add parallel equals true to the jit decorator we also Chang the range statement to P range this allows our for Loops to be split up and run in parallel we also have to import P range from Numba let's run it 100 times H 28 milliseconds the first time this machine has 12 CPU cores let's watch their activity when we run the script the interesting point here is that we haven't changed any of the code we've added one directive the way it actually works is pretty transparent to us there are two things that gpus are really good at math and lightweight multitasking Numba can generate Cuda code directly from the ca. jit decorator the mechanism is the same as how it generates CPU machine code but uses a different code generator back end in the next section we'll go over how we wrote this Cuda kernel for now we'll run it and time the results 29 milliseconds to run the first time and subsequent runs are 17 millisecs let's run it a thousand times and watch the GPU activity it looks like we're using about 75% of the GPU here's a table of the measurements against the pure python implementation these numbers don't mean much but under the right circumstances you can make things a lot faster of the numbers I found just adding the jit decorator speeding up the code more than 100 times the most surprising let's go over writing the Cuda kernel next let's see what Wikipedia has to say about this the soble operator uses two 3x3 kernels which are convolved with the original image to calculate approximations of the derivatives one for horizontal changes and one for vertical let's scroll down a m oh good here's some code that's better two sets of for Loops in the first set we go through all of the pixels in the image in the second set we do the actual convolution this allows us to measure a weighted difference of color between the pixel and its neighbors we take that and store it in an output array we do this twice once left to right and one top to bottom that gives us two matrices combine both of them into one by using a gradient magnitude we do that by taking the square root of the summation of the squares with that nonsense out of the way we can look at some code here's our Cuda kernel we only have to worry about working with one pixel here to get the pixel offset in the image we use ca. grid next we check to see if the pixels are actually in the image we're making the Assumption here that the kernels are the same size these nested for Loops we do the convolution that's the multiply accumulate part finally we set the pixel value of the output image that's the nasty square root of the sum of the squares bit let's scroll down a little bit here we read our image convert it to gray scale and then make sure that it's float 32 now we defined a grid of blocks the Cuda compiler will use this grid to schedule the GPU resources between 8 and 32 threads per block is usually a good number the size of the image helps determine the number of blocks then we Define our kernels this is for the common 3x3 case finally we allocate our output image the Cuda kernel call is wrapped in a function which makes it easier for us to time it the python list after the function name tells Cuda the grid layout parameters follow numble will take care of moving the parameters back and forth between the Cuda device and the host let's run this a thousand times the first run takes 55 milliseconds to compile and run subsequent runs take 20.7 milliseconds by copying the memory from the host to the device and back manually you can save some time this requires a lot more housekeeping on your part but saves NBA from trying to figure it out on the Fly for the setup first we copy the image over to the Cuda device next we allocate room for the output image on the device and finally we copy the soble kernels over to the device now we call the soble filter Cuda kernel we make sure that the Cuda kernel finishes execution then we copy the result back to the host then in our tear down we remove everything we placed on the device let's give that a run it takes about 33 milliseconds to compile and run the first time subsequent runs take about 16 milliseconds that's a reduction of around 4 milliseconds if you're running on a video stream that's a pretty good little pickup here's a side-by-side comparison of the two it's not as easy to read but it is faster how much do you know about interpreters oh there is python is a bite code interpreter when python loads a script or module it can compiles it into by codes the python virtual machine reads the bite code instructions one by one and performs the operation it describes on the left we have the python source code on the right we have the bite codes to which it compiles for example here we create an NP array from a list and assign it to GX that compiles into these bite codes it's pretty straightforward it's all low-level stuff this one function turns into about 150 lines of code when The Interpreter encounters the number at jit decorator it translates the function and calls the llvm compiler llvm compiles the code into machine language then links and loads it this is done when the program is running here's the Assembly Language produced for the function in this case it's arm 64 assembler once the code is loaded and ready to go it is executed subsequent calls to the function will call the machine code version in this case the 150 lines of bite codes translate into about 1,00 lines of assembler bite codes are compact this is one of the advantages it has over compiled code when The Interpreter encounters the num ca. jit directive it does a similar process as the previous jit the Cuda compiler gets called during this process to generate Cuda code you can see that our Cuda function produced around 450 lines of code this is a very highlevel overview of using a jit compiler and using Cuda code with python there was no attempt to write faster code we just played with some parameters notice that we picked an algorithm that lended s to running in parallel and had a good bit of math in it that's a great match for Cuda but engineering is about trade-offs and you have to make sure that the juice is worth the squeeze when optimizing code thanks for watching
Info
Channel: JetsonHacks
Views: 5,469
Rating: undefined out of 5
Keywords: tutorial, demonstration, cuda, nvidia jetson, nvidia jetson nano, nvidia jetson xavier, nvidia jetson orin, cuda programming, cuda programming in python, cuda python, just in time compiler, jit, cuda gpu programming, cuda gpu, examples, gpu programming, gpu programming python, nvidia gpu, nvidia gpu programming, numba, numba tutorial, numba python, numba jit compiler
Id: C_WrbBmiTf4
Channel Id: undefined
Length: 9min 12sec (552 seconds)
Published: Mon Jan 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.