CUDA Programming on Python

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what is up you guys in this one I'll show you everything you need to know about Cuda programming so that you could make use of GPU parallelization through simple modifications if you're already existing code running on a boring CPU the following video was recorded on nvidia's Jetson or in supercomputer in Abu Dhabi the UAE and under a Creative Commons license I'm Ahmad bazzi signing in on this one I'm using a Sublime Text as an editor here I love experimenting with different code editors first I'll start by writing a simple function that does a vector multiplication which will first run on a CPU therefore my python code contains nothing fancy and is very classical I will start by importing the numpy package then I will create a function called multiply my vectors that takes three vectors in where A and B are treated as inputs and C would be our storage array given as input so its values will be affected next I will create a main function that initializes three vectors A B and C of type float 32 and if sizes 64 million each which is enormous before calling my multiply my vectors on the three vectors I will start the timer then after executing multiply my vectors on a CPU I will compute the amount of time taken to multiply two vectors each of size 64 million and of type float 32 then to show that multiply my vectors executes properly I will print the first and last six elements of C to ensure we're getting the right values and then print out how long it took so let's go ahead and run this pure python version which will be executed on a CPU oops there seems to be an error in this code since I did not import the timer package so from time I T I will import default underscore timer as timer also another error is that I'm using the old X range which now should be range running this all seems to work well and as you can see this is taking a lot of time this is natural as we are multiplying 64 million flow 32 numbers on a CPU you can see we get all ones as expected in about 31.4 seconds again this is a lot of time for many online applications now one possible acceleration technique I'm going to show in this video is to Simply tell the number compiler I have a function that I wanted to parallelize for me and then automatically compiles and move that function to the GPU we will be doing this using number Pro vectorize capability and applying it to our multiply my vectors function the first trick using vectorized parallelization is that the multiply my vectors function must be a scalar function this means that all input and output parameters must be scalar values recognized by numpy such as flow 32 float64 and so on currently our multiply my vectors function is set up to receive all three arrays as input parameters and not return any values The vectorized Decorator expects the multiply my vectors function to accept some number of scalar inputs and return a single scalar output so our first step is to take our current multiply my vector's function and cast it into a scalar function to do that we return the result of scalar a times B and we no longer need to pass in C now the number compiler can apply the scalar function automatically across our numpy arrays on the GPU then our last step is to tweak the call of the function to multiply my vectors and change how the multiply my vectors function is called we are now returning C instead of passing it into the parameters now to use the vectorized library we first need to import it from number Pro as such and import vectorized from it one last thing I need to do is declare a python function decorator which goes on the line immediately above our function and begins with the at symbol the first input parameter to this decorator is a list of strings containing the signature of the function that is to be accelerated think of this as the blueprints of the C functions now this function will be compiled to the GPU machine code therefore the compiler needs to know the data types to expect for both input and output parameters our multiply my vectors function is called with float32 so let's create that signature the first entry is the output data type expected from the function and then the remaining are the types for the input parameters so for a we have flow 32 and for B we also have float 32. by default the vectorized function will create a compiled single threaded CPU version of a function but that's not any fun so we're going to create a massively parallelized GPU version so I'll set the target equal to GPU note that some of you may have to set this to Cuda and that's all there is to it now running this we get this error because I have not installed the number package so let's hop on over our terminal and run pip3 install number this might take some time depending on your internet speed so now let's change number Pro to number and change the target to Cuda instead of GPU as I already mentioned as expected running this massive multiplication on a GPU takes about 0.64 seconds as opposed to 31.4 seconds when running on a CPU this gain translates to a times 50 in terms of speed thanks to the parallelization on such a huge number of cores amazing this simply means that running a complex program on a CPU taking about a month could be simply executed in 14 hours on a GPU this could be also done faster if you were given more cores keep in mind that gpus have more cores than CPUs and hence when it comes to parallel Computing of data gpus perform exceptionally better than CPUs even though gpus have lower clock speed and lack several core management features as compared to CPUs now I'll use vs code as my editor so I'll hop on over terminal and run code as such here's me running the previous multiply vectors.py script which ran on 0.32 seconds let's create another script called fill array Dot py where the main purpose is to show the gains from GPU when Simply Filling arrays I'll start by importing numpy and the timer package to compute the time taken to execute functions first let's create a function called fill array without GPU which will run on a pure CPU now what the fill array without GPU function simply does is takes an array in and fills up 10 million entries of this array following the simple incremental equation remember that this function will run only on a CPU next let's create a similar function that is not similar at all meaning that this function is called fill array with GPU which will get the same job done but this time on a GPU the content of fill array with GPU is exactly the same as that of fill array without GPU however as we did in the multiply vectors.py example we will first import this time Cuda and jit jet or just-in-time compilation is a compiler feature that allows a language to be interpreted and compiled during runtime rather than at execution using jit I will declare a decorator by specifying the target backend to Cuda before implementing the body of philarray.py I will run fill array to make sure we have no errors Perfect all is set now in the main let's initialize an all one's array called a of size 10 million where each entry is a float 64. let's start the timer then call the first function that is fill array without GPU then print the amount of time it took to execute fill array without GPU which is running only on a simple boring CPU similarly let's start a timer then call the second function that is fill array with GPU then print the amount of time it took to execute fill array with GPU which is running on a GPU let's run the fill array.py script and as we can see the amount of time it took to fill the array on a CPU is about 2.58 seconds as opposed to 0.39 seconds on a GPU which is again of about 6.6 again a massive gain the following code will demonstrate why you see some film producers or Movie Makers rendering and editing their content on a GPU GPU rendering delivers with a graphics card rather than a CPU which may substantially speed up the rendering process because gpus are primarily built for fast picture rendering gpus were developed in response to graphically intensive applications as opposed to the slow processing speed of CPUs I will create a Mandel Broad for those into mathematics the mandelbrot set is the set of complex numbers for which the function ffz equals to Z squared plus C does not diverge to Infinity starting from Z equal to zero intuitively speaking it's all numbers including complex ones that you keep squaring over and over again and do not blow up so for example if I take Z equal 2 that does not work since if I keep squaring it it explodes up to Infinity in contrast to Z equal one which remains one and in contrast to Z equal half which shrinks down to zero now on real numbers we do not have a surface to visualize but things get really interesting when working with complex numbers since we can visualize them on the IQ or real and imaginary plane and we can see the beautiful fractal curves on the boundaries of the mandelbrot set anyways back to gpus I'll create a script called mandelbrot on CPU which will simply plot the mandelbrot set rendered on a CPU and we'll see how much time it will take we will start off by importing numpy as MP Mac plot slip to visualize the manual Broad and the timer to compute the time taken to render I'll create a mandelbrot function that takes in the XY values of the complex number Z namely the real and imaginary parts of Zed as well as the maximum number of iterations we are willing to iterate we keep iterating over Z equal to Z square plus c so the rule here is that a point belongs to the mandelbrot set if and only if its magnitude square is below 4 or equivalently if its magnitude is less than 2. now this function called create underscore fractal will render our mandelbrot image simply all this function does is that it sets the colors per pixel the input image will be modified during the execution of the function and its pixels will be adjusted according to its width and height we first read the width and height according to the dimensions of the image then we compute the sizes of the pixels along the X and Y axis next we create a 2d Loop so that we can walk over all the pixels in our X Y plane each pixel x value specifies the real part and the Y value specifies the imaginary part of the complex value to be evaluated by the mandelbrot function the output of the mandrel broad specifies the color of that pixel after we are done from the create underscore fractal function we initialize a blank image of size 5000 by 7500. now the images are U and eights or unsigned integers of eight bits before calling the create underscore fractal function do not forget to wrap it around with timers so that we can evaluate the time taken to execute the create underscore fractal function so to render the image after execution we print the amount of time it took to execute that function then finally we plot the image using the M show function of matplotlib now here we're going to wait a bit since we're running or we're rendering the image on a CPU here's the mandelbrot set and as we can see this took a lot of time on a CPU about 110 seconds that's around 2 minutes for a mandelbrot that does not seem to be so simple on a CPU now I'll be creating another script called mandelbrot on GPU which could be seen as a sister script of mandelbrot on CPU however we're going to run this on a GPU I will copy the same packages and the mandelbrot function since those remain unaffected as we did in the fill array.py script I will include the Cuda module from number using jit I will declare a decorator by specifying the target back end to Cuda also for the create fractal function I will call The Decorator Cuda dot jit just before the create underscore fractal function also the pixel size specifications are the same as the previous function so I'll just copy and paste them here right now we shall make use of the Kuda dog grid feature this Returns the absolute position of the current thread in the entire grid of blocks since we specified two Dimensions this corresponds to the two Dimensions declared when instantiating the kernel also we should expect a couple as output of the Cuda da gray the X and Y points after that the loop logic Remains the Same as that of create underscore fractal but now we should use the output of the Cuda dot grid that is we will Loop over all pixels where for each pixel we call the mandelbrot function to figure out the color of that pixel I will copy paste the image initialization and to be even more hard on the GPU I will double the image sizes per axis in other words instead of rendering a 5000 by 7500 image I will have 10 000 by 15 000 sized image so think about it instead of rendering a 4K resolution video I will render an 8K resolution video of course that would take even more time on a CPU but the point here is that I will show you that the GPU is still outperforms the CPU by orders of magnitude so now I will compute the number of pixels by just multiplying the length by width of that image I will specify 32 threads and the number of blocks on the X and Y axis furthermore I will call the create underscore fractal by specifying the number of blocks and threads per Dimension with the same inputs as that of the CPU then I will print the amount of time taken to render the mandelbrot image and finally we will plot it oops we forgot to Define s so let's define it on top of the create underscore fractal call running this we see only 1.4 seconds of execution as opposed to 110 seconds on a CPU which is a 78 x gain this simply means that instead of rendering a 4K resolution video over a week on a CPU you could get the same video in 8k resolution rendered in two hours on a GPU if you are using 32 threads so imagine if you doubled the amount of threads and blocks involved in GPU optimization wow so that's it for the Cuda programming video I hope you enjoyed it I hope you found it useful if you did please leave a like on the video and subscribe to the channel so that the YouTube algorithm could show this video to more people I'll be leaving some cryptocurrency wallets down and below so any donation of any amount is highly appreciated I also have a patreon account so if you want to support me on patreon feel free to do so if you have any questions whatsoever kind of leave them down in the comment section below so that other people could respond to you or I can get to it as soon as possible I will see you then

Info

Channel: Ahmad Bazzi

Views: 1,166,007

Rating: undefined out of 5

Keywords: cuda programming book, cuda programming tutorial, cuda programming python, cuda programming pdf, cuda programming model, cuda programming course, cuda programming c, cuda programming projects, ahmad bazzi, Cuda download, Cuda install, Cuda python, Cuda tutorial, Cuda code, Nvidia cuda, cuda 11 7 download, cuda problem, cuda projects, cuda premiere pro, cuda programming in hindi, cuda profiling, cuda pravoslavlja, cuda gpu programming, cuda gpu, cuda gpu setup, orin

Id: -lcWV4wkHsk

Channel Id: undefined

Length: 21min 33sec (1293 seconds)

Published: Sat Oct 01 2022