Striding CUDA like i'm Johnnie Walker

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is up you guys in this one I'll show you how to use Cuda kernels and striding on Jupiter notebook or Google collab but before I do so I want to tell you that GTC 23 is taking place on March 20 till 23 2023. perfect the following video was recorded on nvidia's or in supercomputer in Abu Dhabi and under a Creative Commons license I'm Ahmed Bazi signing in on this one so for those of you who don't know GTC it's nvidia's annual GPU technology conference focusing on GPU technology and its applications the conference brings together experts and Founders like Demis hasabis industry leaders Visionaries and researchers like anima Anand Kumar and many more you can get to meet Professionals in various Fields such as artificial intelligence high performance Computing data science and autonomous vehicles to showcase the latest Innovations in GPU technology and share best practices at GTC Nvidia and its Partners demonstrate the latest advancements in hardware and software technology including gpus system architectures AI Frameworks libraries tools and many more and the best part about it is that it's open to everyone including you perfect but I'm also hosting an awesome 4080 RTX GPU giveaway with Supreme Ray tracing performances thanks to the 76 dedicated rate racing cores if you're a gamer designer or even an editor you will love it to participate in this giveaway all you gotta do is sign up using the link below to GDC attend Jensen's keynote on March 21st 8 AM PDT or 4 pm CET and some sessions of your own choice then screenshot me a proof that you attended the sessions directly on my email found in the description below I will then pick a winner at random and ship him or her the GPU perfect you are tuned into another armored batsy video now after we open Jupiter notebook or Google collab let's find out about the GPU we are using please note that your GPU or in case you're using Google collab or your own custom GPU could be different than mine of course the numbers I'm giving below and the picture are only valid for the GPU dedicated to me we can see that I'm on an Nvidia Oren supercomputer with a given Universal unique identifier uuid and that the device is supported on my Jupiter notebook let's get started by implementing a first Cuda kernel to compute the square root of each value in an array first here's our 4096 sized float32 NP array perfect now we can simply use numbers vectorized decorator to compute the square root of all elements in parallel on the GPU as follows we'll do the same with a custom Cuda kernel we first Define our kernel as I do cuda.jit so here we have an input array of 4096 values so we will use 4096 threads on the GPU our input and output arrays are one-dimensional so we will use a one-dimensional grid of threads grid of one Returns the unique index for the current thread in the whole grid with 4096 threads index or idx will range from 0 to 4095 that is 4096 minus one then we see that each thread is going to deal with a single element of the input array to produce a single element in the output array this element is determined for each thread by the thread index idx now that we have our kernel we copy our input array to the GPU device create an output array on the device with the same shape and finally launch the kernel here the 4096 threads are arranged into a grid of 32 blocks where each block has 128 threads in general the Cuda kernel launch overhead increases with the number of blocks going for such a large number of blocks would hit performance in the following I will show you how to use striding to solve this problem perfect now the simple kernel deals with a single element of the input array when the kernel is deployed the GPU therefore needs to create as many threads as elements in the array which potentially results in many blocks if the array is large on the contrary a striding kernel deals with several elements of the input array using a loop as follows in this way a given thread deals with several elements and the number of threads is kept under control threads keep doing work in a coordinated way and the GPU is not wasting time creating and scheduling threads now let's consider a small example with an input data array of size 8. and blocks with four threads each now without shredding we need to use two blocks so eight threads in total each dealing with a single element in the array so now we could launch one block of four threads as follows here are the elements processed by each thread a useful way to think about this is to imagine that the grid is moving to process all elements in the input array now let's do some performance analysis where we study the influence of striding and of the execution configuration parameters and the processing of a large array so let's redefine our kernels the simple one and the shredding one now we create a big array we ship it to the device and we create the output array on the device as usual first let's see how fast we can process this array sequentially in a single thread on the GPU we use the striding version here since obviously the simple version would only be able to process one element in a single thread processing these 256 million values in a single thread on the GPU took about a minute and 17 seconds let's see how parallel processing can help us for that we choose an execution configuration by following our simple rules and we use the non-striding kernel the parallel version is much faster as expected perfect now let's try and see if the striding version brings any performance Improvement the gain is not very significant but indeed much better than the sequential single threaded one so that's it guys and this one we simply showed you how to use Cuda kernels and shredding on our Jupiter notebook or if you're using Google collab or even your own local machine with a custom GPU don't forget to attend GTC so that you can participate in the 4080 RTX giveaway it if you enjoyed this video please make sure to leave a like on the video and subscribe to the channel so that you can help me produce more content and know that this video was beneficial to you this is Ahmed Bazi and I'm signing out you are watching a master at work foreign
Info
Channel: Ahmad Bazzi
Views: 545,302
Rating: undefined out of 5
Keywords: cuda programming book, cuda programming tutorial, cuda programming python, cuda programming pdf, cuda programming model, cuda programming course, cuda programming c, cuda programming projects, ahmad bazzi, Cuda download, Cuda install, Cuda python, Cuda tutorial, Cuda code, Nvidia cuda, cuda 11 7 download, cuda problem, cuda projects, cuda premiere pro, cuda programming in hindi, cuda profiling, cuda pravoslavlja, cuda gpu programming, cuda gpu, cuda gpu setup, orin
Id: BwP09FwnQwc
Channel Id: undefined
Length: 11min 6sec (666 seconds)
Published: Sun Feb 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.