CUDA Simply Explained - GPU vs CPU Parallel Computing for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone today we will talk about cuda which is a powerful software platform that helps computer programs run faster now we often use it to solve performance intensive problems such as cryptocurrency mining video rendering and of course machine learning but cuda is not just software it is also embedded in hardware so only those of us who have an nvidia graphic card can access it but wait a second isn't our processor in charge of running computer programs how come our graphic card can do it and better that's not even his official job so before we dive any deeper let's quickly talk about how processors and graphic cards operate a processor or a central processing unit also known as cpu is in charge of all the mathematical and logical calculations of our computer so its main purpose is to run instructions which we commonly know as code so every time we interact with a program every time we copy or delete files or even when we type a single letter inside a text document all these tasks are performed by the cpu additionally all the communication between the different computer components also relies on the cpu so the components they do not communicate to each other directly but they have this cpu middleman in between so for example the hard drive knows nothing of the keyboard but thanks to the cpu we can combine their powers and use the keyboard to rename files on the hard drive and things of that sort so cpus then must be really good at multitasking because we are always able to run a bunch of applications at once we are able to download files through the browser while listening to spotify while scanning for malware while using our mouse and keyboard for a thousand different things but surprisingly this is not exactly the case the core of our cpu can only handle one task at a time this core is an independent processing unit and our multitasking abilities directly depend on how many cores our hardware has and here's the thing the most advanced and newest cpu on the market only has 16 cores and i mean it in terms of products that regular people can buy in the store not industrial grade rocket launching products in fact most of us are working with as little as 2 up to 8 cores per cpu but we never notice any issues and the reason is cpus are incredibly fast they run so fast that humans can't even notice that our tasks are being executed in a sequence instead of all at once so then cpus are generally not so good at multitasking yeah they're fast but they're also quite limited when it comes to running things in parallel but how about graphic cards a graphics card or a graphics processing unit also known as gpu is in charge of displaying images on your screen now modern gpus are extremely powerful because often they come with their very own memory and their very own processor now gpus help execute code in parallel as well as offload processing on the cpu so then when you're playing a video game for example instead of sending every little frame and every little movement to your cpu where a bunch of calculations are made and then they are sent back to your gpu so that you can see the changes on your screen gpus can skip this unnecessary communication and provide you with a much better gaming experience a much faster one but how come gpus are faster at running instructions than cpus it's not really their primary purpose so how can they do it better now can you guys make a guess of how many cores a modern gpu has and please be extra generous because what i'm about to say will shock you so if the most advanced cpu at the moment has 16 cores the most advanced gpu has 10 496 of them yeah specifically nvidia's rtx 3090 has 656 times more cores than intel's i9 12 gen processor now can you guys imagine what it means in terms of multitasking we can essentially run over 10 000 more processes on our graphics card than on our cpu at any given point of time well at least in terms of these two product lines so if that's the case why do we even need cpus let's just get rid of them and use gpus instead but there's just one problem sometimes multitasking is not the best solution in some cases running tasks in parallel takes up much more time and much more resources than just solving the problem one step at a time that's why having cuda run code in parallel is so handy it allows you to switch between cpu processing and gpu processing for very specific tasks that way when writing programs you can pick and choose exactly when to use which piece of hardware so we essentially gain much more control over how our computer operates now you may have already been using cuda without even knowing it as applications such as adobe creative suite are already utilizing it but how exactly can we use it in our own applications how can we access it and combine it inside our python code so let's start by installing cuda and as a pre-installation step we will first verify that our gpu is indeed capable of cuda to do this we will type inside the terminal nvidia smi list gpus now in my case i'm working with geforce rtx 3090 so we will just make a quick note of that and we will navigate to the link i've provided in the description now on this link you can find all the cuda enabled gpus that nvidia has to offer and if your gpu is anywhere on that list everything is perfect you can just move on with the cuda installation so in my case i'll navigate to the g4 series where we can indeed find my gpu awesome now let's move on with the installation now there are several ways to install cuda but in this tutorial we will do this inside an anaconda working environment now if you guys are not familiar with anaconda i'm including a special tutorial in the description which will get you up to speed and if you guys are not big fans of anaconda i'm also including equivalent vn commands as well so that we all have a nice alternative so in our case we would like to start from zero with a brand new working environment so let's create it first we will type conda create n and we will select a name for our environment in my case i'll call it ml machine learning and we will install python 3.9 inside we will confirm with y and we will activate this environment with fonda activate ml perfect and once we are inside our working environment we can then go ahead and install cuda now many of us will be tempted to install cuda directly with the following command conda install c anaconda huda toolkit however this version of the cuda toolkit may not be compatible with other packages which we also would like to install in this environment for example pytorch so let's try something else instead of installing cuda first we will begin by installing pi torch we can do this by typing panda install c pi torch by torch and look at that so it seems that by installing pytorch we are also automatically installing the cuda toolkit and the best part is pytorch already knows which version of the cuda toolkit will work best for it so this is a win-win situation so let's scroll down and we will confirm with why and congratulations you guys have just installed cuda and now we can go ahead and start coding now in my case i'm going to do this through jupyter notebook but feel free to use any code editor any ide it doesn't really matter so this step is absolutely optional so i'm installing jupyter notebook with conda install c anaconda jupiter and i'm gonna run it with jupiter notebook cool now i'm gonna create a brand new python file python 3 file and the first thing i'm going to do is to double check that our cuda installation was successful to do this we will import torch as in pi torch and we will then check if torch that cuda is available and if everything worked and our installation was successful this line of code will return true now let's run it with shift enter and there you go we have successfully installed cuda perfect but actually this command is slightly ambiguous usually you will see something along the lines of if torch cuda is available then device equals torch device cuda otherwise device equals torch device cpu and then right below there will be some kind of a print statement saying that using device device let's rerun it okay and now this is way more informative let's move on now our plan is to perform a quick speed test so we will create two extremely complex data structures and we will then measure how much time it takes for each device to multiply them to do this we will first import the time module which will help us with a timing of course and we will then create a new variable called matrix size which we will set to 32 times 512. now if this size is a bit too big for your computer you can always adjust it to 32 by um 64 or pretty much any other number as long as you multiply this number by 32 because 32 here represents something called a batch size and we will talk about it in detail in future tutorials but for now just make sure you include it next we will create the first data structure completely out of random numbers we will call it x and we will assign it to torch dot rent n as in random numbers and then inside the round brackets we will set the size to be matrix size by matrix size which is an enormous amount of values so we will just copy this line of code we will paste it in the next line and we will create our second data structure which we will call y cool now let's move on with a speed test and we will begin with the speed of our cpu so we will print cpu speed and in the next line of code we can actually start timing so we will create a variable called start and we will set it to time dot time which represents the current time next we can move on with multiplying our matrices by typing torch as in matrix multiplication and we will then pass into the round brackets our x and y data structures and actually let's assign this expression to a variable called result and then once we are done calculating we can just simply print the current time with time.time minus the start time and lastly we will verify that we're using the correct device so we will print verify device along with result.device and that's pretty much it for the cpu now let's move on with the gpu and before we start timing we need to load our data structures into cuda to do this we will type at the very bottom of our code x.2 and inside the round brackets we specify device and if you guys remember we have already set our device to cuda in the cell above and to be extra cautious i'm actually going to assign this expression to x gpu instead of just reassigning it back to x and we can do the exact same thing with y so we will copy this line of code we will paste it below and we will adjust x gpu to y gpu and x to y and now we can start timing cuda so we will copy the print statements from above including the timing commands and we will paste them at the very bottom of our code now we can adjust cpu speed to gpu speed we will adjust result to result gpu we'll do the same thing for x and y so x becomes x gpu and y becomes y gpu and the last thing we'll need to change is inside our verify device command where we change result to result gpu but that's not all we need to keep in mind that our cpu doesn't just stop and wait for the gpu to finish processing it actually keeps executing the rest of our program while the gpu is still calculating so we need to find a way to politely ask our cpu to hold on until the gpu finished otherwise we'll be printing the speed results before we even finished multiplying our matrices so there's an actually very easy way to do this below our loading to cuda commands we will type torch dot cuda dot synchronize and this command is basically freezing our cpu up until the moment that our gpu has finished loading both our data structures and then we will need to call it once again so let's copy this line of code and we will then paste it after we are done multiplying our matrices matrices and then lastly we will actually wrap our gpu speed test in a for loop so for i in range three so we will repeat this speed test three different times we will then um indent the next lines of code and the reason why we do it is because the very first time we perform our matrix multiplication with cuda there is an extra process happening in the background so it's sort of a setup step which takes up time so the most fair thing to do is just to repeat this speed test three different times and make sure that we're getting the best results let's go ahead and run this code so we will press on kernel and we will press on restart and run all restart and run all cells alrighty let's take a look at the results so the speed of my cpu is 8.46 seconds while the speed of my gpu is 0.26 seconds wow we're looking at ratios of 1 for the cpu and 32.5 for the gpu which is incredible but keep in mind that this was a relatively simple task so even though the results are very impressive we're not even close to maximizing the capabilities of our gpu so we will of course learn how to do this in future tutorials this one is just an introduction so we do simple things now i actually have a very cool idea what if we post our speed results along with what kind of cpu and what kind of gpu we're working with and if you changed anything in the matrix size also please mention it and then within a few weeks i'll be able to go over the entire comment section i'll be able to copy your speed results and organize them in some kind of a table or some kind of a database which we can all access and then we will have this super convenient list of equipment benchmarks let's call it so if you guys want to participate please comment your speed results below and any additional information about your hardware i will really really appreciate it now i ran the speed test a few more times just to make sure that these results are not a coincidence and if you guys are curious to find out how it looks like for a 32 by 64 matrix let's rerun the cell and this time we're looking at a one to six or one to seven kind of ratio so it seems that when we're dealing with extremely complex operations on extremely complex data then our gpu really shines so the more complicated the task the better gpu performance we get and the last thing we will talk about is why our device is cuda zero instead of just cuda so what this zero represents is our primary gpu now in my case i'm working only with a single gpu but some of us have systems with multiple gpus so whenever we specify our device instead of just specifying cuda we can actually specify cuda at a certain index now the primary gpu as i mentioned is index 0 while the secondary gpu is index 1 and so on cool now this tutorial was a brief introduction to cuda in the next few tutorials we will dive in much deeper and we will mostly focus on machine learning so we will talk about pre-trained neural networks and we will also talk about interference with tensor rt so i am very excited now thank you guys so much for watching if you found this tutorial helpful please share it with millions upon millions of people or at least a few it's a good start if you like this video please leave it a like if you have anything to say please leave me a comment maybe subscribe to my channel and of course turn on the notification bell now i will see you guys very soon in a brand new tutorial so don't go very far
Info
Channel: Python Simplified
Views: 213,644
Rating: undefined out of 5
Keywords: cuda, cudatoolkit, cuda toolkit, parallel computing, parallel programming, multitasking, multiprocessing, multi processing, gpu, cpu, graphics card, graphic card, processor, computer hardware
Id: r9IqwpMR9TE
Channel Id: undefined
Length: 19min 11sec (1151 seconds)
Published: Sat Dec 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.