PyTorch on the GPU - Training Neural Networks with CUDA

Video Statistics and Information

Video

Captions Word Cloud

Captions

welcome to deep blizzard my name is Chris in this episode we're gonna learn about how to use the GPU with PI torch we'll start by learning how to use the GPU in general and then we'll see how we can apply these general techniques to train our neural network if you haven't seen the video where we talked about GPUs in general and why we would even use GPUs in deep learning be sure to see that video as well because it'll help you get the best understanding of these concepts in this video we're gonna be more practical in looking at exactly how we use the GPU rather than why we would use the GPU for now I want to hit the ground running with some examples there now we have all of our code set up and we're ready to run some examples if you're not familiar with this code setup be sure to check previous episodes in this course where we got all the code up to this point set up PI torch allows us to seamlessly move data to and from our GPU as we do tensor computations to show our first example in action we're going to create a tensor and a network then we'll move our tensor and our network to the GPU finally we'll pass our tensor to our network and get a prediction alright so here we can see that we have a GPU prediction that we've gotten back and if we just print out the device we can see that indeed this tensor this prediction tensor is on the device of type CUDA which is the GPU now let's make a tensor and the network back to the CPU and perform this same task there we can see now that our CPU prediction that we got back from the network after passing the tensor and has a device of type CPU and this is how we can work with tensors and networks on the GPU we can move them back and forth in this way and this gives us a hint at how this would be done in the training loop to do this in the training loop we just need to make sure that our data would is the tensor in this case and the network are both on the same device this in a nutshell is how we can utilize the capabilities of pi torte when it comes to a GPU now something interesting a note here is that we called the CUDA method on both a tensor and on a PI torch network now if you think about that for just a second you realize that something's odd about that because both of these objects are not the same type so even though the method name is the same CUDA in this case and CPU when we're moving to the CPU these are actually different methods and they're working differently under the hood so by the end of this video we're going to understand exactly what those differences are but before we actually get there let's look at a couple more examples that will highlight some of these differences I want to start by creating two tensors now let's check the device of both of these both of these tensors were created on the CPU this is the default behavior of Pi torch now what I want to do is move one of these tensors and only one of them to the GPU so here we've created one 10 Circle T 1 and another 10 Circle T 2 and then we've moved the tensor T 1 to CUDA and now we after checking the device we can see that T one's device is indeed CUDA now since we expect or we're thinking that we might get an error we're gonna wrap the next call in a try and then we're gonna catch the exception so we're gonna do T 1 plus T 2 and we can see down here that indeed we get an error expect a device CUDA zero but got device CPU by reversing the order of the operation here we can see that the error also changes both of these errors are telling us that the binary plus operation expects the second argument to have the same device as the first argument finally for completions sake let's move this second tensor to the puter device and see that the operation does succeed there now the operation has succeeded and we see that the result is indeed on the cuda device note the use of the two method here in this case instead of using the cuda method or the cpu method we used the two method and we passed in the parameter instead the parameter in this case is what specified which device we want to move to the two method is a preferred method to use when we're moving tensors to and front devices one last thing to notice here is in the output of the device whenever our device is on CUDA the GPU we also see an index and this is because pi torch supports multiple GPUs now by default if you only have one GPU on your system it's going to default to index of zero using multiple GPUs is out of the scope for this lesson so we're not going to touch on it in any more detail than that just know that the index specifies which GPU you're using so we just covered how tensors can be moved to and from devices now I want to turn our attention to how we move networks to and from devices more generally we want to know what does it even mean to move a network to or from a device this is the essential thing that we need to grasp here PI torch aside this applies no matter which framework or programming language we're using we need to know what does it mean to put a network on a device to understand this let's create a PI George Network and then take a look inside at and networks parameters so here we'll just create a network and then now we'll iterate through the named parameters of the network we're going to print out the name and the shape of the parameter I want to do this same iteration again but this time I want to print out the device of the parameter and the name whenever we print out the device of each one of these parameters it is indeed the CPU so this shows us that by default when we create a PI torch Network all of its parameters or tensors underneath the hood are initialized on the CPU an important consideration of this is that this is why neural network modules like networks don't actually have a device attribute it's because the network isn't the thing that's on a device technically speaking it's the network's parameters or in other words the tensors that live inside the network that are actually on any given device now let's see what happens when we ask a network to be moved to the GPU all right so we're going to move this network to CUDA there now we'll check the named parameters and we see that indeed this network has been moved to CUDA specifically all of the network's parameters which are tensors have been moved to CUDA so let's create a sample and then pass this sample to the network all right so we've created a sample tensor we check the shape here it's 1 by 1 by 28 by 28 alright so we'll try and we'll catch the exception that is sure to come because this sample is going to be initialized on the CPU and our network is now on the GPU ok and we get an error expected object of device type CUDA but got device type CPU for argument 1 itself in call 2 and then we can see here a reference to our first comp layer in the form method so this error is very similar to what we saw before when we were simply adding tensors let's see this computation succeed by moving our sample to CUDA there after moving a sample to CUDA we can pass this sample to the network and get a result so the next thing is just to take a look at how we can detect if CUDA is available in our system now the reason we want to be able to do this has to do with something that we call device agnostic code and what that means is is that we want our code or our programs that we write and PI torch to be agnostic to which device the particular program is running on so we don't want to write a program that just calls CUDA everywhere or says to and then give that to someone who's gonna run it on a machine that doesn't have CUDA because our program would have work in that case so one of the ways to alleviate that or to write device agnostic code is to use this torch dot CUDA dot is available called and this tells us whether or not CUDA is available in our system and we can go about the rest of our program setting our device based on the output of this call now on the blog post for this episode there's a little bit more detail about writing device agnostic code and the only reason I'm using that kind of language is because you may see it in PI torch documentation or on other blogs it's basically just writing code that that works you don't want to write code that only works on one device so now we're ready to go ahead and take a look at using the GPU in our training loop and we want to do a performance test to understand exactly how much the speed-up actually is when we use a GPU versus CPU so in order to do this we're gonna build on the code that we've been developing over the last few episodes in the series which is just here this is where we configure our runs and our training loop so before we modify or I show you what modifications have to happen inside of this code here there's one change that we have to make to a run builder class in order for all of this to work so I want to scroll up where we defined our run builder and just show you I misspoke earlier I said run builder when I meant to say run manager so we need to modify the run manager class alright in the place where the modification needs to take place is in the begin run method so just down here you can see where we're adding a graph to the tensor board instance we need to actually do some checking to see which device is our run for are we running is this run using the CPU or is it running the GPU what this does is makes this code backward compatible with what we've already written so we need to check to see if our run has a device attribute and if it does we want to use that that particular value and if it doesn't then we're going to default to the CPU so make sure to update your code with this and then you'll be backward compatible and ready to move forward so let's jump back to our training loop now and see what modifications we need to make in order to see this thing in action all right so the first order of business is that we need to make a device or put a device inside of our run configurations so in this case I have device and I've set the two values that we want to try to be CUDA and CPU and so then what this does is it exposes these values inside of our runs so let's go see there's a few places inside the run and inside the training loop well we need to actually access these values the first place is gonna be right here at the top of the run we're gonna go ahead and just get our device so we say run dot device and we create this pi torch device this is one way that you can actually create a PI tours device as you say torch device and then pass in a CPU or CUDA and you'll get back that particular device and this one we can use to pass around and saw our program and it will be the device that is used the next place or the first place actually that will use the device it's going to be the network so we initialize our network and then right away before we even set it equal to a variable we can just chain the to call to it and then send the device in so we'll move our network or initialize our network on this particular device whichever one it may be we're going to be swapping between CUDA and CPU and then the next thing that has to be changed is just down here when we unpack our tensors our images and our labels tensors we need to do it separately before we just unpack them all at once from the batch now we need to do it one at a time and as we do that we're gonna use indexing then we're going to chain the to call on here to send both the images and label sensors to the device that the network is also on and that's all there is to it to get our training loop running against these two different devices so now with all the work that we've done before we can use this to do a performance test to see how much of a speed-up do we get whenever we run on CUDA versus CPU so let's run this code and see what we get we're gonna vary up here you can see we're gonna vary we're gonna do the same learning rate every time because that shouldn't change anything and then we're gonna have three different batch sizes that we'll try it will try 1,000 10,000 and 20,000 and then for numb markers we'll iterate on two variations okay so I'm gonna run this code now all right we can see here that we finished all of our runs and just down here just so we can make better sense of it we're going to query the information and sort it by the what are we gonna sort it by the epoch duration so here we are accessing the run data from within the run manager by doing m dot run data and then we want to orient along the columns and we want to sort values based on epoch duration and we're using the pandas dataframe to do this okay so we run this and here we can see that sorted by run duration that CUDA blew away the CPU every time it looks like it was about twice as fast in this case so when you get your results post on in the comments and let us know how you did what was your speed up so that covers how we can use GPU with PI torch now all of this comes baked in there's no need to do any additional kind of installs or anything like that so PI torches makes it really easy to use the GPU I hope this video helps you understand what it actually means to move a network to a device actually starting to run out of light here this is the one of the first videos that we've ever recorded face to face so we're still trying to work our way through this new kind of way of doing things and it's actually kind of hard because we're using the sunlight coming in from outside and it's starting to get dark so I'm losing all the precious light that I need now fun fact is that we're actually recording this video from Vietnam and we've been traveling for like the last I would say a year and a half and we document all of that on our other channel deep lizard vlog so if you're interested in following us over there go check it out we basically document everything that we do in terms of travel and if you haven't already be sure to check out the deep lizard hive mind where you can get exclusive perks and rewards thanks for contributing to collective intelligence I'll see you next time [Music] [Music]

Info

Channel: deeplizard

Views: 28,583

Rating: undefined out of 5

Keywords: deep learning, pytorch, cuda, gpu, cudnn, nvidia, training, train, activation function, AI, artificial intelligence, artificial neural network, autoencoders, batch normalization, clustering, CNN, convolutional neural network, data augmentation, education, Tensorflow.js, fine-tune, image classification, Keras, learning, machine learning, neural net, neural network, Python, relu, Sequential model, SGD, supervised learning, Tensorflow, transfer learning, tutorial, unsupervised learning, TFJS

Id: Bs1mdHZiAS8

Channel Id: undefined

Length: 16min 39sec (999 seconds)

Published: Tue May 19 2020