How to train a deep learning model using docker?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone and one of the previous videos I showed you how you can use docker to containerize your flask based web application so there were a couple of videos on my channel previously one of them was training a model training a deep learning model for skin cancer detection then we created a web app then we wrapped the web app inside docker so a lot of a lot of people asked me how would you use docker to train the model instead so this is what I'm going to show you in this video and if you're not familiar with the previous videos I would highly recommend you to go and take a look at how to train skin cancer detection model and how to build a web app and how to docker eyes the web app you can skip how to talk no you should not skip docker eyes the web app because I'm using some of the things from there in this video so in this video we are training the deep learning model and as you already know as some of you already know that I created a simple library called well that's fantastic machine learning and here it sits like a wrapper so simple functions that I normally use so this is what I'm going to be using but since I'm changing it actively quite a lot I have to tell you that you must use a given version so let's see the code first of all we we will create the training code so we have already done that previously in one of the videos so here I'm just walking you through the training code quite quickly I have changed a few things but not a lot of things so let's take a look so the first thing that I have modified here is I have removed and vidya epics so why I have removed because just to keep this video a little bit simple then I have added this data part so this folder on my local machine consists of all the data then I have the model it's the same model from one of the previous videos you have already seen that and then most of the stuff remains the same so what changes is since I've removed epochs I remove the line when I'm wrapping the model for automatic mixed precision so that's not happening anymore and epi16 is false so that's the only change and I must tell you that I'm using WTF ml zero point zero zero two 0.02 and now I think everything we have everything here that should be sufficient to train the model without epics obviously so what I'm going to do is I'm just going to do ahead 100 1000 just just to make sure the model trains quite fast and then we are going to train this model and see if it's training or not so I am here and I can just write fightin okay let me make it a little bit more bigger for you okay so I can just read Python and then the name of the file which was main dot part and I hope this trains thumb model so let's wait for it okay so the model is training and since it's only 1000 samples in training it should be quite fast and it is and now we have the validation step going on and everything will be done by this wrapper library that I'm using so we can see it's generating some kind of score and saving them all so everything is working fine without a fix and now we go back and shear now I have to reduce the size of this now here what we have is we have we have created a docker file for kind of like deploying so wrapping the flask app so we will take everything from here and create a new file called docker file underscore tray and paste it here so one thing that you have to remember now since you are we are using the base image of Ubuntu 18:04 right Ubuntu 1804 does not come with CUDA and qdn drivers so you have to install them yourself or you can do something else but before we move there one of the things that you must do if you want to train a machine a deep learning model using and VDS drivers is install the runtime in the docker runtime that Nvidia provides so it's very simple so you go to Nvidia's Nvidia docker repository and you scroll down a little bit you will find the readme and there's instruction for Ubuntu and other types of distributions of Linux so what you do is you can just copy paste all these commands and you run them in your terminal so it's not very difficult and it takes quite a few minutes and after that you're done it's a one time step so once you're done with that now we go back to our docker file for training and here instead of Ubuntu we will use something else so what you can use here you can get anything that you want from docker hub so hub docker calm and then you go and search for whatever kind of images you want so Nvidia CUDA already has a lot of different kinds of images and you can see like this is a ten point to a ten point to QD nn7 runtime with Ubuntu 1804 so is what we are going to use and or maybe not maybe not this one maybe we will use 10.1 because why not so let's see if there is a 10.1 image here okay not on this one No so probably I can just filter here okay so we have some 10.1 images so this is 10.1 qdn and 7 1 2 1604 and 10.1 q en and 7 1 2 1804 so this is what we are going to use so what you do is you have this inside nvidia slash cuda right so you go to your doctor file and you modify this to this one and add and we eat here slash cuda so now you grab the image from here and it comes with all the runtimes that you need now this step is to build the docker image so it's we have already done it once so we can just do docker build and provide the file name- T which is the target so this time I'm doing melanoma : tap train and dot enter and this is going to build the docker image so these steps take some time and you see like docker is also caching using cash for different steps so it's because I have already run it once and now it's grabbing torch 1.5 so this step is going to take some some more time and it will install everything and the image will be ready so when the image is ready we are ready to Train so as you can see that has successfully tagged melanoma train so now we are ready to train our models so what we are going to do is we are going to docker run the Train command so which is also very simple and we have already seen how to do that previously right but before we do that let's let's try to run it first okay so we do docker run and then we specify what kind of image target image so that's melanoma train and then you define the command right so that's not very difficult so let's define one command called nvidia sm body and see what happens so you can see that even though I have GPUs on my machine docker is not able to detect it in its own container and that's a problem and that can be fixed by using the Nvidia rocker runtime but so it's using that runtime we already have that runtime all you need to do is add another parameter GPUs one sorry two hyphens and GPUs one so now you can see it's identifying one GPU on one machine and it's running inside docker and you can also change it to GPUs too so I have two GPUs so I can change it to two GPUs so now it's identifying both GPUs inside occur and this is what makes our stuff quite simple and what we can do now is we can just do Python 3 and main dot PI to train our model so let's see if we can train our model ok we can also it's complaining about pandas so that's one of the very important things that you must look at requirements for txc so I will add pandas you know rad without versions I'm just doing it for the video and then we need to build the container again so it's again going to take some time so let's build the container again so whenever you change something in the code you have to build the container again okay so let's see how much time it takes now it's probably going to take a lot of time because it says download torch game and that takes time so it seems like it is done now so let's try training the model again so using the docker run same come on and see what happens now so now it gives me this error that this file is not found it's because docker container cannot access your host machine right and that's that's the problem and now what you do is you have to change a few things again before we go there I would also like to explain you this docker file very quickly so we grab this image so that's your from command this first command most of the time and then run command runs any commands like so here you can run Ubuntu commands so we are installing Python 3 and python 3 prep we are adding a user called Abhishek we are changing the owner of the use home folder of the user to Abhishek and we are copying everything from here on the left hand side to a folder called app we are changing the user to Abhishek now and we are going inside the app folder and installing all the files from all the libraries from requirements to txt and we are changing our working directory to the app folder so now docker whatever you whenever you're running a docker container you can create as many files as you want or whatever you want but it won't be saved as soon as docker exits as soon as your program and docker is going to delete everything so what do we do is we change the data path a little bit here so instead of slash home Abhishek workspace which is my local address I'm just going to use slash home shaped data okay and so this is my data path inside the docker container now so since we change this we need to build the docker container again and then again we need to wait for some time so let's try building the docker container again let me save this one okay so now it's building a container again let me see if I have something that I should explain you here so all this is old stuff you have already seen all of them in previous videos so you must go through the previous videos once and you must take care of the version of WT if I am add a few planning to use it because I change it quite a lot so we have we have everything I think we do have everything so now the concept of so once we build this docker container people run it again but now with the concept of volumes so what are volumes volumes are like locations on your host machine that you can map to the docker container and the docker containers can then modify it so you can mount multiple volumes if you want or you can just mount one volume so it's just like creating a copy of a folder and doctor container can modify that copy and so when it exits you will have everything inside the folder that you mounted if docker has changed it and it will be changed so we now we wait a little bit more for this container to finish I think it's going quite okay so should finish very fast now okay so it's done so now we can train our model again so let's see what we had in the code did we have so we still have had 1000 so it's Connelly going to train on 1000 samples so now before training I mean in the command what you can do here is you can use the parameter minus B and then specify which floral goes where so my home workspace no data goes to home of the shake data on the docker container okay so let's run it and just so it's visible let me show you again so dr. run and then specify the number of GPUs then minus V parameter which mount this folder from my local machine choose this folder on the docker container and then you specify the target image which is melanoma trained and then the command pythons remain dot pi okay and now you wait so as you can see it was able to read the data but now it has to also grab this pre trained model checkpoint right the image net pre train model so this step is going to take some time I could have also put this in my local data folder and instead of downloading the weights I could have just loaded the state but I'm not doing that I'm just being a bit lazy here so you can do you can do that and then it you don't have to way for training every time so now we wait so now it runs and now we get another error and it talks something about insufficient memory so that can happen because docker has will share the memory from the host machine so to avoid that and to avoid it what you can do is you can make the number of workers to zero in your data loaders but that's obviously not a good choice so what you can do is you can use another parameter called IPC and set this to host now when you run it again your model will work fine so let's run it again and see what's happening so you know that as I as I mentioned docker had already downloaded this file here it had cached it but now it's running it again because it doesn't save anything in my when it's exits so now it's going to download the files again so let's try one more time so you can see now that the model is straining fine and everything is going fine it finished the first validation step so now we will go now let me stop it first ok so now we will go and see if the model file was saved so in the code I save the model in the same data path which should be home abhishek data so we can go back to our terminal and here i can just check if i have model 0 dot bin so if I just do LS so I'm getting model 0 dot bit so I have the file so this is how you can train using docker and you have we used by torch but you can also use tensorflow obviously and I hope you liked the video I hope you enjoyed it and if you have some suggestions then let me know in the comment section and if you liked it then click on the like button and do subscribe if you have not subscribed yet and feel free to share it with your friends and my book is also out so if you like it you can also buy the book and whatever I explained today some of it has been covered in the book also so that's good and thank you very much and see you next time good bye

Info

Channel: Abhishek Thakur

Views: 34,617

Rating: undefined out of 5

Keywords: machine learning, deep learning, artificial intelligence, kaggle, abhishek thakur, docker pytorch, how to use docker to train models, docker nvidia runtime, using docker with pytorch, deep learning using docker, docker for deep learning, cuda cudnn docker, nvidia docker, how to train a model using docker

Id: Kzrfw-tAZew

Channel Id: undefined

Length: 20min 31sec (1231 seconds)

Published: Thu Jul 09 2020