Let's code on cloud GPUs with VSCode and Jupyter notebooks

Video Statistics and Information

Video

Captions Word Cloud

Captions

hey guys it's will today I'm going to show you how to code on cloud gpus using either Jupiter lab or vs code or even connect to your own local vs code the advantage to doing it this way is that you'll be able to get a fresh new environment in a few seconds and then start coding and then from there you know add in gpus install whatever you want and be able to code um pretty quickly so first uh to get started um you just go to lightning. a here and then uh click on launch a free studio and and then this will uh ask you to sign up and um there's no credit card required or anything once you sign up you will end up in a studio just like this so this is free and you can see that I'm running on a CPU for example now I can run on other gpus and different things which I'll show you in a few minutes but um this is just like your laptop now right I have vs code running here and I'm just printing this I can also run using uh Jupiter lab so if you prefer notebooks or Jupiter experience then you have that here so I'll show you that in a few seconds and um here you go you can just create a new notebook uh we'll call this hi uh and you're good to go now start doing what you're going to do great that's it so everything's automatically saved for you now you have different apps in the studio these are two that I showed you vs code and Jupiter right you can uh there are other apps that I'll show throughout this and you can install your own apps as well so I prefer to code through vs code personally um and sometimes I don't necessarily want to use the browser so if I don't want to do that I can just connect my own local vs code so I can say code from local vs code I can copy this and then I'll just go ahead and create a new one here and now I can just paste the command that I just copied here so I just said add new Sage host I press this I press enter and this takes about 3 to five seconds to set up very quickly and now I am using that remote Studio but I'm on my laptop currently so I don't need the browser to be turned on anymore right so I'm here this is the the studio on the browser and this is my local VSS code connected to that studio right so I have to click yes and then if I create anything or write any code you'll see it immediately refracted in the browser as well right so okay so now now that I'm in I showed you how to do this I'm actually going to use the browser for a little bit and then we'll come back and code it locally so that that's kind of all it takes all right so um this is really for any kind of development uh really it's designed for AI and machine learning because you have gpus and a bunch of other things right so I'm just going to run this you have python hello world here you have the terminal you have a standalone terminal here if you want as well so so I can kind of just run things manually right and you have full control you can install whatever you want so let's say I install I I want to use something called C right C is this like silly program um that's not installed on here but I can just go ahead and do that right so I can install this great and now I can use it right cow say just kind of uh sorry C uh hi okay so now this is installed when I um turn off the studio and turn it back on everything that I've set up will stay there so it's a persistent Cloud environment by persistent we mean that all the dependencies all the files you've put on here and everything else uh are available uh whenever you restart it or even change gpus okay so all right this is kind of like a boring example now if you're in AI or ml you're likely going to be I don't know testing the models debugging prototyping that kind of thing just to show you how some of this stuff works I'm going to paste on here pyto lightning right pyos lighting ISR deep learning framework to fine-tune and pre-train models right it's been the standard for doing this for many years you can use any open source model so or your own right so you can bring in models that are P torch could be open AI could be hugging face doesn't matter drop him in here right so you could like bring yourself do model equals I don't know bird or whatever you're going to use in here and then you define what's going to happen during the forward pass and then the training step here as well this gives you a lot of control so you have full full control over how you're going to fine-tune and pre-train so I'm going to run this thing right now um and this is going to run on the CPU specifically now the workflow that I recommend is that you start first um training or debugging or doing your code in a CPU so you can kind of profile how things work right so I see my CPU Ram everything's good um before you go scale to something more expensive now I want to run on a GPU so I'm just going to go ahead and pick this a1g for example so you get 15 free credits with the platform as well um this means you can run roughly 22 hours um of T4 gpus and probably around nine or 10 hours of an at1g as well right and um and okay so I'm going to just control C this thing real quick because I don't need it if you want to start a brand new studio it's very straightforward right I just type in studio. lightning. on the browser and then that just creates a brand new studio so this is by far the simplest fastest way to get an online environment very very quickly that's persistent and you can prototype and from here you can really scale right so you can do a lot more than just coding this is kind of just the beginning of this right okay My ATT is ready so I'm going to switch to it and if you notice nothing's really changing here it's exactly the same code everything is the same I've got these new metrics up here though yeah so my connection um did cut out but I can just kind of reload it and then it should basically connect again all right so now it's back and now my local vs code is still connected to the studio even though under the hood they're different machines now which is great so again we you don't have to mess with like environments or any of this stuff here right okay so remember cow say well you know it's always going to be available in your studio now going forward right uh hi and if I go to the other Studio that started this is a brand new fresh environment uh I don't have any of that stuff right so it's not installed so it's a completely it's like you got a brand new laptop basically everything set up from you so this is like the fastest way to create something new a new environment do whatever you want prototype if you like it keep it if not delete it it's not a problem right um okay so Cal say is working I'm going to just uh run this on a GPU now and just press play and if you haven't tried pyer lining this is kind of one of the magic things that it provides is you don't have to deal with like changing your code for different Hardware so we were the first ones to introduce this ability in 2019 and uh today uh through this trainer so if you've seen this trainer in other Frameworks this is kind of where it's coming from right so this is the world's first trainer uh it's also the more the most robust and scalable trainer because of that so um you can actually go to even multinode and some of the world's largest models today are trained using pyr lining specifically and you notice why because it's very simple and it just 100% works now um it's been a few years of it like being production ready so it works really really well however the studio you can run whatever you want you can run Jacks tensorflow your own P torch code we don't really care right it's up to you um okay so this thing is running now um something that I like to do usually is a hyperparameter sweep before I go scale things so in this particular case like let's say I wanted to train this model right well before I go do that maybe I want to change the somehow like tune the learning rate right and and I'll show you what I mean by this right so let's say I um and you can you guys can go duplicate the studio here um so you can see an example of this right um when you when you find the studio templates you just hit run and then it basically Forks this and then you have a full copy of everything yourself so if I run a hyper primer sleep let's say I'm training one model and the learning rate is how fast that model is going to learn so on the left you have something where it has a high learning rate so sure like the the drops very quickly so theoretically it could finish very fast however the loss doesn't really go all the way to the bottom maybe this orange one would if we let it run longer but it it actually won't give you the best performance now if you're learning or it's too slow you'll see kind of this blue curve up here where like it just takes forever to go low and low and low so you know when you're training something for like 10 minutes or an hour it's really not a big difference I don't know if I'd waste a lot of time on this but if you're going to do stuff over multiple days or months or even multi- node you definitely need to do a sweep for like a few hours to like understand which learning rate gives you the better curve and then run that for like the longevity of the actual thing that's right so um here the training itself probably this is the one I would pick whatever gave me this one because it drops the loss fairly quickly um and it's also the lowest loss right so I know that I probably wouldn't necessarily need to spend a lot of money on this because I can probably cut it out earlier so how do you find that well you can experiment yourself or write something like a sweep right so we're going to do that here so in lightning there's no special syntax to do a sweep it's just a for Loop over kind of what you want to run so the studio that you're on has an SDK attached to it right and you can also use this outside of studio so if you're building pipelines and that kind of stuff you can do that so I can reference a studio here so you you grab a reference to self basically which is a studio you use this plugin here called jobs which is this guy here and then I'm just going to search over these three learning rates four three and two right and then I'm just going to for Loop these and my script is called main.py and I need to make sure that I can take a learning rate argument right right now I can't really do that so if I run this right if I say main.py and then I say LR one it's going to complain that it doesn't know what LR is right so you need to okay well I guess it didn't um does it yeah it definitely does not so um oh I see yeah okay so here here's what I need to do right I'm going to actually um add the argument parser to this and there are many ways of doing this I just prefer this one uh from import arguments parer right and then I'm just going to put this into a main function which is totally optional but it's good practice and notice here I'm using like a vs code extension called uh Vim so so um once you install extensions into your Studios they're available in all your Studios right so if I go to this other Studio here I can still use them here because I've installed it globally for all my Studios and if I code from my local I have all my local extensions that I normally use right so if I'm here then I would have like co-pilot or whatever else I'm using locally as well that's also available in the studios but okay so I prefer sometimes to just code locally like this so again this is my local vs code but the code is actually on the Studio on the cloud so it's a remote server right now okay uh parser I just do uh argument parser and then uh I'm just going to add an argument to it add argue it's going to be LR and then type is going to be a float and that's it and then parser parse r okay so this is going to give me a learning rate this is going to give me the allow me to push um parameters into the script right so I'm just going to open this terminal here uh I'm going to maybe make this thing a little bit bigger okay so let's say Python main.py and then let's just make sure that this is currently working so okay now you see that it gives me the AR here uh LR equals 1 and I can pass that in and then everything should work okay well that argument is not actually being used anywhere so we need to actually pass it in somewhere so we're going to pass it into the model uh like the fine-tuning sequence so here I'm just going to pass in and I'm going to default this to whatever it is today which I believe is this yeah okay this guy and then I'm just going to register the argument here okay and now I can just pass that in here so I've parameterized a script this is called a hyperparameter it means that is a parameter for the script itself so now what I want to do is I want to run the same script here multiple times in parallel that's called a sweep to figure out which learning rate is the best one right all right so this is all set up now my sweep script is here um I'm just going to name that so it's clear what it is and I'm just going to try three different hyper parameters sorry three different values for this hyper parameter so it's three learning rates um and these are kind of like one order of magnitude off usually this is kind of how you would do this right um from experience I know that well we probably want to actually try another one that's like faster so this is a super fast one and this one is extremely slow probably one of these two will be the best one right um okay so I'm going to call this sweep Feb 20 and 13 is the time I'm submitting this I'm going to use an atg to do that that right and now I'm just going to run this sweep script and that's it now the sweep script is going to take my current studio and then uh well the job is sorry it's going to submit four different jobs one job per um hyper per value of learning right here and you see that they give me kind of the links to all of these right if I go back to my studio I see the four jobs here so each job a studio job what it does is it Forks the student Bo M fully so all the dependencies files code everything and runs them in their own machines so now I've got four versions of the studio running in each of its own machines so I can actually test how the model will converge under different learning rates right so um this gives you scale so studio is not just about coding interactively it's actually about scaling out once you've set up an interactive environment you can then just blow it up right you can spin up 100 versions of it um you can also do multi note training which I covered in a different from video and uh many other things right so in this particular case I'm just going to keep it short and simple so I've got four jobs they're all going to be running on their own um h1g machines and uh that's great so while that's happening why am I doing this because I want to find the best learning rate before I go scale up to like a expensive machine so I'm going to go I I should wait until that's finished before I do this but for brevity I'm just going to do this right now so I'm going to scale to a machine with four gpus on it now now normally before you do this make sure you have debugged on a CPU and then debugged on a one GPU machine and then you can go scale to multiple gpus if you've never ran multiple gpus it's probably because um you don't have the right tools for it right if you're using just regular pie torch or something else um one of the advantages of py lighing is that it can just scale no matter how many machines you have automatically and how many gpus are there so once this studio gives me the four gpus which it's there takes a few seconds notice that I have four little bars up here for the four gpus wait for this to be fully green cuz it's setting up everything for you and once it's ready then I'll start running again now the jobs are still going right I haven't done anything to the jobs in parallel these four jobs are kind of running on their own and I can see the machines here right I can see what's happening now the this script is very short so I suspect that this will be very very fast to complete um so these are still setting up minute and a half and so on and okay let's see okay so um this one is using the learning rate 0.01 right now I can SSH into the machine that's running this job to make sure everything's good to go right so here I'm in that machine and I can just monitor things this is not an interactive environment it's really for debugging and understanding what's going on this is also how you run things in production for example when you develop and then you want to scale out okay so now I'm going to run the script again so so far what I've done is I've gone from CPU to one GPU on that GPU I made sure nothing failed then I submitted a sweep to find the best learning rate while that's running I'm switching to a more expensive GPU machine to do multi-gpu training to speed things up this is already probably four or five times faster so think about it this way if you've configured a model correctly and you're adding enough data to it so setting the right badge size and everything if my model takes I don't know 32 hours on one GPU if I go to two gpus what you want to see is that it take 16 hours now so you cut the training time in half because you're using two gpus if I go to four gpus I expect it to go down in half again so that should be eight hours to do this thing right so it's roughly 4X faster you can do the math if you want to continue scaling and go to 8gpu 16 gpus 32 gpus 64 you'll keep cutting the time in half and half and half you can do that here by using the multinode app here which um I have different videos for so I just install this and submit this multinode right and then that will continue to scale infinitely so in this case uh I'm actually doing the training and um before I go scale this up I can actually go find the best learning right here so normally you need like an experiment manager or something to do that um en lightening you have a few built-in versions um which I'll show you in a second with tensor board but if you notice this model is training on the 4 gpus right now I wouldn't scale this up yet to multinode because I see that one GPU is not being utilized correctly right you can bring this thing out on its own and I have it here which is great and then I'm good to go um cool okay so uh in this case I definitely want to wait for those jobs to tell me which one's the best learning rate so let's actually go and see what those are um I can go here in the file system I can go up to jobs right and I see the ones that I submitted just now so these three here right so I'm just going to link them here okay so I'm s linking here great and you see that I brought over all of the experiments that I'm running right now so I'm running four of them and if I go to my tensor board so this basically runs tensor board on the studio can I'm able to take this tensor board here great you see all the experiments here and then I'm able to actually share this link with anyone so anyone can actually see it right now tensor board uh the experiments that I'm running and then you know I would let it run and then I would find an actual um learning rate that I liked and yeah so I'm just showing a few kind of more advanced uses of the studios here I'm going to turn off these jobs because I don't need them to be running so this shuts down all the four independent machines and then I'm back to kind of my studio here now let's say I'm like pretty happy with what's going on um if you notice when I ran this this basically started running on the four gpus right so so I can keep iterating on this and um and that's fine I think like as long as you're happy with how the model's doing I would probably only be on the 4 gpus when I'm training something for real or like running something longterm and otherwise um I will probably switch back to a CPU here and then actually go ahead and just uh continue developing and debugging on the CPU right because I don't necessarily need to be spending GPU hours or or wasting money on kind of just coding and and you know run-of-the-mill development before before I go start to scale up again so this is great now if you notice the other Studio that I started automatically went to sleep if your studio is inactive so it's not running code or anything it automatically goes to sleep which saves you a ton of money um you can see the one that I'm running on right now it's going to be switching back to a CPU machine pretty soon here and um and yeah and I can set up as many of these environments as I want you can keep it around forever and then you can and delete them when you're not using them and uh yeah that's it okay switch to CPU and I'm good to go so now I can continue developing the CPU but yeah I just wanted to give you a very quick high level of like one of the simplest ways um that I think you can get an environment on the cloud um with CPUs gpus and connect to it from your local vs code and um be able to code from using Jupiter lab or vs code as well and um again you can start a new one by just typ in studio. lightning. and then actually running just letting it start up again so I've got another one setting up here anyways I hope this was helpful I hope that you get a lot of value out of this and let me know in the comments uh if you want to see other videos I will be making a lot more videos on more advanced use cases I'll focus on machine learning things like that a prep um anyways thank you for watching

Info

Channel: william falcon

Views: 1,672

Rating: undefined out of 5

Keywords:

Id: eK6ft51OCTc

Channel Id: undefined

Length: 21min 52sec (1312 seconds)

Published: Wed Feb 21 2024