Auto-Tuning Hyperparameters with Optuna and PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and thanks for uh listening to this educational video about how to use up tuna and pie torch so i'll go ahead and get right into it so i'm the speaker for this um i actually live in japan i've been working for a company called preferred networks um and you might know us actually as the company that originally uh created chainer uh which was a precursor to pi torch today i was working on the trainer team for quite a while and currently i'm now working on the auto machining learning all right so let me tell you by giving you an overview of what we're going to talk about today so the first thing is i wanted to focus in a bit on hyper parameters themselves about what they are um give you a glance at what the code would look like that we're going to be using for optuna then talked about the the gears of machine what it is that make up makes up tuna work and how it does what it does then some of the bonus benefits that you get by working with a framework like up tuning which will automate the process for you and then more specifically how you would go about applying that to pi torch so first off what are hyperparameters so hyperparameters are the the variables if you will that control the behavior of algorithms and they're very important because oftentimes they can strongly determine the performance of the algorithm and they're usually set manually by the programmer who will just put in a learning rate of a certain thing or the number of layers or which optimizer to use which batch process to use etc really important thing though is that they determine the success or failure of your algorithms and programs so we have an example here of object detection this is from our competition in the google image recognition contest uh on kaggle and uh the bad threshold hyper parameter you can see gives a whole number of hyper parameters and when we tuned this then we were able to get down to a very clear picture of one object per box but the thing is is that really everywhere we see hyper parameters so we just talked about the suppression method and the suppression threshold which is what enabled us to give those nice boxes for our application but if you start just at the beginning the network trainer or the image itself what kind of augmentation method you're going to use what order what magnitude are you going to do for the augmentations what image size will you use what image format how you will use the jpeg decoder then within the detector model you have you're going to use vgg resnet resnext nas net the number of res box you're going to use kernel size batch normalization order the number of fpn layers etc and then on the network trainer to train the network then you have the batch size which optimize you're going to use whether it's sgt momentum atom alpha what you're going to have the learning rate be or learning rate schedule should be and then even as you go down to the hardware which you might not think of whether you're going to use fp16 or fp32 floating point precision or mix precision um or even the cuda kernel parameters so there's a huge number of things that have to be tuned but then the thing is is that usually what people start doing is they do this by hand so they take the things they know about say the learning rate or the dropout and try a setting for that 0.1 learning rate 0.5 dropout rate and get a certain accuracy and run that and then do it again okay 0.01 learning rate 0.2 dropout okay get a certain accuracy try again .05 0.3 drop get a different result for that but this is manually intensive and the progress that we want to make is is to have optuna do this for you automatically and more quickly than could be usually done by hand or perhaps in parallel i'll talk a bit about that later so usually as people begin to work with deep learning algorithms they go through sort of a hyperparameter evolution the first step is hyperparameters what hyper parameters basically they're not tuning them uh they just take the default values or perhaps the values from the research papers and then the next next step is they begin to realize how important they are is people start to manually fidget with the hyper parameters to try and see what would be the appropriate value and then okay well just manually fidgeting with this doesn't seem to be giving us a full search space so then maybe try grid search but there are problems with grid search it doesn't focus in on areas of higher benefit um there's some redundancy also when you're using similar values with for a particular hyper parameter with multiple attempts and so then we hope uh after this presentation you'll have the confidence to start using up tuna for your hyper parameter tuning but let's let me give you a glance at what the code would look like for that so first i'd like to compare what optuna looks like compared to existing frameworks so existing frameworks for hyper parameter tuning typically have their own syntax which is different from python and is individual to each platform or framework and you need to learn how to do that whereas in optuna i'm happy to say that basically the parameters are defined within the actual program itself so you can see that it takes up much less space and it's much more intuitive because it's defining the search space during optimization using python language i think of this in some ways as uh similar to the difference that was there between um the uh deep learning frameworks that existed before chainer and with chainer then we came the eager mode which was adopted by pi torch also tensorflow now uses eager mode so it's the same sort of revolution where previously it's sort of predefined and sort of outside of the space whereas opportunity brings it into the actual space of the program so it's there available to you debugging becomes more straightforward and also enables you to use natural python language for the definition so that you can do looping or other things within the definitions as well so a fast look at what that would look like within an actual program is first of all we import torch we import up tuna then we need to make an objective function so we need to wrap uh basically a function within an objective function change the parameters so that they're sampled return the accuracy of the objective function and then add two lines to have optuna optimize it and that's the fast version uh towards the end of this presentation we'll be going into more detail on how to apply that to pi torch so next let's talk about the gears of the machine how it is that uptuna does what it does so basically there are two parts of a hyperparameter optimizer that work together to quickly find the best hyperparameter values within a certain time so the first thing is that it uses a sampling strategy to decide where to look and the next thing is it can use a pruning strategy so if a particular trial is not looking very promising it can be terminated early to provide more time to better trials so as i said the samplers are where to look um so as opposed to a grid search which is going to be very methodical or a random search optuna is going to focus in using uh most of the samplers will focus in using bayesian filtering to find the places where it has had the best results and continue to look there so you can see this on this graph on the the left side we have sort of a random search and then on the right side it has the areas that uptuna chose to search and you can see that as it's trying to minimize this function it focuses in on the lowest point and makes more trials there to find a better result so for samplers there are a number of different kinds of samplers that can be used with uptuna the model based samplers that we have available include tpe tree structured parts and estimator which is based on kernel fitting asean also gaussian processes which is also another bayesian optimization and also covariant matrix adaptation evolutionary strategy uh cmaes which is a meta-heuristics algorithm for continuous space and there are other choices uh if if you're required if to do a random search it can also be done where it's just purely random to fully explore all of this space equally or also grid search is something that can also be used and if you want to do something else or have a different style of sampling there's also the facilities within optuna to do a user-defined algorithm as well so now we have several samplers and now you have to choose which one so how do the samplers compare well uh in a way choosing the sampler is sort of a hyper hyper parameter now this is something that up tuna needs to be told in order to decide what to do so i've made an algorithm cheat sheet so that you know which of these is the best to use from our experience working with uptuna we find generally that if you have more than a thousand trials you should use the cmaes it can handle the the extreme number of volume of trials the best of any of the three if then if the parameters are correlated you try the gaussian process and if not try tpe for your sampler and the default for uptune is tpe so if you don't know which of these is the best choice tpe is generally a solid option so next talking about pruners and stopping trials early so the pruning strategy basically is that for some trials they're going to have a slow start and never be able to make up that start so there are a number of uh pruning algorithms within optuna that will then terminate those unpromising trials early so that that compute time can be dedicated to more promising trials as an example for this we have the street view house numbers data set where we worked with for a bit and using successive having is the pruning algorithm we found it to be twice as fast in optimization using pruning as it was without pruning so also for um pruning it requires a bit more work uh with the code itself and is a little bit less black box so for more automated structures uh we have inter automated libraries we have integrations available um this would be something like pytorch lightning uh pychtorch ignite or fast ai where some of the work inside of the training loop is automated for you and might be put into a subroutine that you don't access directly and to save you the trouble of having to try to go within that um training loop we've made integrations available for optuna so that it's easy to implement this pruning along with any of these frameworks also there are bonus benefits of using a program like optuna that is an automated program to go through and work on finding the best hyper parameters one of these is that enables you to scale up fairly easily so here's the example for this where we're going ahead and creating a study we make example study as the name i have some storage for it examples and then if the database already exists then we go ahead and use the existing databases creating a new one we type basically just executed on six different machines and with access to that common database optuna is able to use all six of those machines to simultaneously optimize search over the search space and find hyper parameters so this is a very scalable the machines don't need to be the exact same size or spec or anything it's basically able to use the compute that you have available to you and is very powerful for finding a quick way to get the best hyper parameters that you can find this is asynchronous parallelization of trials so another one could be kicked off even after the first one is and gives near linear scaling another benefit that you get from using uptuna is you have access to visualization so one of the visualizations that you could use for example would be a plot contour so you can see what the search space is looking like as it's going through and finding out what the various values are the number of layers or the number of samples and one of the newest things that we've implemented in optuna is to take a look for what matters most so as i mentioned before in one of the first slides there's an enormous number of values that could be considered to be hyperparameters but honestly uh if you set up an op tuna or a pie torch program and you made up tuna try to optimize all of those hyper parameters it doesn't work the the curse of dimensionality will slow down the optimization ex to a major degree so what you want to know is you want to know which are the hyper parameters that matter the most that are important so up tuna after running a small start test can then give you the hyper parameter importances here's a graphic display of it for just a standard mnist pi torch run and so you see that as we would expect the most important hyperparameter for pi torch on doing a fully connected mnist is the learning rate next thing is the number of units in the very first layer then the optimizer and the dropout rate and then later on the batch size number of units drop out on another layer activation so this then gives you a strong hint uh and quickly tells you where you should be having optuna focus his attention to make the best results that are possible within the time you have so this is something that is a a huge improvement forward and the nice graphical outlook of this can really help to shorten the amount of time that you would need to decide what it is that you should be focusing on with your compute time so now let's take a look and see how we could apply this to pytorch let me start uh kind of at the beginning and just talk about the installation okay this should be fairly straightforward let's see for linux you would go pip install up tuna on mac os it's kind of uh kind of similar for you also pip install optuna and on a windows machine you know it's got to be a little different so it's uh the system prompt changes so yeah that's that's very different okay um so that's how you install up tuna so then for actual code itself this is just a high level diagram of the entire change that you need to do to run optuna it needs to be imported and then you need to have an objective trial so the objective tri objective function is a function that takes in the trial object which is passed to it by up tuna when it's called has your code and then returns some kind of an evaluation score the default is that this will be minimized can also be maximized but something to tell up to know whether this was a good or a bad trial so then the last two lines saying that basically study we create an optuna study and then we optimize it using the objective function and set the number of trials that you want to run so that's the general template for using up tuna let's apply it more specifically to pi torch so importing very important uh basically all that's been added to this is import up tuna but i did want to bring your attention at the very bottom uh in italics i've included the link to the entire code for this i'll be skipping over some of the sections of the pi torch code that don't really change for up tuna so i can focus in and show you the areas that are different as opposed to the entire code this is just a simple mnist fully connected layer demonstration and if you want to see the full code please take a look at the link below so the first thing uh to do is to define the model uh those of you are in pi torch after you have the date and other things ready you want to define the model one of the things i want to draw your attention to is it's also being trashed the trial object so the trial object is passed by up tuna and then allows the features of optuna to be accessed within the function so rate of the first line we have n layers equals trial dot suggest int and layers one two three so this is basically telling optuna that the n layers label will refer to an integer between one and three so we'll have one of three layers in this mnist model and then going down a little bit below we have for eye in range and layers so this is demonstrating where you can use the the looping structure of python in order to define your optuna labels and variables hyperparameters that will be tuned so then the next one again uses a trial dot suggest int to have the number of units which can range from 4 to 128 and so this will be done for each of the three possible layers or if there's only one layer it'll just be done for the first layer and then the next one is we have a drop out for this particular layer so it's the drop out and then according to the number of the layer and that's from 0.2 to 0.5 and the rest of it is basically a standard pie torch definition of a model the next function is the actual objective function itself and this will call the model that we just defined above so you can see right there in the first line generate the model the model equals defined model brackets trial so this is where it's passing the trial object up into the model so that it can be further used there to define the number of layers and then since we have the trial object here the next line is we use it to also select what we want to use for an optimizer we have three different possible optimizers atom rms prop or sgd take your own pick as well and then we have the learning rate where we suggest the uh fl have a float floating learning rate uh also the the optimizers you might have noticed above is a categorical so it's just a list of things uh optuna can handle categories or floats or log normal or any other kind of space for the hyper parameters and then we use that learning rate or optimizer name then to actually decide what the optimizer will be for the model uh the rest of this code within the objective function will include the training loop and is fairly standard until we get to the end where you have the uh you report back the accuracy and what epic was done so these next uh four lines the trial.report accuracy and epic and also the trials should prune these are for the pruning function so while largely uptuna is a black box optimizer in order to do pruning it needs information before the function is done to decide whether the trial should be terminated early or not so the trial report is reporting back to a tuna how far it is into the trial the epic and also how well it's doing the accuracy and then optuna will feed back okay is this a sufficiently good trial to continue or not and if the answer is that it should be pruned then we want to raise an exception so that the function can be exited gracefully and that it can then hand over those resources for another hopefully more promising trial then for the actual running of the program itself uh we have uh if name equals main so now where those are the two lines you might remember from up above where we have up tuna create study and the directions chose to maximize this time as opposed to the default to minimize and then we go ahead and optimize it using the objective function we defined up above the number of trials 100 and also set a tr a timeout and the the items lines below this are basically there to give a nice print out so at the end of the program we can see what the result was what optuna found what the best hyper parameters were that optuna found for that time so we want to see the number of finished trials number of prune trials and complete and then also what the best trial was what the results the value that delivered and what those parameters were that delivered that value now going ahead and running it we can see that uh python pi torch simple.pi uh it downloads the mnist data set and then we see in the middle uh trial zero started finished with a certain value trial one finished with a slightly better value and the hyper parameters for each of those and then continues on trial three twelve four and then around trial five now it realizes that with some of the history that it has that trial five isn't as good as some of the trials it's already done so as opposed to you running out the full fi five it prunes it midway through along with six and seven until it finds some better values and for the default value of this we had a hundred trials that we tried and we skip ahead to the final answer after it's done 100 trials and you can see the summary of statistics that the number of finished trials was 100 67 of them were pruned almost exactly two-thirds and 33 were completed uh the best value i got was 95 accuracy uh actually found that this other layers were not required it got just as good results with one layer 115 nodes on that layer dropout rate of 40 uh and the best optimizer was atom with a learning rate of zero point of 0.02 so that completes uh my presentation uh educational video on how to use up tuna and pie torch for more information you can go to optuna.org or come directly to see us on github uptuna tuna and i wanted to say thank you very much for your attention and listening and please let us know if you have any questions or other things we welcome issues or pull requests or other things on github and thanks for your attention you
Info
Channel: PyTorch
Views: 12,330
Rating: 5 out of 5
Keywords: PyTorch, AI, Artificial Intelligence, Machine learning, Optuna, Crissman Loomis, Preferred Networks, Hyperparameters, ML
Id: P6NwZVl8ttc
Channel Id: undefined
Length: 24min 4sec (1444 seconds)
Published: Tue Oct 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.