Hyperparameter Optimization: This Tutorial Is All You Need

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to this new video in which i'm going to talk about hyper parameter optimization and i'm going to show you certain libraries in python using which you can do hyper parameter optimization so it's very important because you have invested a lot of time in future engineering and now you have come up with some kind of model that fits your problem and now you want to find the best uh hyper parameters without investing a lot of time in manual training so hyper parameter optimization can be done in many different ways in my book i have dedicated a chapter so if you uh have bought my book you can also go through the hyper parameter chapter there hyper parameter optimization chapter and i will be borrowing some of the stuff from the book in this video and i'm also going to show you some new stuff um that have not just discussed in the book so hyper parameter optimization is like you you have you have built certain features for your data set and you have found a model and now you want to find the best uh let's say the best learning rate that would be a hyper parameter or you can also use it for neural networks when you have neural networks you have to find the appropriate number of layers or some parameters inside the layers right so you can use it for that purpose too so one of the easiest approach of hyper parameter optimization is using a kind of a grid and that is known as grid search so let's start with grid search and before we begin we also need some kind of dataset so let's see what kind of dataset we have so this is one of the data sets that we will be using it's called mobile price classification so what you're provided with is different features of mobile and you have to predict a price range so price range can be zero one two three zero is low cost medium cost high cost very high cost okay so these are the different categories that you have to predict so it becomes a kind of a classification problem and you see like here if i if i look at the distribution of labels uh it's uh same for all all different classes so the number of samples for all classes are equal so what we are going to do is um we're going to take this data set and build different types of hyper parameter optimization use different types of hyper parameter optimization libraries and see what happens this is an easy data that so this is not not very difficult so we might to be able to find optimal rapper parameters very easily uh so let's see so here i have created uh i have downloaded this file called mobile underscore train dot csv in which you have like the different features and so we won't be doing any kind of feature engineering in this uh tutorial we will just be doing hyper parameter optimization that's it and um i put it in the input folder and i've created another folder called src or source and inside that i have created a file called grid search so one of the things that you have to remember that you have to do everything inside a cross validation loop that's it you have to do everything inside a cross validation loop and as long as you do that you are not over fitting the model so let's import a few file libraries so we would need pandas and to read the csb we need numpy maybe we need numpy let's just import it and we need some kind of model so in this particular tutorial we will be using random forest just for the sake of it so you can use any model that you want maybe in the end i will show you how to tune some parameters of neural network so from sklearn import ensemble and then we would also need some kind of metrics so from a scalar import metrics and you would also need um kind of like create a cross validation create cross validation folds right so from scale learn import model selection so we have all these things now we can probably start with reading the data set first okay so let's read the dataset mobile train.csv so this is our data set we don't have to do a lot of pre-processing on it so what we have to do is create a training and um create training features and targets so x is my training features so i can just do um so in the data set we don't have any kind of id columns so we just have a target column and all the all the other others are features so i can just do df.drop price range which is the target column x is equal to one okay dot values so i have created a numpy array of all the features and similarly i have targets which is df.price range dot values okay and now what what you can do is you can you can just do a simple grid search um on some parameters of the model that you choose so let's choose a model so i will say my model is the classifier is ensemble dot random forest classifier and i will set a parameter of enzymes to minus one so it's going to use all the cores of my machine and then you will specify some kind of grid of parameters so param grids so which is a dictionary and inside this dictionary you have the different parameters so let's say one of the parameters of random viruses number of estimators right so we will set some kind of values there 100 200 300 400 so it's going to try all of them and uh one of the parameters is max depth uh so you can set some values there too one three five seven and uh let's say we want to tune one more parameter called uh criterion and it can have categorical value values like ginny or entropy okay so we got the parameter grid and now we want to do some kind of uh grid search on this and second learn provides you with a way you can do grid search with the cross validation so you can also provide your own cross validation to uh the function but we won't do that we will just use a simple cross validation and let's first import grid search cv so grid source cv is going to do grid search with cross validation and you have a estimator parameter so estimator here is your classifier the next thing that you do is you have a param grid parameter and here we have the same name param grid the grid of parameters and then you have a kind of a scoring strategy so here since we have all the classes have the same number of samples we can use accuracy and then we can define some kind of verbals so let's say 10 and uh and jobs we have already implemented say it said that n job should be minus 1 for the random for us so here we are using only one core and cv equal to five so that's doing the five-fold uh cv so this is not even stratified for so you should probably do stratified folds so this is like one way of doing upper parameter optimization and you can just call the model fit function x y okay and now it can return you the best parameters so let's try fitting it first so one one one thing that you should know that um here if if you don't specify cv equal to five it's still going to be five fold cv and if you have um categories as targets then it's going to use stratified kpold otherwise it's going to use or binary classification then it's going to use stratified k4 automatically so you don't have to worry about that and now you can probably let's train it let's see if it works so i'll go to the source folder and python grid search so it seems to be working and you can see that it's printing some kind of score and that's your uh accuracy score so you can also replace it with some kind of function so you can use the make scorer function from scikit learn and you can pass it through so you can change whatever kind of scoring you want to do so it seems to be going pretty strong we have already seen 0.865 and now the next thing that you want to do is you want to print the parameters like print the best parameters so what you can do is you can say model dot best score will give you the best score and you can also print model dot best params underscore uh best estimator underscore dot get params so this will give you the best parameters of the estimator so let's see so this is going to take a while because so what's it doing is let's imagine a for loop and that's what it's doing and for every value here every combination it's trying five-fold cv so this takes a lot of time so let me see if we can reduce some of the stuff so maybe just try two estimators and two depths and let's remove the criterion for now and maybe it's faster oh it's done so it took 1.5 minutes so let's run it again and see how much time it takes now so you see like uh now it went quite fast didn't take much time but also your accuracy is very low so this is the best score that it has printed and these are the parameters of your model uh the best parameters of the model that it found so it found like an estimators so here it's printing everything you don't need to care about everything you need to care only about the three parameters criterion max depth and and estimators so this is like one of the ways of doing um hyper parameter tuning um so another way of doing hyper parameter tuning is using a random search so random search uh what happens in random search we select a combination of parameters randomly and then we calculate the cross validation score so everything that we have discussed here remains same except instead of grid search cv we now use randomized search cv and everything else remains the same so you have instead of param grid you can specify here you have to specify some kind of distribution of parameters so you can instead of like this list we can write np dot a range hundred fifteen hundred hundred okay so between 100 to 1500 everything and we forgot a comma not everything but with a step of uh 100 so np dot a range one two twenty uh okay so everything else remains the same um you also need to specify here number of iterations so if i specify 10 iterations then it's going to do random search 10 times and that's it and everything else can remain the same so let's try this one so i'm just doing 10 iterations for now with a knighter equal to 10. okay so it says that fitting 5 volts for each of 10 candidates totaling 50 feet so you see like this is also a little bit more expensive but it's not as expensive as grid search so grid search will take much longer because it has to evaluate everything in the grid and right now if you see the size of this grid it's huge there are many different values compared to our previous one but random search is not going to evaluate all of them just going to evaluate some of them randomly okay let's see how much time this takes now so this took 1.3 minutes and it gave a score of 0.885 so now you can see like the max step was 15 criterion was entropy number of estimators was 600 so that's what we got from got from our parameter regret or distribution of parameters so this was random search and the previous one was great search so all you need to do is specify a grid of parameters so this there's many things that you can do here so one of the things is you can also uh do the same thing with some kind of custom scoring so you can have a custom scoring method like um which is not available in circuit learn let's say you have your own method and now you can also use certain kind of pipelines so here if i i mean i don't i don't really need to scale the data for random forest but let's say um i scale the data from scale learn for pre-processing um or i can also kind of like create some kind of decomposition so let's do that instead decomposition we also scale before that so we do need to scale the data so okay so now and we also import one more thing called pipeline okay so now you have everything uh imported and now what what we what we can do is we can we have to create some kind of pipeline so let's say we define a variable called scl which is preprocessing.standard scalar okay and we have pca we do some kind of pca decomposition on the data so decomposition not pca and we don't define the number of components and then we have the classifier so now you you when you have these three things you your classifier hmm let's call it rf random forest so now uh your classifier is not the random varus but a pipeline or like a sequence of these three things so i can just do pipeline dot pipeline and inside this uh i define a list so let's say it's looks something like this okay and what what you have you first first name what you're doing so scaling okay and scaling will use scl the second thing that you want to do is do the decomposition so pca and this is going to use pca and the third thing is the model itself which is random forest rf and this will be rf so this becomes your pipeline and now you have to change the param grid a little bit so instead of let's say we also need to find out the number of components and components in pca so what you have to do is you have to add pca in front of it so this name that you have chosen underscore underscore so two underscores and then n component and you can say the components can be anywhere in this range let's say from 5 to 10 maybe and uh then you have the other parameters so here you have to do the same rf underscore underscore so this is the value of key in this uh list of tuples so this rf okay and now you you can also write your own custom scoring but we are not going to do that so uh this remains the same and then we run it and see how much time it takes now so yeah i'm calling it grid search but it's not grid search anymore it's random search so let's see how much time this one takes so now it has finished and given us a very bad score obviously because the pipeline uh is not good so now you can see like you have these different uh steps okay and inside this you are getting a bunch of parameters so scaling also has different parameters that you can see and then you also have uh the pca then you have rf in the end so this is what you can use in case of using pipelines when you have different steps but i mean the score is not good for this one because i use something that i shouldn't have um but this is how it's done and now we can go to a much better way of doing other parameter optimization which is using um by shin search okay some kind of bishop optimization um so one of the libraries called scikit optimize provides you with a way to do bison optimization with caution process and that's something that you can use so i'm going to create a new file for that and call it um maybe i should create should i do it in the same file then i need to change the name of this file so just search not by um let's do it in the same file so what you need to do is uh you need to create a function create an function called optimize function so that's the first thing that you need to do so optimize and optimize function so you have to take a look at this function called gp minimize so that's that's what we will be using and um then in this function this function should take a callable which is the function to minimize so that's the that's our optimize function it should take a single list of parameters and return the objective value so that's what your optimize function should do it should take a single list of parameters and return a value so instead of taking in the list of values which is params right we can also make it take a few more arguments maybe param names the names of the parameters some x and some y which is where x is our features and y is our target and now what i do is i convert params to a dictionary so dict zip param names comma params so here i have converted to a dictionary of parameters so where um my first my key is the name of the parameter and the second one is the parameter value itself so this is done now we can have a model so here i'm not going to show you so um i mean this thing probably is something that you cannot use if you are tuning uh parameters of multiple things here but in our case we are only tuning the parameter of one single model and this will take a dictionary and so now you don't you don't need to specify the parameters but it's already it has already been done okay and then you have to do all the k-fold stuff on your own so model selection dot um stratified cable yeah and you can have the number of splits as five so this is one of the things and create a list of metrics so here we will be calculating accuracies and you can say for idx in kf dot split x comma y so here capital x is x and small y is y and then you split the here you split the data set so you have to find train idx train indices test and this is you can also call it um valid indices so test is here validation data so you have x strain which is equal to x train idx and you have y train which is y train idx and you have the same thing for not the same thing but similar thing for x test and y test x y test and test idx okay so we have split the data now what what we want to do is we want to fit the model first so model dot fit um here it will be x rain and y train and then you create some kind of predictions so we are very flexible here now we can also have uh probabilities instead of actual prediction but here we just need prediction because we assume um that we are using a 0.5 threshold so predict x test and then you calculate some kind of accuracy call it fold accuracy accuracy of this fold so metrics dot accuracy score and why test and spreads okay and you found this accuracy now append it to our accuracies list accuracies dot opened uh full accuracy okay and now you in the end you have to return it to return it you don't have to return the actual value so remember the function is called gpu underscore minimize right so you have to minimize this value so what will you minimize is return minus 1 into np dot mean of accuracies so you are returning negative of the mean of accuracy so if you have if you have to like minimize log loss then you should not multiply by minus 1 because that's already lower is better and here what i've done here is um a little bit interesting so i have converted the parameters to a dictionary of parameters and input in the model itself but now if you have if you have number of different steps you have to filter the parameters accordingly parameter names and parameters accordingly otherwise this won't work no sorry this won't work all the time so you have to do it yourself um so now we need to define some kind of space of parameters that we need to search so let me just get rid of all this i don't need all this stuff okay so our optimization function should accept only one value one argument so our optimization function can also be written as so let's see we import from func tools import partial so now we write um optimization function is equal to partial and then inside that you have sorry optimize which is your function and some more things so param names you specify some param names we will do that now and you have x which is x and y equals y so what are param names um so param names are different parameters that we want to optimize and they come in a sequence so first of all we will define some kind of parameter space so when you're using sk opt or cycle optimize you have to define this parameter space and space can take different kinds of variables so this space can be um real numbers integers or it can be categorical space so if you look at the documentation of sk opt you have space dot categorical space dot integer space.real so um that's all we actually need so we don't need anything more than real category in teacher right so i'm going to import from a scale import space and here we define this uh space so order is important order of how you define these variables and that will be clear after a few more lines of code so space dot integer the first one let's say between 3 and 15 and the name is max depth um then again you have space dot integer 100 to 600 and your name is an estimators okay and maybe we can define a real number so space dot real [Music] and inside this you can have something 0.011 name is [Music] number of features so what is it called max features and how to choose uh you need some kind of prior which is let's say choose from uniform distribution uh we can also add space dot categorical and then you have these choices so ginny or entropy entropy and here the name is criterion okay so you have defined this list of parameter spaces now and in the same sequence you have to define the param names that's how we are using it here so you don't you don't really need to do that if you don't want to do it so max depth and estimators criterion and max features so now if you have um if you have a standard scalar or some kind of other processing or some other model something like that so you can also define it here sequentially and then you just have to remember that in the function here in the optimize function you have to take them back from this params argument in the same sequence that's it and param names here is param names so we have the optimization function now and now we want to just run the gp minimize function so we can write here um result in gp underscore minimize so we also need to import it from sk oct import gp minimize so and here we can write um the optimization function which is the first argument dimensions which is the second argument so dimensions is your parameter space and then you have number of calls you have to specify so let's say 15 calls number of random starts five random naf 10 random starts let's say and verbosity so i'll just set it to maximum so once you have that you can also uh print the best parameters so if you do a print like like a dictionary let's say and you have param names and result dot x as values so that's your best parameters uh let's run this and see what happens so you can see um in in the verbose you can see like it's evaluating uh different um parameters now so now we are getting a current minimum of 0.89 which is much better than what we got earlier using grid search or random search so let's uh let it run for some time and see what happens in the end so it has finished but we got an error in the end so let's take a look at that error um okay so i think it should be and then i need a sip let's run it again just to see the last bit should not take a lot of time so now we see that uh our max features are 0.69 um so 69 percent features no sorry the value of this argument is 0.69 and criterion journey uh number of estimators three three nine and maximum depth of eight and it reaches uh accuracy of 89.75 percent so this is one more way of doing hyper parameter tuning so you just need to remember that uh the values um so create these two lists and then create a dictionary here and from here things will change a little bit if you are using something complicated so like if you're using not complicated but if you're using like a pipeline or using something something different something else like some kind of uh trying to find parameter of another model and trying to combine it with this one so then you have to take care of your things but this is how it works in general another important library for hyper parameter optimization is hyperopt so let's take a look at hyperop so hyperopt uses uh three person estimators if i'm not wrong tpe three structured bars and estimators to find the most optimal hyper parameters so what we are going to do is first we are going to import a few things for opt so from hyper opt import hp admin the function for minimization tpe and trials okay so we have that and one thing that you have to remember now is instead of uh param names your params is a dictionary itself anyways so we don't need param names at all and everything inside uh this function this optimize function will remain the same so when you're moving from sk up to hyper up it's as easy as that everything remains the same and now you have to define the parameter space so to define the parameter space uh you have to actually use a dictionary here so let's let's convert to a dictionary so we have this one here and this one here and now how we define this dictionary here is uh you have the name obviously so let's take this name and convert we're just converting it to a simple dictionary first of all um [Music] and then we will use hyperopt spaces so criterion and this max features okay so we have max features here and now you have to use um hyper op parameter spaces so in this one it changes a little bit so you have instead of integer you have hp dot q uniform so you can you can read more about the many different types available so you can read about them in the hyper documentation i'm not going to details there for integers if you have to use you can use queue uniform so you have to specify max depth um the name and then you have uh the ranges so 3 to 15 to 1. um so you can also see in this queue uniform okay it's not working for me here but you can check out documentation if you want and similarly you have number of estimators which is also integer so you can use q uniform again and speed speed.q uniform let me find out the documentation for it hyperopt q uniform hyper round uh let's see if you can if we can find it too uniform so you have the low value you have the high value and you have the value of q okay okay so that's what i was checking i forgot so similarly you have um okay so this should be q uniform and this should be um hp dot choice the choices will be the same and this will come in the very beginning okay and now we can have for max features you can have hp dot uniform instead of q uniform and we take the max features from here and put it here and everything remains the same um so something wrong here i don't know it doesn't look like anything is around here so now we have the we have our optimization function which is going to be the same instead of param names we don't have the param names anymore and we remove this one and in uh our result function so first we need to initiate a trial so we will say trial is or trials trials okay so now you can have certain parameters there but we really don't need it now and result will be fmin and then you provide a function which is your optimization function and then you provide a space which is your parameter space and uh you can also do you have to define trials which is trials and instead of n calls you have maxi valves which is 15 um you don't have this verbose parameter and what else do you need algorithm so you also need what kind of algorithm you want to use so we will use tpe dot suggest okay and now here we can just do print result and let's hope it works let's see hp choice takes two positional arguments but three were given okay so that is a problem and so that was the problem here previously so you have to specify them in a list um and let's try this again so and estimators must be integer got class float so yeah that is some of the some of the problems here so you need to fix that and to fix that what you can do is uh you can import scope from hyperopt so from so i think it's returning float values and you have to just convert them to and so from hyper opt import scope and here you just wrap this in scope dot and so this is something that you can do and do the same for an estimators and you're done okay let's see cannot import name scope from hyperdropped um okay yeah sure so it's not in hyper rob hyperop dot base import scope let's see so we have 15 trials and it has already reached 89.2 percent accuracy so we just let it run for 15 and see what happens in the end so it has finished now and you can see it has found some optimal parameters so criterion is one so it's zero and one zero it was uh it's it's an index so just have to remember that zero is entropy one is it uh zero is ginny one is entropy then you have the max step max features a number of estimators so we have found uh the best parameters and our accuracy is more than 90 so it's awesome and now um there is one more uh very good library and um this is something that i have not discussed in my book so that's optional so we will be trying to use optima optional for the same thing that we are doing and see how that performs so optuna is also a very good hyper parameter optimization library and this is something that i have not discussed in my book but i was recently talking to one of the core developers of optuna and he suggested me to take a look at this and i did and i found it very good so let's see how we can use optuna for the same uh function that we have been optimizing till now so first of all that you need to do is to import optuna obviously so you can install it using pip install optuna and now uh we don't need so for optuna you cannot define parameter space outside so you have to define it inside so you don't need this and instead here i will just write study like a study so optuna dot create study so i can do direction equal to minimize here so it will minimize the function you can also do direction equal to maximize and then you don't need to do the negative thing but we have been doing multiplication by minus one for everything so let's keep it that way and here you can do uh study dot optimize objective now optimization optimize comma here you specify the number of trials 10 about 15 15 trials so now you go inside your optimization function and here it will take okay so we need to we still need to create the partial function so optimization function uh partial and then you have x equal to x y equal to y okay and your objective function should take a variable argument called trial and x and y so now you don't have params anymore okay and x and y we have already defined and instead of uh sorry you need to write optimize here and optimization function will be input to study out optimize and here now you have to define uh what you're selecting so the params the grid that we have been defining till now has to be defined inside now okay so you can have something like um uh and true entropy equal to trial dot suggest sorry trial dot sub just categorical and inside that you can have different categories what what should i call it criterion and inside that you can have uh ginny entropy okay and so this is this is like one one of the things that you have now but you have to also do it for end and integers and real number so you can also do like an estimators as trial dots adjust and uh and then you have an estimators so any name that you want and you can uh give a min and max value so i can do a number of estimators 100 to 1500 and uh what was the third thing that we had so we had entropy we had number of estimators we had max depth max depth which is also your int right so you can do int and we will use the same name here max step between 3 to 15 and there was one more thing which was max features max features and we can take it from suggests underscore uniform so from a uniform distribution so we can say um max features equals uh sorry this is name that you have and then 0.01 to 1.0 um and now so you have defined every variable here which is also very nice and compact right so you can you can do a lot of things here and it won't be as complicated as hyperopt um so n is by the way hyper opt is also not very complicated so you can use anything you want it just depends on your choice so an estimator this and estimators and max depth is max depth and then what we have uh max features is max features and criterion is so it should be i should change the name okay so we have this i'm missing a comma and that's all you need so let's try to run it and see what happens x is not defined okay sorry x should be capital c optimize got an unexpected keyword argument x so this is small x okay let's try again so yeah it is running finish trial 0 value 85 and then it's showing you criterion was jenny and estimators was this and blah blah blah again it finished trial 1 with a value of 83.7 so it's negative because we are minimizing the accuracy so you can directly do maximize if you want um but this is so cool so i think i like optuna more than anything else and this is actually the first time i'm using it so i think you can do a lot of things here let's let's go take a look at our opportunities uh documentation and see what you can do in the meantime so they do they do have a website too um [Music] i think it was called tuna.org and yeah so here they provide some some examples and which i think is quite nice so you have objective function and you're suggesting some uniform distribution so let me increase the size for you and you return uh something and then you say if you want to maximize minimize and you have the number of trials awesome and let's see by torch thing so what they are doing in define object to function to be maximized okay and then suggest the value of after parameters using trial object so you have number of layers so it's adjusting the number of layers which is quite quite nice and so this is how you can also use it for by torch models so it's basically in the end it's just about a function an objective function and it should return some kind of objective value that you want to maximize or minimize and it's the same for all kinds of hyper parameter tuning uh libraries so you don't have to change a lot of stuff there so let's take a look what's happening here so it has reached 0.899 i i think it's a little bit slow or maybe it's the same um taking the same time probably so anyways it's reaching a value of zero point uh sorry zero point nine zero eight ninety point eight percent accuracy which is quite good so now in this video i've discussed grid search randomized search i have discussed sk optimized section optimized i have discussed hyper opt i have discussed optuna so now it's up to you which method you choose i like to like uh try to tune hyper parameters manually first and then choose a range of values and then throw in some kind of optimization algorithm so that's something that you can do but it's totally up to you i know it has been a little bit of a longer video than usual but i hope you like it and if you like it and click on the like button and do subscribe and share it with your friends if you feel like they should also learn uh i hope they will like it too so that's it for today's video if you have any suggestions or comments do let me know i want to be sharing the code on github for this most of the code is also available in my book so you can take grab the book if you want and it's more about learning by coding so watch the video and code yourself you will learn a lot more than just copy pasting code from somewhere so thank you very much and see you next time and goodbye
Info
Channel: Abhishek Thakur
Views: 41,826
Rating: 4.9454093 out of 5
Keywords: machine learning, deep learning, artificial intelligence, kaggle, abhishek thakur, hyperparameter optimization, grid search, random search, how to do hyperparameter optimization using sklearn, how to do hyperparameter optimization using optuna, how to do hyperparameter optimization using hyperopt, hyperopt, optuna, skopt, how to do hyperparameter tuning using scikit optimize, bayesian optimization, gp minimize, minimization function, model tuning, model hyperparameter tuning
Id: 5nYqK-HaoKY
Channel Id: undefined
Length: 59min 33sec (3573 seconds)
Published: Sun Jul 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.