Custom optimizer in PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey guys and welcome to the video! Today,  I will talk about how we can implement   a custom optimizer in PyTorch. I'm on the official  github repository of PyTorch and I'm going to   show you where exactly you can find the relevant  source code for implementing custom optimizers. The module we're interested in is under the  subpackage optim and is called optimizer.py. Here we have a class called  Optimizer that is supposed to be   subclassed each time we want to  implement a custom optimizer. The constructor has two arguments one is  params and the second one is defaults. The   params argument is an iterable and  there are two possibilities when it   comes to the elements. First of all, one  can just have an iterable of tensors or   one can provide an iterable of dictionaries. If  you decide to go for an iterable of tensors we   can only have one set of hyper parameters for  all of the parameters. However, if we decide   to go for an iterable of dictionaries we can  specify different hyper parameters for different   parameter groups. Let us now focus on what  is inside of the constructor. As you can see,   we defined two different attributes first of them  is state which is a default dictionary and the   second one is going to be param_groups equal to an  empty list. The state attribute is really useful   because it allows us to store any information and  then use it in future steps. One specific example   of when the state might be useful is when we  implement momentum optimizers that need to store   information like the parameters from the previous  step. The second attribute param_groups serves as   a nice way how you can access all your parameter  groups with the same formatting of the parameter   dictionaries. The rest of the constructor  focuses on populating this param groups list. Here we have the zero_grad method of the optimizer  and it is supposed to be called just before   calling backward on our loss. The idea behind it  is to set all the gradients of the leaf parameters   to zero. Finally, we have the step method that  is supposed to be implemented in the child   and this one is supposed to be called just after  calling the backward method on the loss and in the   most general terms the goal of this method is to  update all the parameters. To do this update one   can use a lot of different things and information.  Specifically, you probably want to use things like   the inner state that we just saw or the gradients  of your parameters. Anyway, I think those are the   most important things that you need to know and  let me show you an example. Before I start with   the example though let me quickly introduce the  Rosenbrock function. It is a real function of two   variables and it has a unique global minimum at  this point. The reason why we want to use it is   because converging to the global minimum is not  that easy. I will write an example where i define   a custom optimizer and also i will try to see how  this optimizer performs on the Rosenbrock function   and to do that i will also need to write some  visualization code. All right, let's get started. I imported matplotlib to create  nice plots and animations.   Then I have numpy and torch to work with tensors  and arrays. I imported two already implemented   optimizers. One is Adam and the other one is SGD.  I will use them to benchmark our custom optimizer.   Finally, I imported tqdm to get progress bars.  Let us now implement the Rosenbrock function. First of all, we fix the a and b parameters  namely a is set to 1 and b is set to   100. Also let me point out that the way  we wrote this function will allow us to   use it both with PyTorch and Numpy. Now  we would like to implement a function   that runs the optimization given  a selected optimizer and its hyper parameters. Okay so this run_optimization function  accepts multiple parameters. First of all,   we need to provide the xy_init that represents  the starting point of the optimization algorithm.   Here we need to provide the class of the  optimizer that will only get instantiated   inside of the function. We need to provide  the number of iterations and finally   we can pass any additional keyboard  arguments (hyperparameters) to the optimizer.   What this function returns is a path  which is a 2d numpy array where the rows   are different iterations of the optimizer  and the columns are the xy coordinates. First of all, we create a tensor, set its value  to x y in it and make sure that it requires the   gradient. We instantiate an optimizer using  the optimizer_class and the optimizer_kwargs.   Note that the first argument that  we're passing is an iterable of   tensors or namely an iterable of a single  tensor. That means that we are not using   the possibility of creating groups. However,  in our example it doesn't really make sense to   split up our parameters into multiple groups since  we only have a parameter space of dimension 2. We initialize the path array and we populate  its first row with the initial x and y.   We then compute the loss which is just  the value of the Rosenbrock function   at a given point and we compute the  gradients by calling the backward method. We take all our gradients and we clip their  norm to one. We do this to guarantee that   the gradients are not exploding and  finally we call the step method of our   optimizer and as discussed it will update  in some way the value of our parameters.   At each iteration we cache the  corresponding value of the parameters   inside of the path array and finally we return  it. Now we would like to write a function that   visualizes paths of different optimizers  to do that we can create a small animation. Let me go through the arguments. We start with  the paths which is a list of all the different   paths representing different optimizers. Then for  each of them we can specify what color we want to   visualize them with and also we can specify  (to have a nice legend) what the name of that   optimizer is. Of course we can also specify the  figure size and the x and y limit of our figure.   We return an animation that is going to show the  paths of all the optimizers. Let's implement it First of all, we make sure that paths  colors and names have the same length   because they correspond to  the same set of optimizers. We compute the maximum path length  across the different optimizers   however in our example all of  them will have the same length. We prepare a grid of x and y coordinates and  then we evaluate the Rosenbrock function at   all of them. Here we just save  the x and y coordinates of the   global minimum because we know what  it is and we want to visualize it We create a figure and then we create a  contour plot of our Rosenrlock function. We predefine a scatter plot for each optimizer. We plot a legend and we also plot the  coordinates of the global minimum. We define a function called  animate and we will use it   to generate different frames of our animation. Finally, we create a FuncAnimation  object providing our animate function   our figure and other parameters influencing  how the final animation is going to look   like. Now we wrote all the boilerplate  code and we can test whether it works. First of all we specify the  initial x and y coordinates. We are going to run the optimizer for 1000 steps.   We run two different optimizations.  The first is going to be using the Adam   optimizer and the second one is going to be using  a vanilla SGD - stochastic gradient descent.   I define the frequency for the final  animation and the idea is that we don't   necessarily want to have 1000 frames and  instead we want to take every 10th frame. Finally, I prepare all the inputs for the  create animation function and i run it   the last step is to save the  animation as a gif on our computer. Now we can try to run it. We can see that both of the optimizers  go directly to the valley and once they   are in there they just continue towards the  global minimum. Let us now implement a custom   optimizer. Just for the record, we're  not trying to beat the state of the art   or whatever. The goal is to show some features  of the Optimizer class and how you can use them   Here we import numpy and torch and the parent  class Optimizer that we will need to subclass. We name our optimizer WeirdDescent and the way it  behaves is very similar to the coordinate descent   algorithm, however, in our case we're  going to choose the coordinate randomly   and not necessarily cycle through  all the coordinates. Additionally,   with every 100th step we will multiply  the learning rate by a very big constant. The only hyper parameter that this optimizer is  going to have is a learning rate and what we do   we insert it into the defaults dictionary and then  we use it to call the constructor of the parent. We're not going to use the concept  of a closure in our example. However,   you might want to use it when you need to  evaluate your loss function multiple times   and some existing optimizers  are actually using this. Here we want to define a key "step"  inside of our state dictionary. If   it's the first step we set it equal to 1  and if it's any consecutive step we just   keep on incrementing it. We need to  keep track of the number of steps in   order to be able to recognize whether  the current step is a multiple of 100. If we happen to be in a step 100 200  300 and so on we just define c to be   something really big for example  100 otherwise we keep it equal to 1. Let me explain what this block is trying to  do. First, we want to randomly select a group   out of all the available parameter groups.  Given this parameter group we then want to   select a random tensor that belongs to  this group. Once we have this random   tensor we try to access its gradient. If  it's defined great. If it's not defined   we repeat the above process until we hit a  tensor that has gradients computed on it.   Also let me point out that in our example with  the Rosenbrock function we will only have one   parameter group and inside of this parameter  group we will only have a single tensor. Given our selected tensor we want to  pick one element at random. Once we   have our element we will create a mask  tensor that will have the same shape   as our input tensor and it will be equal  to zeros everywhere except for one entry.   Finally, we perform the parameter update. The  way we do it is by taking the gradient tensor   and the mask tensor and multiplying them  elementwise. What this will essentially   result in is an update only along one single  coordinate. The alpha parameter defines the   multiplier we are going to multiply the update  with. As you can see we take the learning rate   and also our constant. Finally, we return the  loss all right so let's test out our optimizer. I just directly modified our source script  to include the weird descent optimizer   and also at the end I'm printing  a few rows from the path array.   It is always just one coordinate  that is changing its value at a time.   As you can see our optimizer is jumping every  100 steps. Which is something we expected.   When it comes to moving along a single coordinate  at a time, unfortunately, we cannot see it clearly   in this animation. The reason for this  is that we only show every 10th frame.   Anyway, that's it for the video. I hope I  managed to cover the most important things   and if you're interested in the topic i would  definitely encourage you to read through the   official PyTorch source code, documentation  or even look for implementations of custom   optimizers in third-party packages. Thank you  very much for watching and see you next time!!!
Info
Channel: mildlyoverfitted
Views: 1,280
Rating: 5 out of 5
Keywords: pytorch, optimizer, programming, custom optimizer, Adam, stochastic gradient descent, tutorial, custom optimizer in pytorch, mini tutotial pytorch, mini tutorial mildlyoverfitted, programming custom optimizer, learn custom optimizer, custom optimizer deep learning, optimizer class, rosenbrock function pytorch, visualization logic implementation custom optimizer, optimization logic implementation custom optimizer, adam and sgd example custom optimizer
Id: zvp8K4iX2Cs
Channel Id: undefined
Length: 17min 59sec (1079 seconds)
Published: Sat Jan 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.