Training more effective learned optimizers, and using them to train themselves (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we'll look at tasks stability architecture and compute training more effective learned optimizers and using them to train themselves by luke metz nero maez varanatan see daniel friedman ben poole and yasha sol dixteen so on a high level this paper deals with sort of a meta problem it deals with learning optimizers that learn machine learning models learned optimizers is a kind of a new field of research and the goal is to obtain an optimization function that can be used to train all kinds of machine learning models and this paper builds on a line of research and kind of extends that research it doesn't it's not the first one to do this but it is so far the largest and most compute intensive and most uh task encompassing uh notion of learned optimizers and the optimizer they end up with has some nice properties as they're gonna show and also it can be used to train itself so it can iteratively be used to train itself ending up with a even better learned optimizer so we're gonna go through the paper and we're gonna find out how much of these claims are kind of um wishful thinking and how much are actually true i have mixed feelings about this paper though in all of this uh remember my opinion is my opinion and they are very open about their results which is something i really really appreciate i feel that if more papers were as open as these people are about what worked and also what didn't work we would be in a in a better place as a research community that being said as i said i do have some mixed feelings about the statements being made here and about how the results are interpreted so stick around if you're interested into that also i find the broader impact statement to be a bit funny but i will come to that at the very end if you like content like this as always don't hesitate to share it out i've been on a bit of a break it feels good to be back making videos after after right paper deadlines let's dive in they say much as replacing hand design features with learned functions has revolutionized how we solve perceptual tasks we believe learned algorithms will transform how we trained how we train models so lots of packings in this sentence for those for you young kids that have been growing up with deep learning there was a time before deep learning and basically what we would do is we would use hand design features and this works really well if you have like a database of customer data it worked moderately well if you have like a picture so if you have a picture whatever of your cat what people used to do is they used to run these kind of very handcrafted detectors feature extractors over this um so these might be like fixed filters like three by three sobel filters gradient filters and so on run them over the image try to to detect corners try to detect very small things and then once they had a couple of features like this they would feed this into a classic kind of classification algorithm like a logistic regression and so on there were sophisticated approaches but most required the hand engineering of features of course deep learning transformed all of this deep learning basically if you want to take a cynical look at deep learning it's simply replacing the part that creates the features the classifier is still like a logistic regression however deep learning knows how itself can extract good features in fact better features than humans ever could for perceptual tasks so for images for sound uh in the latest iterations also for language these people say that this can also this kind of thinking can also be applied to this optimization algorithms so in optimization what you want to do is you want to train your deep network right whatever goes from your image from this thing right here to your final output you want to train this and we train this using gradient descent so what this has is usually there's like many many layers in your deep neural network and each one has parameters well let's call them theta theta 1 theta 2 and so on these are all vectors or matrices your convolutional filters your batch norm parameters and so on we can collect all of these into a big parameter vector let's call that theta and the uh the task is now to find the best theta i think you're introduced to that so in optimization what you want to do is you have a theta you feed an x you feed an example through it you get some sort of output let's call that f that gives you some sort of loss you back propagate that loss and what you end up with is a gradient of theta if we were just doing gradient descent we would update theta right here we would update theta to be theta minus the gradient of theta given some step size right here this is classic gradient descent and most algorithms are something like this in for example gradient descent with momentum considers has like some additional term right here where they consider the last steps at a grad for example considers a factor down here where they divide by some kind of the square norm of past gradient so d sorry this you add up the past gradient square norms like this or you average over them there are many variants you can do this averaging right here also with momentum in kind of a decaying way there are all sorts of algorithms to optimize these functions and the sense behind this is that ultimately deep learning is a non-convex problem so instead of your classic classifiers they look something like this as a loss function in your parameters or more maybe more to say something like this if we look at it in 2d and you can just do gradient descent basically go to the optimum however in deep learning it's a bit of a different situation so you might have many different optima many local optima and we know by now that we can go to either one of them and that should be fine so let's do some level sets right here maybe here here okay but so you can see right here you have multiple optima where these dots are but in between it's kind of uh shaky so you might have like a major flat area right here but then as you get close to this optimum maybe the steepness increases so if you look at a cross section there might be like some sort of a flat area and then uh it increases again and you want an optimization algorithm to kind of automatically adjust to the steepness and to changes in steepness and so on and that's what these modifications to gradient descent are supposed to do so adegrad for example adjusts automatically to a landscape like this so even if it's convex you can see that the scale of this parameter is much flatter than of this parameter at a grad would automatically kind of stretch one out and make the other smaller such that it transforms it to a nice kind of all the all dimensions are equal problem because you only have one learning rate per dimension if you go further and go into the regimes of atom or rms prop these now can also kind of change over time add a grad also to a degree but much more so these other algorithms can adapt to like changes in steepness and once it goes flat again they can kind of recognize ah now it's flat again so i might do some bigger steps once it goes steep again they're like okay i should probably be kind of concerned right here there's also this notion of momentum that's really useful uh the kind of counters counter stochasticity of stochastic gradient descent it's it's a big field but what they all have in common it's humans sitting down coming up with this particular like a particular formula because they feel uh if i you know do this thing then it might it might do this it might stretch out these dimensions that might be beneficial these are humans sitting down now the analogy here that these people make is we used to do this for classifiers we used to hand design features that we felt make sense like the image gradients and so on or the fft for let's say for um sound and and that um that worked so far but it worked better when we let deep learning do its thing and the goal of course here is also that we let machine learning come up with the optimization procedure so what exactly goes so if we try to update theta we might update it not as a fixed formula but we might take the old theta we might take the gradient of theta and we might take a bunch of features that we calculate from these things like things like the sum over the norm of old gradients and so on and we put this all into a big function so f and f as you know in the classic sense that's what the humans define but now the goal of course is to learn f so we have a set of meta parameters let's call them whatever that thing is um and and and phi maybe psi i know psi let's call it like this and now i have a meta parameters so let's use let's let's parameterize f as a neural network that learns to output the next weight for the underlying neural network now the f itself of course has to be learned somehow but the idea is is kind of since it's a meta algorithm meta algorithms tend to be much more general and much more smooth and therefore they themselves could be optimized fairly generally and once we have a good f we can apply it to all sorts of tasks and that's exactly what they do so they consider three problems in learning optimizers so first of all computational scale learning optimizers is hard and this paper here invests a lot of compute into learning one meta optimizer second training tasks and this i feel this is the kind of the the core here in that what they do is they they now you have to pay attention so if we talk about data sets it's it's very confusing now because on one hand you have data sets like mnist and you have data sets like c410 right so these are data sets but in the in the task of learning an optimizer a data set is something like this um so in mnist let's just make the analogy here we have following samples this image this image uh this image right in c410 we have like this airplane right here this is an airplane it's an airplane believe me um with the truck right truck and so on we have this now this are the classic data sets however in this paper a data set consists of the following and this data set they use here is called task set so one sample in the task set data set is i take the mnist data set i use like a five layer cnn on amnest and um i use a batch size of 32 and i let it run for 10k steps and so on that's one sample right the next sample could be i take c410 i use a resnet 50 on it my batch size is 64 and i let it run for 50k steps right so this this these are now samples in this task set data set and the task set data set consists of a wide variety of tasks i believe over 6 000 different samples which include things like rnn tasks image recognition tasks very simple like 2d optimization or sorry quadratic optimization tasks and so on so there's all these kind of different tasks and the goal you cannot see now the goal is that if we find so here what's the goal when we learn mnist what the goal is if our output is going to be a cnn that we can input any sort of digit into and it gives us the label 2. the goal here in task set is if we find f an optimizer that works for all of these samples in the data set then we can give any sort of new sample so let's say we will give we'll have a new problem right we'll have our medical medical data set and um we have this resnet 101 that we want to train on it not a pre-trained but that we want to train on it we want to train it with a batch size of 64 and so on we can input that and the um the optimizer will spit out good parameters for that particular uh date for that resnet 101. the optimizer will be good so it's important to stress that we are looking for one single optimizer um one single function that can optimize all these kinds of different tasks right uh that's a challenge of course and that's what this paper attempts and then the last thing here they say is the inductive bias of optimizer architecture the parameterization of the learned optimizer and the task information fed to it strongly affect performance in this work we propose a new hierarchical learned optimizer architecture that incorporates additional task information such as validation loss and show that it outperforms the previous learned optimizer architectures so i think you get the overview right now so let's actually jump um right in so what does their optimizer look like their optimizer here is kind of the contrast to previous work let's actually jump jump into their optimizer their optimizer consists of each parameter is associated with one lstm and one feed-forward network okay so the lstm gets the following um actually let's let's look at the the feed forward network where do they say what these output at some point they say what they output um one second nope nope so here uh such as training loss validation loss normalized have a relatively consistent scale to compute here to compute the weight update the per parameter mlp outputs two values a and b which are used to update inner parameters so their formula to obtain this is what we call theta right here their formula their formula to update theta is this thing right here x a of a and b so for each parameter their optimizers outputs a and b so that's this feedforward network it doesn't actually um as i can tell this paper is very confusing like there are multiple points where it's not clear what they do and their notation differences doesn't help so here if i had to guess i would say they don't output delta w they actually output a and b okay so into their feed forward network goes the most important thing is the gradient okay if if this network were to do something very trivial it would simply output the gradient right here it would it would make a equal to what one no what's x of one no that doesn't work zero sorry it would output a equal to zero and b equal to the gradient and then you just get gradient descent back but we also want to feed it with information that it could use right that it could use to um to make better decisions such as momentum right now if it can it could technically reproduce sgd with momentum if we give it the second moment well now it can it can do things like add a grad because that uses the second moment it's no notice like note that this algorithm doesn't do it symbolically there are other papers that try to come up with a symbolic expression for a better optimizer right like i've shown you with atom like you can write it down as a symbolic expression this is not that paper this paper really the output of the feed forward network is a number or two numbers per parameter or two vectors whatever you you want to look at it like this is a numerical procedure you're really trying to for this thing is this f it's really a vector goes in and a vector goes out okay and these are the features gradient momentum second moment and so on um there are more features that go into the model namely training and validation loss so since you are training an underlying model you have access to the labels at all time this is what you have to think even at test time so when you test your f with a test task that test sample will have an associated training data set with it right and you're going to have the loss of that training data set and you're also going to have the validation loss um i guess you could split it yourself if you wanted to but uh the goal that's we're going to come how we exactly optimize f and what the loss for us is but intuitively you want to train your f such that the validation loss of the inner task is as small as possible and we're going to see how that works so yeah the tensor shape as well so it could technically do something like implicit batch norm right it could do that um depending on how big the current tensor is that it optimizes gradient norm and so on so the total norm of the total gradient they just feed all this kind of information in here and you can already see kind of my first my first bummer with this is that if this were really modeled after classic deep learning what you would input is two things okay maybe like the the current step no not even that so what you would input is two things you would input your sample x and you would input the gradient okay like you would input your your sorry not the sample you would input the current weight yes the w that you're changing and you would input the gradient which is the gradient that you get from back prop from the underlying system and this technically since the lstm goes over time right so in each step the lstm technically remembers the last steps if this is a neural network it's a universal function approximator it could technically calculate the momentum it could technically calculate the second moment of these things um i guess these things here you you could feed in i agree couldn't do that conceivably but these other things you could you know this it could calculate this so we're back into the business of feature engineering and this is going to and they say this at the beginning right as i said this paper is quite honest they say that these things that they feed in also these things they make a lot in terms of the final performance of this model so this kind of bugs itself with the analogy of hey remember when we replaced handcrafted features with learned features in computer vision let's do the same it's only halfway there as yes we are replacing the symbolic operation but we are still inputting a lot of the handcrafted features that we think are useful okay so as you can see there's an lstm going over the time steps and for each for each parameter there is a small feed forward network the output of the feed forward network is going to be sent back to the next step of the lstm the lstm of course is recurrent and so on so i hope you can see how this works so what this what this does is is um you have a neural network that you input a data set into you let a data set run through it it gives you a loss and you are using f to optimize that loss right f is a function that takes in the w of the current neural network that's the w here and it outputs the w at the next step t plus one you do this for a bunch of steps so a bunch of steps until you have like i don't know n steps then you take your validation data set of the inner task your validation data set and you calculate your final loss loss of your validation data set um given w so loss given w of the validation data this is disconnected right here and what you want is you want to optimize the psi of the f such that test loss is as small as possible i hope you can see the problem in this even if this is all differentiable which it can be right you are going to have to back propagate through n inner steps of optimization since each of these steps is a forward propagation through f right and only at the end you have an actual loss right here a validation loss so you're going to have to back prop through all these n steps which is simply not possible currently we can't back prop through thousands of steps and we need thousands of steps currently to optimize deep learning architectures so they are opting for something different okay so we have this model the model is acting as an optimizer at the end there's a validation loss and we are wondering how should we optimize this model to make the validation loss as small as possible a given an n step roll out of the underlying thing while we can't back propagate through the entire rollout and if you have guest reinforcement learning you're almost correct so the answer here is going to be evolution strategies they say it right here we deal with these issues by using derivative free optimization specifically evolutionary strategies to minimize the outer loss obviating the need to compute derivatives through the unrolled optimization process previous work has used unrolled derivatives and was thus limited to short numbers of unrolled steps yada yada using evolution strategies we are able to cons to use considerably longer on roles okay so they they use these evolution strategies and later these persistent evolution strategies which are modifications so evolution strategies really briefly there are many many variants of it but ultimately what you can do is you are here with your guess of the best parameters you are going to perturb these parameters by a little bit um in multiple directions so since evolution kind of the the um there are many ways of evolutionary strategies and this i feel what they do here is sort of the weakest way because i've i've had people uh flame me before because they're saying like these are not really evolution strategies and i agree it's basically glorified random search so you kind of perturb it in each direction where you end up with this population then you evaluate each of these new data points and maybe you'll find that this one this one this one these are actually good this is like meh and these ones are really bad okay or like worse so you want to shift your guess of the best parameters into the direction of the of the good ones and away from the direction of the bad ones and you can kind of see this green thing here as a pseudo pseudo gradient it's kind of a finite difference method if you really think about it and i know evolutionary strategies and so on they contain things like crossover and whatnot inspired by biology honestly they don't say much here but i have read the um the kind of other papers or i've not fully read them but looked at them and it looks to me like that they're doing something like this and they're using kind of the same trick to calculate this pseudo gradient as the reinforce algorithm so this is kind of the log derivative trick to differentiate something that is not differentiable and uh yeah so again this is not really written well because here i would expect that they just take a step into the direction of these good perturbed points but what it seems like just from the abstract because in the abstract they say oh we optimize all our things using adam right and so in terms of the outer grade i can actually show you uh this is so here is a again not to uh rag on these maybe i'm just a poor reader but this is a wildly confusing paper to read and i have still have not really a clue what's going on uh because things are just described vaguely then there's this pseudo code which doesn't help like it it does not help i like it just it basically just specifies how they named their variables uh it doesn't show you most of the actually important logic at least that's what i feel um okay so here outer optimization details we optimize all models with atom right we swept the learning rates yada yada we find the optimal learning rate is very sensitive and changes uh depending on how long the outer training occurs so it clearly they say outer training and adam which means they use adam for the outer training but before they say oh we use derivative derivative-free methods like evolution strategies and they don't say anything about atom up here so what i'm guessing is that they use the evolution strategies to find these pseudo gradients right here because in the paper that i've looked up from them which is their own older work that they use these evolution strategies to obtain a gradient and then i'm going to guess they take this gradient right here and they feed that as the task gradient into atom and then they use atom to to basically optimize their their outer thing but instead of backpropping to get the gradient they use es to get the gradient i'm guessing that's what's happening um yeah so that for that then task distributions as we said they have this task data set 6000 tasks designed after this task set data set it's not exactly task set i think it's inspired by task set these tasks include rnn cnn's masked auto regressive flows fully connected networks language modeling various variational auto encoders simple 2d test functions radicals and more for tasks that require them we additionally sample a data set batch size network architecture initialization scheme so there are multiple issues here one issue is that right next sentence to keep outer training efficient we ensure that all tasks take less than 100 milliseconds per training step um for each task that makes use of a data set we will create four splits to prevent data leakage this is very cool that they you know really separate inner training inner validation outer training outer validation and so on um sorry not outer training outer validation and then outer test that they only look at at the end of course outer training is the inner task but you can see that even google research has it doesn't have really enough compute here to really thoroughly survey deep learning as a field and and take all the tasks into consideration so they have to like settle for rather small tasks like c410 mnist and so on and various small architectures of course that go along with it and if you know much about deep learning you know that there are considerable effects of scale in these things namely optimization has i think optimization honestly has kind of gone back a step in terms of complexity it used to be much more of a debate like should you know this optimization algorithm that one now most people use atom and also a lot of people just use sgd with momentum and especially in the larger models like let's say bert or or even larger models um sgd with momentum seems to be the way to go not only because it's it's easy to implement but because it actually performs well especially in large models with large data so there are considerable effects of scale and by only training on small models and data that is a very big hindrance and we're going to see it in the results right after writing the next step right here that this this is limited to that this is limited to that let's say to that domain they also say up here unfortunately directly utilizing these large scale models is computationally infeasible therefore we out to train on proxy tasks for speed yeah not not really representative in terms of how optimization interacts with the task um the the the yeah so that's that's kind of my my comment right here and one that i see like the biggest weakness of this paper uh okay so we went after that and i would say we jump now into the results so the results here are uh the following so here they compare with various handcrafted optimizers right and it's it's a bit of a weird thing to um and let me just say this this task is a very big and very uh very hard engineering tasks because all of these tasks have to implement them then their loss are of different scales you have to take care of that and so on so this is considerable engineering effort and uh it's like i don't i don't want to diss the work i just uh kind of want to point out where the limits are uh in terms of where they might not have pointed it out so much so here they compare to different things uh the top ones are algorithms that have like a fixed learning rate it's like whatever in for adam like i suggest your three e minus four if that doesn't work uh at least a little bit you're screwed right so you take that so one trial um then you might want to use adam but you might want to kind of search over the learning rate so they do 14 trials to search over for a good learning rate in adam and it goes on until like this this here is 2 000 trials trying out different parameter combinations while their optimizer their learned optimizer only ever has one trial because it's it's learned it has no hyper parameters and that's one thing they point out that once they have learned their optimizer it itself has no hyper parameters it you can you i mean it can't it's a learned function right so there's nothing to search over and therefore that's a that's a you know something you save so you can see that if it's over this middle line the learned optimizer improves over the other optimizer for train and test sets uh in solid and in shaded you can see for most things there is a bit of a movement to the right except in these you know very very grid searchy things so if you do grid search heavily and you have lots of parameters to tune it seems you can outperform this thing but it can outperform things where you do not grit search at least on these kinds of tasks which is pretty cool um to say it does use more memory and i don't know exactly if it uses more time it certainly uses like five times as much memory as adam i think they say yeah time i don't know adam is doing considerable amount of work as well so don't underestimate that compared to like one lstm forward pass they analyze what their learned optimizer remember this is one learned optimizer out of all these sites they have one data set they end up with one learned optimizer and now they look at it and they feed this loss function right here x minus y squared now if you look at the trajectories of the atom optimizer if you like start here it'll go this this way if you start here it'll go this way of course because this whole line here is a a global optimum of this function so atom seems to be doing something sensible and in fact i've i've tried them in a little collab all of the classic algorithms do this um however the learned optimizer does something else namely it pulls towards zero zero right it pulls towards kind of the origin so they claim that this optimizer has learned something like implicit regularization which does make sense right this optimizer is optimized for giving as good of a validation loss as possible okay now what do we know especially about small tasks small data set small architectures on on deep learning what do we know about the validation loss is that a little bit of regularization might be a good idea because overfitting in these regimes is still a problem so it makes sense that's something that is trained to optimize for as low validation loss as possible will learn to implicitly regularize the parameters right i think that's it's it's sensible and they analyze this right here and they show that this optimizer has in fact learned by itself to kind of pull the weights towards this point zero that's one take on it the other take on it uh could be it it could be that simply in the tasks it's given setting most weights close to zero was actually just a good idea per se and maybe the the scale right here or the shape of the loss function is too broad for this and it pulls it towards zero for other reasons ultimately we can't know it seems though that the explanation is somewhat plausible um i have to say there's one exception the atom w so atom w optimizer will explicitly do the same thing so if you start with atom w here let's do that in a different color it will kind of um go towards or yeah depending on the step size it can go like this or it can go like this it will pull towards zero because it also has this kind of built in so it's cool to see that the learned optimizer has learned this though in a chapter titled understanding optimizer behavior i would expect honestly uh something more interesting than like clearly we have already come up with with this in adam w and clearly the notion that we should kind of pull weights towards zero and that might be some sort of a good idea as a regularization isn't new to humans right what i would have expected um here is that they say wow our learned optimizer has like learned kind of a complex but sensible way to deal with steepness changes in the landscape or something like this that that is not achievable or not easily achievable by kind of these classic algorithms it's more complex but it makes sense that's what i want a learned optimizer for i don't want to learn the optimizer to tell me well maybe you should like add a bit of the norm to the loss like gee thanks so yeah again they don't make claims about superior behavior of their optimizer but still that's kind of what i would expect from a learned function um again if you look at the generalization along different things you see the the gray band here is where the up where the training tasks lie in terms of the number of hidden units batch size and data set size and they say sometimes our learned optimizer which is in red generalizes like yeah sometimes it does but sometimes it just like screws up completely and more often than not it seems like here here okay here it's better but then here it's worse so i would not yet take this off the shelf though i agree it has some it has some promising uh value lastly they say okay now we've we've done this on all these small models let's go let's go bigger and bigger for them actually means a small resnet on c410 which is like 14 layer resnet and a small resnet on resized imagenet so these are still small things and i i don't know exactly why they can only once they have the optimizer why they can only feed these maybe because the lstm itself also has like an internal memory constraint when you have to feed in all of the weights of the network however look at this so this is c410 right this is c410 on a resnet resnet so this is fairly big but you can see atom and momentum they overfit so here is the training loss i'm going to guess this is the validation loss they overfit well the learned optimize wow it doesn't overfit but you see so first of all it ends up here okay ends up here when adam and momentum were here their validation loss was here which is pretty much where this ends up so better nah and then you can make two claims you can say this is because it's whatever implicitly regularizing but also you can say this is because it's crap right it like it doesn't actually manage at least your optimizer should be able to get the training loss down right if if like any optimizer i get it they say it's implicitly regularizing but no like why like i'd rather have explicit regularization but have an optimizer that actually gets the training loss down as as much as i want it if i run it longer i don't care about overfitting it should peg down the training loss and this one doesn't do it i think the explanation here isn't that it's super duper regularizing here it's just crap and yeah again not to say that the paper is crap but the learned function they get isn't as good as atom or momentum here the same thing on a bigger this is imagenet on a resnet on a bigger resnet i believe and you can see that yeah you maybe can say that the learned optimizer is on par with the others but you see a trend right you see the trend that this it gets so when it's small right small problems the learned optimizer here outperforms okay when it's a bit bigger problems the learned optimizer it still outperforms invalidation loss when it's even bigger the learned optimizer is the same size right and here you can see if you grid search you can outperform the um the learned optimizer three e minus four look at that look at that it's like jackpot um so this high suspection is if you go to even higher problems right then this learned optimizer will just get worse and worse and worse and this is the ultimate dichotomy in this paper it says look there are no hyper parameters in our learned optimizer uh you you don't have to do grid search well where can i do grid search on small problems where can't i do grid search on big problems where does this learned optimizer work on small problems i don't care if i don't if i if i can or can't do great search on small problems i care about big problems which have fundamentally different optimization properties than small models so the last experiment here is where they take this optimizer this learned optimizer and they use it to train itself so they train it once and then they you know apply it to itself like the analogy is the compiler that can compile itself so you can see that yeah at the beginning it's kind of faster but then it kind of flattens out and you can see that it it it can't train itself right that's the answer because it doesn't matter like this this part here except in very in limited circumstances where you want to like train to okay performance really fast it doesn't matter it if it doesn't end up in the same place right and you can clearly see here it's not gonna end up in the same place i'm gonna show you the full graph in a second but even from that you can see that it cannot train itself it in fact adam can train it sell it the this optimizer better than it can train itself and this you know that yeah just take it take that for for what it is um they have a full plot like the longer plot in the appendix right here and where is it here so you know you decide if this algorithm can be used to train itself or not i get it it's pixelated right now it's gonna load in a second but you can see all right so the as you said there's this this giant um yeah here there you go this this pseudo code in this paper right here in the appendix is supposed to be helpful i guess um but yeah so what it actually shows is how is like their variables and how they interact and again i find it's correct what they when they say there are no hyper parameters once you've trained the optimizers but g are there a giant amount of hyper parameters in actually training that learned optimizer so just deciding which features go into that and then so you you have whatever your your your embeddings this list like it like okay there are no hyper parameters in this procedure i get it i'm a bit hyperbolic here but there are no other parameters except for you know this list the fact that you use a sine function um these uh gradient clipping values right here this um clipping thing right here the fact that you use a square root right here uh whatever you scale that by this constant right here this thing the fact that you use log apps here you can have all all kinds of things there are not many hyper parameters right here but it goes on right the g norm again we clip by something that is completely arbitrary uh you can you can see that the the architecture oh another clipping value that is just set to five the the arbitrariness of how you train this optimizer itself is is is riddled with hyper parameters and i get it the sense is that this has on has to be done once um but given the result i feel that this uh yeah this there's lots of room and i feel whatever you input into these whatever rolling features there are has is going to have a giant amount of influence over the over the what comes out over the optimizer comes which is again is something they admit right so much code in this yeah okay lastly let's go to the broader impact statement which i find to be uh amusing for a simple reason so the broader impact statement what is it supposed to do i i maintain that what it's supposed to do is you i i don't agree that these things have to be in but if you want to put one in and the way that the people who require it frame it is you think about your method the thing you have suggested and you think about the ethical uh societal implications of that and you really think about the good and the bad implications of this and my meme it is the broader impact statement is technology good technology bad technology biased and i say good bad biased because you want to think about what's good you want to think about what's bad and then there is it's really in fashion to say that everything is biased and of course your model is as a result also biased or your method or whatnot this is a a fashion at the moment um expect this maybe to go away in a couple of years the other thing part of the meme is the technology part so i say technology because what people usually do is they've just presented a method they don't want to trash it right they like you you're not going to say my method is potentially bad what you want to say is you're going to make it easy for yourself and say well my method is part of machine learning or if you if you have something for optimizing gans you say well gans can be used for good and bad and are biased right so you you make it both easier for yourself and you take yourself out of the crosshairs by simply going one or two layers up and the ultimate layer up of course is just the statement technology um so i intended this to be a meme until i read improving technology to do machine learning will accelerate its impact for better or worse we believe machine learning technologies will be beneficial to humanity on the whole thus improving the ability to optimize models are moving towards like literally the meme has become a reality uh by them explicitly saying well this is part of technology and technology can be good or bad none of none of this is actually about their the specifics of their method like in my mind if you are seriously doing this you should think about what differentiates my particular paper from other papers and how does that particular differentiation manifest good or bad uh as a consequence like how what are the consequences of that particular differentiation however technology good technology bad technology is of course biased so yeah that's that all right i i hope this was i i think it's cool work right this is cool work and you know google is one of the very few places where this even can be done it is certainly it is a paper that fully admits its limitations and that's also extremely cool um and interesting and it's written very unclear at times honestly but yeah that was my commentary i hope you enjoyed this if you did share it out leave a comment tell me what you think including what you think if you have a different opinion and i'll see you next time bye

Info

Channel: Yannic Kilcher

Views: 16,815

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, optimization, lstm, taskset, google, google research, compute, outer optimization, adam, adamw, sgd, momentum, learning rate, gradient, learned optimizer, second moment, cnn, rnn, paper explained, neural network, gradient descent, hyper parameters, grid search, mnist, cifar10, imagenet

Id: 3baFTP0uYOc

Channel Id: undefined

Length: 53min 35sec (3215 seconds)

Published: Sat Oct 03 2020