Frank Hutter and Joaquin Vanschoren: Automatic Machine Learning (NeurIPS 2018 Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you can do so quietly that would be better because we want to stay on time so today it's a great pleasure to have two great speakers for the tutorial on automatic machine learning the first speaker is Frank cutter who was a professor in University of Freiburg and the second speaker is a Hindman Shirin Seaton professor at end of an University of Technology and co-founder of open imelda Torg together they organized the autumn al ICML worship for the past five years and are actually organizing the meta learning workshop later this week on Saturday Frank will start and there is going to be depending on time maybe a 5-minute break during which you can also ask questions there are two microphones one on each side of the central aisle please line up behind them and we will call you all right yeah thank you very much for the introduction while people are trickling in let me make one point so we have slides available for this tutorial if you go to your browser and you type Auto ml dot org and your browser works then you can get the slides there now that there's a lot of people the Internet is maybe not so strong anymore so this work ten minutes ago likely it's not going to work for you either than but the slides are available online interesting we don't have that on the screen but all all you would see is a browser that's trying to load oh and it did load now No well nevermind ultramel org that's where the slides are and if you go to auto ml tutorial so do Auto ml tutorial here on auto ml dot org that's where the slides are and all the references in both my slide set and one of your keen are actually clickable links so you can just directly use slides in order to explore the literature all right with that said let's start the tutorial so this tutorial is on auto ml which is the quest for using machine learning and optimization in order to make machine learning better in order to make machine learning more robust and easier to use for people without expert knowledge in machine learning so to briefly give some motivation we live in very exciting times of course for deep learning we've seen tremendous progress in speech recognition and computer vision for self-driving cars in reasoning and games and deep reinforcement learning etc but all these advances are based on expert knowledge of some very brilliant people who have gathered a lot of expertise in deep learning and know how to set all the hyper parameters so if you want to apply deep learning to a new problem domain that is of course it's a problem that deep learning is very sensitive to a lot of different hyper parameters such as architectural choices and choices in the optimization pipeline and the way you regularize your network in order to generalize to new application to new unseen data and so it's you easily get between twenty to fifty design decisions that you need to make in order to start to train your network and that leads to the current state of deep learning practice given a new data set you need an expert that chooses the architecture and the hyper parameters that would work well for this data set and then given that architecture and hyper parameters you can apply deep learning in an end-to-end fashion to learn features from the raw data but unfortunately even experts don't always know exactly how to choose the architecture and the hyper parameters for a new data set so you end up with this iterative loop with a human in the loop the human expert having to choose architectures and different hyper parameters exploring that space in order to at the end actually get a nice deep learning system that gets state-of-the-art performance what auto ml tries to do is to provide this kind of one-stop shop you put in the data and then there is this big box here that does end-to-end learning that does learn features from raw data but also already decides about which architectures to use which hyper parameters to use etc one way to achieve this is of course to mimic this manual approach up here where you have some deep neural network so you can have some learning box that could be just a deep learning framework and you have the expert choosing architecture and hyper parameters here you replace that by a meta-level learning and optimization process that reasons about which architectures work well which hyper parameters work well on this particular data set and that can also reason across data sets one node this learning box here this is not constrained to be a deep neural network it can also be some traditional machine learning pipeline we need to clean and pre process your data select an engineer better features select the model family and the type of parameters and construct ensemble pombles of these models and you can also do metal learning metal reasoning about which types of hyper parameters work well in these frameworks so the outline of the tutorial is as follows will first talk about modern approaches to hyper parameter optimization which might sound as a somewhat boring topic but if you don't get automated hyper parameter optimization in inside of your inner loop then well you can't really do auto ml so it's it's really important to have a robust and efficient core of your machine learning pipeline then we'll talk about neural architecture sir and then about meta-learning i'll cover the first two of these and jerkiness gonna talk about meta-learning all of these are based on a book that we edited in particular these are the three first chapters of that book basically review articles about these three different fields so the outline for my part is well first I'll talk about hyper permit optimization phrasing Auto ml as I have a permit optimization problem then blackbox optimization methods for high performance optimization and then going beyond blackbox optimization in order to gain efficiency for neural architecture search the overview will basically mimic that first there is search based design how do you write this as a optimization problem and then again blackbox optimization and speed-up techniques the first part is based on this review article that I wrote together with my brilliant PhD student Matias Feuer who deserves a lot of credit for this so let's get straight there Auto ml as hyper permit optimization let's define the hyper permit optimization problem by talking about first so we have some algorithm some machine learning algorithm a and it has hyper parameters lambda R is domain capital lambda and so we'll use lambda throughout the tutorial for hyper parameters and you have a loss that depends on your algorithm where's hyper parameters trained on the training set and validated on a validation set and the hpo problem is just to find the hyper parameter setting that minimizes this loss the validation loss all right what are these hyper parameters well there is continuous hyper parameters of course such as learning rates there is integer hyper parameters such as number of units there's also categorical hyper parameters that are maybe a bit more interesting than continuous and integer ones these categorical parameters are discrete so we can't differentiate through them they finite domain and unordered so some examples here might be well you might just choose between different algorithms machine learning algorithms support vector machines or random forest neural networks or you might choose activation functions and your neural net this is a rail you leaky rail you sort NHS or so and you might also choose operators that operate on your latent feature representation such as convolutions several convolutions max pooling etc etc and these categorical parameters will come in very handy later on when we do neural architecture search and want to write neural architecture search as I have a permit optimization problem with a fairly generic version of what constitutes a hyper parameter in a special case of course of categorical designer ease another type of hyper parameter is a conditional hyper parameter and so conditional hyper parameters be our only active if some other hyper parameters a are set in a certain way and these conditional hyper parameters allow us to write pretty generic search spaces as a hyper chromate optimization problem so one example is Adams second momentum parameter if you don't use atom then this hyper parameter is not going to be active it's not even going to be inspected and is provably unimportant if you don't pick atom then you don't need to pick the second momentum parameter likewise if you have a parameter that's a convolution kernel size of a certain layer if the type of that layer is not convolution then this hyper parameter is not going to be active and as a final example well if you have a support vector machines kernel you only need to choose that if you actually use a support vector machine if you use a random forest instead then provably that SVM's kernel parameter is not going to be inspected so with that notion we can actually write Auto ml as a hyper primate optimization problem as follows so there's this combined algorithm selection and hyper primate optimization problem where you have several algorithms a 1 to a n and each of these have hyper parameter spaces and you can have a loss function and now the problem is to find the combination of the best algorithm it's hyper parameters to jointly minimize this loss and with that you can actually write this combined algorithm selection hyper primary optimization problem as just a hyper primate optimization problem where you have one top-level hyper parameter that's a choice of machine learning algorithm and all the other hyper parameters are conditional on that choice you don't have to stop at one level of conditionality actually kind of the first Auto ml system that we worked on Auto Rekha had four levels of different of conditionality and the total of 768 hyper parameters so pretty pretty big space that wouldn't typically be called hyper parameter optimization but the same type of algorithms actually do apply to that as applied to a low dimensional continuous space so I am referring to all of this as hyper parameter optimization just for simplicity all right so I hope to have convinced you that hyper parameter optimization where's the general notion of what constitutes a hyper parameter is a pretty powerful Beast so let's talk about some black box optimization methods for doing hyper parameter optimization so if you have for example a deep neural network and you have different hyper parameter settings here you have a black box that trains the steep neural networks and validates it you get out validation performance well what black box optimization does is treat this as a black box and seek for the input lambda that maximizes the validation accuracy so you just search for the lambda that maximizes f of lambda and with that you throw away all information about what happens inside of that black box and while you give up a lot by doing that but it gives you a generic interface for applying this to all kinds of different problems this black box function is of course expensive to evaluate so sample efficiency is important we can't evaluate this too often the arguably simplest approach for doing hyper parameter optimization is grid search this is of course exponential in the number of hyper parameters it's useful for if you have one or two hyper parameters but if you have a lot of hyper parameters it's probably not the best method random search in contrast is only exponential in the number of important hyper parameters so if you have for example one unimportant hyper parameter here and and you have one important parameter then what random search does with this budget of nine function evaluations here you get nine different values for this important parameter and therefore it can use your budget much more effectively than grid search which would only get three values for this important parameter and get each of these evaluations three times because the unimportant power method doesn't matter so random search definitely does perform better than grid search if you have some unimportant parameters and often often times you have parameters that are not completely unimportant but you definitely have some hyper parameters that are far more important than others such as learning rates and so random search is a useful baseline because of that but it does have the problem that it doesn't follow the information it doesn't actually learn anything about the space if in one part of the space performance is dramatically better than in other parts of the space random search will just not care about that at all and that is of optimal and one approach that does care about this it's based on optimization which is a very popular approach for black box function optimization made works as follows so you fit a probabilistic model to the function evaluations that you have made so let's say we have evaluated the function here and here so so we have one hyper parameter here it's just a single hyper parameter because I can't make a nice nice figure in n dimensions but it directly translates to that case and here you have the true function here in dashed and we have evaluated it here and here and what Bayesian optimization does is to fit a probabilistic model to these function evaluations in order to predict for unseen hyper parameter settings what the performance would be like typically you use Gaussian processes for this which give you a mean function here in a solid line and an uncertainty estimate here in this blue funnel and you use this uncertainty estimate in order to trade-off exploration versus exploitation so exploitation in areas where the function is predicted to be high in this case we're maximizing and exploration in parts of the space where you haven't evaluated yet this trade-off is formalized by means of a so-called acquisition function and there is various different acquisition functions one popular one is the expected improvement over the best point seen so far and this is plotted here what you then do is to optimize this acquisition function over your hyper parameter space to find the Maximizer then you evaluate your function at that point refit your model to in particular get low uncertainty estimates around here and also a somewhat different fit globally then you recompute the acquisition function optimize your acquisition function again and evaluate the next point and you iterate until you're out of time now this is a very popular approach it's been around for over 40 years and it's very sample efficient it also works when the objective is non convex noisy as unknown derivatives so all the the types of characteristics that we do actually face in hyper parameter optimization and there's also recent convergence results depending on some assumptions on the smoothness of your response surface etc one example for Bazin optimization I got an email today from and other fighters who knew that I'm giving this tutorial and he told me finally he can tell the world that based optimization was actually very crucial inside of alphago so here's some quotes of his email which are quotes from a forthcoming paper of his with you tranch n so I'll just kind of read this to you so during the development of alphago it many hyper parameters were tuned was based in optimization multiple times this automatic tuning process resulted in substantial improvements in playing strings for example prior to the match release at all we chew on the latest alphago agent and thus improved its win rate from 50% to 66.5% in self play games this tuned version was deployed in the final match of course since we - and alphago many times during its development cycle the compounded contribution was even higher than this percentage so I think this goes to show nicely that in practice based optimization definitely can have impact and I would have been very surprised if based optimization was not used in alphago knowing that nando is at deep light and was involved in the project good so based optimization is a very powerful method however there is some problems for the standard Gaussian process approach for this genereal and general auto ml framework we have a lot of hyper parameters and highly conditional space this mixed discrete continuous hyper parameters you have high dimensionality with low effective dimensionality and for each of these problems there are fixes but they're kind of just non-standard fixes the noise is also sometimes heteroscedastic so different hyper parameter settings yield different noise for example it it might also not be Gaussian it might be multimodal for example if some runs diverge in some runs converge with different learning rate settings and Gaussian processes are not necessarily the most robust method in terms of their internal hyper part you need to set high per priors in order to get the best performance and you need to pick the right kernel for your application that had their there is actually some work on picking the right kernel automatically and this this goes towards pushing Gaussian processes towards more robustness so I'm looking forward to that being applied in in a lot of different packages also the model overhead is not necessarily it can be a problem if because Gaussian processes scale cubically in the number of data point so if the black box function is truly expensive then typically this doesn't matter too much but in multi fidelity approaches then you can actually get quite a lot of function evaluations and then this model overhead can actually matter a lot and you need to resort to approximate cost processes a different model that has also been used sometimes in particular in my own work is a random forest model which is kind of a simple approach that works out of the box it's not very sensitive to its hyper parameters it can directly choose the important features it can directly work in high dimensions with mixed continuous discrete hyper parameters etc the only thing that's not so nice about it is that it's not a great probabilistic model it doesn't have this nice for ballistic interpretation what we and we do need uncertainty estimates for based optimization and what we use here is typically frequentist uncertainty estimates so we just model the variance across the individual trees predictions so if all the trees predict the same thing then we are certain if the pre trees predict something different we uncertain so it's a simple model that that we actually use them the seven hundred dimensional space there's also work on based optimization with neural networks in particular there's two different types of works one just fits a standard neural network and then takes the last layer and does a basin linear regression of the learned features in that last layer it's called dingo did not keep networks for global optimization and the follow-up work that we did is using fully based neural networks that are trained with stochastic gradient HMC and I'll just just give a full base in the interpretation and these networks can give you very nice predictions so for the regime of the data where there's lots of data it gives you very good predictions with very low uncertainties and in the areas of the space where there's not a lot of data it gives you large uncertainty so that's precisely what you need in order to have in order to do based optimization however so far to the best of my knowledge this actually hasn't been studied for high dimensional spaces for conditional hyper parameter space is discrete spaces etc so that is actually a definitely an important area to follow up on a final model for Bazin optimization I wanted to talk about is the stree of person estimators it's actually quite different it does not model this probability of the function given the hyper parameter but instead it models a density estimate of the probability that a hyper parameter setting is good and the probability that a hyper parameter setting is bad so I'm making an example here so let's say we have evaluated this function four times we want to minimize now there's two bad points up here and two good points here and so you fit these density estimators and then you look at the probability of being good divided by the probability of being bad and use that as an equalization function this acquisition function has been proven to actually be equivalent to expected improvement by Burke's trial which is why people use this as a base an optimization algorithm so then you optimize the second position function get this new point refit this kernel density estimator for bad points recompute your acquisition function maximize that then evaluate here and by evaluating here now you have another good point and this point here was previously good and now you have other good points so this becomes bad so there there's a hyper permitir saying which quantile of points you call good and which quantile of points after which you call it bad and so the the pros of this method are that it is very efficient these kernel density estimators can be fit very quickly it's very parallelizable and very robust it kind of just works for all kinds of different dimensionalities etc one con is that typically it is actually less sample efficient than Gaussian processes when you have the right kernel for the Gaussian process a final model that is not based on optimization that I wanted to mention as population-based methods because they also used the neural architecture search and there you have a population of different configurations you want to maintain diversity and you don't want to stagnate you don't want all the configurations to be the same but you also want your population to improve over time one example of population based methods are evolutionary strategies the and they work as follows so you have one incumbent point you sample for example from a normal distribution around that point evaluate all these different points then pick the best ones and move your incumbent over to be a weighted average of your good points a popular variant of evolutionary strategies as gmaes which is a covariance matrix adaptation evolutionary strategy which basically wins the black box optimization challenge every year that is about cheap function evaluations so if you have a budget of something like a million function evaluations then these evolutionary strategies are a lot better than based optimization basically optimization is really just not targeted towards that and would also be far too slow in that setting and well recently we also looked at how what CMA has actually do for have a Pramod optimization and showed that it is actually quite competitive if you have parallel resources then it can actually be better than all the basin optimization methods that we tried it however only works on purely continuous basis alright so that was an overview of different black box optimization methods now I'll talk about how to speed things up why do we need to speed things up because this black box view is just really too slow for deep learning and for big datasets if doing a single function evaluation takes a week or so then while getting 50 samples would take a year and obviously this we don't have a year in order to do our hype a prime at optimization so there is four different types of approaches for going beyond black box optimized meta-learning I will not talk about because that is part three of this tutorial and you're keen we'll talk about that later and there's a lot of different approaches here and I will briefly talk about these first two and then focus on multi fidelity optimization because that is kind of the most mature method right now and is at a stage where it's you can use it as a tool so have a permit a gradient descent this is the first one I want to talk about this formulates the hyper primate optimization as a bi-level program where you have this outer objective where you want to minimize the validation loss of well as a function of your hyper parameters lambda and your internal model parameters W the WS are set by optimizing an internal and inner loss namely the training loss given a hyper parameter setting long-term so for example you optimize with SGD to get double w star and then you want to get this validation loss and Europe you can also phrase this as getting gradients with respect to the validation laws by for example deriving through the entire optimization process that leads to this w star this sounds expensive to start with and it is indeed expensive in terms of memory or time so you can choose but if you have a lot of hyper parameters then this can actually be very useful because you don't just get a single function evaluation out but you get a whole gradient out however this this approach is somewhat expensive and another approach is actually to interleave optimization steps with respect to the validation performance you do a gradient step of the hyper parameters and then you do a gradient step with respect to the parameters of the training loss this interestingly enough works actually quite well in practice but there's no theory about why this works and where this might fail and and so on so this is actually a very interesting area of research that I would encourage people to work on and just in general if you can do a gradient based optimization then of course there's a lot of cool things that that we can do and neural architecture search is actually also going this direction with using a lot of gradient based your architecture search I'll talk about that in a bit the second way of going beyond blackbox optimization is to do probabilistic extrapolations of your learning curves in order to do early stopping so you have an initial learning curve and you want to know well where does this go I will just do well or will this not do well and so you can learn to extrapolate for example you can use parametric learning curve models that are fit with MCMC to do these predictions or you can also use bayesian neural networks in order to do these extrapolations the final way of going beyond the blackbox function is to do multi fidelity optimization so what is multi fidelity optimization it's about using cheap approximations of the blackbox function and cheaper approximations the performance on which correlates with the actual blackbox function so you want a cheaper proximation where if a hyper parameter setting does well there typically it also does well on the expensive blackbox how can you get these cheap approximations well in a variety of different ways for example you could look at subsets of your data you could look at fewer epochs of SGD in bayesian deep learning you could do shorter MCMC chains you could do fewer trials and deep reinforcement learning and we've actually done all these for you could also do down sampled images and object recognition we haven't done that but I'm pretty sure that that also would work and this approach also is actually applicable in a lot of different domains such as fluid simulations for example you could use less particles in a fluid simulation or do a shorter simulations so it's it's really a generic optimization trick that applies beyond hyper parameter optimization so how do you see is multiple multi Fidelity's here's an example this is svm fit on em list this is the entire M this dataset 50,000 data points and here there's roughly 400 data points and the two most important hyper parameters C and gamma and you see that the response surfaces actually look quite similar in particular you don't want to do a lot of function evaluations over here in order to figure out that this area of the space is bad if you can figure this out already on this much cheaper approximations and importantly doing function evaluations over here is about ten thousand times cheaper than over here because SVM's scale super linearly and the number of data points so you want to do many cheap evaluations on the small subsets and few expensive evaluations on the large data set and if you do that you can actually get up to a tenth up to 1,000 fold speed ups over doing Bayesian optimization on the expensive black box fraction how do you do that well you fit a Gaussian process model that takes as input the hyper parameters and the budget and then yeah predict how well that will work and then chooses bowls lambda and the budget in order to maximize the bank for the back so for example you maximize information gain per time spent information gained about where the global optimizer lies and there's a variety of different methods that that all fall fall into this general general sense of multi fidelity based optimization however this is not trivial there there's quite a few approximations involved it's kind of hard to code up and also well you need to get the right colonel if you have high dimensional income and conditional spaces this might not actually be the best approach and a much simpler approach would be to for example just sample a bunch of random configurations on the cheapest setting on the cheapest fidelity take the best fraction there off move them to the next budget take the best fraction there off move them to the back max budget take best fraction and move it to the next budget this is the approach called success of having and here's another visualization of success of having for a different a fidelity namely wall clock time so you you do you evaluate a bunch of hyper parameter settings all for this all for this initial wall clock time you take the top-performing ones continue them to the next fidelity take the top-performing ones continue them to take the top-performing ones continue them and wallah you get something that that you would do as a human you wouldn't continue this poor learning curve here all the way for a week in order to see that it didn't work that you would cut it off early it could be one problem however if this learning curve he actually does continue going up and give you the best performance then success of having would never find it even with an infinite amount of compute power and this is a problem that the extension of so called hyper band fixes so hyper band actually call success of having iteratively one time in the most aggressive way with starting from the lowest fidelity then my computer is freezing it's not good interesting oh great hmm okay the presentation just died for no apparent reason I will restart sorry about that so we're we yeah we're talking about hyper band so hyper band first called success of having on this most aggressive record then I calls it again starting from this less aggressive I'm setting then calls it again starting only from here and then calls it again only using the full black box so given an infinite compute budget you would actually find issues where so so you would find the best configuration even if this is the best configuration and continues up here I think I'll stop using the laser pointer because then my presentation does not want to continue so hyper band has a lot of advantages you can you have strong anytime performance due to these multi fidelity's it's general-purpose for low dimensional and high dimensional spaces conditionalities and so on it's all not a problem it's very easy to implement it's also scalable and easily parallelizable however it is based on random search and so it doesn't exploit knowledge about which hyper parameter settings work well and that's where based optimization is strong so therefore we combine them to get the best of both worlds we use space and optimization in order to pick these configurations and then use type AB and in order to allocate their budgets just to visualize how this this works so hyper band versus random search initially have a band is much faster given enough time it's actually not much faster than random search based optimization as a reverse initially there's no speed-up over based of a random search because you need to explore the space first in order to figure out where the good points lie but given enough time you actually converge faster to the optimum you can speed this up by doing matter learning and you're keen we'll talk about that but if you don't have matter learning if you learn from scratch then well it's not going to be better than random search to start with but using multi fidelities you can actually make it better than random search in the beginning and also in the and yeah so this barb approach um has almost linear speed ups by parallelization so that is also not a problem and that is why for practitioners right now if you ask which tool should I use I would recommend if you have multiple fidelities I would recommend Bob because it's it does combine the advantages of TPE and hyper band it's kind of this this tool that if you don't want to think about it it would work quite well other multi fidelity based optimization methods might also work well if you fit the kernel correctly if you don't have access to multiple fidelity's I would still recommend if you have low dimensional continuous basis then the based optimization was gossip processes for example spearmint if you have high dimensional categorical conditional spaces I'd record mine based optimization with random forests as in smack or this TPE method based on kernel density estimates and if you're purely continuous and you have a relatively large budget and maybe want to parallelize really nicely then CMAs can actually be very competitive there is several open source tools that are based on hyper permit optimization so auto Beca already mentioned this is based on the recommend you take is an optimizer there is hyper updesk you learn which is based on Sai could learn and keep EE is Auto rescue learn also based on scikit-learn and smack or in a newer version on Bob this also uses meta learning in order to jumpstart the optimization and post hoc resolving and it won the automatic competitions well the first one and the second one there is a third one going on right now and there's going to be a workshop about challenges of machine learning where that's going to be discussed there is a teapot which is also based since I could learn and evolutionary algorithms and it focuses more on pipeline construction and finally there is h2o auto ml which is based mostly on random search but also stacking and very efficient of limitations I would like to mention that auto ml can be as a democratization of machine learning Auto rescue learn actually didn't only win against other auto email systems it also won against human expert in a kegel like competition particularly performed better than up to 130 different teams and it's really easy to use it's BSD license on it's a plugin estimator for a second line and it's already has a lot of community adoption so this really opens a door kind of for everyone to use effective machine learning that can in some cases actually work better than human experts if the human experts don't have domain knowledge domain knowledge kind of beats everything if you can generate better features then that is of course very helpful one quick example application of Auto rescue learn so in a collaboration with Freud works robotics group they were interested in this binary classification task the robot had an object in its hand and wanted to place it down and was wondering will the object fall over or not we use the data set of 30,000 data points for which a bachelor student actually manually created 50 features and then back in the day this was 2015 he used caffeine and spent about three months to get to an error rate of 2% which was a very nice bachelor thesis everybody was very happy and after that auto rescue learning are completed and well we actually got to a zero point six percent error rate in 30 minutes this is not to diss the bachelor student he actually generated these features without which auto rescue learned wouldn't have done anything but it goes to show that deep learning if you don't have expertise and if you haven't played with this a lot before is not necessarily the best tool for a given application so you should also look at simple baselines in particular if you have feature rise data again not your dis teef learning all the rest will be deep learning because deep learning is of course awesome and can generate abstract representations of the data from raw features so the second part will be on your l'architecte research this is based on this review article I wrote with my PhD student Thomas L school and a research scientist at Bosch Yun Heinrich medicine and this is also a chapter three of the auto ml book so let's first talk about the search based design the simplest search based design is a chain structured space so you just have to decide how many layers to everyone and for each layer what is the type of this layer is this going to be a convolutional layer max pooling fully connected etc there was historically the first type of space people looked at but then over time people look more at more complex spaces with multiple branches and skip connections I'm inspired by rest nets etc since last year the types of search spaces pretty much everyone is using is is now actually a cell search space in which you you parameterize a building block that takes some input and deal some output and then you stack these building blocks together very much like a residual networks or inception networks typically people have two different cells one regular cell and one reduction cell that reduces a spatial resolution of the image and typically this macro architecture here is just a chain structure you could also think of doing a multi branch or skip connections in this microarchitecture nobody really does that but the the the multi branch and the skip connections are in this individual cells that you stack these cell search bases have two advantages the first one is that you can well they actually very small I'm compared to a general space where you parameterize this entire network here and therefore you can search them more efficiently and the second advantage is that if you have found a Cell on a small data set like Jaffar then you can actually use that cell on a larger data set such as imagenet and this actually does give you better performance and has has improved the state of the art for for imagenet and other large data sets and one disadvantage of using the cell search space is that well you're constrained to well you you will not find entirely new architectures you will only find cells that you will stack in this macro architecture because the macro architecture is defined manually so if you want to find yeah completely new architectures then one might want to go beyond the cell search space again and so I promise that I'll talk about how you can phrase neural architecture searches hyper primate optimization so here's a cell search space that I'm Sofitel used basically they have this recurrent here on network controller which would pick first well the first hidden state that goes into your cell then the second hidden state that goes into yourself then it would pick the operator to use on your first sitting state then the operator for your second hidden state and then a combination operator how do you combine the the results of your two branches and that could be just an addition or concatenation and so you end up with five categorical choices for each of these blocks the first category choices has has domain 0 to n minus 1 if you're in the end block because those are the hidden States you could pick this these categorical choices here have just different operations such as max pooling separable separable convolutions etc and this last categorical choice has very few options so you have 5 categorical choices for each of these blocks and you have B blocks with B typically being set to 5 and verse 5 this gives you 25 hyper parameters that fully define the cell and there isn't even any conditionality going on here this is just categorical hyper parameter and standard hyper permit optimization methods can be used in the space if you have an unrestricted search space then you can still write that as a hyper parameter optimization problem but then you have conditionalities and you can only well you have to have a limitation of the maximum number of of layers so for example if you just have a chain structured space but you're not gonna restrict this then well as an hyper permit optimization problem you still need to say well a maximum number of layers such as 10,000 or so and each of the layers that is not active well so you have a stop level hyper parameter that's a number of layers and everything that's larger than that all the hyper parameters are just not active all right so that's how you can write in your architecture search as hyper primate optimization let's next talk about some black box optimization methods for neural architecture search the method that really popularized neural architecture search in the community versus neural architecture search was reinforcement learning by berestov and courtly what they used is a recurrent neural network controller to sample the architecture individual blocks at a time and then they train the child network with that architecture to get some accuracy and used reinforced in order to train the parameters of the controller this became really popular because they were the first to achieve state-of-the-art results on C for Ken and also on a language data set penn treebank however they used very large computational resources 800 GPUs for three to four weeks 12800 architectures evaluated etc that is why recent work really looks at going beyond the black box function but but if you have these large computational resources then this was the first paper that could actually really achieve new state-of-the-art performance with an automated pipeline there's lots of other work on black box optimization for example new revolutionary prisons actually go back to the 1990s typically what they did is to optimize both the architecture and the weights with evolutionary methods and that doesn't scale to very large networks as well as stochastic gradient descent which is why more recent methods based on evolution actually only use evolution in order to get you an architecture and then train it again with stochastic gradient descent and here you can see nicely that initially you could just start with some very simple architecture and then you evolve this over time to become more and more complex and better performing and these evolutionary strategy is not evolutionary strategies evolutionary algorithms are actually the best known approach on this one data set see for ten they were actually compared in a head-to-head comparison by quarterlies group and evolution actually did better than reinforcement learning and this evolution approach really just uses a fixed length search space there is no neural networks going on in the controller there is just a simple evolutionary algorithm picking these 50 different hyper parameters so 50 because it's two cells each was 25 hyper parameters yeah another approach that has also been used for neural architecture search before it became mainstream is phasing optimization so back in 2013 I'm James Bergstrom did this joint optimization of a computer vision pipeline was 238 hyper parameters using TPE and in 2016 we had auto net that actually was also part of the auto ml challenge and was the first auto DL system to win a competition data set against human expert based optimization Wisconsin processes is a bit harder to use for neural architecture search because you need the right colonel and there are some kernels for neural architecture search already in 2013 there was this arc kernel and in 2018 there's a nice bod kernel somewhat related method to based optimization a sequential model based optimization which also uses a surrogate model in order to pick the the architecture to evaluate next and also yield it state-of-the-art results compared to reinforcement learning recently all right so that was based on a blackbox optimization from your architecture search we already saw most of blackbox optimization for Hyper primate optimization anyways and now we look at going beyond the black box there is four different approaches again I will not talk about these last two meta learning well I'll say two sentences if there's actually only one paper it's here at nibs that basically learned a controller across different data sets and then given a new data set has already a controller that's that's warm started and can find good architectures for this new data set faster than when learnt from scratch there's two papers using multi fidelity optimization and with Bob [Music] yeah one just doing standard joint architecture search and hyper primate optimization and one doing the same in a reinforcement learning context where you optimize the policy Network and the parameters of the reinforcement learning algorithm etc all these three papers here don't do neural architecture search in this really big space but actually have only a couple of different options for the convolutional neural networks but also some hyper parameters that are optimized I will talk about these two approaches weight inheritance and network morphism and wage hearing and one-shot models because those actually we didn't see in hyper prominent optimization these are special to neural architecture search the first one is based on well network morphisms network morphisms are operators in your architecture space that change the network structure but not the model function so for example if you have a network and you add one layer that you initialize with an identity mapping then you have a new architecture but you have the same function and this allows for efficient moves in architecture space so you have some architecture with a given performance you apply some network morphisms to for example add a layer or sorry extended layer add a layer or add a script connection then you train for a little bit just a few epochs to optimize these new weights typically improve performance and then you pick the resulting model that's the best and iterate it's a very simple approach that actually yields a very efficient architecture search you can do architecture search and about a day on a GPU using this the second approach I want to talk about is weight sharing and one-shot models so in 2016 there was a paper on convolution on your fabrics that basically embed an exponentially large number of architectures in this so-called fabric here this is a fabric and each path in this fabric is one neuron network so the idea is kind of similar to to to dropout you train well actually not here yet you so you train this in your neural fabric you train all of the all of the different paths at the same time and basically you get an exponential Samba of different architectures but there there wasn't any the relationship to dropout wasn't that clear here yet that became more clear in this paper simplifying one short architecture search which actually use path dropout in this well one-shot model they call this neural fabric the one-shot model and they use pass dropout to make sure that the individual models don't really rely on the other models that that you don't just do well as an ensemble but also that individual models do well there is another related approach that uses reinforcement learning I'm also using this one-shot model and that samples one path at a time in that one short model trains it it's weights and therefore also trains all the weights of all the other the chair some part of the path and there is one other approach that somewhat different for weight sharing this one trains a hyper network that generates the weights of different architectures so the architecture is an input to the network and the network outputs the parameters to use for this particular network and what's shared is well there's only one hyper network that's being used to generate the weights for all the different architectures so this can also be understood in this way to sharing view one of the most currently most popular approaches for neural architecture search is darts stands for differentiable neural architecture search and this relaxes a discrete math problem and to look at the one-shot model but for each of the for each of the operators you have a weight alpha that that you used to multiply the result of this operator so you have this cell architecture and you don't know from here to way to here which operator I should you use you could use three different ones and you initialize them uniformly giving them all the same weight and just add the result here and then you actually in the space you do gradient based optimization using a very similar approach as I mentioned before and have a problem at optimization by locating a tile to interleave optimization steps with respect to validation error of the architecture parameters and we respect to training error of the weights and then you end up with some architectural parameters that are much stronger than the others and in the end you just take an arc Max and take the the most promising architecture yeah and this is basically because you have these interleaved optimization steps of validation one step gradient step with respect to validation error one gradient step with respect to training or error you're only maybe a factor of two or three slower than a normal optimization of your training error and therefore you can now do on your electric chess search very efficiently there is a lot of questions here like why does this work if you just inter leave these steps that are currently being being studied there's a lot of different promising works and a review for time reasons I'll just give you a very brief overview so there is a several iclear submissions that are actually following up on darts there's also interesting follow up work on hyper networks and there's also an interesting line of work on multi objective optimization and neural architecture search to very explicitly give you neural networks that work well with resource constraints for example I do want to leave you with a couple of remarks on experimentation in your architecture search so the final result of different methods is often quite incomparable due to a variety of different reasons namely different training pipelines are being used there is no source code available often there's different hyper parameter choices and there's different search spaces and initial models and that that all makes it very hard to compare different different methods it's nice when people release the final architectures but it doesn't help for comparing the different algorithms to each other for the different hyper parameter choices there's actually very different hyper parameters during the training and then during the final evaluation and it's quite important where these hyper parameters come from if you have to do hyper parameter optimization of your method in order to then um do well then well it doesn't work really well as an auto ml system because that should just work off the Shelf given a new data set and you shouldn't have to optimize it first and well it also matters whether you start from random or whether you already start from a state-of-the-art neural network and and only improve that locally so if you review neural architecture search papers I would I would ask you to maybe look beyond just the error numbers and see far but look at the the method itself and look ask for these details also we just need benchmarks that are not just C far but that maybe see far with this search space with this training pipeline with these hyper parameters and then we can compare these different neural architecture storage methods on a fair basis finally experiments are often very expensive and so we there's also a need for cheap benchmarks that actually give us some some information about which methods are worked well so we can do a lot of runs and get statistical significance all right to wrap up hyper parameter optimization annual architecture search are really exciting fields there is several ways to go beyond neck box optimization and one shameless advertisements are we are building an auto ml automatic deep learning team we want to actually build a research library of different building blocks for efficient neural architecture search going to build an open-source framework auto-pay torch and we have several openings on all kinds of levels with that I'm at the end of my part and we'll now take a five minute break I encourage you to stretch your legs maybe go to the washroom if you need to but we'll also take some questions during this period all right is that questions and there's mics I don't really see so if you're the mic just tap on it and ask away I'm joking maybe you could I have a question for searching on these continuous type of parameters it's always seems like there's an extra question of the range and maybe the scale of the hype of parameter and how do you handle that sort of thing I'm very good question so if you have a very large range then it's going to be very hard to find the optimum so it's gonna be less efficient if you have a small range then well you might actually not include the optimum for this particular data set so this this definitely is an issue what you can do with TPE is you can actually specify a prior over what you expect to be a good hyper parameter setting so the priors work much nicer and TPE then they work in standard base and optimization because a prior and based optimization would be well I expect the function value to be this given this hyper parameter in TPE the prior is well I expect the learning rate to typically be good around ten to the minus three or ten to the minus two that's a prior that's much easier for humans to to give and then you can just specify a Gaussian centered where you think it should be centered but also giving the search an opportunity to overwhelm that prior if there is enough data suggesting otherwise thank you the recent advance of neural active research we've seen methods which uses reinforcement learning or genetic algorithms while for hyper parameter optimization we tend to see more patient imitations do you see any reason for this change or is this matter of taste of the different authors and I I think the the reason clearly is that Bayesian optimization works really nicely for low dimensional continuous spaces with Gaussian processes and most of the basin optimization community really comes from Gaussian processes so that that's where that's the type of problems that they deal with all the time where got some processes to well and that that's where Basin optimization methods are being developed with new kernels they also do apply two on your l'architecte research etc and we are starting to see some approaches and well we've used random forests etc in this high dimensional spaces before but we are kind of an outlier in the basin optimization research community with that most other people use Gaussian processes I think also with Basin neural networks they're extremely promising and that could really appeal to another whole different community namely the basin deep learning crowd so when they start using Bayesian optimization with basal neural networks I think we'll also see a lot more of that type of work for neural architecture search the flipside why haven't we seen a lot of reinforcement learning for Hyper primate optimization that's a good question that might also actually work quite well and we do see evolutionary algorithms for Hyper primate optimization just kind of knotted nips because these people these papers tend to be rejected at nips they appear at Gecko and other more evolutionary algorithms communities I say microphone yeah hi so I just was curious if you have any thoughts on I guess I'd call it Otto Otto ml so learning a neural network that knows how to optimize the high parameters of other neural networks generically obviously this slower but it has the theoretical advantage that you run this exactly once for everything on the outer loop and then you only run the like normal two-stage optimization is there any work in this area I'm yet so this is a very good question in fact there will be an auto dl competition where you have to submit a system that just does everything and the system kind of optimally would be the result of this process that generates this Auto DL framework and and what you want to do is of course well use a lot of different data sets reason across data set so we'll see a lot more about that in the meta learning part I think well that that's really where you can get this inductor buyers from what are these these faces typically like and what performs well in these spaces and yeah there's not much work on that yet but it's definitely very exciting and I think we will see quite a bit of work in that realm okay cool thank you other mic you're keen if you want to set up okay perhaps it's a little bit naive question but do you do you have a high expectation that these search methods will lead to finding any very interesting and novel architectures like for instance with Nass okay well it did work a little bit better on imagenet but it wasn't a massive improvement and it doesn't really seem we learned anything about building better architectures as compared to let's say building ResNet where we did learn something this is a good question and I think will I really leave it for time to tell I think now that we have more efficient neural architecture search methods where actually a lot of people can use them without having access through Google scale in first computational infrastructure I think we'll see thousands of people starting to use these methods and maybe come up with something cooler and really understand more about this space we're not there yet but I would expect that we will get there in the next couple of years thank you yeah do you want to set up and then maybe one final question how do you compare reinforcement learning approach and the patient method in terms of the simple efficiency realization computing efficiency of different kind of problems so I would generally say based on optimization tends to be more sample efficient but they haven't been any head-to-head comparisons so actually we're working on one after that I can I have some data for that but my expectation would be that that it's going to move more efficient in terms of Bayesian optimization based optimization tends to work well particularly for these low dimensional spaces where reinforcement learning is not geared towards that as much thank you thank you and I think we'll if we can move over all right thank you and there'll be another question period later yeah sorry we have to cut the questions a bit short because we're running short on time next up is going to be walking thank you it's a bit busy here so good afternoon my name is Joaquin Valley shortened and I will talk about learning how to learn Frank has told us a lot about what you can do when you're giving a new task and you have to start on scratch but typically when as humans we never run into the situation we always have some kind of prior experience that we can use to solve the task more efficiently than when we would have to start on scratch so I'll talk about that so learning is a never-ending process imagine that you are a small baby the first time if you learn to walk it's very very hard you fall down a lot you have to many many tries before you can pull it off but next time when you learn to ride a bike or do something else with motor skills this becomes a lot easier because you have first learned how to walk right babies don't start riding bicycles that first start riding while learning how to work and every time we encounter a new task we learn how to do it more efficiently based on prior experience and a process at arc we call meta learning we learn more efficiently with less trial and error and let's stay down so it's actually happening here that we we transfer an inductive bias from prior learning iterations to the new task right so inductive biases any assumptions or priors that you put into your learning system except for the training data and if we can extract useful information like constrains or beliefs or representations from previous tasks the new task becomes much easier and we'll see a range of different techniques that do exactly that the underlying part is that the prior tasks have to be similar you can't just transfer information or assumptions from tiles that are very different they have to be some are similar if not then you may actually harm the learning right okay so we call this field also meta learning because we collect metadata about prior learning episodes and we transfer that to the method learners so the method learner gets a bunch of metadata and has to make sense of that and use it in a useful way to then construct a base learner then we'll do the actual modeling now sometimes this is some squash together and the meta learner will directly build models which will come back to that so I'm subdividing this field into three levels each one requiring more and more similar tasks so the first type of problems is where the tasks can be very different from each other and then we just generalize general knowledge about tasks right typically what humans do you when you're confronted with the tasks which are not familiar with you just try whatever word dwell in the past in general not very specific just something that always works that's the thing you try first then later on when we have more information about the tasks and so I can I can compare my tasks to my new to prior tasks then I can reason about okay how is this task different from my new tasks and then I can actually transfer information transfer information in a much more useful way and finally we come to tasks which are so similar that I can actually take a train model from a prior task and then repurpose it for solving my new tasks so the first type of tasks this is where we the task can be quite different and so there we look at the performances of previous approaches and we also look at exactly what we did and we encode it as a configuration that's the whole set of assumptions like the neuro architecture the algorithm the pipeline of steps tree builds high parameters that you change all these things that uniquely define your model those are your configuration is the lambda here okay so giving a configuration and a task I get the performance out of it and then my metal learner has to take this configuration and a performance and I figure out how to use that usefully to solve a new task so the first and simplest way to do this is simply remember what worked well in the past then you build a general ranking of which are the best approaches and then when you encounter a new task we just go top to bottom you try the method that works well in general always the best then the second third and a third one very general and it's also useful as a warm storing technique so if you're considering versus Bayesian optimization you can start with the overall top ten configurations to warm start that approach nothing you can do is shake you can configure your your design if your meta learner and you facing a problem with many variables the first thing you want to do is have eliminated variables that make the problem harder right so to do here so here we for instance look at which have parameters are really important to chew now which one can I just leave at default one way of doing this is called functional and no violets work by yung Fontaine and monitor so here you take every high parameter individually you vary it you see what the effect is on the performance and if you see that the high parameter has a large effect on performance you say this is the one I want to keep if not you throw it out of your search space okay now you can say sometimes you have high parameters that have a lot of variance but they also have a very good default if I just keep it a default I'll be fine that's defined in tunability so here you first learn a good default and then you measure how much improvement can I still get if I tune it parameter and then I rank my high parameters based on the chin ability and I use only the top ones and the final thing you can do is she can look at which task did I saw they were similar and what did not work because I will not try that again it's like it's a learning experience you try you try something before it failed probably you don't try that again right okay now you can look at how similar tasks are in hindsight so here you have a new problem you try a couple of approaches and then you see how how these different approaches worked and then you can start reasoning okay if this work and this didn't work which other tasks do I know from the past where this was the same so if two tasks the same performances for simple similar configurations then that means that those tasks must be somehow similar to each other right and only if the expression is is with these relative landmarks you just remember performance differences between two configurations and if two tasks have correlating world of landmarks then that means that those tasks are likely similar are you going to use that so the way you use that is you first gonna start with a good configuration you valuate it and then you say okay which tasks are now similar based on this outcome then on the similar tasks you look which other configurations do I know that were better than my current one and then I used the best one of those and I start again I get one new evaluation point I again update my similarity of my data sets I choose another competitor based on what worked well on the zoomit tasks and that gives and I repeat this way until I stop doing that this is a very very effective way of solving this space it gives no other information to work with now we've seen based optimization so you also build a lot on that and so basically here you learn solving a single task you try a number of configurations you have performances you'll grow this regression model and then you can use that to choose a new configuration but unfortunately when you do this and your task is done and you're giving a new task you kind of forget all this useful information you don't want that what if we can actually learn these surrogate models learn very well what worked on a specific task and then transfer that to new tasks so I don't have to start from scratch again so one way of doing is this surrogate model transfer so here we say that the task is similar if well you assume that if the task is similar then a surrogate I learned on the similar task will again be useful I can start from there there's basically two ways of doing this first of all you build a surrogate model you remember your surrogate model you store it per task so for every task you have the circuit model that you that you learned and then when you have a new task you're going to combine all these circuit models from all these tasks and there's two two ways of doing this one is to first measure a distance between tasks and then you basically wait the contribution of each circuit model based on how similar tasks are if you're a very similar task then you give more way to the surrogate model if it's a very distant asks you're going to ignore it another nicer way I think is to build a new Gaussian process mixture that's weight by the current performance so first of all you built a mixture you let all of these GPS make a prediction and then you see okay on the new task however they will they're doing and if some of them were doing very well you increase their weights is some of them are doing very badly you decrease their weight it's like having a bunch of advisors it's some of them gives you good advice you listen more to them if not you listen less to them now this one thing here always with the GPS this always is question of scalability and of course if you have if I give you millions of data points from previous tasks it's a pity you cannot use them in one large surrogate model and this was kind of solved by people at Amazon pepperoni and its colleagues so they say okay I can't I can't train a Gaussian process but I can I can train it I can train a linear patient optimizer which is linear and trying ok now of course this linear surrogate here it's probably not a good approximation for for my my high parameter that I'm trying to to model right so but what if I can add the the second degree polynomial the third degree polynomial then I can represent the the behavior of this hyper parameter much better then the question is okay if you do that then which polynomials should you use because some high parameters a very linear some are very nonlinear so you have to learn from each high parameter - ball that in this case we learn this way it's a neural network so we give the middle network all the configurations all performances and the neural network has to learn what's the optimal basis expansion what's the optimal set of have dimensions to build a small limit if you do that then you can simply fit a linear Bayesian regression role in there and you can use that perfectly well for making predictions that's chaos much better another thing you can do you can combine the patient while your circuit models from different tasks using multi task based organization so the way this works is like in the figure here so assume you have three tasks the green the red and the blue at some point you want to be late for the blue tasks but you have a lot of uncertainty now if you can include information from the red and the green tasks you get less you get more information about this high parameter for instance so you you get a much better estimate of performance so you actually use information from the from other tasks for this specific task to blue one this is very nice it's not so scalable and this was soft not so long ago by French group using Bayesian neural networks as these are neural networks that outputs a mean and uncertainty and the nice thing is you can also use these in a multi task way and then you can you can exactly do this in the multi task away and be efficient as well a final thing that's been done by people at Google this year it's an algorithm where you assume that the tasks that you get in are sequential and every task you have is sort of similar to the previous one what you then do is you look at your previous task you see where you were wrong this graph is you can see sometimes you were under estimating sometimes you were overestimating and you're gonna transfer that to new tasks because you're going to start with a prior that's equal to these residuals but goes to these residuals right so if you were previously overestimating this point now you start with a prior that starts under estimating it and that way you correct for that and then you can transfer useful information from previous tasks to the new tasks that's very cool okay this is all without any information about the tasks what if I can now look at my task and tell how it's similar or dissimilar to other tasks now to do that I need some way to characterize my top my tasks and we call this meta features these are measurable properties of the tasks that we can use these are simple things like the number of instances and features which we'll come back to that so we give our meta learner basically our performances or configurations and the meta features of the tasks and then it can use as many features to measure how similar two tasks are and that's very useful information of course if you can if you have a good estimate of how similar a previous task is you can meaningfully transfer more information from that task to the new task so one way of doing this well you have these meta features one way is having using this handcrafted meta features that have been in literature for a while these are simple things like the number of instances because more Institute's make easier learning the number of features become more features make it typically harder done with missing values and outliers ultimately aren't learning harder statistical things just as skills and coach-house if it's also correlation it's my feature do I have many features that are correlated with my targets that's good do we have features that are correlated in between that's not so good so you want to collect a kind of information there's also things like information theoretic things like class entropy or the mutual information between a feature in a task tells you about how much information there is in the task things like model-based features where you build a simple like decision tree and you see how composite tree is that tells something about how complex a task is things like land markers where you simply run a number of simple algorithms I use the performance as kind of mark of how difficult this task is for that type of algorithm besides those you can also learn representation that's literally useful for it's like images and sound so here one way of doing this is using deep metric learning and common way of doing this is using Siamese networks so here you take two data sets to image data sets can be amnesty and so far you put them true first feature structure can be convolutional nets then you give them too many features structure then you push into some layers and that at the end gives you a vector for each data set now if there that are similar those vectors are supposed to be similar right so then you are going to use an external ground truth measure for instance using meta features that gives you an error signal and then you can say okay if the two vectors are very dissimilar with the task of actually very similar you back propagated error to the network and that's way you you learn a representation so after the training this will allow you to put in a new data set you get a vector back and you can use a vector as a way to measure distances between this task and any other task of that type now if you know how similar two tasks are one thing you want to do is to use things that worked well before right one way of doing this is with genetic optimization so here you want to start your genetic process your evolution using pipelines or configurations that worked well before so you look at the meta features you use something like the l1 arm to measure which tasks are similar you use the pipelines or the configurations that work well on those similar tasks and you start with those you start with those pipelines that were worked well on similar problems you can also do this for based optimization so here you start you initialize your visualization using a number of configurations that's worked well on similar tasks Frank Awad explains that this gives you a significant boost in performance and it something is always generally useful but you learn which tasks are similar you transfer information from those similar tasks to new tasks another nice approach by nikolewski here at Microsoft it's using collaborative filtering so here you consider that the configurations are rated by the tasks just as users rate movies here the the tasks rated configuration right if cooperation is good for the task it will get a high rating kind of the the similarity here so now you build you use matrix factorization as you do before so you start out with this matrix P you fill it with all the evaluations that you have this will be very sparse matrix then you use matrix factorization to learn latent representation for your your your tasks and your configurations and that means that now every configuration every pipeline Avenue architecture you have is one point in this latent space like now you compressed the information from a very large mission space to a very small dimensional space in it small rational space you can easily fit a bayesian optimizer visualization and you can use that to find better pipelines in your space and of course you need publicity realization because you need the insert you see here but otherwise it's it's a it's a very cool technique as well okay then you can also directly learn a mapping between the meta features and what you want to use first is you want to you can do similar things as before like instead of warm starting by looking at a similar task you can build a meta learner it takes the meta features and then gives you a ranking of well performing configurations right so this is basically placing the k10 kinda kN time approached particularly with metal mole that uses when a forest or actually boost something you can learn more complex patterns you can also use this to to look at your meta features and then say okay now for this data that you need to tune these high parameters or you can configure your search phase this way or you can even train meta learners that take the meta features and configuration and then predict the performance on this task this method will perform this well this is sort of like a surrogate model but more general you can use any model for this you can have educate these meta models in other HTML pipelines was often very useful for instance is to train a meta learner that predicts how long an algorithm will take to run because that way you if you have a choice between two algorithms and one is they're equally performance estimated but one is much faster then you probably want to start with those configurations to us algorithms that are faster it helps you make predictions to search the space more efficiently okay now we sell them learn a complete algorithm from scratch we often decompose the problem into several parts and let me solve this part individually and one way that we do this is using these pipelines right we don't write an elegant from scratch we first pre-process the data using existing algorithms that we learn from data using existing algorithms and this partner structures something we all use and it helps to significantly produce our search space so okay but if you want to explore a space that's defined by the pipeline how do you do that efficiently and well if you can do this then it's easier to learn because the individual parts are easier to learn you can transfer information from one form from one partner into another pipeline and it's also more robust right so one way of constrain displace is to substitute discretize the space and only try a number of pics five lines and this is what first is done in this work by Nicola way they in beforehand discretize the space and this apparently works quite well sometimes you can also impose a fixed structure on the pipeline this is what I'll just go learn does they start with a number of pre-processing steps then a number of yeah things like BCA in pitches selection and then they build a classifier and you know this again constrains the space of what you only these power plants you can build another thing you can do is hierarchical task planning where you break down the tasks into subtasks and again subtasks so for instance beginning you choose where you're going to classify this using a pipeline or end to end to the neural nets and depending on the choice you have other choices and it again helps you to to pay down a search and here you can use method learning again to make decisions you can use warm starting to start with pipelines that worked well before or if you have to make a decision between including our including components or going let's all right in the search tree you can use metal models that advise you what you should do and you can again search the space more efficiently one particular way is using pipelines so they of course you're not gonna start with very complex pipelines you start with very simple pipelines and you evolve them as much as needed right so the complexity is defined by the complexity of the problem and again you can then evolve them the mechanisms you know you can take a pipeline and then you can mutate it so you can change your component okay you move it or you can add one or you can tune the high parameters or you can do cross over where you take two pipelines and then you cross them over you get two new pipelines and then you hope that these do better than and once you had before this again allows you a more efficient space and also if you discovered something like a pipeline it does a crucial step in your process this this this part of the pipeline gets populated more and more often so it this information gets spread and more useful so different approaches for doing this the most well-known is teapots so these dis builds tree based pipelines basically you start with a data set you take multiple copies you build some of pipelines like you do pca or you do polynomials at some point those pipelines can come together in a feature joint so you take these features and these features this becomes your new pipeline and so these pipelines are like tree shapes and this allows you to build very complex pipelines very efficiently another method is gamma very new this is more of the same with in a synchronous way so you're not stuck waiting for stragglers in your in your generation this this evolution is synchronously and recipe does a grammar based approach and again you can use meta learning although it's not really done yet so much you can use it efficiently for instance for worm starting the search or for using meta models that make help you choose between component between options another interesting approach is building pipelines to self play so so here again you build pipelines by inserting deleting and so on but you start off with a number of trees dodging wrongly generated you evaluate them and then you train a neural nets to predict how well a configuration is going to work then this new attack also predicts some actions you want to take like adding a component or moving components and you send that information to Monte Carlo tree search within does a number of simulations so it both number of pipelines goes to network ask is this going to work well or not yes then I pass it on oh I'll do more of these kind of simulations and at the ends it uses the feedback from the the neural network to generate good pipelines and the new network gets evaluated pipelines learns more about which patterns work well and this way the algorithm basically does cells play with each other so so there's no input here there's nothing comes from before it just does self play and explores by itself and finally you want to learn from trained models and so here the model that the tasks have to be very similar but you cannot transfer from very different tasks so here you met a learner gets performances it gets configurations exactly how did you build this known as for instance and it gets the model parameters of the trained models and then it has to use those to learn more efficiently the easiest way that we all know is transfer learning so here you have a source task and a target task and you need a way to find which source stuff is similar if you do you can transfer models from the sauce task to the new task and thirdly this works by either using that model as a starting point and then you tweak it later on or you free certain parts like you you freeze the architecture but you update the weights and the different ways of doing this been a lot of work here for instance impatient networks where you start with a trained Bayesian net on a similar task and you do minor changes until you find a new Bayesian network so for your task or reinforcement learning you can go from one task to simmer task by taking a policy and only changing it a little bit now something we've all been doing probably in our universities these days is transfer learning which you know that works because that works very well this first the thing we all do in chaos for instance we take a neural network convolution neural Nets we build layers and if you look at the filters at each block here in this convolutional nets you can see that first it learns simple things like like its lines and colors then you get more complex things like dots and later on you detect eyes and shapes and so on right so you can use this as the part that learns useful features and then you can learn other top tasks just by adding to it and so here's very important to see how similar tasks are if they are very similar you can well and if it's small you can typically just freeze the network and then add some laser the end and you just change you just train a new layers if the data this large and similar you just add lays at the end you retrain and joined and if it's large with different you want to unfreeze part of the network and retrain it again but you can you need more data in there but even with all these approaches this will fail if your tasks are not similar enough this is something we don't really understand right now we don't know when a task is similar enough for transferring or not you till we just try it and we adapt when it doesn't work so there are a lot less all work in real learning to learn learning to learn new algorithms to data so one critical example is the one from Joshua and Japan Geo who actually well they consider the brain story down to backdrops so we couldn't replace the way it's a simple parametric update rule that was inspired from a neurology so it's update rule we have the learning rate some pre synaptic activity then we forced a signal and you penetrate up furniture eyes this and then you you learn basically what the optimal settings are for this update rule and you can you can search for different update rules and this way later on provenance on and Johnson they replace it with a neural net so here the the weights of general have become your high parameters and then you kind of optimize those so after optimization you end up with a new update rule that you can use and then afterwards this was replaced again by an STM right so you used to lsdm to learn a good update rule for Fortran neural nets that proved out not to be not so scalable and then nothing happened for well not not so much happens in the meantime but then Marshall and reach of each and colleagues at deep mind they replaced this alastair with a current wise LCM which is much more scalable much more flexible so that's the green walk here and then you this is your meta learner a meta model she optimizer and this has to learn the update rule for a model deal to my Z to have a new word so at some point this LS cm gets an error signal from the Ultima see from the model and has its current state and then use that to make a prediction about how the way it should be updated that's this GG here so here the X is this time again so the original weights come in then you add the weight update to it you get new weights you can our error signal the lsdm gives you a new update and you move all the way to the right at some point there is evaluation you get a loss function you back propagate your time the error and that way this team gets trained this means that this else TM is now able to learn a new update rule there specifically good for the type the type of tasks it's trying to solve and you can then remove this Optimus G replaced with another koozie or use a new task and you can train this optimizer to be very good at doing the weight updates for a range of similar tasks also nice that this is now in one single network like the brain of course if you want to transfer information from previous tasks it will be cool if you can now also use this to learn from very little data because that's typically take a lot of it at the Train what can we do to train from very few examples and this is an image from Google our shell is actually giving a talk and meta Larry workshop about you should learning I think this this will be very accursed you all to go there so the problem here is you're giving one task and this one task simply consists of five classes you get one example of each class and then you have to make predictions so if to predict which class this is which class this is but you don't get just one example of this tasks you get a lot of these examples and your meta model now has to learn a set of way so while you need to you need to parameterize your model and then you need to learn the parameters so this model can solve this task efficiently okay and then basically you look at the loss of this model and then you look at what is this loss over a number of tasks and that's your error signal for that that's your cost for the metamodel and the metamodel uses that to then find better parameters and then that way learns how to build a model for this house there's a whole range of techniques to solve this problem and different ways of building metal models so one thing you can do you can start with existing algorithm as a metal owner mm-hmm you can't princess start with an LCM and you do great ascent I'll talk more about this soon another very popular approach is mammal where you just learn the initialization of your network and I need a great inter sense and then there's a whole bunch of techniques I can't go into but they all have some kind of memory component so they remember instances from previous tasks and they use them in the new tasks and these are either canine-like techniques or you learn embedding in an acacia fire or you also have this black box models there's typically some neural network that has memory components that can use the memory from previous tasks and apply the new tasks and it's actually still the yard right now yeah so one thing you can do it's very general it's not the most performant algorithm it's very generally useful and very elegance is again you use STM for this and so if you ask a metal donor here and you come to make the observation that if you look at the the gradient update rule right your weights are equal previous weights - learning rate times the gradient and if you look at what the lsdm cell update does so usdm state your memory equals the previous state times two forgetting gates times the hidden state times the input gate so if you now say okay let's just equal this data to this city and this agreement is women and this one to this one then I can train my L SEM and if I've trained that and I also have my update rule right I think that's very cool so then you can solve this problem by starting with any random initialization you train your model on the first task you get an error back then the ostium uses its current hidden states you generate a new new weights you train a model you get an error back and you keep doing this until you end up in your test set then you can evaluate the cost of this STM there you can back propagate this to time you update all the settings of these STM's you have a new theta zero you go back and forth back and forth back and forth and so you train your LCM to be very good at updating the weights of these base models and you also learn in good in association the same time another approach very popular is this mullick gnostic meta learning so here you you don't bother with building complex RCMP's you you're going to learn new skills quickly by just starting from a very well initialized neural nets right the goal is here to train a neural net on a bunch of similar tasks so that there's a very good installation and whenever it needs to solve a similar task that can start from that initialization and get get optimal models much faster so if you look here you have this the current utilization theta then you look at in this case three tasks for each of these three tasks you take K examples evaluate the gradients then you yearly the weights for each of these subtasks and then you update this gradient theta to minimize the sum of these per task losses and that brings you to you a new initialization which is hopefully closer to the optimal weights theta 1 2 3 here it's a very cool a loop so you just you learn the the insulation for a number of tasks and whenever you have to solve a similar task we can just start from that optimized insulation this has been shown to be very resilient to overfitting it also generalized better than the Austrian approaches and it's even improve of universality where they can prove that there are no theoretical downsides to doing this so it's if you just learn a good initialization and use green to sense this will never be worse than learning a very complex you know not to to learn of the rule and more recently reptile was released that's a more scalable version of this that uses circuitry great descent so instead of actually doing this thing every step its those K steps from one task and only then updates the weights so it's a bit more scalable ok then also another useful truth metal industry forcement learning but this comes very natural you can just have a reinforcement learner that has to create a new algorithm or new model and then we can just evaluate a model and use the performance as a reward for reinforcement learner and then the reversal torrent has to learn over time how to build off the model and the main goal here is to actually solve the first learning problems much faster than before because humans are to be very good at playing new games faster than then reinforcement learning albums are so the idea here is that you build a meta reinforcement learning algorithm that's integrated deep RNN and you train it on a large number of environment and this algĂșn dinero an agent that then for similar environments has to learn a policy you look at the performance you give the performance back then that way it can learn a policy over many environments and so this way this reinforcement learning agents learns how to create a new reversible earning agent that's much faster at solving similar tasks and I should also work for future learning it's a more recent paper by by g1 so here you you do future learning by not conditioning only on the observation but also on the upcoming demonstration right so if you have a bunch of a demonstration spurge tasks so you learn you train this method learner to build and we're fortunate learning agents not knowing what the actual demonstration is going to be but it will be measure it could be maximally prepared for solving tasks that are similar to the ones he's been trained on so it learns how to build RL agents for similar tasks and in this whole range of other tasks that you can solve with with meta learning one thing is active learning so here you can build a deep network that learns representation of your data and then a policy network and so whenever it receives the state and reward it just tells you which other points you query next it's and you then you use metal or metal learning to train these this network and so kind of post network you can also do density estimation by just learning a distribution over small set of images and then you use memo Mel based future learner to two learned entities much faster than otherwise possible or you can actually you can also do matrix factorization this way the bottom line here is that you can take basically any algorithm like active learning dense estimation matrix factorization and you can replace existing algorithms by a new album that you it's very powerful you can replace these handcrafted arguments by learned ones okay that's the end so finally maybe you wondering yeah I want to do metal learning but how do I get useful metadata to work with to train my models well we'll be working on stat form code open ml so you can go this open-mouthed org it has thousands of uniform data sets you can download them all on mouth and they're the same formatting so you can run the experiments and all all data sets simultaneously if the effort gets 100 of these meta features so you can measure distances between tasks you get millions of evaluated runs and Israel all evaluated with same splits with different metrics and for a large bunch of them you also get the traces whenever a model is optimized you get the all the sub models that are optimized running up to that and the models that were built for some of them not for a lot of them and then you have API is in Python or in Java so here you see the the Pythian APR for instance so it's a very simple interface you download the tasks with an ID then you build whatever classify you want in this case just a cycler classifier you transformed it to a flow flows a representation of a pipeline or a learning object then you run the flow on the task then you get a run which you can so locally or you can also publish it so these five lines allow you to download lots of datasets relative models and also share those moles again with anybody and always run locally you download data you run albums locally and you share results globally and this allows you to do neverending learning you can build an album that works against this API downloads datasets learns how to learn all my tasks shares its result its model with the server and then other agents other BOTS can then download the small again I use them so this gives you a platform where you can exchange information about meta learning and you can also easily build you if you want experiment with meta learning this is the best way to start so you can easily write your own BOTS so you to learn well over large number of tasks I mean also benchmarks so here you can see for instance visualization so everyone here's a model this is performance higher better this time and you can see all these color dots are dots built by humans and these orange dots are dots built by robots and it's job of they can just learn by themselves or they can actually look at the models from from previous robots or the previous humans and then use that to do metal learning and build new models faster and by the way we have some openings for a programmer and a teaching PhD if you're interested in building is very cool platform ok and a final note I think we have made a lot of progress in the last years I think we're I made a lot of progress in to making a real human like learning to learn happen I think it is very a crucial aspect of of science because this gives you a significant advantage if you can do if you can solve one task IRL that's fine but if you can learn how to solve any task that's much more powerful right and something we should definitely put more effort into it's also a universal is part of life matches of humans trees learn you know the university well life in general learns all the time so it's the time we understand this process it's a very exciting field there's many unexplored possibilities and many aspects are completely not understood for instance stuff similarity when can I transfer a task or model from previous toss to new task we just don't really know I think we did not a lot more experiments a lot more science going into that so the base we do final challenge can you actually start building learners that never stop learning that go from task to task the task some tasks can be very similar sometimes can be very different it has to never stop learning it has to learn across these tasks and it also has to learn these agents has to learn from each other I did one learn one learner learns a very good speech very good this was a very good representation or a very good pipeline some other learner to be able to reuse a pipeline or to reuse that that feature representation and we can build a global memory where these systems can can talk to each other and can share the information and then if you have that then you can let them explore by themselves use active learning to actually use what's there these what what what we know what everybody knows and to build new models and that will bring us to true automatic machine learning where these models efficiently use any information that it had before to build new models for new problems thank you [Applause] all right questions you can queue behind the microphones around the central aisle if you have any questions okay so thank you very much for your inspiring talk my question is about tasks so I think I know that by using meta learning or transfer learning you know the agent can actually generalize the previous three analogous to annecy situations or tasks but you know when you look at human beings you know their behaviors might be driven by like you know curiosity or maybe you know just desires where there is no specificity so if we continue working on transfer and invoke meta learning if you really think that we can actually create truly intelligent agent you know which can you know spontaneously create some you know behaviors or something like that I think that's this notion of creativity is super-important if he is this also defines us as humans and intelligent beings is that we seek out more information we don't know something we're curious about and we either start reading up or we learn more or we start experimenting and gathering more information I think will be very good if we can build algorithms that have this curiosity and learn actively might be used for the Future past or something I mean they don't know maybe the agent don't know how to use this experience to escape sometimes maybe in the future they can do something so maybe it's you know as you said maybe it's important to stores and memories like we have the Internet we should give machine learning agents like an Internet where they can find tasks and moles and things in an organized way that they can learn from that thank you do you think in the distant future the other way just be one and the intelligent model that as a way of or won't ask what do you think we were sleepy and I have many dedicated models for a specific person well you can't really have one model for all tasks but I think you can have a system where I met a learner that knows so much about the world that it can build it model efficiently so you don't seem to be one eye the one system that related by itself and as well for many many times the same time well there's this thing called the the no free lunch theorem right so you can you can learn good models for tasks you can you can meta learn good models for a group of tasks but one mole that that solves all tasks is kind of not possible because you don't have the necessary Nutr bias to learn efficiently or maybe it's directly possible for the fission to build one it's much much more efficient to learn how to learn models for certain types of tasks thank you Thank You speaker [Music] are there any techniques to overcome high performance variants that gives high variance from training to train and what time is Asian home batter will like my Asian optimization become good under the setting of high variance sorry can you repeat the question I understood it's about high variance and base and optimization but that's about all that my question is the performance has high variance okay to do yes okay so if you have a high variance black box that in repeated runs give you very different performance then well you need well one approach that you can for example user is again the multi fidelity approach where you evaluate only once for kind of all configurations but for the best configurations you actually want to evaluate more and more so the the thing that you return you actually want to be very sure about so for example in reinforcement learning you want to do multiple trials because there is a lot of uncertainty a lot of variance across runs and so for the best configurations you actually want to do a lot of runs but you don't want to waste doing for example 100 runs for every configuration so yeah you can use a standard multi fidelity approach for that so this all ml models could be pretty complicated so how do you make sure that the model actually learn something that makes sense you you set up a testing environment where you test for what you wanted to do yeah and of course you need a generic way of having lots of benchmarks you don't want to over fit to one particular benchmark but but those are standard questions of experimental science right how about problems like domain shipped like your model is fit data of a certain time period but going forward in future make these things don't look like that way yeah this is called tons of drift so there's different ways of solving this you have some l games that naturally adapt to that like most grain based algorithms they did well they do go into send until some point and if the data changes then service changes and it optimizes further so those kind of that automatically other algorithms you just have to retrain every time periods so they have to adapt to new kind of data and you will over fit to the types of datasets that you will see that you see if you see very different data sets going forward then well you can well have a component that looks at the data set and for these types of data set uses different types of object detection [Music] is there any process that I can create metadata for object detection that I'm not aware is there metadata for object detection are there lots of datasets for it where you can train on I am we open l has a lot of these well not things like like image net or something else or maybe it's a call to arms to the computation community to create nice image there might be some specification definitely in object recognition I'm not entirely sure I'll look into it and get back to you all right so this concludes the tutorial on auto ml and let's thank the speakers again [Applause]
Info
Channel: Steven Van Vaerenbergh
Views: 6,332
Rating: undefined out of 5
Keywords: nips, neurips, 2018, tutorial
Id: 0eBR8a4MQ30
Channel Id: undefined
Length: 122min 25sec (7345 seconds)
Published: Tue Dec 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.