Hyperparameter Optimization - The Math of Intelligence #7

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

For super newbies watch siraj's intro to deep learning

👍︎︎ 1 👤︎︎ u/TheOtherGuy9603 📅︎︎ Aug 03 2017 🗫︎ replies

I downloaded all of his videos on my phone to watch it while I am travelling, really good stuff.

👍︎︎ 1 👤︎︎ u/jermteam 📅︎︎ Aug 03 2017 🗫︎ replies

Captions

Hello world, It's Siraj! What hyperparameters should you use to train your models? You will see these magic numbers ALOT They are the model values that are set before you train on any data set A machine learning model is just a formula with a number of parameters that need to be learned from data. But there are also parameters that can't be directly learned from the regular training process We call these higher level properties Hyperparameters This could be number of trees in random forest, number of hidden layers in neural network the learning rate for logistic regression it is a process of trial and error and it is not very intuitive since we are not great at interpreting high dimensional data Researchers consider the possibility space of hyperparameters thier canvas But what if we could have these parameters learn the optimal value used for themselves? that would make life easier right Let's see if we can figure out really basic strategy for ourselves and then try to improve it I have got a data set of tweets that labeled as positive or negative perfect for a binary classification problem. and let's say I build a support vector machine to learn this mapping so it can then classify a new tweet immediately. This is called sentiment analysis. It is a really popular task in language processing. If we mapped out these vectors in 2D space, we can imagine a curly line that separates the positive tweets from the negative ones. This is a decision boundary, separate but equal. A support vector machine can help us define this decision barrier Since it is non-linear, our SVM will use what is known as the kernel trick. That means instead of trying to fit a non-linear model we can map the data from the input space to a new higher dimensional space called the feature space. By doing a non-linear transformation using a kernel or similarity function and then use a linear model in the feature space We define our kernel or similarity function between tweet vectors as the radial basis function which takes its inputs from two vectors and outputs a similarity based on the following function. So the more similar two tweets are the higher the output value from our function. There are two hyper parameters that govern in how our line is going to be drawn Both of these hyperparameters need to be selected very carefully they depend on each other in unknown ways so we cannot just optimize one parameter at time then combine the result. What if we just tried every single combination of hyperparameters? Assuming we built our SVM already we can choose a set of possible values for both of them and create a variable to store our models accuracy for each set then we will create a nested for loop for every value of C and try every value of Gamma Inside our loop we will initialize our SVM with the hyperparameters at that iteration we will train it and score it, the compare its score to our best score. If it is better we will update our values accordingly This process will run for every hyperparameter value we have until it finds the optimal ones. This technique is called grid search. We essentially made a grid of our search space and then evaluated each hyperparameter setting at the points we introduced for as many dimensions as necessary. This was a pretty easy strategy to implement But this scales pretty poorly with more hyperparameters or dimensions we add. Also known as the curse of dimensionality I think we can do better than a exhaustive search We tried every combination of a preset list of values of our hyperparameter. But what if instead we tried random combinations of a range of values for a number of iterations we define. This wont guarantee that we will get the best hyperparameter combination like grid search. But it will take a lot less time. So manual search, grid search, and random search are fine and dandy. But there is got to be a more intelligent way of doing this that incorporates learning. One technique that is very population right now is called Bayesian Optimization. Last episode we talked about how bayes theorem is a way to determine conditional probabilities. It shows us how to update a existing prediction given new evidence This forms the basis of the bayesian way of thinking as apposed to the frequentist approach. These are the two different approaches to probability basically its like a mathematical gang war between applied statsticians. Bayesian means probabilistic. It focuses on the probability of the hypothesis given the data. That means the data is fixed and the hypothesis is random The frequentist approach focuses on the probability of the data given the hypothesis. So data is random as in if we repeat the study the data might come out differently. But the hypothesis is fixed. We can apply frequentist or bayesian methods to pretty much any learning algorithm.They have different aims. In the context of hyperparameters optimization a bayesian approach takes advantage of the information our model learns during the optimization process. The idea is that we pick some prior belief about how our hyperparameters will behave, and then search the parameter space by enforcing and updating our prior belief based on our ongoing measurement. So the tradeoff between exploration making sure we visited the relevant corners of our space and exploitation once we found the promising region our space, finding optimal value in it is handled in more a intelligent way You know, we only have few weeks left to submit our Bayesian optimization uses previously evaluated points To compute a posterior expectation of what the loss f looks like Then it samples a loss at a new point that maximizes some utility of the expectation of f That utility tells us which regions of the domain of f are best to sample from This 2 step process is repeated until convergence For the prior distribution we assume that f can be described by a Gaussian process A Gaussian Distribution - often called a normal distribution Is described as a bell shaped curve Distributions are equations that link outcomes of a statistical experiment with its probability of a current The Gaussian is quite popular, half of the data falls on the left of the mean Half falls on the right And this is useful in many situations A Gaussian process is a generation of the Gaussian Distribution over functions Instead of random variables While Gaussian Distribution are specified by their mean and variance Gaussian Processes are specified by their mean function and co-variance function The way we find the best point to sample f next from Is to pick the point that maximizes an acquisition function This is a function of the posterior distribution over f That describes the utility for all values of the hyper-params The values that has the highest utility will be the values you compute the loss for next We'll use the popular expected improvement function Where x is the current optimal set of hyper parameters By maximizing this it will give us the point that improves on f the most So given on the observed values f of x We update the posterior expectation of f using the GP model Then we find that the new x that maximizes the acquisition function the expected improvement And finally compute the value of f for the new x Initially the algorithm will explore the parameter space But it quickly discovers the region with best performance and samples points in that region To Summarize, we can optimize our hyper parameters using several strategies But Bayesian Optimization looks most promising Bayesian Optimization picks a prior belief about how the hyper parameters will behave And then Searches the parameters space by enforcing and updating that prior belief based on ongoing measurements So Bayesian let their prior beliefs influences their predictions frequentists don't

Info

Channel: Siraj Raval

Views: 87,613

Rating: 4.7854252 out of 5

Keywords: bayesian, frequentist, probability, math, mathematics, hyperparameter, optimization, machine learning, deep learning, AI, artificial intelligence, python, programming, coding, learn programming, learn coding, educational, data, data science, analytics, code, svm, support vector machine, scikit-learn

Id: ttE0F7fghfk

Channel Id: undefined

Length: 9min 50sec (590 seconds)

Published: Fri Jul 28 2017