Vanishing Gradients and Activation Functions - Intro to Deep Learning using TensorFlow #8

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome back right now we're going to go over the last thing you need to learn before building your own neural network and that is the activation function which forums the layers so recall what we've gone through so far you've learned how to build a linear classifier learn how to score the outputs of that classifier and he learned how to update the parameters W in order to minimize the loss of that classifier and that is the whole fundamental basis of machine learning now we are going to go over the magic part of the learning and that is how to induce non-linearity into your classifier and turn it into a neural network so activation functions in different contexts are also called neurons and what they do is they will take your input and give you an output of either 0 or some value basically in order to create this non-linearity and allow your classifier to take on a whole wide range of nonlinear values so that you have to spend less time on each engineering and so on and so here are four common activation functions I'm going to go through the sigmoid in more depth and I'm going to explain how the choice of activation functions influences the outcome and treatability of your models there are some that are always better than others and there are more activation functions being created all the time it's also a significant area of research if you are doing ml research now one thing you might want to do is to design any activation function and so let's look at the classic sigmoid function it's not commonly used but it's definitely the most classic activation function this is what you'll see quite often when people talk about neurons what this activation function does is it takes some input X and return some value 1 over 1 plus e to the negative x but this does is it's flashes any numbers that into it into this range between 0 what so one thing to be aware with neural networks is that you can saturate neurons and that kills the gradients during that propagation which you definitely do not want so let's think about let's think about how signal might vanish if you pass it on to too many layers of neural networks recall from the back propagation exercise that you will pass some inputs through a sigmoid gate and you will be calculating the local ingredients backwards of the other direction and you can have an analytic value for the derivative of sigmoid over DX and then has the value of sigmoid X times 1 minus Sigma X let's work through the numbers here what happens when x equals 10 that'd additive you get a value very close to zero what happens when x equals zero you get a value of 0.25 when x equals negative 10 you get a value that is very close to zero so here this this graph kind of simplifies for you the possible values that the derivative of sigmoid with respect to derivative D sigmoid of EDX can take and it's really between 0 and 0.25 and if you keep multiplying sigmoid layers by each other it passed signals through multiple sigmoid layers what happens is that any number that you multiply by a fraction of 1 gets smaller and smaller so think about how we multiply gradients multiply ocol gradients with a signal passed backwards from the last neuron like this in this direction you may have a gradient value that's high here but gets squashed and smaller and smaller and smaller until it's dead why is this such a bad thing and recall our weights updating function if your gradient is dead then your weights will no longer update if the wiggle if the gradient into each step is too small the training could converge even without achieving global minima for the last function the grade small gradients means a stop in training this small gradient problem is called the vanishing gradient problem and it's one that there are a number of ways to address it and one is in your choice of activation function perhaps for people working with language LS TM tackle this problem nicely for recurrent neural network so if you're running into problems with a recurrent neural network that's taking forever to train now you might want to switch formats to LS TM so something you think about per minute is a gradient more likely to vanish closer to the input layer or closer to the output layer it is more likely the vanish closer to the input layer think about how this may happen if you are multiplying gradients from the output layer backwards then it's more likely to vanish closer to the input this means that if you are able to look at the gradient of each layer in the neural network and you focus you really can just focus on the first half even and if your ingredients are vanishing then you know that this is the source of the issue the sigmoid function also has another issue where the outputs are not 0 centered I'm not going to go into into too much detail onto why that's a disadvantage only that it definitely ends if you want to think through that look through the cs2 31 slides and work it out another problem with the sigmoid function is that calculating e to the X is computationally expensive think about how you would compute e as a floating point to the X it's kind of messy so you have another activation here 10 H which is strictly better than sigmoid it has all the good parts of sigmoid while being better since you're a centered but it still kills the gradients but it doesn't kill the gradients quite as quickly because it can take values from 0 to 1 always pick 10 H if you are using a sigmoid activation function one way you can immediately improve performance is to swap it out now let's talk about real ooze real ooze are very interesting because they do not saturate in the positive region their computational a efficient it's much easier to calculate the maximum value of 0 and X than it is to calculate 1 over 1 plus e to the X across millions and millions of operations and in practice it converges faster than sigmoid and 10 H and so for any value we feed into it it will either give you 0 or positive value sense it can still die look at it look at here it can still die if the gradients you keep feeding and have negative values it'll just like snap off and then dip but for the most part it is better than sigmoid and 10h any any gradient that is nonzero gets maintained through anything that's less than zero this one can still die but this one might be positive so still better now early career Liu is strictly better than real ooh think about real ooze dying neuron problem there you can solve that someone figure this out by just instead of using zero just using point 1x which is an arbitrary value you can set this to any value what you don't want to do is set this to one because then obviously it stops being a non linear function and becomes another linear operation so in practice use riilu try and leaky below to see if that will help you out don't use sigmoid anymore older textbooks still use sigmoid as the primary example but in recent years it's really not as effective as the wheelie methods all right now let's move on to coding our first two layer near on arc
Info
Channel: DOST PCIEERD
Views: 7,117
Rating: undefined out of 5
Keywords:
Id: kGAo32JgY48
Channel Id: undefined
Length: 8min 41sec (521 seconds)
Published: Wed Dec 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.