Deep Learning-Activation Functions-Elu, PRelu,Softmax,Swish And Softplus

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is krishnak and welcome to my youtube channel so guys if you are actually following my complete deep learning playlist i had discussed about some of the activation functions like sigmoid tan h relu and leaky relu but they are still different variants of relu and apart from that there are also other activation functions which are pretty much important so in this particular video i have made a long video of more than 35 minutes you know to make you understand all these things yes and i've discussed in detail in depth about relu re-kileru we have discussed about elu right this is basically the exponential linear unit then we have p relu then we have soft max activation functions as as we say guys in the last layer you know if it is a classification problem if it is a binary classification problem at that time we actually use sigmoid right and if it is a multi-class classification problem we basically use as a softmax but how does softmax actually work there is also an activation function which is called as swish which has been come out from google brains they have actually invented this activation function and apart from that we also have an activation function which is soft plus again it is a kind of variant of relu itself so all these things will be discussed in this particular session again i'm saying you guys this session is going to be a little bit big more than 35 minutes uh where i'll try to show you right and show you most of the things and try to explain you so please make sure that you watch this video till the end this video will be the part of the deep learning playlist and if you go and check out the deep learning playlist i'll also put this video in the proper order you know soon after the relu and leaky review is done because this particular video has some more added activation functions now the reason why i'm uploading this video is because these videos are missed in the deep learning playlist there is also an optimizer that we missed that is called as adam optimizer that will come after this particular video is actually uploaded so that recording has also been done so probably uh tomorrow that particular video will be there so please do enjoy this particular video and please do watch the whole video because i have explained it properly so let's go ahead okay we have discussed about sigmoid function right in sigmoid function i told you that our value will be ranging between 0 to 1 okay that basically means whatever weight so after this calculation goes on right we apply a function on top of it that function for in our case we are going to take sigmoid once we do this that is weights multiplied by the input plus bias right this is my bias right after this on top of it we apply a sigmoid activation function once we apply sigmoid activation function the value that we get is between 0 to 1 but sigmoid what i told that whenever i try to find out the derivative of sigmoid in the back propagation the value will be ranging between 0 to 0.25 these all things have been covered in the previous class okay then i also told that what is the major problem with respect to sigmoid function when we are doing back propagation there is a concept called as vanishing gradient problem vanishing gradient problem right that basically means the weight updation right is getting updated by just a small number which is very very negligible right and these are things we have discussed it i would suggest guys go through our previous tutorials because if i start explaining again it will take time then we understood about we understood somehow see prone to vanishing gradient function output is not zero centered now what does this not zero centered basically mean not zero centered remember guys this not zero center means what now if i take a normal distribution data right if i take a normal distribution suppose this is my feature if i'm taking a normal distribution normal distribution says like this okay here my standard normal distribution suppose i feel like standard normal distribution okay now when i take a standard normal distribution distribution data you would know that my mean is zero and my standard deviation may be something in case of standard normal distribution my standard uh deviation is basically one so my data is basically centered between zero okay and it may be either positive it may be either negative okay so this is what it basically says this this is the meaning of this this this is what is the meaning of uh if i take my function output is zero i can say that this this data is basically zero centered i can say this data this data is zero centered why it is 0 centered because understand over here 0 centered basically means what i'll tell you because this is very important to understand suppose my data is present with a sum mean is equal to 0 like this over here and standard deviation is equal to one or it may be some other small number you will be seeing that most of our data will be either positive or negative but it will always be in this specific region which is centered towards zero okay but in our case in our case if i if i go and see if i go and see this particular sigmoid function it is obviously not centered towards zero right you can see that most of the data over here it is basically ranging other data is basically ranging over here but in this function whenever we say that it is having zero center data is basically zero centered we basically talk about this kind of bell curve or gaussian curve okay so in short if somebody asks you what is this most specific thing about normal distribution data or gaussian distribution data at that time you can specifically say that the data is zero centered but in the case of sigmoid it is not zero centered now why this may play a very very important role just understand that if the data is not zero centered more computation time will be taking up and because of that to reach the convergence point that is the global minima also it will take time okay so this was the point that i missed when i was explaining about sigmoid okay so guys and any queries that you have just note down in your um in your notebook and at the last we'll try to solve that particular queries okay now coming to the tannage function i told you in tanning the output of your x that is basically your weights multiplied by the x plus bias right whenever i try to provide that particular output that value is ranging between minus one to plus one okay and when i try to find out the derivative of those the value is between zero to one okay and again you also have a vanishing gradient problem over here okay it is almost like sigmoid but just remember the range is different okay and here also your data is not zero centric as okay so here you will be able to see sorry this data is basically zero centric you can see that it is ranging it is coming from here it is going to this zero centric and it is coming over here okay so here you can see over here you cannot see that it is zero centering because now none of the data passes through this zero any data that passes through zero any data that revolves around zero is basically called as zero centric okay in this particular case here also you can see that all your data is ranging between minus one to plus one but most of the data passes between this zero and this particular value so it is zero centric at one point of time you'll be seeing that it is reaching zero okay so here you can see that it is also written the output interval of tan h is one and the whole function is zero centric which is better than sigma this is obviously better than sigma but one problem that we face with respect to tan h is basically your the the vanishing gradient problem because your derivation of tan h is always ranging between zero to one okay now coming to the relu activation function now in relu activation function you can see that the functionality was max of zero comma x right max of zero comma x now this basically says that whenever your x value is greater than zero then the output that you're going to get is basically that particular x if your x value is less than or equal to zero you are basically going to get the value as zero okay now remember when i try to find out the derivative when i try to find out the derivative with respect to x if i try to find out the derivative with respect to x this x derivative of x with respect to x is what 1 only right so all the derivation will be 1 whatever the x value is greater than 0 all the derivation will be 1 whenever the x value is less than or equal to 0 you will be finding it as 0 okay because this is a constant right this is a constant over here any value that you are going to try to find out it will be constant but again you have one condition that you cannot in the real world scenario you cannot find out the derivative of zero right so for this how do we handle it i'll just tell you but remember this derivative of x you will not be able to find out i'll just talk about it in a practical scenario okay now see this now what happens whenever your x value is less than or equal to zero you will be getting it as zero but i told you what are the main problems with respect to this when you are doing the back propagation or whenever you are doing the back propagation and that time whenever you are trying to find out the derivative of a negative base suppose you have some negative weights over there right and whenever you are trying to find out the derivative you are trying to find out the derivative this is going to be zero and when zero is there according to the chain rule right according to the chain rule right suppose one of my derivative is zero the other may be one the other may be one like this right because it may be either zero and one so whenever there is a zero this whole value will become zero right and when you are doing an weight updation formula like w is equal to w minus learning rate of derivative of loss with respect to derivative of w old right this is my world now when this value is becoming 0 right this whole value will become 0 at that time your weight nu will be equal to weight old so in this particular case the weight updation will not happen and this scenario is called as dying relu or dead activation function okay dying relu or dead activation function right now in order to prevent it i told you that this is the problem with respect to relu value is the most common activation functions activation functions used in hidden layer okay it is used in hidden layer white is used because it solves the vanishing gradient problem first of all it solves the vanishing gradient problem how does it solve the derivative of a value function is either 0 or 1 okay and remember it is not between 0 to 1 it is either 0 or either 1 whenever the x value is greater than 0 it will be 1 if it is less than 0 it is going to be 0 okay but whenever we have this 0 we have this dying rayleigh condition i hope everybody knows from my previous session if you have not seen the previous session you will not be able to understand this guys okay i have clearly mentioned i have clearly explained in this all things in my previously previous session i'll just show you that also where i had written all the things if you you have to really check out the video uh from the previous session whatever has been uploaded okay now this was the thing with respect to relu now coming to the next thing coming to the next thing if i talk about real leaky value activation function now here you see in relu max of zero comma x was there right max of zero comma x was that that basically means if your x value is greater than zero then that basically means that will be selected as the max value but in this case i told you that my derivative doing the back propagation is basically bringing a zero and it is showing us a dying relu or dying activation function because if the derivation is zero the whole value is actually becoming zero so in order to prevent this we will be using leaky value in leaky relu we will be multiplying a small constant with x okay with x again see guys over here we will be multiplying a small constant with x okay so whenever your x value whenever your x value is basically a positive number it will be coming up that now see this this will be very interesting if i try to find out the derivative of x derivative of x with respect to x this will definitely give me one so these all values will be giving me one if i try to find out the derivative of point zero one x with respect to x at that time what value it will be giving me it will be giving me 0.01 so instead of having a smaller value a smaller lighter value on the top of that line see initially if i just go on the top right my derivative over here was exactly zero but in order to solve the dying value problem now instead of zero it is showing us 0.01 right so this value that you see from the y-axis it is basically getting combined over here it is combined over here and this is my value as 0.01 so this is actually solving my dying rayleigh problem because whenever my values are negative okay whenever my values are negative due to negative weights when i do the derivative of this value it is going to come it as point zero one now this point zero one is greater than zero so in this particular case my whole derivative of loss with respect to derivative of w will not be will not be equal to 0 it will be greater than 0 because this value instead of multiplying with 0 we are multiplying with 0.01 okay so that is how we are solving this dying value a dead activation function okay by just multiplying a constant along with x initially in a normal leaky value it was 0 x here we are having 0.01 multiplied by x because when we do the derivative with respect to this particular function i am either going to get 1 or i am going to get 0.01 okay so because see this this is 0.01 because whenever i get this particular value i am trying to find out the derivative with respect to this okay so with respect to that i will be actually showing it so this was with respect to leaky value and this also i discussed it this also mathematically we discussed it we tried to understand and guys i'm still revising you know previous concept don't think that i'm teaching something new these all are all previous concepts that we had already discussed okay now coming to the next one now fine with respect to leaky relu i told her that for negative value we have at least something like point zero one okay um let me just show you here [Music] i think we had discussed it here itself let me just upload it okay now see whenever we have this kind of derivative okay this is your sessions only what i had actually discussed whenever we are trying to find out this kind of derivative i told you that if i'm using leaky relu if i'm using leaky rail for some of the weights if they are negative right that may come as 0.01 okay 0.01 so i i may use it as 0.01 over here suppose suppose this particular value is point zero this value is point zero one okay this value is one okay then again this value is one and again this value is point zero one and sometime it may happen that it may be becoming less than 0.01 when we are going through the back propagation in a d player but whenever we do this kind of multiplication right whenever we do this kind of multiplication will this be a small number will this be a small number obviously the answer is yes so usually what happens is that whenever we are trying to use leaky relu there are chances of vanishing gradient problem to happen vanishing gradient problem to happen because if suppose most of my weights are negative if most of my weights are negative maximum number of weights are negative that time in with respect to the derivative i will be multiplying with 0.01.01.01 everywhere and because of this this will definitely become a smaller number if this becomes a smaller number this will also become a smaller number right so when we are updating along with the learning rate our w new will be approximately equal to world so definitely in some of the scenarios it has been said by the researcher that leaky relu actually uh brings up you know this kind of vanishing gradient problem okay so we'll try to understand how what is the next thing that we can bring up now we are coming to the next activation function which is called as elu elu basically says which is called exponentially linear units okay exponentially linear units it's basically says that what we are trying to do with respect to e lo instead of whenever your x value is greater than zero see again it will be exactly same only they are trying to handle the negative values in an efficient way whenever your x value is greater than 0 your output will be always x whenever your x value is less than or equal to 0 then they are basically applying this particular operation that is alpha e to the power of minus x minus one okay this alpha will be a learning parameter also we can say it is a learning parameter it may be some fixed parameters we have to put this like point zero one point one point two point three so this depends on different kind of problem statement this is basically a hyper parameter okay hyper parameter so when we fix some alpha values if i try to see the diagram with respect to elu right the exponential linear units your layer will look something like this your layer will look something like this so this is how your data will be looking this is not a relu or this is not a leaky relu in short right enrique hello i told that my data will be looking something like this right it will be it will be something like this a small value of 0.01 right in this case you're having like this okay now what kind of problem it is actually solving we'll just try to discuss it okay and whenever i try to find out the derivative of this kind of value with respect to this function i will be getting this one now here you are solving one one major problem that you don't have to find out the derivative of zero because and know where you are getting derivative of zero okay because here whenever you are trying to find out the derivative you have you are getting this kind of curve okay whenever your x value is less than or equal to zero you are actually getting this kind of curve so the major problem that you're solving over here obviously is that you are not a you are not have to find out the derivative of zero in this case because that pass that value will always never come if i go and see other values right suppose what happens if my x is 0. in this also i'll be getting this 0 will be very very near to 0 in this particular case it will be exactly 0 right so whenever i'm trying to find out a derivative obviously you cannot find out the derivative with respect to 0 apart from this also there is also one more function which will discuss about that but just understand that this exponential low is actually for fixing this problem let us see what all problems it is other than fixing that okay so if i go down elo is also proposed to solve the problems at lu obviously eu has all the advantages of relu obviously you can see always remember you have to see this function and you have to see the derivative function okay no dead values issues first of all because again in this particular case all the time you are getting some positive number okay here also you are solving the dead reload issues because in the derivative none of the value is exactly zero this is little bit more than zero guys okay it is just above the line don't consider that this line is actually showing you zero value zero values basically means with respect to y axis the derivative values okay we are trying to find out the mean of the output is close to zero or zero center so this also functionality is basically been implemented that the mean value is basically zero center because it is passing through zero most of the data is passing through this whenever any curve that you create if it passes through zero it is basically said as zero center data okay one small problem that is slightly more computational intensive similar to the leaky relu so guys remember this is a function e to the power of minus x x minus 1 whenever we do back propagation to find out the derivative of this particular value since it has exponential term it will definitely take time because derivation see one thing if i try to if i sorry sorry if i try to find out the derivative of x with respect to x this will obviously take very less time right it will obviously take very less time because we are just trying to find out with respect to a variable now what if if i try to find out the derivative of alpha e to the power of minus x divided by derivative of x in this particular case since we have an exponential term that there will be some time taken for this kind of operation because there is an exponential term there is some alpha value and with respect to x this will obviously take some amount of time right because this is basically taking less time this is taking more time so we basically say that elu is basically computational it requires more computing it is computationally expensive okay similarly in the case of 10h also it is computationally and expensive because in 10h there you will be seeing that you will be finding out the derivative with respect to a tannage function right so always remember this computational derivative with what just with respect to one weight updation definitely will not be looking at but when you are creating a neural network there will be so many hidden layers there'll be so many hidden layers there'll be so many weights right there'll be so many weights in that particular case we require very very good and in this particular case you can see that whenever you are applying an elo activation function you will be seeing that one major disadvantage is that it is computationally expensive now the other thing that what type of problem it is solving we have already discussed that it is solving dying value problem drawing reload activation functions and then it is also solving one more it is also solving it is basically saying that all these data are zero centering okay so remember whenever you want to explain regarding activation function you have to basically say all these things now uh we'll discuss about soft max but before that let's discuss something about pre-value now in pre-relu pre-relu is a activation function and this is called as parametric value in parametric relu all the functionalities will be same all the functionalities will be same okay what what do i mean by saying whenever your x value is greater than zero okay it is going to give you the x okay then whenever your x value is less than zero you will be given some alpha multiplied by x okay remember this this is what this particular function says okay whenever your x value is greater than 0 sorry it is always going to give you x whenever your x value is less than or equal to 0 it is going to give you alpha multiplied by x now see over here very very nice concept okay if you are if your alpha value is equal to 0.01 see over here if your alpha value is equal to 0.01 this will actually become leaky relu right see this first one if your alpha value is 0.01 it is becoming going to become leaky value because if i become leaky relative it is nothing but 0.01 x right if your alpha value is zero that basically means it this becomes relu right it becomes relu activation function it becomes relu activation function because if i make your alpha value as 0 0 multiplied by 0 will be sorry 0 multiplied by x it will also be 0 so in this particular case the whole value will become 0 so this will be like a reload activation function the third criteria is that if you place alpha with respect to any value okay alpha with respect to any value then you will be able to solve any kind of problem so this alpha can be considered as a learning parameter as a learning parameter which will which will dynamically change based on the learning based on this particular activation function this alpha is not fixed because with respect to different different hidden layers with respect to different different activation function this alpha value may change okay so this is a learning parameter okay learning parameter all the time it will not become 0.01 because whenever i use leaky value this value will definitely become 0.01 if i use relu activation function this value will become 0 but in some of the scenarios we may use different different activation function with different different alpha values at that time this will be dynamically learned when the training is actually happening okay so i hope everybody is clear till here okay so parameter parametric relu is solving both leaky value problem statement normal relu and different kind of release act again these all are variants of relu guys different different variants of relu right so here you can also see this so let's discuss about this pre-relu is an improved version of relu in negative region pre-relu has a small slope which can also avoid the problem of relu death so whenever i'm telling you this one sorry this function is not there whenever i'm trying to multiply by alpha of i that basically means that this alpha of i obviously when we are trying to use it as 0.01 it is always going to give me a derivative like this right it is going to give me a derivative like this right if i am using alpha is equal to 0 at that time definitely will become my value function and my derivative will be like this but this problem has dead relu function right that reload that activation function so this can be learned as this might be considered as a learning parameter with respect to this okay so here you can see if a is equal to 0 f becomes relu if a is greater than zero f becomes leaky relu if a i is a learning parameter f becomes pre-release okay parametric very much simple very much easy okay now coming to the next one which is called as swiss relu okay swiss relu in this relu activation function the formula is only this much okay y multiplied by sigmoid of x and this is called as self gating this is used in lstm okay lstm mostly this is very very computed extensive guys and remember this kind of activation function you use only when you have in your neural network greater than 40 layers greater than 40 lens whenever you have greater than 40 layers minimum minimum it should have 40 layers then only this swiss a self-gated function will work otherwise if it is if you are having less than 40 layers this is definitely not going to work okay now let's go and understand the math behind it okay the mass is pretty much simple okay remember guys this this this color in the pink color you see this is a relu activation function okay but in the case of swiss the function that you are going to create is nothing but x multiplied by sigmoid of x what is this x x is nothing but weight multiplied by the input plus the bias right so this is what you are actually getting and because of this you are getting this kind of curve okay this curve is pretty much good it is also zero centric because it is passing through zero uh it is also solving the dead neuron problem dead uh relu problem why because whenever i try to find out the derivative this is going to give at least greater than zero okay whenever we are trying to find out the derivative it is always going to give me zero because when i try to find out the derivative with respect to sigmoid of x with respect to x i'm also going to get different different values this will definitely give me one this will give me based on this x equation right whatever x value i have and with respect to the sigma so whenever i try to find out the derivative it is always going to be a little bit greater than zero so this is called as a self-gated function okay self-gated function again you can see guys there are so many activation functions we have discussed over here only the thing that is mostly focused on is in this particular negative value so that is the reason why in my previous session i discussed about so well with respect to vanishing gradient problem exploding gradient problem how to solve that how to fix that okay so this was with respect to this one which is called a self gated function uh which is nothing but a swiss function now coming to the next part which is called as max out function let's read about it okay again guys you always have to read this uh swiss design was inspired by the use of sigmoid function for gating in lstm and high highway networks uh we use the same value for gating to simplify the gating mechanism which is called as self-getting the advantage of self-getting is that it only requires a simple scalar input while normal gating requires multiple scalar input the feature enables self-gated function as to easily replace activation function that takes a single but remember guys this um i talk about unbounded snails but remember just like this that whenever you have a deep neural network at least more than 40 layers you should basically use a swiss a self-gated function now coming to the next one uh which is called as soft plus okay now in soft plus you can see over here guys i told you that derivative of zero is sometimes not possible in the real world scenario even though most of the most of the things you know whenever we are actually calculating they are sometimes scenarios where you may get the value as zero the x value is zero not most of the time but at least it is a rare scenario that you will be getting the value as zero if you get in that specific way definitely using a relu activation function will not serve the purpose because you don't know the derivative of zero okay so what we do is that we try to use a function which is called as soft plus now this soft plus function you'll be seeing that the function looks something like this log log to the base e sorry natural log to the bracket 1 plus exponential of x so instead of applying max of 0 comma x the function will basically apply this one and because of that you'll be seeing that you'll be getting this kind of curve and here both the problem will get solved when you're finding out the derivative remember uh and here you will be able to see that your derivative of zero need not be find out in this particular curve that is your green curve and this is basically called as a soft plus activation function okay so this were the different different activation functions that we have discussed in today's session there is also one more thing which is called a soft max now in soft max let me just define it for you okay so i'm just going to go over here and open this okay for today's session and let me just define soft max activation function okay now remember over here guys first of all suppose i have this is my input layer suppose this is my hidden layer right this is my again another hidden layer and suppose i have three output layers okay three three output what does soft max basically say okay what does soft knife basically say let me just show it to you the softmax activation function applies a simple formula whenever you remember guys whenever we have two output layers suppose two output let's at that time we use sigmoid activation function i told you that right in my previous session whenever i have two output layers we use sigmoid activation function and we know that sigmoid activation gives us the value between zero to one now if there are two output layers first of all okay let me just make this as two output initially let me make this as two output suppose this is my two output length okay this is my two output layer okay so in this two output layer since i'm using a binary classification i'll definitely use sigmoid activation function now when i use the sigmoid activation function suppose i get this probability suppose the image that i'm passing suppose i am passing the image of a dog or a cat so my output layer will be that either it will be a dog or either it will be a cat but with respect to sigmoid function what happens is that it says me okay it is it is giving me it is 0.60 that is 60 percent it is saying that it is a dog point 40 percent it is saying me that it is a cat remember this is a classification problem and this is working on probability okay my model says that 60 percent it is saying that the image is a dog 40 it says me that it is a cat right now in this particular case whatever what what we do we say that okay whenever the value is less greater than 50 we are going to consider that particular value and we are going to say that that is the output right in this is with respect to sigmoid activation function and when i try to find out the sum of these two values remember when i try to find out the sum of these two values this value is going to become 1 or 100 right but now what if if i have three output layers i cannot use sigmoid activation function okay sigmoid activation function is only going to give me values between zero to one so whenever i have more than two output categories at that time we use something called as softmax softmax activation function now what does this software activation function says not only greater than two outputs guys greater than any it may be any number of outputs it may be 10 outputs it may be 20 outputs suppose uh i i suppose i have four outputs like this okay i have four outputs and i want to check whether this image belongs to a dog or a cat or a monkey or a human being okay like this what is that what is i'm actually checking now suppose my first output okay internally i'm using soft max so this will actually give me probability like this now see this suppose i say my first output is 0.6 this is saying 0.6 it belongs to dog the second output says 0.2 it belongs to cat the third output says point one it belongs to you know point one it basically belongs to uh our uh suppose this is chimpanzee something like that and point one it belongs to human being now in this particular case you will be seeing that you will be getting four different values and always remember the sum of all these values would be one okay sum of all these particular values will be one now if i really want to find out the the soft max function of this particular values it is pretty much simple we basically apply a simple formula let me just show you the formula the formula looks something like this e to the power of x j divided by summation of k is equal to 1 to k e to the power of x j now what does this basically say is that now i have written some probability values like point six point two point one point one right i have written like that i can show you this is my final output but before this whatever x value we get whatever x value because see over here we are just applying e to the power of x j x j is nothing but weight multiplied by inputs plus bias right x j is nothing but that right weight multiplied by input plus the bias now suppose before this i still did not apply a softmax activation function and suppose my fourth value my four values were like this suppose my four values were like 40 30 10 5 right if my values are like this if i really want to find out what is this value probability with respect to soft max so the formula will be like e to the power of 40 divided by e to the power of 40 plus e to the power of 30 plus e to the power of 10 plus e to the power of 5 this should be the formula sorry for that uh in in the case of soft max this is basically the formula now suppose this value gives me 0.61 okay or it may be giving me anything okay e to the power of 40 you can do the calculation like this okay e to the power of 40 what is e to the power of 40 uh divided by e to the power of 40 plus 30 plus this it will give me some value okay some value will give me then if i really want to find out this one so it will be e to the power of 30 divided by e to the power of 40 plus e to the power of 30 plus e to the power of 10 plus e to the power of 5 similarly e to the power of 10 divided by e to the power of 40 plus e to the power of 30 plus e to the power of 10 plus e to the power of 5 like this we have to actually compute and that is what the formula says over here okay e to the power of x j divided by summation of e to the power of k is equal to 1 to k e to the power of x k where j is equal to 1 to k okay so that basically means how many number of values that i am having before this layer before this layer whatever values whatever the output of this value the output so that the input of this value the input is this value the input of this so we are going to combine all this particular value and we are going to compute like this and finally you will be able to see that you will be getting some four probabilities here i have just taken as an example you may get four probabilities and whichever will be the highest probability you are going to take that as our final output okay final output final output so this is this is called as this is this whole thing is basically called as soft max activation function sigmoid and softmax activation function is always kept in the last layer so i hope you like this particular video guys this were all the important activation functions and you can definitely play with all these kind of activation functions probably in the next video again i'm also going to update the optimizers with respect to adam optimizes so yes most probably the video will be uploaded tomorrow so yes this was all about this specific video i hope you like it please do subscribe the channel if you have not already subscribed and share with all your friends whoever require this kind of help i'll see you all in the next video have a great day thank you bye
Info
Channel: Krish Naik
Views: 42,591
Rating: undefined out of 5
Keywords: types of activation function in neural network, softmax activation function, relu activation function, sigmoid activation function, tanh activation function, swish activation function, activation function in neural network ppt, tensorflow activation functions
Id: qVLQ9Cqm-ec
Channel Id: undefined
Length: 38min 48sec (2328 seconds)
Published: Mon Nov 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.