The Sigmoid : Data Science Basics

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys welcome back so in this video we'll be looking at the sigmoid function and whether you hear because you just need to learn the sigmoid function for a class or whether you're here because you want to really get into the heart of machine learning and data science I want to start this discussion not from the mathematics or the formula that defines a sigmoid function but rather from why do we care about it at all why did people put so much thought into developing the sigmoid function if there are simpler things out there so with that in mind let's start with a real world set up so that everything we say going forward can be a little bit more applicable so we're going to start from here let's say that you're a data scientist for the education system and the current thing you're working on is developing a model which is going to predict whether or not a high schooler will drop out of high school and this is going to be some function of all their past academic performance the point of this video really is not the model but let's say you've built such a model and the model in the end outputs some score S which is between negative infinity and infinity so it's just unbounded in both directions but what you do know is that the higher the score is so the more towards positive infinity the score is the higher evidence we have that this current student will indeed drop out of high school and conversely the lower the score is so the more towards minus infinity the more evidence we have the student will not drop out now currently the score being kind of unbounded is not super helpful in comparing one student with another so the goal throughout this video is going to be converting the score which is again between negative infinity and infinity just some unbounded real number it's going to be mapping that to a probability P where probabilities of course have to be bounded between 0 and 1 the reason we want to do this is because of mostly because of interpretability because now we can say ok given that the model outputs this score we can map that to a probability and that'll tell us what's the probability or the likelihood that this student will drop out if that likelihood is very high like 95% and we can send many resources to that student to hopefully help them not drop out if that probability is more like 2% maybe we don't need to worry so much about those kind of students so that's the motivation now the first natural thing we might do is just draw a straight line so or while I did say here that the score is bounded between negative infinity and infinity of course if we consider some finite set of students when training our model our score is going to be between two bounds so let's say when we train our model our scores are between negative 10 and 10 so our goal is to map this range between negative at 10 and 10 to the range 0 to 1 and the easiest thing would be just draw this blue line that I've drawn so the student who has a negative 10 score that gets mapped to 0 and the student who got a positive 10 score gets mapped to 1 and anything in between just gets mapped linearly now this is a very natural thing to do and it has some advantages in terms of simplicity but it's not the best choice and here's two big reasons why the first big reason is purely just mathematical for example let's say that we choose to use this model and in the future we get a student who when we put them through the model they get a score of twin tore negative 15 our linear model would basically just map that out of bounds something that's less than 0 or greater than 1 so that's no longer interpretable as the probability so it seems kind of weird to do that and perhaps the bigger issue which is issue number two is that the rate of change of this linear function doesn't really capture the heart of this probabilistic setup and here's what I mean by that more concretely so this linear function is pretty much just increasing at the same rate throughout all time well let's pause for a second and really think about what it means for the score to be around zero versus something positive versus something very negative if the score is around zero and that maps to the probability of 1/2 let's say that means that we basically are have a 50/50 shot whether this student is going to drop out or not drop out so we have basically zero information now let's say the score goes from 0 up a little bit to 1 that's a big deal because going from a score of zero to a score of 1 is a huge change in relative terms so because of that huge change in score in relative terms we would expect the probability to spike or jump by a lot so the shape we're looking for here is not a line but something more like this we want the probability to spike for this 0 one change and symmetrically in the other direction if the score goes from zero to negative one that's again a really huge change in relative terms so we expect the probability to dip by a lot so we want this kind of shape if that were to happen so we see all three linear functions not capturing that now to round out the story let's say the score were happening to be nine and let's say I said it went from nine to ten now that's not a huge change in relative terms a more human way to think about that is saying I have a student who scores nine so basically I'm saying I have very very high evidence that they're going to drop out and now I said what if their score were ten well I still have a pretty high evidence things haven't changed all that much so another way of saying that mathematically is that when we get out here to the boundaries of our function we're expecting the probability not to change by much for these equal changes in score so we're kind of expecting it to be flat over here and if I were to connect this guy to this guy we get this kind of shape and the exact same story over here jumping from a score of negative nine to negative ten is really not changing much in terms of our probability of the student dropping out so we're gonna have this sort of shape in the end for our function and this black line I've drew here this black curve is exactly what's called a sigmoid it solves these two problems note it solves the first problem because the sigmoid is inherently bounded between zero and one even if I were to send the score over to a billion it's obviously going to be pretty much 1 but the sigmoid never actually touches one or touches zero so it always stays bounded no matter what the score would happen to be so that solves problem number one and more importantly solves problem number two which is that it exactly captures this this setup where if we have no information where the score would be equal to zero let's say then little changes in the score have big impacts on our probability but if we already have a lot of information in either direction the same amount of change in score is really not going to change anything at all so that's why we use this thing called the sigmoid now that we have the motivations in our heads let's go ahead and jump to the mathematics so it turns out that the thing I drew is described by this mathematical formula where the probability given the score is equal to one / 1 + e eb Euler's number 2.7 whatever e to the power of negative s let's do a very quick sanity check to make sure that the curve I drew matches to the function that I wrote let's say s were equal to positive infinity then the bottom of this fraction would be basically going to 1 + 0 and that would be 1 so that matches up to the fact that if I go to positive infinity the function the sigmoid should be 1 if s were equal to negative infinity the denominator goes to infinity so the entire fraction goes to 0 which is exactly what we want on this end and the last sanity check is if I plug in s is equal to 0 so the score is exactly zero meaning I have no information one way or the other this probability should be one-half indeed if I put in 0 in s I get 1 plus 1 on the bottom so the entire fraction becomes 1/2 so that is the correct formulation for this curve now to just close out this video I want to touch on the derivative of the sigmoid because it ends up being really really important for when we look at neural networks other machine learning applications and if you're just here just to kind of get an idea of the sigmoid you can just follow the mathematics but if you're more trying to get into the heart of machine learning we care about the derivative of the sigmoid because it's used to calculate loss functions which basically help us train our machine learning models either way this is the derivative of the sigmoid so it would be DP DSP being the probability s being the score so you can work out for yourself that this is the derivative I'll go through the high-level steps with you this is our function we're going to be interpreting it as 1 plus e to the negative s all to the power of negative 1 so we use the power rule to bring that negative 1 into the front which technically it is here but you'll see why it goes away the negative 1 comes in the front then we put a power of negative 2 so that is why we get 1 plus e to the negative x squared we use chain rule to take the derivative of the lower part which gives us e to the negative s and of course the negative s is why we end up with a positive sign in the front so this is the derivative the first derivative of the sigmoid now can we reduce this into something that's a little bit easier to work with yes and that's the key insight why people continuously use the sigmoid because of its really nice derivative property we're about to see so I can split this guy up into the multiplication of two terms so the first one is 1 over 1 plus e to the negative s and the other is e to the negative s over 1 plus e to the negative s the reason I did this is if you stare at this guy this is exactly P P of s itself so that's P and this guy is actually 1 minus P you can do a very quick mathematical expression to check that for yourself so what that basically says is that the derivative of the sigmoid with respect to s the score is actually P the probability times 1 minus P which is a very nice form and not just is it a nice form it also has a great intuitive explanation for example let's consider three cases to close this video out let's say that your probability was already very high so your score is very high and your probability was in year one if your probability is near one you plug in one into here and you get that the derivative ends up being zero does that intuitively make sense yes because if your probability was already very high which means your score was also very high then a small change in your score is basically going to change your probability by zero basically you already have so much evidence that the student is going to drop out that a little change in the score doesn't change anything same thing for the other side let's say that your probability was zero which means your score was very negative which basically says that you're very very sure the students not going to drop out if you change the score a little bit that doesn't change the story and the last case to consider before we close the video is what if the probability was 1/2 which means we have exactly no information to help us with this problem we're equally certain the student will or will not drop out well then we actually get this function to achieve its highest value which is 1/4 because 1/2 times 1/2 is 1/4 and that makes sense because that's exactly when the probability would be changing the most another way to say that is let's say that I have zero information whether or not the students going to drop out then you change my score by a little bit let's say you increase my score by a little bit that gives me a ton of information which is basically the same story as the sigmoid jumping very fast in the beginning therefore my derivative is going to be very very high so hopefully that helps you understand the sigmoid very deeply not just at a surface numerical level but at a level of why do people choose to use it why is it used so much in machine learning and if you have any questions at all please feel free to leave them in the comments below like and subscribe for more videos just like this and I will see you guys next time

Info

Channel: ritvikmath

Views: 12,369

Rating: 4.9819551 out of 5

Keywords: data science, math, sigmoid, neural networks, machine learning

Id: Aj7O9qRNJPY

Channel Id: undefined

Length: 11min 34sec (694 seconds)

Published: Mon May 25 2020