Loss Functions : Data Science Basics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everybody welcome back today we're going to be talking about loss functions in the context of machine learning now first let's think about where loss functions actually fit into the whole machine learning pipeline so as you know in machine learning typically we're going to have some kind of data set we're going to split it up into a training and testing set we're going to use the training set to build some kind of model then we're going to take that model and apply it to our testing set and see how well it's doing on the testing set so loss functions fit in at the part where we are actually training the model on the training data so if you think about the various different kinds of machine learning models we've studied on this channel or that you might have studied on your own or in class for example logistic regression or svm many different other ones there's typically some kind of objective function that you're trying to minimize and that is called the loss function so at a higher level really what's happening is that many machine learning algorithms proceed by saying okay i need to take this training data and i need to fit some kind of model to it but of course the natural question is how do i know which model out of all the ones i can select is going to be the best one on the training data and that's where we use the loss function we basically supply the model a loss function and say here is the metric that you're going to use to see if you're doing a good job on the training data so take this metric the model says okay thanks for the metric i'm going to see which combination of parameters whatever the model might be is going to give me the lowest loss now if you notice this is not a hard science we can supply really any loss function we would want based on our particular needs and based on our situation and the model is going to go ahead and just optimize for that specific loss function which means that if we supply one loss function we might get a completely different model in the end than if we supplied a different loss function and maybe that's intentional or maybe it was unintentional but either way it's something that we need to be aware of as data scientists so the way we're going to proceed in this video we'll first spend a little bit of time talking about loss functions in regression we'll just talk about the biggest one here and then we'll spend the bulk of the video talking about loss functions in classification because i think here there's more to be said so first regression let's give a small real world example let's say that you're trying to predict how many minutes it's going to take for someone to finish running a marathon so let's say that you have three people in your training set here and they finish in 305 330 and 407 minutes respectively and now your model whatever it might be is trying to get as close to these numbers as possible so it's trying to build a model that is as close to these numbers as possible now let's supply a loss function to the model the most popular loss function in regression and you've probably seen this already is called the square loss the square loss for each individual observation is simply the true value of that observation minus the predicted value or the value that the model generates squared now although this video is about loss functions and it's pretty easy to just put stuff into a function and get stuff out i don't want that to be the focus of this video i want to really talk about why do these functions look the way they are mathematically what are they trying to achieve so if you take away anything from this video it's not the actual formulas that are going to be on the board but it's the spirit of the formulas the intuition behind the formulas so let's start with this one this part inside makes total sense right because we're going to measure our loss based on how far away we are from the truth but why is there a squared why didn't we just take an absolute value for example so the the spirit or the intuition behind the squared loss is that if we're really really far away from the truth we want to give that a really big penalty so for example let's say that we are two units away from the truth so that this difference is equal to 2. now that would mean that the loss for that observation is going to be 2 squared or 4. okay now let's say that we're twice as far away from the truth so we're 4 units away from the truth so the inside is 4. now the loss for that observation is 4 squared or 16. so you notice that although we just got twice as far away from the truth our loss actually increased by four times and that is intentional that is built in to this squared loss it basically says that actually if you're really far from the truth i want to give you a penalty that's increasing quadratically not linearly because i don't want you to be really far away from the truth i want that to be very heavily penalized so that's why it's a squared loss and let's compute it just for the sake of example so let's say that we're deciding or not we but the model is now deciding between two different sets of predictions for this training set this set on the left and this set on the right now all i did was just compute the difference between each element here and the truth so for example 315 minus 305 is 10 325 minus 330 is negative 5 and 420 minus 407 is equal to 13. so then we just square each of these guys according to the square loss so these three numbers you're seeing down here 125 and 169 are the square loss for each of the observations that are in the data or in our training set we add them up and we get 294 for the first set of numbers and if we do the same thing for the second set of numbers we get 266. so using the square loss which again is the most popular one in regression we are going to prefer this model over the one over here because it's giving us a lower square loss okay so that's regression so let's move over to the classification case because things get a little bit more interesting and the story kind of continues and we'll tie it back to this as well so now we're looking at classification and just to be explicit a classification problem in our case a binary classification problem is one where you're either predicting an observation as class one or class negative one so it's going to be one of those how do we determine loss in this case so first let's say that the truth so let's just say that we have uh five examples here and we have one negative one negative one one one and each of these is called y sub i so y sub i is going to be the truth about each of the observations that we're seeing in our training set again let me put this in real world terms so we can attach our mind to something here let's say that we're trying to predict if someone will finish the marathon above average time or below average time so above average time would be class one below average time would be class negative one so whatever model we build we're going to put in information about the athlete information about the course and all that information collectively is just called x sub i we're going to put x sub i into our classifier and we're going to get a score called f of x sub i and the higher the score is so the closer to infinity it is the more positive it is the more likely the classifier is saying that this observation should be classified as a one that this person will finish the marathon in above average time conversely the more negative the example is so the closer to negative infinity the example is the more confident the classifier is that this person will finish the marathon in below average time so now just before we talk about three commonly used loss functions let's kind of get an intuitive idea about what's the relationship between these y sub i's and f of x i so what i've written down here if y i is equal to one so if we know this person finished the marathon in above average time what would we like our f of x i to be we would like our f of x i to be really really high very close to infinity a very positive number because that means that it's very very confident that this person is going to be finishing the marathon in above average time which is the truth in this case so another way to say that is that we want this quantity f of x i times y sub i to be high and why did i do times y sub i well in this case y sub i is one so it doesn't actually change anything if i put y sub i over here so that is what we want in the case where y sub i is equal to one now let's just think about the other case just for a second what do we want in the case where y sub i is equal to negative one so if we know this person finishes the marathon in below average time then what do we want for our function here we would like it to be very very negative because again the interpretation of our model is that a very negative value for this function or for the score means that it's very confident that the class should be negative 1 which is the truth in this case so in this case we would want f of x i to be a very negative number and if we multiply that very negative number by negative one we're again going to get a very high number so a very negative number times negative one is a very positive number and the reason i did this is because we get the same exact result here which is that we want the product of f of x i times y sub i to be very high the reason i went through this exercise before talking about the loss functions is because it's going to make our interpretation and our analysis a lot easier because now we have a definite goal in mind no matter what is being output by this classifier no matter what the actual labels of these classes are our objective is always to get this quantity f of x i times y sub i for any given observation to be as high as possible if we can get it as high as possible that's going to be a very good situation for us and conversely if this quantity is very low it's very negative that is worst case scenario because think about what that means for example that would mean that the person actually did finish the marathon in above average time so y sub i was equal to 1 but f x i is an extremely negative number that's worst case scenario we're saying that the class was actually one but the model confidently thought it was negative one or the other way around in this case so keep that in mind always keep this quantity in mind because you're going to see it come up in each of the loss functions we're about to look at the most easy loss function to talk about is the zero one loss the zero one loss literally says that if this quantity the one we just spent some time talking about and i apologize for reversing that but this multiplication is the same multiplication you're seeing right here the zero one loss is very forgiving in the sense that it says that if this quantity is above zero then we're going to give you a loss of zero so we know that this quantity should be as positive as possible the zero one law says that even if it's a little bit positive as long as it's greater than zero we're going to go ahead and say your loss is zero for that example there's no loss however it's saying that if your quantity here is less than zero remember that we want this example never to be too negative this is saying that if your example is less than zero at all we're going to give you a loss of one so if we plot this guy it's pretty easy to plot so on the x axis here we're seeing y sub i f of x sub i which is that same quantity we've been talking about this whole time and if this is above zero then we see this red line has a loss of zero however if it's less than zero then this red line goes to one so this is kind of just a step function nothing too crazy going on very easy to understand now let's think about why this might be not great not a great idea and as a segue into the other two loss functions we're about to talk about let's say that we're dealing with two training examples for one of them this quantity let's say is just a little bit lower than zero so it's almost getting it correctly classified but for the other one the value is way out here way negative so that's probably the one that we want to prioritize we want to give a bigger loss to that one because it's a much bigger problem for us in the context of all of our data this does not take that into account because for both of those it would give a loss of 1 because it's just a step function so we would like to introduce some kind of notion of the lower the score is for you the more negative it is the bigger problem you are as an example therefore we want to give you a bigger loss so that on the next iteration of our algorithm or however our machine learning algorithm works that example can be prioritized we can make sure to get that into the right class so that's where this exponential loss comes in this is the cleanest one to write and i think it's also pretty easy to understand so this exponential loss is literally just e to the power of negative y sub i f of x sub i you see this thing come up again so you see y sub i f of x sub i the same exact thing we've been talking about and this interpretation could not be simpler it's literally this blue curve here this exponential curve so the bigger this quantity is again we know that that is preferable that is good the bigger this quantity is the closer your loss gets to zero and the more negative this quantity is which is bad this is exponentially going up so this is growing very fast as our loss gets lower and lower on this axis so that is kind of the behavior that we more wanted and that is kind of fixing the problem of the zero one loss and that's why the exponential loss might be preferable in some cases and just to give you kind of a preview uh the exponential loss is used in for example the adaboost algorithm so it's used in the adaboost algorithm now to finish this video let's talk about one more let's talk about the hinge loss so the hinge loss is this green curve here and i'll explain why it's shaped like that now the hinge loss is given by this kind of funky looking function here but you see this quantity come up again the y sub i f of x sub i right so this quantity at least keeps showing up over and over and over again in all the different classification losses that we're looking at so the hinge loss although you can map this function to the green line i just want to explain it graphically i think that's easier to understand the hinge loss is basically saying that if your quantity the quantity again being f x i y i is bigger than one then i'm going to give you a zero loss at that point i'm going to say that okay you're going to get 0 loss but if that quantity is less than 1 then your loss is going to go up linearly now this is a little bit different or maybe a lot different from the exponential we looked at here because this is saying that actually your loss is going up exponentially as this quantity gets more negative this is saying that your loss is going up just linearly okay so although these two functions the blue curve and the green curve look sort of similar in this area as you get out to this area this is actually going to pick up a lot faster the exponential is going to pick up way faster than a linear function clearly okay this is called the hinge loss and we see this most often in support vector machines or svm now i just want to put a bow tie on this video because we talked about many different kinds of loss functions in classification and one of them in regression but i haven't yet answered a question that many of you might have at this point which is which loss function should i use because what i said in the beginning of this video is that based on the last function you choose to use you're going to get a different model because that model is prioritizing or optimizing some different function based on what you put in so the natural question is of course which one should i use well this is where it comes down or machine learning comes down more of an art than an exact science because each of these has trade-offs for example let's just compare again the hinge loss versus the exponential loss so that's the blue curve and the green curve here now let's say that we're getting more and more negative values so more and more values over here which we know is bad in any sense so this hinge loss is giving a bigger and bigger penalty as you get more and more negative on this axis but that penalty is only going up in a linear fashion now compare that with the exponential loss exponential functions grow incredibly fast incredibly fast so the story we're telling by consciously choosing to use the exponential loss is that as we get further and further down this axis we are giving a harsher and harsher and harsher penalty to that example and that may not be something we want to do because we might want to put more attention on examples that are pretty far out here but maybe we don't want to put as much attention as the exponential function is saying maybe that's just a little bit too extreme so this is where again it becomes more of an art than a science where you need to really think about exactly what form of penalty do i want to put on examples that are far away from the truth do i want to give a very big penalty that's increasing very fast do i want to put a more moderate penalty that's not increasing as fast or do i want to put no penalty at all the penalty is just the same no matter how far away you are from the truth so these are all things to think about when you think about which loss function am i going to use okay so hopefully that helped you understand loss functions in machine learning at least the couple of them that we've looked at here i didn't state this but there's many other loss functions out there these are just some of the very commonly used ones so i wanted to expose you to them but there's many other loss functions out there people who do hardcore research in this field will come up with their own loss function based on their own needs so i think personally lost functions are at the intersection of the science and the art aspects of machine learning they're kind of coming together so i hope you like this video please leave any comments below with questions you might have like and subscribe for more videos just like this and i'll see you next time
Info
Channel: ritvikmath
Views: 7,026
Rating: 4.9765396 out of 5
Keywords: machine learning, big data, data science, ai
Id: eKIX8F6RP-g
Channel Id: undefined
Length: 16min 40sec (1000 seconds)
Published: Mon Nov 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.