Machine Learning Model Evaluation Metrics

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

as has been mentioned I basically self educated myself in machine learning and data science and even though I have a I have graduated from Applied informatics program at my university it was a long time ago and most of the practical knowledge comes from courses pet projects competition that sort of thing and one of the things that was confusing for me was evaluation metrics why there are so many what's the point what's the difference what what do they mean and today I'm going to try to save some of the confusion to those who are relatively new to machine learning so what's an evaluation metric just a quick recap it's a way to quantify performance of machine learning model and it's basically a number that tells you if it's any good and you can use this number to compare different models it is not the same as loss function although it can be the same thing but it doesn't have to be so basically the difference is that evaluation metric is something I mean what's function is something that you used while you're training your model while you're optimizing while evaluation metric is is used on an already trained much machine learning model to see the result as if it's any good and today I'm only going to focus on supervised learning metrics when you have only have 40 minutes I'm not going to cover them all there's a lot of them but these I think are the main ones to know to understand them and I'm gonna give you an idea of what they are and how they differ we're gonna start with classification metrics and to make things a bit simpler I'm gonna start with a binary classification problem in the first metric that you encounter on your machine learning journey if you're doing classifications of course accuracy and it is a number of correct predictions out of total number of predictions it's super easy to understand it ranges from zero to 100% or zero to one and it's very intuitive you can easily get it in scikit-learn for any classification with a score method so the score method for estimators in scikit-learn gives you an evaluation metric and for classification it is accuracy so I build some logistic regression on some test they some some sample data and I get almost 96% well this looks amazing but that it may not be that good we don't really know because we don't know the context and I'm gonna show you why this is not necessarily a good thing in this case I'm gonna use the same data set to build dummy classifier and a diamond classifier and scikit-learn is something that doesn't learn anything from the data it follows a simple strategy you can generate predictions uniformly at random or it can in my case just pretty the most frequent value it or the most frequent class it has seen in the training data and in this case I'm gonna get 94% accuracy and this is because the data the data is highly class in balance I made this artificial data set with the 10,000 samples and here you can see that only 95% of the examples are positive and the 5% are negative so simply saying that everything is positive we're going to get 95% accuracy in this case 94 is probably because of the way that the data gods played so 96% accuracy is good we don't know is it a good model is it a bad model in if we don't know what our data looks like we cannot say if it's a good number or not and even if it was balanced data we and we wanted to improve from just this number we cannot know what the errors the model is making and what to do with and how to improve but luckily there is a lot of other classification metrics that we can use and diagnostic tools and we're gonna start with confusion matrix confusion matrix is table basically matrix that gives you numbers of how many samples your model classified correctly for what they are and how many it means on mistook for something else it's technically not a metric it's a more of a diagnostic tool but it has to get insight into the types of errors your models making and it also helps to understand other metrics that are derived from it so that's why we're going to look at that here's another example I built a basic random force classifier and got eighty-three percent accuracy on it so to understand what errors my model is making I'm gonna build a confusion matrix and to do that you can simply import confusion matrix from scikit-learn dot matrix and a convention is that first you pass their actual values to a metric and then you pass their predictions and you get this beautiful array which if you haven't seen the confusion matrix before can leave you confused and to understand this better typically confusion matrix are also represented as tables just like this one for example so in by convention on the rows to get the actual values and in the columns to get the predicted values this is the convention that is used in scikit-learn and for example in tensorflow but be careful because other tools may be using a different convention and it's gonna be the other way around so even for example a wikipedia page has a the other way around so sometimes tutorials or articles on this cannabis lady and you can you need to pay attention to what's where so we can learn a lot from this matrix for example in diagonal we have the true negatives and true positives true negatives are the values that were predicted negative and they were negative and well true positives is when it was predicted positive and it was positive we also yeah false negatives false pas and by summing up the true negatives and true positives and dividing by the total number of predictions we can get the accuracy but I wouldn't be showing the confusion matrix just to show another way to get the accuracy there are other metrics that we can get from from it let's say we're building a spam filter or a recommendation system something for a user where we if we say that this is something that you want and they don't really want it they're gonna be super annoying if we see something or if we put some important email in the spam folder and it's not as spam again users will be really pissed so we don't want that so we care about false positives we don't want to say that this is the thing that you want when it's not and in this case one metric that it was going to be helpful for us it's precision precision is calculated by dividing through positives by the sum of true positives and false positives so if we don't have any false positives this gonna be one and if we had some false positives and then we're gonna manage to reduce the number we're gonna have we're gonna improve the metric and it's gonna be a bit closer to one now another case is for example if we do something with medical diagnosis and we don't want to send people home saying that they don't don't have a disease but they do actually in this case we are carrying more about false negatives and in this case we need recall and it's very similar to precision in key but in this case we care about false negatives and about having as little of them as possible and there is another way of summarizing the confusion matrix in one number taking into account both precision and recall and it's f1 score which is just a harmonic mean of the two so depending on your business problem you might want to care not not so much about accuracy accuracy but not or not only about accuracy but also about precision if you want to minimize false positives or recall if is false negatives and these are actually unlike confusion matrix these are actually numbers which you can compare and you can get them easily from scikit-learn dot matrix the same ways we got the confusion matrix by passing the actuals and the predictions and we can use them for example in grid search if we want to choose a model that gives us better recall we're gonna put a scoring parameter recall and in this case it's going to get us the estimator that's going to give us the best recall among the possible versions of hyper parameters another way to summarize the confusion matrix is called Matthews correlation coefficient and what's important to notice about this formula is that it takes into account all four cells of the confusion matrix and this is important because it gives it it makes it different from f1 score and has some nice properties I'm going to show you with examples so let's say we have this data we have a hundreds of samples 95 of them are positive 5 or negative and we're just going to use them in classifier so we get 95% accuracy on this now we're gonna calculate f1 score and it's going to be even better than accuracy is going to be 90 0.97 and then MCC it's gonna be ante fide scikit-learn is going to return a 0 and give you a huge red warning because you were dividing by 0 and this is good because this is the only metric so far that gave you any red flag any indication that something's fishy is going on with your model and it it is because it's a dummy classifier there's nothing there's nothing good about it now another example let's say we have the same data but now we managed to classify one negative so it's a little bit better we get a f1 score of 0.9.5 2 we get MCC of 0.135 but what we're gonna do we're going to say okay now what we call positive we're gonna call negative the data is the same is just a way where you say like this is a positive class this is a negative class we're gonna use the same model and we basically just flipping the confusion matrix everything is the same but everyone's chord changes for this model and MCC doesn't and this is precisely because f1 score takes into account true positives false positives false negatives and does not care about true negatives so f1 score is very sensitive to what you call a positive class and what you call a negative class while MCC is not so far and I feel that if you want to summarize a confusion matrix a number for a binary problem MCC gives you a better feeling of what's going on but unfortunately there's of course the downside it doesn't really extend the well into a multi-class problem and so far we've looked only on the metrics that take into account whether prediction was correct or not but a lot of classifiers they generate probabilities of a cloud of an example belonging to one class or the other and there of course metrics that take that into account for example one of the popular matrix is ROC curve which stands for a receiver operating characteristic and to be honest this is one of those cases where knowing what the abbreviation stands for it doesn't really have to a normal person to understand what it is and this is basically a plot where you have false positive rate on the eggs and true positive rate on the y-axis and why is it a line was it a curve why is it not a dog and this is where the probability thresholds come to place to play when we have a probability generated of an example belonging to one class or the other it's by if all this 50% there's a decision threshold where we decide okay if it's if the probability is larger than 50% then we're gonna say it's a positive class and you can actually change this threshold you can move it around you can say okay we are only going to say that it's a positive class if the probability is higher than 60 70 80 and if you move you're gonna this threshold number of true positives and true negatives is going to change because if something was identify a class a' fide as positive at with 60% probability and it was if you move the threshold to 80 percent is going to be misclassified as false negative so what that is exactly what's going on here we're going to be moving the threshold and we're gonna be for each at each threshold calculating the true positive rate for the model and the false positive rate I'm gonna plot it as a dot I'm gonna connect them with a curve and the next question is if it's a good curve or a bad curve and to me was helpful to think of best case scenario the idea unicorn sort of model that can perfectly split the two classes without making any mistakes and in this case the true positive rate for such a model will be one and the false positive rate will be zero because we're not going to have any false negatives and in this case we're not going to have any false positives so and when we move Thresh or one way we can be ly really close to this upper border of this graph and then changing the threshold and the other way we're going to be moving this way so it's going to be really really hugging that corner and what we want our model to have is the curve really really tightly closed to that corner button again it's a plot it's difficult to compare automatically without looking models based on the plot that's why there is a metric called area under the curve which basically just calculates the percentage of this plot that's under this curve when you understand what the curve means and what it does it's a lot easier to understand this metric but this is not the only curve that you can use there is also a precision recall curve which is follows exactly the same principle where you move the threshold and then you plot precision and recall in this case and then you get a curve and you can calculate the area under it again why there is one curve where it is another curve in this case again it makes sense to look at the data and see if you have a class imbalance they just said I'm going to use I'll be using exactly the same example you get the probabilities and plot the curves and as you can see for again for a class imbalance that I said ROC gives 92 and for the precision recall is only 57 almost 58 so again this a lot depends on what kind of data you are dealing with there is also log loss is another way of assessing the performance of a machine learning model that generates prediction sorry probabilities of as well as predictions and it also often used as a loss function of course and it takes into account uncertainty of model predictions here you have yy4 in binary case is going to be for the true labels if it's a zero one and the logarithm of probability the minus sign before the formula is to make it a positive number because logarithm of something smaller than one is going to be negative so it's going to be difficult to compare models with with negative numbers and intuition for log loss is easier to plot than to explain with some words so if you have a prediction that is if you have a to label of one and predictive probability of class belonging to one you're going to have a very small log loss but the more wrong your predictions are the more confidence your model is in wrong predictions the log loss is gonna skyrocket so you would care about log loss when you not only care about the accuracy of your predictions but also about how confident your model is in the prediction predictions it make makes so we've talked about some metrics for binary problems and let's see how they can be extended to multi-class classification problems and you can plot the confusion matrix for multi-class problem in this case it's the digits data set so numbers 0 to 9 actually the handwritten classes numbers and the classes that they represent you can have the same story with a true label on in the rows predicted in the columns the diagonal is going to be representing the correctly classified examples and but then you can actually diagnose and see like that there's eight is mistaken for one or there is two that's mistaken for three for some reason and you can make some adjustments based on what you know about the data in some cases something will be expected in some cases you can see indication of something is really going wrong with the model you can also get precision recall and f1 score but for multi-class problem the notions of true positive to negative and so on don't really apply directly but this matrix can actually be extended to a multi-class problem by calculating them per label and an average in them and the three there are modern more than three ways of averaging but I find that the more commonly used are the macro average in micro averaging and weighted averaging I'm going to show an example of how they're calculated for precision for instance because I used the same way of the same principle how their average is calculated so we're gonna take this tour example and we're gonna make it confusion matrix and then we're going to calculate each metric for each label sorry for each class I'm going to say we're gonna treat them as one versus all problem me and that if it's for example for a bird a bird that we're gonna calculate the true positives for a bird it's gonna be this cell this is where we said it was a bird and it was a bird that false positives are going to be where we said it was a bird so we predicted it was a bird so this column right under but it was not we're given some of these up and this will be false false negatives where it was actually a bird but we said there wasn't so we're gonna do this the same thing for all classes and we're gonna calculate the totals as well now to get micro-macro and weighted average which is going to add the number of samples here as a column and micro precision is going to be calculated by just using the total numbers in this manner every sample equally contributes to the average and the equally represent each sample is equally represented in the average and then for a macro precision we're going to calculate the precision per class just for example for a bird it's gonna be a 1 divided by one plus one it's one for cat 4 divided by 4 plus one is 0.8 and we're just gonna take the average of that column and this way every class regarding regardless of its size is going to contribute equally to a macro precision and weighted precision is similar we're gonna again precision calculated per class but then we're gonna weight it by the number of samples about how they are represented in the data so just micro average like I said before it allows you to make sure that all samples equally contribute to the average and micro average allows all classes to equally contribute to the average while weighted average weights each causes contribution to the average and when would you want what the well that largely depends on your data and again a bit a bit on a business problem if you have a class imbalance data and you have one class that is underrepresented but you really want to get this one right you wanna you maybe want to use macro average to make sure this this this classes contribution is amplified and it's on the same level as for other classes scikit-learn documentation recommends using micro average for a multi labeled problem I personally haven't done much with multi level problems but I'll trust them on this and a multi class log loss is actually a more general case of a binary log loss and the intuition is exactly the same and the formula looks a little bit different it's a general case it's a general case but it's essentially the sum of we're gonna go through every sample and through every possible label and for every every time we're going to be taking Y which is whether this label is correct for this sample a binary indicator and we're gonna multiply it by the logarithm of probability of this label being right for this sample to me it makes it easy to understand a metric like this to just try to try to code it and in this in my case I just used for loops to show like to make it intuitive but of course in psychic learn it's done with vectorized operation it looks it's a little more optimized what I think so but for me it helps to understand it this way now I'm gonna switch to regression matrix and I find that regression matrix are a little bit easier because you're not dealing with probabilities and you only have a continuous value and your prediction and continuous value subtract one from the other you get residuals and now the question is how to evaluate a model based on all those individual residuals and if you remember for all classifiers scikit-learn gives you a default evaluation metric of which is accuracy and four regressors this would be R squared also called our coefficient of determination you the same way as you would get accuracy with square math that you can get R squared with a square metal four regressors r squared shows you how well model predictions are practicing my approximately the true values and is going to be one for a perfect perfect fit and it's going to be zero for a dummy regress or prediction average and why is that so this actually stems from the formula and at the numerator we in the numerator here we have the sum of squared residuals so the difference between the actual and the predictive value and then the denominator we're going to we have the squared distance from the actual values and the mean and if you think of it if you have a model that just predicts to mean that at the top and the bottom parts are going to be exactly the same so this will be 1 minus 1 is going to be 0 and if your model is doing even slightly better than predicting the average then you're gonna have the shorter distance between the actual values and the prediction values compared to the distance between the actuals and the mean and this will be somewhat closer to the whole matter who is going to be somewhat closer to one but you can of course go negative if you rent if you predict infinity for everything everything is going to be one minus infinity and it technically it can be negative but that means that something is very very wrong with the model and the good thing about this matter is that is that it has an intuitive scale it's kind of like accuracies you get the percentage value and doesn't depend on your target units which can be a good thing but it also doesn't give you any information about the prediction error how far actually your predictions are from the actuals there's a lot of metrics which give you exactly that because oftentimes you care about the error that a model is making and the most intuitive one is the mean absolute error it's basically the average of absolute value of the residuals and you can get it again as scikit-learn we from dot matrix there's another way of summarizing this getting the average of the residuals using instead of absolute value squared values but then you lose the the metric is now not in the units of the target value base instead in the squared units of the target value that's why more commonly you would see our MSC being used which is basically just a root root of the mean squared error and the the M mean absolute error and root mean squared error have a lot in common they range from 0 to infinity and they have the same units as Y values so you can see the error that your model is making in the units of the target values they don't care about the direction of the errors because of their the values of residuals are u squared or taken or have the absolute value and you want to this metric to be as low as possible but they're different of course and our MSE gives very relatively high way to large errors because the residuals are squared before contributing to the average and this makes absolute error more robust to outliers because they're gonna because they're not squared aramis e is often used as a loss function because it's differentiable but for an evaluation metric it doesn't matter this much and there's a lot of debate and there's a lot you're gonna see a lot of tutorials saying no you should use mean absolute error be of not our MSE for evaluating new metric because our MSE isn't appropriate and it's me misinterpret the air and these articles are mainly based on the paper that was published in 2005 by Wilma and Matt Sorum however a bit later in 2009 there was another paper by chai and Draxler that argued that actually it's not the case and then our MSE is perfectly fine metric and in fact sometimes can be better than the absolute error especially if you expect the error distribution to be Gaussian which is often the case and it's important to know that neither of this metric is going to be good enough on a small test set when you have less than 100 examples but in practice most of the time you're going to have more examples in there so that's okay and in practice I find that our MSE is completely fine metric to use and if you really really want to downplay the outliers you can use absolute mean absolute error to as a second metric to and but for most cases our MSE seems to be doing well and there is one version of it which also is quite often used as a root mean square logarithmic error which is very similar to our MC except instead of wise you get a logarithm of Y and there's a small technicalities Y there is plus one and the good thing about this matrix metric is that it shows a relative error and this is important in cases where your targets have an exponential growth for example if you have if you're particularly prices and they have a wide range if you want to error of final five dollars for 50 dollars be quite a large error where else you wanna if it's a five dollar error on a fifteen thousand dollar price it's not a big deal so logarithm helps you to level these things and has an interesting side effect it also penalizes under predicted estimates more than other predicted which is also sometimes useful so I've rushed through a lot of metrics both for classification problems and for logarithmic problems of sorry regression problems and I don't expect you to we understand them all but I what I want you to take away from this talk is that there is a reason for this variety and there is no metric that's gonna feed all the cases you need to get to know your data you need to understand what how many outliers do you have if it's a miss class imbalance data or it's not without understanding your data your metric may not make sense and more importantly you have to know your business problem and understand what your model should care more about because in some cases one metric is going to help you more to achieve your final goal in some cases is going to be another so you always need to start with knowing your data and knowing your problem thank you I've put some possibly helpful links here [Applause]

Info

Channel: Anaconda, Inc.

Views: 21,870

Rating: 4.9703703 out of 5

Keywords: anaconda, anacondacon, open source, ai

Id: wpQiEHYkBys

Channel Id: undefined

Length: 34min 2sec (2042 seconds)

Published: Wed May 08 2019