Outliers : Data Science Basics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone how's it going so it's going to be a pretty brief video today we're going to be talking about the role of outliers in machine learning algorithms and then also talk about ways that people typically deal with outliers as well as some of the shortcomings of those methods in general i think outliers are a pretty interesting topic for two main reasons one is that even if you don't study stats this word outlier has become kind of a commonplace word even in the media and just speech for example will say oh that baseball game was an outlier and we just mean that it was different from a typical baseball game for example the other reason i think it's interesting is that even with all these statistical tools we have there's not like a set way to deal with outliers it really depends on the problem different people will take different approaches so it really highlights this fact that math is not this yes or no set in stone kind of process but it really depends on the situation so we'll go about this video pretty simply we'll just go through four different very popular machine learning algorithms and talk about the impact that a couple of outliers can have on the results and then as i said we'll talk about some common ways people deal with outliers by the way i got this really cool lobster hat for christmas hope you like it so let's begin by visiting our very first friend in machine learning and stats which was linear regression so if you remember linear regression you just have a x and a y variable keeping it real simple and you draw a line of best fit through all of your data points so let's say all these black x's were your initial data points and you drew this green line of best fit through them and they have a pretty good fit no real issues there now here's the problem let's say you introduce a couple of outliers so these red x's down here with the exclamation point next to them are outliers because they are very different from the typical black x's we have up here as you might have learned in linear regression this line that we're going to draw now is very affected by outliers and what's going to happen is that the slope of the line or beta 1 hat in this form over here is going to shift down so that the new line we get is this red line and the biggest issue with this red line you can see what it's trying to do it's trying to kind of compromise between the original data and these outliers but in doing so it doesn't really capture either one too well so we definitely have an issue with outliers in linear regression it turns out that we can also frame logistic regression which might have been the first classification technique you learned in the exact same way just a quick recap models the logit of the probability of any example being either 0 or 1 as the same form beta naught plus beta 1 x just like we had up here so let's say our initial data is these black x's which were all class 0 and these black x's which are all class 1 and we asked to draw the sigmoid which is going to predict the probability of each example being in class 1. and without the presence of outliers this would be pretty simple we just draw this black sigmoid so that all of these get correctly classified as class 1 because their probabilities are above 0.5 and all these guys get correctly classified as class 0 because their probabilities are below 0.5 now again we introduced just a couple of outliers so here's some outliers here these three red x's and i've drawn this red arrow to indicate that they are way over here in the x direction so they are way to the right have very large values now let's think about first mathematically what's going to happen to the sigmoid actually the same thing happens because we are modeling it again as beta naught plus beta 1 x so our beta 1 hat again goes down and the impact that having a lower beta 1 hat is going to have on this sigmoid is that it's going to flatten out and stretch out the sigmoid so it now looks like this red sigmoid here and more intuitively you can see what's going on is that because these three red x's have very large values of the x variable yet they are classified as class zero we get this situation where the sigmoid is trying its best to incorporate these into the class zero which it tries to achieve by stretching out the sigmoids so much that they get incorporated into the lower part of the sigmoid but in trying to do so it completely destroys the rest of our data for example if you look at this sigmoid now you see that everything below 0.5 so these are all still correctly classified but if you look at everything above 0.5 we get these three correctly classified but these three or four x's here are actually belonging to class zero predicted as class zero even though they're class one and maybe even the worst part of this is that although it tries really hard to incorporate these three into class zero it never achieves they still get classified as class one so we get a bunch of mistakes by doing this in logistic regression having these outliers so again problematic scenario let's look at k nearest neighbors another friendly face so with k nearest neighbors if we have two nice looking clouds of data so we have the blue triangles and we have the green circles then we can draw a pretty nice looking decision boundary any point on this side of the decision boundary so above it is going to get classified as a green circle because if we use let's say k equals three who are my three closest neighbors they're always going to be some of the circles if we're on this side and they're always going to be the triangles if we're on this side so no issues there let's see how this story changes if we incorporate just two extra green circles in the wrong place so they're outliers so here we have put the two green circles with the main pack of blue triangles and we see that the decision boundary changes in the following way so this whole area is unaffected this whole area is unaffected but around where we introduce the outliers we get a very funky looking decision boundary and the reason is that let's say you're trying to predict some new data point here where the x is and you ask who are my three closest neighbors well it's going to be these two outliers as well as this blue triangle so it's going to say majority class is green circle and so this whole decision space gets allocated to these circles as well so we see that in k nearest neighbor introducing just a couple of outliers near a different class can have a big impact on the decision boundary now we've talked about so far machine learning methods that are very impacted by outliers let's talk about one that is not so affected by outliers and that is our old friend decision trees so we have a decision tree here this is just some variable so we see that in general low values of this variable correspond to these triangles higher values of these variables correspond to the circles but there are two outliers where the variable is very high yet those are classified as triangles and so if we recall how decision trees work it's going to scan this entire variable's range and it's going to pick a split such that on one side of the split we have mostly triangles and on the other side of the split you have mostly circles so let's pretend at first it chooses this split here so this black line that i've drawn on the left hand side it's getting 100 correct because it's saying those are triangles and they are indeed triangles on the right hand side it's getting most of them correct but it's doing poorly it's misclassifying the two outliers so the natural question is is there a different split that i could try to get even a better outcome and the answer is no for example let's just try hypothetically what if it chose to split here instead well if we did kind of entertain the idea that the decision tree could be swayed by outliers maybe we think the decision boundary would get pulled in that direction let's think about if that actually makes sense in the context of decision trees so if we have this as our split and we say everything on the left hand side is a triangle we're still getting all these correct but now we get an extra mistake with this green circle and if we say everything on the right hand side is a circle we're still getting these three circles correct but we're still getting those two triangles wrong so all we've done by changing the location of the split is just introduce one more mistake so we see a decision tree wouldn't actually ever do this so this is not possible for the decision tree to split and so we see that even if you have outliers like these two triangles and no matter how far they are in that direction it's not going to matter whereas in logistic regression the further these outliers were in that direction the more the sigmoid gets stretched out and the more mistakes we are making so that's why if you hear decision trees and everything that comes from them like random forests and bagging and boosting are somehow resilient or robust to outliers this is kind of the behavior that we are talking about and now to close this video let's talk about two very common strategies people use to deal with outliers let's talk about the pros and cons of them and let's talk about the general con of doing anything to your outliers to end the video so the two main strategies that people use to deal with outliers the first is called trimming this is probably the one you're more familiar with so let's say this is our data so this is some variable and you're looking at a histogram of that variable trimming basically operates under the assumption that any very low values of that variable or any abnormally high values of that variable should be deleted so for example if we choose our thresholds as the 5th percentile and the 95th percentile anything before the 5th and anything after the 95th we just throw it away and so a natural question now is what does that do to the histogram so we go from this histogram to this one so you can see the tails have been chopped off and also what happens to the rest of the distribution is that it all gets raised slightly so an intuitive way to think about it is we take the probability from the tails away so that's gone but we still need to have the curve integrate to one has to add up to 100 probability so we take that probability we just deleted and we reallocate it to the rest of the curve so the rest of the curve shifts up so that is trimming now the downside of trimming you probably notice is that we are literally just throwing away data in cases where you maybe don't have a ton of data to begin with this could be a problem and that's where the second strategy which is related but has a very different step at the end this is called windsorizing probably named after somebody called windsor and so the first part is the same we still pick low and high thresholds we'll just pick 5 and 95 again but the big difference is that we don't delete the data on either side of the threshold we take the stuff that's below fifth percentile and we set it equal to the fifth percentile and so the intuition here is that we're saying anything below the fifth percentile is in some sense abnormal or unexpected is kind of not the normal so what we're going to do is take all those values and set them to the most reasonable value that does exist in the data set that we think is normal and that would be the fifth percentile so we take all this data in the tail and we set it equal to the fifth percentile and we do the same thing on the other side we take everything above the 95th percentile and we set it equal to the 95th percentile now let's ask the same question what does that do to the histogram afterwards well we haven't deleted any data in this case we still have the same number of observations so the only change is that these values the 5th percentile and the 95th percentile that we had up here are both going to get boosted they're going to get raised because now we just have a lot more observations at exactly those values so the advantage here is that we're not throwing away data like we are in trimming but the disadvantage is that we could potentially have a lot of samples now that are exactly the same either exactly the fifth or exactly the 95th percentile so we might be artificially reducing the variance of our data so these are the trade-offs in these two methods and now just to close this video i want to talk about the general downside of doing any kind of default method with your outliers so there's a lot of programming languages out there there's a lot of packages to deal with outliers you can do trimming you can do windsoring you can do probably even more complex things to take your outliers and transform them into something else but you have to stop and think anytime you're taking an outlier and transforming it into something else you're inherently making this assumption that you're never going to see these type of outliers again for example in your testing data because if we think about it for a second we're saying that these values are abnormal we probably won't see them again they're just kind of a one-off thing and so we're going to do something reasonable to them but that may not be the case it may be the case that if you look at your testing data the things you're trying to predict you could still see outliers just like this and if you haven't done anything to address the root cause or mechanism that produced these outliers then you're never going to get those correct in the testing data because you just haven't built anything to deal with them and even more than that you're probably going to get them very wrong because you're treating them in the training data as just regular examples instead of anything special and so you're probably going to have big errors when it comes to your testing data so a lot of times students will come to me and ask what's the right way to deal with outliers and they're usually trying to choose between some of these out of the box techniques and i would say that none of these are the right way to do with outliers all these out of the box techniques are the fast easy way to deal with outliers if you want to make quick progress in your project but the right way to deal with outliers is to stop take some time think about what is the mechanism that produced these outliers could that mechanism still exist in the testing data and if so we should do something more intelligent we should look at how the outliers differ from the rest of the data and perhaps build a separate model or treat them in a different way so hopefully you learned about outliers in the context of machine learning so both which models are and are usually not affected by outliers some out of the box techniques to do with outliers and the pros and cons between them and most importantly just the philosophy of what it means to do anything to an outlier at all and what consequences it can have for your entire data project okay so if you enjoyed this video please like and subscribe for more just like this and i'll see you next time
Info
Channel: ritvikmath
Views: 12,171
Rating: undefined out of 5
Keywords: outliers, outlier, data science, statistics, big data, ai, machine learning, decision tree, regression, classification
Id: 7KeITQajazo
Channel Id: undefined
Length: 13min 6sec (786 seconds)
Published: Mon Mar 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.