The Science Behind InterpretML: SHAP

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>> On this special build edition of the AI Show, we'll get to hear from Scott Lundberg, Senior Researcher on the Microsoft Research team. It is definitely important to debug and explain your machine learning models. In this video, Scott will explain the science behind SHAP values and how they can be used to explain and debug your models. Make sure you tune in. [MUSIC] >> Hi, my name is Scott Lundberg. I'm a Senior Researcher here at Microsoft Research AI, and I look forward today to diving into the background behind SHAP, which is a tool in research designed to use Shapley values from game theory to explain machine learning models. So to understand how this works, let's start with a model that we're going to explain. This model, let's imagine, works at a bank and takes information about customers like John and outputs predictions about their likelihood of having repayment problems if the bank were to give him a loan. In this case, the repayment problem is a bit high, so the banks unlikely to give him a loan. As a data scientist responsible for building this model, you may have used a whole variety of different packages in the model development process. Anything from Scikit-learn, or linear models, or trees, gradient booster decision stuff, deep networks, all in pursuit of producing a good, accurate, high-quality model. But in that process, a lot of these things are very complicated and very opaque. Which means in order to debug these things, you need to be able to interpret them. So that's one huge motivation for interpretability and explainability is the ability to debug and understand your model. Not just for the data scientists though, it's also for customers. There's even legal requirements in the finance domain. But in many areas, you need to be able to communicate to a customer why a model is making decision about them. Also for businesses, it would depend on these models. Understanding how they work and, hence, when they can break is extremely important in order to manage the risk that is taken in these businesses for models. All of these motivate how important it is to have interpretability and explainability. So how does SHAP help with this? Well, if we go back to John, it's important to understand that whenever a model makes a prediction, it's always got some prior in mind, in the sense that there's always some base rate that we would have predicted if we knew nothing about John. In this case, it could be our average or the training dataset of defaults, or could be some test dataset that we have in mind, or some particular group of people, like all the people who got accepted. Whatever that background prior knowledge is, that's actually where we start when we don't know anything about prediction. But John didn't get predicted the base rate, which in this case was 16 percent of our dataset. That's just the expected value for our model's output. He got predicted 22 percent. So what SHAP does, it says, "Hey look, we need to explain not 22 percent from zero, because zero is just an arbitrary number. What we need to explain is how we got from the base rate where we knew nothing about John to the current prediction for John, which is 22 percent." How do we go about doing this? Well, essentially, we can look at this expectation of the model's prediction over our training dataset in this case, and then we can fill out John's application one field at a time. In this case, we're filling out that his income is not verified. Now, what that does is it bumps up the expected value of the model by 2.2 percent. So we can say that 2.2 percent must be attributable to the fact that John didn't have his income verified. So relative [inaudible] and training dataset this increases his risk. If we do the same thing for his debt-to-income ratio, we see that that is at 30, which bumps him up to 21 percent. Then we see that he had a delinquent payment 10 months ago, which further increases his risk up to 22.5 percent. Again, we're filling out his application one entry at a time. Then we throw out the fact that he had no recent account openings, and this drops his risk significantly because not applying for credit is a good sign. But then finally, we fill in the fact that he has 46 years of credit history, which you would think would be a really good thing, but ironically in this case, it turns out that that hurts him significantly and bumps his risk up to 22 percent. So now what we've done is we've filled out his entire application and we've arrived at the prediction of the model, but we've done it piece by piece so that we can attribute each piece to each feature. Hence, explain how we got from when we knew nothing about John to the model's final prediction. Now, let's back up and see how this works for a simple linear regression model from scikit-learn, for example. So here's a model trained on this lending dataset where it's a linear model, and I'm showing you a straight line that represents the partial dependence plot of that linear model. What we can see also on x-axis we have the feature I'm explaining, which in this case happens to be annual income. Then on the y-axis, we just have the axis for partial dependence plot, which happens to be the expected value that models output when we change one feature, which for a linear model is a straight line. What I want to highlight here is how easy it is for a linear model to read off the SHAP value from a partial dependence plot. So the gray line here is just the average output of the model. But that's the prior base rate we were talking about. Then what we can see is that the SHAP value is just the difference between that average and the partial dependence plot for the value of the feature we're interested in. For a linear model, we can simply look and say, "John has made $140,000. That puts his partial dependence plot at a certain point." Then we can just measure that height from the mean value, which if it's higher than the average, it's just going to be positive. If it's lower than the average, it's going to be negative. We can do the exact same thing for more complicated models like generalized additive models such as whether an EVM package. What happens there is now you don't have a straight line anymore. You can have a much more flexible partial dependence plot. But again, the SHAP values, because there's no interaction effects going on, we still have an additive model, the SHAP values are again, just exactly the difference between the height of the partial dependence plot and the expected value of the model. So if you were to plot the SHAP values for many different individuals, you would get essentially a line, and that line would be exactly the mean centered partial dependence plot. So more complicated models though, or of course, where people are most interested in this kind of stuff. In that case, you can't just use a single order. You can't just introduce features one at a time, because it turns out that the orders you introduced features matters. If there's an and function or an or function, the first or second one you introduce will get all the credit. So here's an example on a real dataset where we say no recent account openings in 46 years of credit history, we've filled out account openings first and then credit history. What if, in filling out this application we first fill out credit history and then account openings? Turns out, it makes a huge difference. What that means is there's a strong interaction effect between credit history and account openings. That's where SHAP comes in to try and fairly distribute the effects that are going on in high level interactions. You can say, "How on earth are we going to do this?" Well, it turns out we can go back in the 1950s and rely on some very solid theory and game theory that is all about how to do this fairly in complicated games with lots of interacting players that have higher-order interaction effects. How can we share those interaction effects fairly among all of the players such that a set of basic axioms are satisfied? It turns out there's only one way to do it. It came from values that are now called the Shapley values after Lloyd Shapley. Lloyd Shapley did a lot of great work in game theory and allocation and things like this, and actually got a Nobel Prize in 2012. So this is based on some solid math. So going back to our data scientist, you can say, "That's great. I'm really convinced by this. I think I should use these values. How do I compute them?" Well, it turns out that result from averaging, just we did talk about before, using a single ordering, but we have to do it over all orderings. Because it's computationally intractable, and it's even worse because it's NP-hard, if you know what that means. So that's where the real challenge in these values lie, it's how to compute these things efficiently. I'm not going to go into the algorithms that allow us to do that, but that's at the heart of what is in the SHAP package and the research behind it. It is designed to enable us to compute these very well-justified values efficiently and effectively on real datasets. If we do that for extra boost, we can actually solve it in polynomial time very quickly, exactly. Now we see that the SHAP values no longer exactly matched the partial dependence plot because they're accounting for these interaction effects. Because when you look at a partial dependence plot, you're losing all the higher-order interacting information about ands and ors that your model may be doing. But the SHAP values account for that and then drop that credit down onto each feature. So you'll see vertical dispersion when you plot many people's SHAP values for a feature. So let's do this for a particular feature to dive into this credit history. Because remember, it was a bit surprising that credit history hurt John's credit score. So if we plot credit history versus the SHAP value for that credit history, we get a dot for every person. Again, a little bit of vertical dispersion from the interaction effects. Then if we look at John, we'll see he's in this tail here at the end where he's got a really, really long credit history. It doesn't take too long before you realize that debugging was super important because this model was actually identifying retirement age individuals based on their long credit histories and increase the risk of default for them. This is a big problem because age is a protected class. So this essentially found that credit history with a complicated model was able to pull out credit history and use it as a proxy for age. So it's really important to explain and debug your models. So we've talked a little bit about just one example of explainable AI in practice and debugging and at model exploration. But I'd like to highlight the fact that there are so many other ways to use these types of interpretability in your workflow. You can monitor models by explaining their error over time. You can encode prior beliefs about models and then use explanations to actually control your model train process. You can talk about customer retention by supporting call centers by explaining why a churn model was done. We've applied this in decision support for medical settings. There's a lot of places where human risk oversight of machine learning models is enhanced with explanations. In regulatory compliance, there's a lot of need for these type of transparency for consumer explanations. It can help you better understand anti-discrimination as we just showed an example of, and of course, risk-management, where you're understanding what your model will do when economic conditions change. All the more important right now. Even in scientific discovery, you can find that explanations can help you better do population subtyping, extremely helpful for pattern discovery and even signal recovery of things inside, things like DNA. All of these things are just tons of downstream applications of interpretable ML that's supported by these types of tools and research, and I hope that this insight has given you a bit of a taste and excitement for what can be done here. Thanks. [MUSIC]
Info
Channel: Microsoft Developer
Views: 12,137
Rating: 4.9457626 out of 5
Keywords: Microsoft, Developer, build2020, responsibleml
Id: -taOhqkiuIo
Channel Id: undefined
Length: 11min 46sec (706 seconds)
Published: Sat May 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.