SHAP with Python (Code and Explanations)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
do you want to understand how your machine learning models work how each model feature has contributed to a prediction or even what trends the model is using to make predictions in general well then look no further than shop it is the most powerful python package for understanding and debugging your models today we'll be walking through sharp code as well as explanations for all of the sharp plots these include the waterfall plot Force plots absolute mean sharp plot B swarm plot and dependence plot you will see how with a few lines of code we'll able to create eye-catching and insightful visualizations if you want this code then check out the companion article Linked In the description we don't discuss it in this video but there's also a section in the article that looks at interpreting sharp values when you have a binary Target variable really the plots that we see in this video are just the tip of the iceberg if you want to get your shap skills in Ship Shape then wait until the end of the video where I explain how you can get access to a python course for now let's jump to the tutorial notebook we start by importing all the necessary python packages we have some standard packages like pandas numpy matplotlib and Seaborn we'll be using xgboost to build our model and finally we import the shap package we also have to initialize this package this just allows us to display some of the sharp plots in The Notebook next we're going to load our data sets and we'll be using a Abalone data set and you can see that if we print out the length we have 4177 observations or Abalone in this data set so abalones are a type of shellfish delicacy and we want to use this data set to try to predict their age or more specifically the number of rings in the abalone shell we'll be using features like the length of the shell the diameter of the shell as well as the whole weight of the Abalone if you want more details on the fields then just check out the link to the the data set at the top of this notebook now before we jump to the shaft values it's worth exploring this data set this is to build some intuition and also help you understand what you see in the sharp plots so we start by looking at one of the features and we display a scatter plot of whole weights and Rings whole weight is the weight of the entire Abalone and this includes the shell and the meat inside the shell so looking at the Scatter Plots we can see that the number of rings tends to increase the as whole weight increases this makes sense as we would expect older Abalone to be larger and way more we can also visualize the sex of the Abalone this is a categorical feature when Abalone can either be labeled as infants I male M and female if for each of these labels we create a box plot of the number of rings so looking at this box plot we can see that infants tend to have a lower number of rings and when we compare males and females there's really not that much difference I wanted to point this feature out as the shap values for categorical features can get a little weird after one hot encoding all the individual binary features will have their own sharp values this makes it difficult to understand the overall contribution of the original categorical feature we won't go into details yet but there's a useful article Linked In the description for our last bit of data exploration we're going to create a correlation Matrix for all the continuous features in our data sets we then visualize this using a heat map so you can see that we're dealing with some highly correlated features for example length and diameter are perfectly correlated whole weight is also highly correlated with some of the other weight measurements for example shocked weights which is the weight of the meat inside of the abalone shell and shell weight which is the weight of just the shell excluding the meat now we're almost ready to build our model we're going to use our data exploration to help inform some feature engineering so the first thing we're going to do is drop diameter and whole weights from our list of features this is just because we saw that these were highly correlated with some of the other features we also saw that sex was a categorical feature so before we can use it in a model we need to transform it into three dummy variables we then dropped the original feature from the data set so we just display a snapshot of our X feature Matrix and you can see that we have eight model features in total we can now train a model to predict the number of rings in an abalone shell as our Target variable is continuous we're going to be using the xgboost regressor function and we train a model on the entire feature set at this point our model should be good enough to demonstrate the sharp package we can see the Spy evaluating it first we get the predictions on the entire training set and we create a scatter plot of these predictions versus the actual number of rings we also add a red line which gives the perfect predictions so we can see the model is doing an alright job the predictions are not too far away from the red line so we haven't put too much effort into this model and unless you're using shop for data exploration you should always use best practices for example using a train test split the better your model the more reliable your sharp analysis will be so that evaluation told us how well the model was making predictions we can now use shop to tell us how it is making those predictions to do this we pass our model into the shop explainer function this creates an explainer object we then use this to calculate shock values for every observation in the feature Matrix and that's it it's as simple as that with a few lines of code you can calculate the shaft values and gain an incredible insight into how your model is working just one note the step can take a long time to run if your feature Matrix is large you can save time by only passing a subset of observations for example if we use this line instead it would return the shaft values for the first 100 observations we won't go into detail in this shop value object but for now we're just going to look at the shape of its values components so this tells us that there are eight shaft values for each of the 4177 observations remember we had eight module model features so in other words we have one shaft value for each feature in our model so we can use the sharp waterfall function to visualize these shaft values and yeah we display the waterfall plot for the first observation so there's a lot going on yeah so let's let's break down what each of these figures mean firstly e of f of x is the average predicted number of rings across all 4177 Abalone f of x is the predicted number of rings for this particular abalone the shaft values are all these values in between they tell us how each module feature has contributed to the difference between the prediction and the average prediction so for example shaft weights has increased the predicted number of rings by 1.68 and lastly all these numbers on the left are the actual feature values so for example we can see that this feature is male because sex dot m equals one another way to visualize is information is to use the force plot you can think of this as a condensed waterfall plot so you can see that we have the same base value as before and you can see how each feature has contributed to the final prediction of 13.04 so the waterfall plot and force plots are useful for understanding how the model has made individual predictions now let's see how we can understand the trends the model is using to make predictions in general we can also combine multiple Force plots together to create a stacked Force plot yeah we have passed the first 100 observations to the force plot function how this works is that each of the individual Force plots has been flipped 90 degrees and then we stack them side by side vertically and you can see that this plot is Interactive we can choose which features to use to order the force plot by so for example let's click shell weight we can also choose which shap values we want to display so again let's click shell weight and from this plot we can see that as shell weight increases the shaft values also increase so in other words all the Abalone tend to have heavier shells so the force plots are useful if you want to quickly explore some of the relationships captured by the model the next plot we'll look at can tell us which features are most important to the model next built-in function is the the bar function and this gives us the absolute mean sharp plot so each of these bars gives the absolute mean shap value for that feature remember for each observation there will be a shaft value for each of the eight model features we have taken the absolute of these values and calculated the average across all 4177 observations we take the absolute as we do not want positive and negative sharp values to offset each other so features that have made large positive or negative contributions will have a large mean sharp value so in other words these are the features that have made a significant contribution to the model's predictions in this sense the mean sharp plot can be sort of used as a metric for feature importance so next in my opinion we have the single most useful sharp plot and that's the B swarm plot so this is a visualization of all of the shaft values and on the y-axis we have the values grouped by the different features and the color of the points are determined by the feature values so higher values are redder and then on the x-axis we have the shaft values so like with mean shop the B swarm can be used to highlight important relationships we can see which features have large positive or large negative shaft values in fact these features have been ordered in the same order as the mean sharp plot we can also use this plot to start to understand the nature of these relationships for Shell weights notice that as the feature values increase the shaft values also increase we saw a similar relationship in the Stacked Force plot you may also notice that the relationship for shocked weights is is the opposite um looking at the B swarm plot you can see that large values for this feature are associated with smaller shaft values let's use dependence plots to try to understand what's going on here so a dependence plot is just a scatter plot of the shaft values versus the feature values for a single feature and they are particularly useful if the feature has a non-linear relationship with the target variable so for example we have the dependence plot for Shell weights and looking at the B swarm plot we might have assumed that this relationship was linear but looking at the dependence plot we can see that it's not exactly or not perfectly linear we can also use the values for a second feature to color the scatter plot so we have the same plots as before but now the larger the shocked weight value the red of the points and the shaft values are large when both shell weight and shock weights are large and finally we also have the dependence plot for shocked weights and we can use this plot to confirm what we saw in the B swarm plots the shop values do decrease as the shocked weight increases this relationship seems strange wouldn't we expect all the Abalone to be larger and have more meat well this is in fact a result of an interaction between shell weight and shock weights we can actually use shap interaction values to identify relationships like these this is an extension of standard shap values if you want to learn more you can get free access to my python shaft course by signing up to the newsletter in the description along with sharp interaction values you'll learn all the theory behind sharp as well as how to build your own custom sharp plots there is even a lesson on working with image data and a deep learning model
Info
Channel: A Data Odyssey
Views: 42,788
Rating: undefined out of 5
Keywords:
Id: L8_sVRhBDLU
Channel Id: undefined
Length: 15min 41sec (941 seconds)
Published: Mon Mar 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.