ML Interpretability: SHAP/LIME

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to machine learning dojo i am here with my partner in crime connor tan hey tim thank you for having me on again it's an absolute pleasure we decided to do a series of bite-sized videos on interpretability we had the main man christophe molnar on machine learning street talk last week and to be honest it's almost overwhelming right when he when you go through his interpretable machine learning book there are so many different approaches and connor and i believe that shapley values is probably the main place to start when we're applying machine learning models we often want to understand what is really going on in the world we don't just want to get a prediction sometimes we want to understand how a model works overall but sometimes we want to explain the results of a individual prediction maybe your application for a credit card was denied and you want to find out why maybe you want to understand the uncertainty associated with the model and maybe you're going to take a real world decision based on your prediction that's where japanese values come in a prediction can be explained by assuming that each feature value of an instance is a player in a game where the prediction is the payout shapley values a method from coalitional game theory tells us how to fairly distribute the payout amongst the features for example suppose you have a model that predicts house prices the average prediction for all apartments is 310 000 euros there are trees nearby you're allowed cats in the apartment you have 50 square meters of floor space and it's on the second floor how much has each feature value contributed to the prediction versus the average prediction the answer is simple for linear regression models the effect of each feature is the weight of the feature times the feature value sharply values are the only possible explanation that obey four key properties efficiency symmetry additivity and dummy this means that the shapley values of each feature are in the same units of the prediction and you can just add up the shapley values to explain an individual prediction sharply values are the grand unification of feature attribution methods unifying previous methods like lime deep lift and layerwise propagation enjoy the show folks there are four special properties that an additive feature explanation could have and shapley values are the only possible ones that meet all of these criteria let's start with the efficiency property when you sum up the shapley values over all of the features for a particular example it should be the same as the difference between the average prediction and this prediction and this is really important because it means that your sharply values are in the same units as the prediction so if you're predicting something like a price your sharply values for every feature are going to be in units of dollars the next important property is symmetry so when this particular feature walks into the room if they have the same effect for all of the possible rooms they are effectively interchangeable the symmetry property essentially means that we're going to be giving the same credit to two features if those features are completely interchangeable the third principle is dummy it simply means if adding a feature to the coalition does nothing then it should be assigned a blame of zero the final property is additivity if your model prediction is the sum of two component models then the shapley values for your model should be the sum of the shapley values of your component models this additivity property is incredibly useful if you're analyzing something like an ensemble model a random forest because it means that you can calculate the sharply values of every component decision tree and then just add them up if you want to analyze the behavior of the overall model it's just the most useful technique i think out of the whole book of interpretability methods it's such a useful one it's got all these amazing theoretical properties and it will work on any ml uh algorithm right it doesn't need to be an intrinsically interpretable model chapter values will work on any ml model that you talk at it okay well how does it work work came from uh chablis values themselves come from this guy lloyd shapley who's a nobel winning economist and champion values try to answer the question of how you fairly distribute a payout amongst players so it comes from game theory this idea didn't even start with machine learning it came from a completely different area of academia so in our context the the players here are the features that uh contribute to making a machine learning prediction and the payout is how the prediction on a particular example differs from the average so that's what we're looking at we're trying to explain why one particular prediction is different from the average prediction across all your data and it's a really interesting word they're fairly so what do you mean by fare well all these properties you can look at what makes a payout fair this symmetry so that's where the two features that are the same given equal payout there are some features like the dummy property so if you have a feature that does nothing it really shouldn't get any credit to additivity but the really cool one is this thing called efficiency and that means that the contributions of every feature must add up to the overall payout that sounds kind of obvious right but the cool thing about that means that shapley values are going to be in the same units as your prediction so if you're predicting something like a price then the sharpie values are going to be in dollars so you can assign a dollar value to every single feature and i mean graphically what that looks like is that here's the shop library you might have seen this before if you have a average prediction you know base value you can add up the shop values of every single feature and if you add them all up you get the prediction you get the actual prediction on your example of interest so that's just really awesome feature efficiency and that's what i think what makes happy values so so useful so to explain a couple of things here you said that the shapley values are in the same kind of scale if you like as the um the actual thing that we're predicting but in the image that you've got there there's pt radio and lstat these are separate features so how can they all be in the same scale as the thing that we're predicting well let's get into how these features are actually calculated then i think what what makes it meaningful as you say is that the they are on the scale of the label space so the label space here corresponds to i'm not even sure what the prediction is here but that 22 23 24 that's meaningful for the thing that we're predicting and all of the shapley values the contributions for the various different features have been scaled to be meaningful on that label space exactly so what i think this case is the boston house price data set so rm is the number of rooms in a house and as rm has a sharp value of 6.5 or a sharply value what that means is that this particular house because it's got quite a large number of rooms the house price has gone up by six point five dollars i think it's actually uh six point five thousand dollars so lstat for example is actually taking away four point nine eight let's say thousand dollars from the dreaded boston housing pricing data set my god that brings me back nightmares from my phd days but um it just makes it so much more meaningful but just to rewind a tiny bit we are considering coalitions of the features so lots of different combinations of how the features can work together and what effect those um coalitions will have on the prediction right but we're not retraining the model are we that's right so to actually go and calculate how you assign this blame the different features you have to consider what if some of the features weren't in the model what if they were in the model and to approximate it we normally just impute an average value for a missing feature whereas theoretically i guess we should be retraining a model but we don't do that in practice how does that work though so we're not computing an average we are somehow extrapolating over the features which we are removing from the coalition all right let's break it down well let's take linear regression as a great example we all understand how linear regression works right for every feature you multiply it by a coefficient then you add those up and that's your prediction so if you want to consider for a particular data point what is the contribution of a particular feature the jth feature it's very easy to see how that works you look at that term there the coefficient times by the feature and you can compare that to the average value of that term um so that coefficient times by the average value of that feature so that difference between the feature effect and the average feature effect that explains the contribution of that particular feature to the linear regression model so for a given data point its contribution is the difference from the average or the expectation over all of the other data points yeah and that's an interesting point you actually need to have knowledge of all of the other trading data points this makes intuitive sense right if you have a equation like this and you're trying to understand why a particular house is quite expensive you can just look at these terms and say which term is a lot bigger than it is normally how do you calculate chapter values for any model you have to sum over all possible coalitions where coalitions are subsets of features so here's the theoretical equation for how you calculate exactly what the shapley value is so ignoring this kind of normalizing coefficient of the front what we're comparing is a coalition that includes your feature comparing it to a collision not including your feature so for example it might be you have all of the features but one and then you add that one feature back in and you can see how does adding in that feature change your prediction but you have to sum that for every possible subset of features right you have to consider it for the empty set you know your feature is the first one to be included when you've got half the features when you put all but one and so that's exponential in the number of features and so for anything but the most simple case you can't actually calculate it that makes sense so there are because what we're essentially considering is a binary mask of features which corresponds to a coalition and there is an exponential number of permutations of those binary masks but i'm trying to relate this to what we were just talking about because you were saying well you can compute the feature effect by removing the kind of the average effect of that thing that was removed but here we're talking about coalitions and how does the feature effect come into the coalition because as i understand what we're essentially doing is we're comparing the model that has all of the features enabled versus a model that has some of the features enabled but it's not like we remove those features what are we replacing them with is it the average feature yeah so what you can do in practice is to impute a value for that missing feature you'd substitute in a value taken from a different example in your training data set so is that what we're doing are we are we substituting it with a different value or the features that are removed are we extrapolating in a sense by replacing it with the expected feature value great question well in the implementation that we you know use in in machine learning you actually use a the value of a feature from a different example in the training data set and that data set might that example might be a completely different uh example and together you might come up with this kind of frankenstein example that's got cobbled together values of features combining other features from other examples and it might not really make much sense as an example when you're predicting on this it might be out of the distribution of your data i see so actually what we're talking about here is a general mathematical intuition or framework for how you do these coalitions but the reality is depending on the particular type of machine learning algorithm you'll have to come up with well what does it mean to remove these particular features from this particular coalition exactly i i think you know in the in the mathematical framework we ought to be retraining a model based on just that coalition subset and seeing what that would do but in reality you don't have enough computational resources probably the best way would to kind of brute force retrain the model for every different permutation of this binary mask of features but that would just take forever exactly so instead what you can do is rather than retraining you just run the model in inference mode and in forward mode only and you impute some value for that feature using a different example so this is this is one of this is the first approximation you make but then you have to start making a few more approximations so this is exponential in number of features so if a classic sharpie value estimation you do some sampling estimation you'd rather than summing over every possible correlation every possible subset you would just randomly sample um some of these features but still that's not very fast you know shabby values weren't very commonly used historically because you need to do a large number of samples to get a good estimate of the chapter values and that means running your model in inference mode so many times and who's got time for that this blows up really quickly so if it's exponential in the number of features it means if you have more than a handful of features it's going to take a very very long time right so you you can use a monte carlo method new example from it i just wanted to close the loop on on this interpolation question or extrapolation so it might mean that you might have let's say a baby who earns a hundred thousand dollars a year i think was the example that kristoff came up with because you you might have um this baby feature and now we're saying that the salary is is no longer part of the coalition so we need to extrapolate the salary so we average over all of the salaries and it might create this fictitious data point that is a baby that earns 100 000 a year which would be ridiculous it's out of the distribution exactly these frankenstein features don't really seem to be very sensible it doesn't really give you much confidence in the algorithm if you're computing things based on examples that don't really make sense but i think it's a more fundamental question right there is it can you even answer it in theory when you have two very highly correlated features and you're trying to unpiece this model and figure out which feature is the model actually using you have to consider you know a high value in feature a and a lower value and feature b and a high value and future being a low value in feature a and that's the only way you can figure out which which feature is the model using but you know that's not going to make much sense in practice if a and b are always highly correlated they always go together well we said there's a better way right to do this monte carlo sampling and this is where shap comes in shap is different to chapter values chap is sharply editive explanations and it is this awesome unification of all these different feature attribution methods right there's lime there's deep lift layer wise propagation the classic shapely values we just went through and what the authors of chap say is ah well these are all the same kind of thing they're all additive feature attribution methods what that means is you're you're trying to come up with a single number that you assign to every feature and you can add up those numbers and that should give you the prediction and they say well there's only one winner here this you can you know you can use the shapi framework and think about these four different you know mathematical principles and there's only one of these techniques that obeys them all shapley values and so they kind of say you can forget all these other methods sharply values is the one winner right that symmetry is about fairness of features it means that when you have two features that both make the same influence on the model they both have the same effect maybe they're highly correlated they should get the same payout they should get the same value your algorithm shouldn't just arbitrarily pick one of them and give it a high value ah that's really interesting because that's a good thing and a bad thing because if i had features that had a lot of shared information in fact if i had a duplicate feature if i put salary in there twice it would assign the same shapley value to both of them but it would also mean that my kind of currency or sharply value would be divided between them so now i've got i've got two small shapley values for salary instead of having one big one exactly so sometimes that's a good thing right if you want to really understand how the model is working is actually paying equal attention to these two salary features so sometimes though you actually might want a more sparse explanation you might want to explain the model predictions in as you know few numbers as possible and then you really don't want to have to be double dipping and including all of these salary features and theoretically why is there a symmetry property is it because you know we we treat all of these different we have all of these different permutations of the features we're doing the extrapolation over the missing features and the way that we are measuring that must be symmetric for this property to hold exactly and it depends exactly how your interpretability model works whether you have this symmetric property so something like a lasso regression if you use a lasso regression as part of your interpretability methods and as you do often in lime for example you break this imagery method because you're going to find a sparse explanation that will only use a small number of features to explain the prediction and explain why that is well a lesson regression is a regression that has a l1 norm rather than an l2 norm so rather than looking at euclidean distances you're looking at manhattan distances and one property of that is that you tend to make a model that uses as few features as possible it sets its many values to zero so when you fit a local interpretable model as you do in lime often you choose something like uh lesser regression and it means you you come up with an explanation that tries to explain the model prediction using as few input features as possible so that's good in a sense that it's quite human interpretable right it's it's easy to get your head around but it breaks this property of symmetry well in a sense it's less interpretable so if you had let's say a lasso model or an l1 model to you know as a surrogate to explain boston housing it might tell you that elstat was dominating the explanation whereas there was actually shared information between l stat and let's say the p ratio but you wouldn't see the p ratio exactly in fact if you do that if you fit a surrogate lasso model to try and explain a typical boston house price data set in a complex model you can you can approximate it pretty well just by considering rm and lstat so if you want to get your head around what's going on in the boston house prices data set you can pretty much ignore all these other things like pt ratio and nox and age you can get most of the way there by just considering rm and l-stat but as you say that's a fundamental trade-off because you might want to consider the contribution of all of the features or you might just be interested in considering the features that um i suppose with lasso what concerns me is it's what what exactly tips the balance to elstat dominating the explanation is is it is it the magnitude is it the number of examples is it the order that the examples are seen i think the thing that tips the balance is uh how accurate your local model can be using a small number of features and still be a good approximation to the complex model that uses all of these features that you're trying to explain so when you use a simple model like a lasso model your you'll have some metric like the mean squared error and you'll compare what's the what's the best score you can get approximating your complex model with a simple model using you know just one or two features and it will turn out that lsat and rm together you you can explain you know 95 of the variance or something like that yeah additivity is really cool what that means is that you can add the sharpie values between different games in the game theory context or different models in the context of machine learning so why is that really useful it's really useful because in something like a random forest it means you can do the sharpie value analysis on every tree and then when you combine all these together you can just add the happy values so this grand unification of of shap it's quite cool i think you know it's like a framework that says all of these previous techniques for investigating models they all at the same kind of thing and ours is the only one that has all these properties but also they the the authors of of shap came up with some better methods to actually calculate these values that you know a big improvement on that sampling method the monte carlo sampling method so we have kernel shap and tree shape so what was the problem with the monte carlo method was it just the fact that there was a uniform prior essentially because if you think about it you've got all of these different permutations of features and we're not really using any domain knowledge on how to increase the sample efficiency yeah classic shapley value monte carlo sampling it's not very principled you're just randomly picking coalitions to to sample in inference mode and that's not the most efficient use of your computational power okay so how can we do it better okay well this thing kernel shaft it's kernel based approximation and it's actually quite similar to lime it doesn't look anything like sharpie values so what you do you take your your initial sample to explain and you generate some coalitions which are different um subsets of features some of them are missing some of them present and you run the model on forward mode in inference mode on all of these coalitions and then you fit a weighted linear model on on these permuted samples so just like in lime you fit a local model that is trying to approximate your complex model the crazy thing is if you pick the weights in the right way the coefficients of your linear model that they're sharply values so it's as simple as that you just have to use the right weights for your linear model ah so does the kernel bit essentially meet you know because a kernel just means a distance function i suppose is that what it means that the weight is a function of some distance function that you create exactly so the kernel here is tells you how to weight every single coalition so remember in lime you're fitting a local model but you try to do a a weighted fit so that the model is locally accurate and so you give more attention to samples that are very close to your initial sample in chap it's something quite different so you have this coalition vector this z vector and what that is that's a binary mask of your features which ones are present which one's absent and you consider a mask like this which is if you have one zero one one one one that's how you've got eight features and only one is absent and you could take away more features and you know half of them absent or you could consider a coalition where all but one are missing now in lime you'd say the further away you get from your initial sample the less weight you give it but in shap what this equation here does it gives most weight to small coalitions and large coalitions so this example here would get a lot of weight this one here not so much and this one here comes full circle to having a lot of weight a lot of weight again and that makes intuitive sense right what we're saying is when you're when you consider just one feature by itself as a first feature to put into the model that tells you a lot about how that feature is impacting the model yeah that makes sense so almost from an information theoretical point of view um small coalitions are more salient right they actually give you more information gain so it's about ignoring the permutations where the information gain is low but what interests me about this though is that the selection is based on the binary mask you know rather than the actual input space yes exactly so it's quite um it's a bit more principled in that sense because you don't have to come up with some arbitrary weighting function right to explain how close two samples are the domain knowledge might be different for text data for image data with shap it's kind of quite fixed in stone you can derive it from these these properties like efficiency and symmetry and so on and it tells you exactly what weight to use interesting because this is in contrast to something like lime where there's a weighting over how far away the permuted feature is yes exactly so in lime you have to come up with some way of describing you know for every different kind of data type like tablet data or image data you have to come up with some kind of distance function why don't we just quickly touch on online so lime is where you have a locally interpretable um surrogate model right exactly so in shapley values there isn't a surrogate model shall be values i guess unless you use the kernel chap method where you use this local linear model to calculate the sharpie values so kernel shaft looks very similar to lime i guess okay so in the original shapley values we are just considering all the mathematical permutations of the features we're not building a surrogate model but then in the kind of um principled and optimized versions of of shapley value you know in shop basically we are building a surrogate model so in lime there are different flavors of it depending on the type of predictive architecture so if you're just using tabular data or if you're using image data or if you're using language for example yeah so on this particular example we have a candidate input which is the yellow dot and then we have a model which has been trained to let's say classify something and what you can do is you can permute around that particular input example so you can come up with lots of other similar examples in the neighborhood and as you can see the classifier boundary um is in that space so some of those permuted examples will give you a light blue classification and some of the permuted examples will give you a gray classification okay so we're saying that the color here is the prediction of the model that we want to explain right whether it's light or dark and these black dots are the permuted samples around the point of interest is that right exactly that and i suppose the idea here first of all with lime is that we want to permute samples in the neighborhood of an input sample so that is already a little bit hacky but in the case of tabular data we can just generate a uniform random ball in the neighborhood exactly i think you can do something in line where you can you can just draw from the distribution of the data so in this case it looks kind of uniformly distributed in all the features so you can you can sample from a really large area but then crucially you have to weight your local model to to be most accurate in samples that are near and that means you have to define what you mean by near is it some kind of euclidean distance maybe and what's the length scale i think in tableau data by default to use an exponential smoothing kernel and the length scale of that kernel is 0.75 times by the root of the number of features so 0.75 and where did that come from oh wow that sounds like a wonderful arbitrary parameter okay so so that means we're building because in line we're building a surrogate linear model over the permutations of a particular input example to describe a pre-built black box model that we already have so here we're building a linear model over all of these different permutations and we're describing how those permutations flip the switch on this other classifier that we've already built but we are also weighting the contribution of those permuted features by some arbitrary kind of distance function exactly and so we get this proxy model this surrogate model here that is you know doing a pretty decent job in the neighborhood of this yellow point in recreating the complex model but it's obviously not going to be very accurate far away okay so that's on tabular data so what would lime look like on let's say nlp or even vision i think that's a that's a pretty cool example so in vision i i guess it's a little bit hacky isn't it because you need to describe what does it mean to be in the neighborhood of something because when you're in a very high dimensional space like a like an image you're gonna have to come up with something fairly kludgy so i think what marco rubio came up with on this was to create a segmentation map or it's called a super pixel segmentation where you essentially trace boundaries around all of the edges in the image and you can use that to create a whole bunch of regions in the image and then you can create a binary mask of permutations of turning on and off regions in the image so some of the the regions when they're turned off could be grey and then you can build a linear model over the permutations of the regions which have been disabled yes and i guess you can then look at the predictions on those permitted features and see how those positions have changed and so if you turn off a really important area of the image and the proxy model the surrogate model has vastly changed its prediction you know that's a really important area of the image yeah so absolutely fascinating but again we we have the problem don't we that there are an explosion of possible combinations of masking these images right so how do we fix that problem do we just sample from it so again yeah you sample it you have to come up with some way of permuting them which might be random random coalitions of which which pixels are on which are off or which super pixels are on and which are off and then you have to come up with a distance measure that describes how far away each permuted image is from the original because remember you're not trying to you know train a model that will handle any image you're only trying to make a local model that is appropriate in the neighborhood of your original image i think in lime you do something like you take the uh this is the super pixel mask this binary vector of ones and zeros and you look at a euclidean distance or cosine distance actually between the original and the permuted oh i see so it's not a because colonel shout had quite a um in it you know an information theoretical principled approach to this but with lyme there's no principled method to say which particular binary mask is better than any other binary mask but there is the cosine distance between the binary mask so we're still just permuting the space of all possible binary masks exactly so i guess you know with this cosine distance it means that if you only turn off one super pixel then it's quite a close feature that gets quite a high weight if you turn off lots of super pixels it's quite a low weight i think one thing it misses out then compared to shap is that if you were to turn off everything apart from one area of the image then that would get a really low weight because you've you know you've turned off so many things but from an information theoretical point of view it's actually quite a useful thing to consider because you're considering just that remaining bit of the image so maybe it should get a high weight yeah absolutely and i suppose the other thing that makes me think this is just quite arbitrary is this this super pixel mask right that wouldn't always give you good results so that would work very well when you have images with transients and gradients and edges and for example if the image was quite noisy the super pixel segmentation might look pretty crappy i suppose like the question is why do a super pixel segmentation why not just do a uniform grid i guess that is you have to use your domain expertise to permute these samples in an appropriate way right if you were to do some kind of grid-based permutation it might not look very realistic cool okay well let's let's um touch on the text version quickly because uh again i think this is beautifully illustrating that you need a lot of domain specific knowledge and you have to come up with ways of interpreting how to use these methods so with nlp the permutations in the neighborhood of a particular input example are now the word tokens so you can just permute the word tokens and you can kind of turn them off so here we've got a token edu you can just create a permutation of this example now with the edu either removed or replaced with some dummy token and it's a bit like the issue you mentioned earlier with correlated features or dependent features that if you were to remove them and not put anything back in you might get some quite unrealistic text just words missing that doesn't make any grammatical sense so your whole interpretability model is based on running examples with missing words that might make no sense at all yeah and then that is the peril of it's a little bit like we were saying with with the extrapolation thing earlier that you create examples that are outside of the original distribution and don't make sense anymore but without a more principled way of of doing that we're a bit stuck so bringing it back to shapley values then we have this kernel chap that's much faster than the the previous unprincipled way of monte carlo sampling and use the sharp kernel to compute the weights and then you can just read off the coefficients and that's your chapter values there's also some other model specific methods that the shop library comes with so on kernel chap we've moved from this factorial number of permutations down to what so tree shap your computational complexity is no longer related to the number of features which is really cool but it does depend on the max depth of the tree i think kernel strap you still are in the worst case exponential in the number of features you have but you're just making much better use of your sampling method to your sampling in a really principled way yeah because my intuition on the kernel shaft is that you know there's a kind of pareto distribution and almost all of the information gain is in the fat end of that tail and you're deliberately in a principled way sampling from the fat end of the tail exactly you're prince you're sampling from these coalitions that have got maybe only one feature missing and then you sample from those coalitions that have only one fetus present and then once you've exhausted all of those you can then start sampling from those that have two present and you start to understand a bit about future interactions tree shape yes so chap is really cool tree shop actually completely solves this issue of being exponential in the number of features with tree shape you're exploiting your knowledge of how decision trees work and regression trees work and you get an exact solution that's no longer exponential in the number of features but it does depend on the maximum depth of a tree and it has you know there are some caveats depending on some further approximations you can get some good results but in general um it really opens the door to using sharply values in practice and as you know there's a friend of mine andrew mentioned recently that nvidia have done some work and have migrated this to nvidia rapids and you can you can run this process on gpus much much faster and so treeshap is really where it's at for efficient estimation of shape values i'm just trying to get an intuition on how tree shap works so we're talking about coalitions of features is it rebuilding the tree or is it still somehow disabling features that are not in a coalition on one pre-built model that's it it disables features i believe so if you imagine a decision tree going down the tree if you want to turn off a feature you could um turn off that node or go down that node probabilistically as it were and so you can run an inference mode and just by not allowing the tree to make any splits based off your feature of interest you can simulate what would happen if you turned off that feature i see but from an information flow point of view i suppose there are different ways of building trees you can have one tree or there's random forests and so on um but i can see lots of situations where if if that feature had a high information gain it would be quite high up in the tree so if you disabled it then the tree would be useless yeah i think what you have to do is still go down that split but rather than using that feature to make this split you'd instead it might be for example that you know before you disable that three quarters of the examples went down the right end of the tree and one quarter went left so you might now just flip a coin or flip two coins and send three quarters to the right and one quarter to the left rather than actually looking at that feature and that's really intuitive because that was kind of similar to what we were saying before about okay well um we will replace this decision junction with the kind of average for that data set so in this case it might just be as you say a coin flip and 50 of the time you go down the right hand side exactly and this is such a great example of this additivity feature that we know that shapley values have values can be added across games or across models which means that you can do this analysis for every tree in a decision forest and then to work out the shaping values for the whole forest you can just add them up i see that makes sense so actually tree shaft isn't anything different it's just an interpretation of the shap methodology for trees in the same sense that we were talking about interpretations of lime for you know tabular data and image data this is just how do we use something like shap on the substrate of tree models exactly beautiful well um connor this has been an absolute pleasure thank you very much indeed oh it's my pleasure it's such an interesting topic as well amazing well um if you like this folks let us know in the comments and um connor and i are actually are quite interested in in doing one of these for every single interpretability method so if that's something that interests you let us know and we'll see you back very soon peace out shapley values are the only possible explanation that obey four key properties efficiency symmetry [Music] there are four components of shap values i i can't handle it tim your eyebrows the shape your eyebrows made it was like there were four it was just too cool i've lost my lost my chill already from sentence one absolutely for a additive feature explanation method to have this property of symmetry it needs to give them the same the same shame the shame of property i've turned it to sean connery i might have a whiskey and then come back and film i think it'd be way better this is so useful in a context of the context because it means that you can calculate the sharpie values for every every ah man i can't explain this linear regression turns out to be quite complex watch
Info
Channel: Machine Learning Dojo with Tim Scarfe
Views: 1,890
Rating: 5 out of 5
Keywords:
Id: jhopjN08lTM
Channel Id: undefined
Length: 39min 44sec (2384 seconds)
Published: Fri Mar 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.