Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data Scientist, Aviva

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Kasia and I'm going to talk to you about incredible machine learning and I will use a specific use case of lying the framework just out of curiosity how many of you deal with machine and it would build machine learning models on a daily basis ok quite a view quite a few of you and how many of you have heard about lime okay not bad not bad so it's great because if you didn't hear about that you'll be all learn about it today but before I start I would like to say something about myself so I come from Poland and my background actually is in biology so I did my I hope I were yes I did my master's degree in biology in Krakov after which I moved to Sweden so smoke Adam Eaton Uppsala where I did a PhD in evolutionary biology and my degree took me to different weird and wonderful places I spent a lot of time in the forest on a Swedish Island Island did a lot of lab work in loons Swedish academic town I did a lot of analytic analytics work as well but even I should have known that means something I spent a year Imperial College working on malaria transmission so eventually I moved to London in 2014 and that became my home and ever since I've been working on well in analytics and data science ROS and currently I'm a data scientist at Aviva so going back to there's one more thing I want to tell you about me about myself so data's big part of my life but unlike some of the presenters that were here before it's not the only part of my life so when it comes to data obviously I try to teach myself new new methods and data science applications through online courses meetups and quite engaged Twitter users especially in our community iMac organizer of our lady so that that are the meetup but they also run block on data science using our but as I said I try to do it when I have time but these two little creatures don't leave me with a lot of it so so this is really how I spend my time and it's usually away from computer screen so if I if I don't spend time with them I try to hike do jigsaw bike so I think in that sense I'm not this automatic ikki data scientist anyway no definitely going back to interpret the ball machine learning so whenever we talk about interpreting learning we really talk about the black box algorithms and what they mean by that so for me a black box is really a system that performs a certain behavior to the input data that we feed into it and gives us an output but we don't really understand how we how we get that output so and I think especially recently it's a huge hype around black box algorithms because these are the algorithms that tend to be most accurate speaking letting your own that's right so but how did we get there and and is it really that black box see so let's say we try to answer question will the loan default right we're gonna receive an application and we try to judge that and obviously this could be any classification question whether is uh you know what is that that's a cancerous cell or whether the prisoner would give peril to will commit a crime or whether whatever machine will break down right so it's just a matter of a classification problem we deal with so let's say we still will stick to the question about the loan and we have a certain information about the applicant and we and obviously this information alone is quite useless so to really make you stronger but we need some historical date I mean to really understand how different applicants with different features ended up in the past and let's say if we have very few variables and not lots and lots of data we can we can actually draw the conclusions that are very easy to interpret so here obviously we can run just a linear linear model draw a line between the two classes and and see that by following certain rules about the the predictors we get certain outcome so it's fully into interpretable what about nonlinear relationships so we still is not so bad because let's say we can use a decision tree and again we can follow rules containing that tree and so kind of select groups of classes and see where they came from and why they were classified this way so again it's so incredible the problem starts with big data so big data creates more than dimensions obviously more relationships between predictors and outcome variable but also more complex if you will landscape overlay of those relationships so how do you really create a model that gives you an information about all those intricate relationships so again in these societies we're not completely helpless and let's say we can use feature importance plots so this will really inform us how important the feature was in predicting the outcome so I'm not saying when we have random forests we can let's say judge that the more important the feature the higher up on average its its we found it in making those decisions about about outcomes that say so it is something it's not ideal because for instance it doesn't give us info in any indication or direction of the relationship whether the direction is linear or not so it still leaves us with a lot of work to do in order to actually understand how how the features that I've come relate to each other it gets better if we can use partial dependency plots so this will those plots basically show us the relationship of marginal effects of each of the predictors has on on the response variable so you can see we can pick a number of let's say we can run the feature importance plots then we can pick the top important variables we can see how they relate to outcome variable and then even if this relationship is nonlinear as offenses and get like there's no very relationship and gets negative and positive we can still see that and interpret that so it's great however we are not able to produce those dependency plots for every classifier we have so for instance we can't do that for neural networks and then finally what was just give you examples it's done more tools that we have but for instance you could you could go buy Asian and you can create by Asian Network which is I would say a bit of an outlier in machine learning models because it goes away from this assumption that we have a response variable and predictors and it basically shows you the pendants ease between all the variables we take into account including the outcome variable and you can present him with weight so it gives you an idea how how variables depend on each other and and how strong the relationship is however by Asia networks tend to be not as accurate when it comes to predictions as let's say neural nets and all together all those tools have definitely one downside so they won't be able to tell you why this particular observation was classified one way or another so it can give you the global behavior of the model but it won't really tell you why this particle observation was classified this way so he can imagine even if you have let's say this is the landscape and here you have a lump of observations that tend to be classified in a certain way you won't be able to see that in those plots so what can you do so before you think what you can do you can think about accuracy and interpretability trade-off so I think it's a as a common agreement in machine learning field that such a trade-off exists and some people will draw a line reflecting that relationship this way so so those models that are highly interpretive all are usually less accurate that those that we don't really understand so the question is what about models that we could cluster here so highly interpretive all and highly accurate at the same time so he this is where line comes into play and lying through the abbreviation for local interpol more diagnostic explanations so by interpretive explainations I really mean that were able to understand what model does which features with pixel in order to create or develop predictions when I buy model agnostic I mean that this can be applied to any black box model any black box model that we know today and also that the model that may be developed in the future so and in essence the assumption is that any model photos if you try to apply line to linear regression that is highly interpretive all linear mode obviously it will still assume that this is a black box our assumption is every algorithm is a black box and finally local one means that is observation specific so gives you this information about every single observation you have in a data set so this is new work so the work online was only last year by the research thing at the time based at the Washington University and even though it's quite a recent invention if you will the great thing is that it's available in both Python and are sorry so again maybe let's talk a little bit about what I mean by local so when we said this was the landscape of relationships between predictors and outcome variable and we don't really get much from having a global view of that or at least in money in many businesses and industries you don't so so lying what does it ignore that and it goes on a very local level it picks the observation or sometimes a class of observations that are very similar and gets to the point where you become so local that you actually can use a linear model to understand the relationship so and it will be highly accurate on that local level so forgive me if you don't you're not interested in that technical details but I noticed in many talks online those technical details how action line works are escaped and I think it's faulty to really understand why it's so powerful so essentially what what lime does it takes an observation and create essentially fake data for it it takes all the data you have for that point or your or your predictors and it permits it in different ways so it creates fake data sets for each observation and then it calculates distance metric or similarity score between those fake data and the original data so we know that the data we created how similar they are to the original ones then it takes your black box algorithm let's say neural nets and it makes predictions on that new data set finally it taught if it tries different combinations of predictors let's say em number so you reads a small number of predictors to figure out the minimum number of predictors you have that gives you the maximum likelihood of the class that was predicted by the black box so this is important because this is let's say if you have 300 features that went into your model you end up with a handful that is most informative to drive to drive that prediction and finally it picks those M features and it and together with those similarity scores it fits them - it fits the the simple model that's a linear model to derive weights or coefficients and those coefficients will serve as an explanation for that particular observation on the local scale so hopefully that makes sense also what what is very good about line is wherever I put a star these are the points at which you can make decisions or you can optimize your explanation so you can decide how you permeate your data you can you can decide what what is your similarity measure you can pick you can decide how many features you end up with or you can you can choose the simple model that will be fit to two to those data so it's quite it's quite flexible okay so this is long and someone would say okay do I really need it if I have a model that performs really well if I have a model that has a very high accuracy do I really know the world really need lime or can I trust this model based on accuracy alone well you can't so the reason for that is that without knowing what model picks up on you don't know if it's the genuine signal you don't know if it's noise some correlations something lately relevant just to give you an example let's say we have a model that distinguishes between Huskies and moves quite challenging tasks but let's say this is this is Moodle performance so it performed very well because it was correct in five out of six cases and there's only one case here that went wrong right so it performs well I would be quite happy with it but then when you actually see the explanation for this model we realize that it's it's rubbish because here is the picture on the left hand side the original picture and then on the right hand side you see which part of the picture were informative to the model for making predictions so you can see for predicting Huskies it usually elements of a husky here elements of a snout of a face so that's a sensible behavior but for wolves all it picks up on is snow so we just build a great snow detector and so what that is very accurate it's it's a bad model we can't trust it and you wouldn't know that without having this explanation in the first place another great thing about lying is that you can apply to also NLP models certain text analytics so let's say you try to distinguish between articles or blog post that were written either by atheists or Catholics and then you run a model that performs I think in this original example it was 94% accurate and mazing right but then you check which are the hints that model picks up on and okay I can see that maybe maybe your email address contains ad you then probably you know indicates that you educated perhaps that means you you're less likely to be Catholic but then but then there are things that don't make sense at all so obviously model is not sensible here it picks up on things that don't make sense for us so we can't trust it I used lime in a very simple example for understanding model that classified of distinguish between cancer and benign cells and malignant and benign cancer cells so hopefully I can show you my code very quickly I published on my blogs and long life coding nothing but just to give you an idea what how how we can perform in a in a simple classification problem so for this example I used Brittany Bane was Wisconsin a breast cancer dataset and so it's a perfectly tied for dummy dummy presentations it's quite small it's just only about 700 observations it's quite well known and understood so after doing some very simple data data cleaning getting names right removing some ponies removing all the front factor levels I start each to all local instance and I set up my data science pipeline so basically I divide my I'll split my dataset into training validation and testing sets and then I run excellent h2o auto malfunction so basically this is the function that optimizes and chooses predict classifier for me and only after 60 seconds I already get excellent results and no surprise here these are the top performers top six performance top two neural nets are there for grade boosted gradient boosted models again no surprise and then I simply divide this test set into observations that were correctly classified or incorrectly classified so the rest is not really important I mean I'm doing some performance analysis this is not really important but what you probably want to see is this that Iran lying on correct correct correctly predicted observations and you can see for benign cases there's something that makes sense to us and would make sense to a medical doctor so let's say the cell size was small the shape was regular the band nuclei believed it wasn't it wasn't visible I'm not I'm not sure here what that means but some of those features are quite intuitive and also a number of mitosis or some one example was what's life so it all makes sense I would say that okay if I see mother performing this behavior I would say it's it's sensible similarly with malignant cases okay see the number of mitosis it's high cell shape is regular irregular cell sizes is big and something called clump thickness is high so again it makes perfect sense if I show this model to a doctor and say okay are you going to use a new practice probably doctor would trust this model however when we go to incorrectly predicted observations you can see again like here this one was predicted to be burnout benign and its size was small this the size was small the shape was regular but there were other things they say like Barney Clara Valley was high so so what it really tells me is that there were certain behaviors so certain features for that observations that are typical for for normal cell and that's why model misclassified it so even though the mother was wrong I still have trust in it because his behavior was sensible so so this is example and now why this is important so it's important for I would say two areas of reasons and one is more technical so obviously it's for developing trust we have in our model in being able to predict how model will behave on unseen data and also how we can improve our model based on explanations and in all those three areas LOM was proven in research to do really well so for instance when it comes to trust when the researchers showed that people who are not familiar to alignment or not familiar to machine learning at all are being able to interpret the explanations very easily and also based on those explanations are able to tell good models from bad models also people who have access to explanations are able to predict how more they will behave on the scene dates are better that people that don't have access to those explanations and finally when it comes to improving the model and this is this is absolutely awesome so if you take non machine learning experts who have access to explanations and you ask them to do feature engineering to chuck features that they think are not important and maybe engineer features that may be important they're able to improve the model much better than machine learning experts that don't have access to those explanations so from a technical read from technical point of view having access to those explanations is absolutely critical but there are much more important pieces or just technical ones why we need those explanations on a legal site obviously starting from May 2018 will be obliged to comply with gdpr which basically means or one aspect of gdpr is that the customers have right to an explanation so for companies that incorporate automated decision making process they will have to provide explanation why a particul customer was refused that alone was not recruited also on on a slightly different level something that Kathy ano calls weapons of mass destruction so which are models or algorithms that have three properties that are obscure we don't really understand them that work at scale ubiquitous and also that damage so whether this is assessing someone's performance whether it's granting parole to a prison and predicting whether they will commit a crime this is very important and if we can't understand those models can access those models we are at risk of harming vulnerable people but still you can think okay those hidden models but they are not really everywhere they affect only certain demographics but no think about something that to effect she talked about in TED talk in the old days if it when comes to marketing if we wanted to sell someone a trip to Las Vegas we would target demographics that we know about we would target let's say men between 20 and 30 years old all the people people have certain financial profile and we would know exactly who it targeting and why imagine these days if we wanted to develop an algorithm that simply sells trips to Vegas to whoever is going to buy them and then without knowing about it it turns out that the algorithm targets let's say people with bipolar disorder who are about to enter a manic phase and therefore they are more likely to overspend to engaging the gambling and so on we wouldn't even know about it and this is about marketing this is about personalization this is about something that will or is affecting us right now so I don't you being depressed about it but what I'm trying to leave you with is the message that this is everyone's responsibility it's not only policymakers responsibility to make sure that we understand our models there's a possibility of data scientist machine learning researchers and we live in amazing times to be able to create those models but the same time they come at what risk of harming very vulnerable people so this should be a priority number one for us to develop those tools and frameworks for understanding our models and line framework is a great place to start thank you very much and before it's productionize before is deployed it has to be acknowledged and and confirmed that you know it doesn't it does comply with woof GDP are essentially that it doesn't contain it doesn't rely on any biases that we have that is not being racist it's not being sexist right so that predictions are not being made based on your gender or your race so that's what line could be could be based on it so that's what line could will help could help you with originally how do you address it so yes you run your model you test a smooth line you see that it doesn't make sense the underlying logic for the model at you and home as well would not generalize well yes Lord how do we address it that's exactly the point so for instance with let's say atheist and the Christians example if you see that your model picks up on the data that is not actually part of your text you can easily modify your data so that does it contains only text with image I'm not a deep learning expert I'm not sure what you can do perhaps you could trim pictures so that they ignore the background like but it's definitely the first step to understanding what goes wrong all right with the model and then you it opens the ways for you to decide how you can modify your data set or features that are incorporated to make it what trustworthy raining I don't know thank you to the talk it's really interesting can lime take apart ensemble models and explain them in a really nice way and like is it exactly the same technique literally today we have this gosh it is so in example was enough let's say if you build in some model that relies on one dataset I can see how long can be SEC applying to that as to you know any other any other classified by the other that right but I'm not sure how I would behave to what we know today if we build on some more model that relies on different models that were done worth different data sets so for instance I have a mother for China I would look for something else and can use the parameters for those and then I try to create or something more door based on these two I'm not sure but again line without in both in those open source Python are packages out there so it's really muscle or research okay I think yeah very short question about me to reduce the number of inches movies so sometimes I mean if I deal with like hundreds and hundreds of beach as I would use PCA for them but if I have anything up to 100 features and again depending on the model but for instance of something I running for us our individual use feature importance there's a good name for this there is a very useful function in German forest package enough that essentially shows you the performance of the northern given the number of top important features so when they need to focus on five 10 15 whatever so then I wouldn't really use that to guide my decision whereas a couple so basically the maximal number of feature they can get rid of without losing performance these squares very briefly go through the process of starting with the data set and preparing it and post just going to try to make Stev are really not that money so essentially you've followed your pipeline or your framework as you would have so Francis here this is essentially the whole beds here is essentially there's nothing new about creating the model so I have my data by spelling the training validation and train sets I run the model obviously here I use h2o auto a metal thing but this is the one I have I use it to predict my test set and that's it and still I have all the all the performance analysis where I have my I have my curiousity amongst a spec it's professional Wayne Rico and so on so at this point most people would stop right they have the performance metric that I'm not happy with them or not but what do you do I mean it's a bit different now because this song fiddling going on to the trance form h2o model so that is person to line but even if you didn't use a stone that sign you could literally just under it so here you create the explainer you define your train set you define your model so this is the leader and and that's it and then you you apply this explainer to well here I you know and obviously I split the data set so I have kind of two different thicknesses but you would just apply it to your test set must be a seminar so here there's an idea for kind of demonstration purposes to show okay how how do we understand the model with correct predictions and Inc relations but overall you can just apply that explain it to your test set and that's it so that's one additional function call good question why not series just to slow very pompous wanted with math teachers you look at one particular degenerate can you stop policy because that makes sense for this one the most important parts of life make sense Purdue position of your penis you see that's what I'm lonely to me about lying when they first started using it because I thought oh you know I used to the sketchy important slots and dharmic fixed you know this speech is important always know not to expect but they're like it made me realize and what the more and more I work with them I try to understand it it's really this that you don't have one pattern of those relationship between predictors and outcomes and in that sense you can go here and understand why your point is classified this way here and it may differ from why it's classified like this here so this is what really done so we need to apply the spot I would try to show you well one slide and this comes from the presentation that was originally made by one of the creators of line and and I haven't used that functionality personally but I know that in theory it's possible so I'm just very quickly I don't know how much time you have sorry I just to give you an idea as I said I haven't used that myself is this that essentially apparently I don't know that happens at random or their life has an algorithm that picks those kind of edgy points that will guide your view of what global model looks like so as I said I personally then didn't use it but this comes from from people who actually invented a time so in theory it should be possible to focus on most informative points and see how that relationship changes yeah you
Info
Channel: H2O.ai
Views: 46,770
Rating: 4.8830409 out of 5
Keywords: machine learning, Meetup
Id: CY3t11vuuOM
Channel Id: undefined
Length: 37min 17sec (2237 seconds)
Published: Mon Dec 18 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.