Natalie Hockham: Machine learning with imbalanced data sets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks for sticking around I probably wouldn't have but yeah it's a recently my colleague Angus myself we've been working on designing an implement in a new fraud detection system at go cordless I found out that this chap from stripe is doing a similar talk on a fraud detection system so I decided to talk more about the unbalanced data aspect and a little bit less about the fraud detection but that was also included as like a secondary thing so and we start with some of the boring stuff sorry what is an unbalanced data sets a data set is unbalanced when the class of interest or the minority class is much rarer than the other class of all classes and they're called the majority classes and what happens is the classification rules that predict the small classes they tend to be fewer they tend to be weaker than those that predict the prevalent classes and what you end up is with test samples born into the small class which mr. classified more often than those belonging to the prevalent classes why is this bad so often the cost of missing a minority class is much higher than that of missing the majority class so as an example take cancer diagnosis if you miss classify someone who doesn't have cancer as having cancer then the worst that can happen is that they'll go on to have further tests if you might miss classify someone that does have cancer does not having capped cancer you know it's pretty shitty really so the performance of machine learning algorithms is typically evaluated using predictive accuracy that's the default and a lot of the psychic learn classifier modules but if you've got a base rate of 99% for the majority class and you're trying to optimize based on accuracy really the most intelligent thing to do there is to label everything as the majority part when we're doing the Ford work we were using the cross-validation module in inside learn and actually there's a really nice scoring input parameter on the cross bar score function and you can actually alter that so that the default is accuracy but its accuracy but you can change that to we call precision f measure G mean etc etcetera so that's definitely something to to remember so I'm going to launch right into some some more interesting examples so exactly this is a satellite radar image of the sea service surface and these images are typically used to detect oil spills these oil spills they originate from ships mainly and the oil slicks are actually less reflective of radar than the average ocean surface and so what you end up with these black regions on the images so the one on the right is the or slick you also get a load of look-alikes and these derive from natural phenomena like Al gave rain wind shadows and unfortunately or fortunately however you unlike the look-alikes occur way more frequently frequently than the oil slicks and you end up with a hell of a lot of false positives if you're just when you're working or you can end up with a hell of a lot of false positives when you're working with these datasets that's a really cool paper by the way okay another example we've already touched on this the diagnosis of cancer I actually found it quite difficult to find a good image for this for this slide but an example that I did find was using histopathological data to generate features for machine learning and this particular paper actually goes into it approximates the mapping between the data and the original feature space and the transform data in the PCA embedding space but they one of their big problems is with the fact that the original dataset is massively unbalanced okay now my third example is fraud detection and this is one that this kind of most pertinent to us there are a few papers out there on telling you know machine learning and telecommunications port and employee fraud and payment fraud and this is kind of like sparked quite a lot of interest because it makes a lot of sense financially to kind of pull up pull out all of these fraudulent cases and stop them source or as soon as possible because it costs money and boy does it cost us money okay it's a go-kart this what is go cordless we're an online direct debit provider we serve around 800,000 people in the UK and abroad and I'm sure everyone here uses the rector but you probably aren't aware of the mechanism that goes but that's kind of behind this payment this payment type and so what happens is a merchant will establish a relationship with the customer and they want to exchange money or the customer wants to pay the merchant direct debit is a really cheap payment option for the merchants offers that are a hell of a lot lower than kind of fast payments and so what happens is the customer will pay go cordless or transfer money into the go car discount we pay out to the merchants and the whole process from start to finish takes between four and five days and so actually it's a fairly short way to make money if you're a fraud stir because we've got a hell of a lot of time to catch you and so we're a startup we're fairly young with four years old and we've up until recently we've been using a fairly basic clunky system and the the kind of the ratio of water to non foster and our data in our dataset is around two percent and so formally we used a discrete set of rules that we chose using Bayesian logic over time we've added more and more rules it's kind of created a system which was actually really really quite messy and we've had to employ a team of four analysts that go through all of data manually or and we have ford alerts which are which I'll created automatically lots of which are redundant but it's kind of we're trying to cover all bases so this is how we're doing and actually we're not doing too bad at detecting the Foresters our aim is to have a recall of one and we almost do have a recall of one if you like the current method that we've got in place like an ensemble method but without but the machine learning so there are different stages at which we can catch a foster but we always eventually catch them at some point or another so yeah at the moment we've got a recall of around 100% the precision is 8% and we thought we could do better so we decided to build a classifier and we used you know the obvious Python libraries we also use C one and this is the first time like I do Steve on it so it's pretty cool for data visualization especially when we're presenting to the rest of the company it's a something may be worth checking out so to start off as always a quick and dirty model and on the left hand side you can kind of see a basic outline of the pipeline that we use for this dirty model we tried a few different classifiers we've got a framework in place that makes it easy to test out all these different classifiers and random forest classifier the random forest classifier actually perform the best we were able to get a precision of eight percent so that didn't change but the recall was significantly less than one and over here I've actually normalized the merchant base to 1,000 to give you get a better sense of the numbers so where do we go from here I mean your data I cannot stress this enough Angus and myself we sat down for three days going through all of the data and what turned out was merchants that were classified of foresters weren't actually fraudsters they were bad news cases a lot of the time what's a bad use case for example chaps that sell boiler insurance to all people they legitimately or been granted access to that person's account the person who set up the direct debit it's not such a noble way to make money because they don't need that service so we don't want to serve those people but really they're not foresters so they shouldn't be included in our model and we sat down for three days long as myself and he likened it to the to the job that the YouTube reviewers have to do anything when they looked through graphic content because actually it was it was soldiers to him okay so once we clean the data we decided to resample it so this is a data level approach to addressing the class imbalance and that there's a few ways that you couldn't resample the data set under sampling is an obvious one it's very easy you effectively remove random instances of the majority class until your two classes are sufficiently balanced and the problem with under sampling we found is that it great gave great model statistics but we lost a hell of a lot of information and we ended up with a model that didn't generalize very well and so we averted attack went down the route but over sampling it's kind of a bit more difficult to show over something on a graph so I've I've made the data points thicker to show you that we've kind of overlaid some on top of each other and so obviously with over sampling you duplicate triplicate replicate the the minority class until you get the same class balance that you're looking for and then briefly I want to go through another technique that we came across we actually haven't gone down the street but it's worth explaining it's fairly new and it's called synthetic minority over something and it effectively involves creating synthetic examples of the minority class so I'm going to quickly go through the technique ok so this is the original dataset and I'm just going to remove the majority class to make it easier to see so let's assume here that the amount of over something that we want is 200% so we choose a sample and for each lamp or we want to select the K nearest neighbors to it you should decide 1k I've decided upon five in this instance okay and we want to over sample by 200% so we randomly choose two neighbors from the cleanest neighbors and then at this point we want to generate one sample in the direction of each of the two randomly selected nearest odors and you want to generate that sample along the vector joining the original sample with with one of the other points and what I've got here is the over sampling graph that we actually used that we went with and the synthetic minority over sampling technique graph on the right hand side and you can see that that smoked actually forces the decision region that there were of the minority class to become a lot more general and it's been shown to your successful results in a few studies though right so back to fraud it's all well and good saying that we need to over sample or resample the data set but how do we go about doing this scikit-learn is a babe because in psychic learn or put 1/4 we kind of struggled with altering the class weights but in scikit-learn 0.16 a lot of classifiers now have a pass weight as an input parameter on the classifier itself I think all excluding Gaussian for my base to ignore that and this saves you a hell of a lot of hassle of having to manually up sample in D sample and so we added a few more steps to our pipeline obviously this is a pipeline should be a closed loop it's an iterative procedure but we had a grid search and feature selection and in the end we did we we came up with a logistic regression model the recall was very close to one of precision we were able to increase to 40% that's a five-fold increase and it's something that our fraud team is really excited about this is an overly simplistic explanation of the system that we've implemented like I said before there are different stages at which we can capture fraud stir prepayment prepay out blah blah blah and there are different models that we have to implement freeze each stage okay so we've covered a few data level approaches to addressing the class imbalance and it's worth mentioning some more of some algorithm level approaches so cost-sensitive learning is one of them and the goal here is actually to minimize the cost of misclassification and to do this implementation is actually very similar to the reese marketing approach and you still use the classmate parameter but you choose the class weights manually based on the cost matrix which you've come up with with the business team and I want to go through another technique so this is I guess it's out an algorithm level approach it's called a de boost or adaptive boosting it's we actually didn't use this on the floor staff we've been using it on some other work that we've been doing but we've generated some good results with it so I thought it was worth going through it ADA boost is short for adaptive boosting and it's effectively a matter estimator that you can use with a whole range of classifiers its adaptive in the sense that you fit a classifier once that has been fitted you fit subsequent week week learners and they're tweaked in favor of those incidents which were misclassified by the previous class why's that sounds like a bit of a mouthful so I'm going to quickly go through the procedure so you begin by fitting a classifier on the original dataset so that's the decision boundary and you see that there were three data points which were misclassified and so we want to re-weigh toes in correct misclassifications more heavily for the next classifier that we fit so there we fit the second pacifier and you can actually repeat this process any number of times depending on on yeah how much time you have and there we have a third class a third week classifier and the final classifier is actually a weighted combination of all the weak classifiers the alphas are chosen to minimize the overall the overall training area era and there are lots of variants on on adaptive boosting yeah you should look at ensemble methods in scikit-learn and finally while I was doing a kind of my research for this talk I came across a couple of other interesting techniques which people are using to clean their data and so Tomic link removal the goal here is to remove borderline examples so Tomic links are points that are each other's closest neighbors but which don't share the same class label so you can see them in circled in green on the left hand side and what you do though is you remove them it kind of makes the the border between the two different classes a lot more distinct and then there's condensed nearest-neighbor the goal here is to remove instances from the majority class that are very distant from the digital from the decision border and effectively what you're doing is you're selecting the subset of samples from the training data such that the one nearest neighbor and with the subset can classify the examples almost as well as one knows neighbor with the whole dataset okay and so just to finish I just want to briefly talk about what our project on fraud detection means for the company because that's kind of where all of this work into addressing the problem with in violence datasets came from so we're introducing the norton new fraud system at the moment is a gradual process obviously we don't want to introduce any breaking changes and but based on the feature importance is that we've derived from the work that we've done so far we've already switched off certain chord alerts and that's already reduced the burden on the our team of fraud analyst who are really happy that the the best model was the was the logistic regression because it's something that is very easy to explain to the rest of the company and what we intend to do is to provide a fraud score to the Ford analysts on the scale of nought to 1 so that they can read from the dashboard what features are contributing to such a high flawed score when it occurs and finally there are certainly other considerations or other things that we need to consider which we haven't stationerity this is the common assumption in many time series techniques but we know that the mean the variance in the autocorrelation structure the force of failure is likely to change over time it's something that we're monitoring and yeah and finally I think it was Ian that mentioned earlier saying about keeping your model as simple as possible lots of people that have suggested to me using neural networks and other nonlinear algorithms but for us the main challenges being you know cleaning the data engineering more features that takes a hell of a lot of time and through just doing that and over sampling we've been able to get a model which is significantly better than the one that we're currently using so I finished a bit early but if you have any questions how do you define against a direct effort is slightly different and more actually will sign up to our company so it's a sign up to our service and what they will do is they will have acquired a load of customer deep details stolen customer details maliciously and what they do is they create direct debits on behalf of those customers and they direct Deborah is essentially a call mechanism so they pull the money out of those customers accounts and then the customer will get in touch with us because on their back st. Lucy go cordless they've charged back to us we're chasing money from there from the merchants so it's a little bit different to stripe in that that money switches hands fire go cordless and I don't never went to the floor talk yesterday exactly exactly but I mean we try to say if a Forrester has an email address which was previously used by afford ster we flag that as fraud before they default the customer obviously it's we wanted to take them as early as possible not after they'd afford the customer because we do have a five day lag between the payment being initiated and being actually carried out so it's kind of beneficial to all parties if we act as soon as we come yeah yeah I know sorry that's a good question we've only to be recently back we actually have been using an ADA booster on a different project not on the floor project itself and we're getting good results with that model but we actually haven't it were kind of at the early stage of that project so we haven't been a lot they're looking a lot into the future importances one thing that was interesting actually was when that just on the subject of feature importance when we initially run our models without balancing the data set random forests came out as the best model and when we looked at the future importances that were generated it just didn't make sense intuitively you know and so you know the accuracy was obviously very high because the base rate is so high but it was only once we've done the resampling that you kind of you know the results actually match our intuition and that's like huge because you know machine learning it is working off the experiences of humans right we're not and and that's something that perhaps you had to explain to people in the company because there'll be a new type of fraudulent behavior and they can't then they go test this and we say not fraud they say oh you're wrong we say well next time it will catch it you know we'll just retrain the model yeah yeah yeah that was helping yeah yeah okay so the question is when we evaluated the model what level of what false positive rate and false negative rate would be really for the false positive rate and the true positive rate okay well so really I always think in terms of recall and position but with us that go cardless the kind of a lot of the risk is on us if we don't catch the foster so we you know we aim for a week or as close to one so we want the true positive rate to be 100% you know as far as the false positive rate is concerned it's very much a trade-off between the costs of hiring you know many forth and lists and the cost of having a huge charge back that we have to that we have to do is say we aim for you know our model kind of performance we based on a true positive rate and that yet anything any improvement that we can get in precision or the kind of false positive rate is bonus to us so it's very to detect them as forward so at the moment we've got a precision of forty percent if we take them detectives brought we have the the Ford analyst actually go through them and all of the other ones you know not just not only the true positives but also the what turned out to be false positives and then yeah they money we mark them but at the moment I mean fraud is just so there's a lot of clever people out there trying to do for us and it's taking up a lot of times that was what spurred on this project and that's what motivated it and yeah it's already generating some good results so hi oh what transformers here is a Forster because he's interested what transformation is on the feature space to be honest I don't want to go into too much detail on the features that we're using that it's like for a fairly sensitive topic things like emails we used data that we acquire from the merchants been next line up we also looked at their payment activity and dynamic features as well as static features so they're they're kind of activity using going Harless within the previous five days day there are some Angus can you remember any other weird ones or not the features himself Oh what did we do yeah we did this yeah I mean structured not my vision but on the features itself we didn't in terms of engineering I can't think of a good example of hand I think it was choosing the features that was a bigger part of the process as opposed to engineering that as opposed to all between them wants to be once we've got them yes yes so like a one-house classification yeah exactly that's not the algorithm level approach to addressing the data and balance problem yeah we've considered it and kind of wet down the easy route first probably and yeah something that we might be might consider looking at in the future but right now we don't we don't we don't need to we've been using anomaly detection on something else on something else recently it go cordless which is looking at Ford more on a payment level than an on an on a merchant level sorry thank you thanks
Info
Channel: PyData
Views: 32,584
Rating: 4.7963638 out of 5
Keywords:
Id: X9MZtvvQDR4
Channel Id: undefined
Length: 27min 45sec (1665 seconds)
Published: Tue Oct 06 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.