[Music] not in everyone and thanks for the organizer for the opportunity to share what we do at so far with you today so today I'm going to talk about how we use machine learning in financial credit risk assessments my name is Soledad galleys just to say a few words about me I have a PhD in molecular biology and then it is a 8 years of research in neuroscience in University College London and I finally joined SOPA as part of the data science team where I contribute to the generation of insight through data analytics and machine learning so you may wonder who are SOPA I wonder who so far where before I joined them sapa is some digital finance company and basically it was the first peer-to-peer lending platform in the UK and it was founded in 2004 I just want you to get an idea of the scale that we're working at at SOPA we have lent money we have million more than 2 billion pounds since foundation and we have learned money to more than 240,000 people I think it's out some people choose to actively invest their money with us so why did I join SOPA so what attracted me from this company is that machine domain is at the heart of what we do and we use state-of-the-art machine learning to produce or models basically we can use every single algorithm that is out there to produce or to try and build the most accurate models and we also work in an agile manner so we work fast and we deploy fast and we generate our own software which is good at the time of deployment because it allows us to produce and deploy quite quickly another thing that is very important at least for me is a very end to end seem interesting inter team collaboration so the data scientists are working in close proximity with the risk analysts that provide some of the domain knowledge and we're also it's also working in close proximity with the developers that are the ones that are going to take to life or they're actually our research so in the philosophy and lending model at SOPA what we do is we put in contact investors we want to invest their money with borrowers that are looking for a low rate loan so basically when an investor put the money into so far we match it to a borrower that is looking for a lousy loan low rate loans say to improve the house or to buy a car or whatever they want and when this more aware returns they don't need to soap up with the interest this interest is passed directly on to the investor so I allowing this circulation of money and by managing risk we are allowed to provide low rates to the borrowers and higher returns to the investors but in order for this peer-to-peer lending model to occur as so far we need to handle all types of financial associated risk and this risk includes credit risk and it also includes fraud prevention and we use machine domain for that and we also use machine learning for example to detect a thermal if the document has been modified so if someone is trying to upload a document that is false we use machine learning for that and we also use machine learning for example to determine the success of our marketing campaigns among other things but today I'm going to talk about the experience the journey that we had at Soper to build our credit assessment model from beginning to an end so what is credit risk credit risk is the risk of default or any acquire that by a person and put in lay terms credit risk is the risk that a person to whom we have borrowed money will not be able to pay it back and we want to assess and mitigate risk for three main reasons the first one is we want to lend responsibly so we don't want to lend money to a person if that is going to hurt their financial situation because they cannot pay it back and we also owe to our investors so we don't want them to lose their money and finally we want to keep our reputation as a responsible and prudent gender so that's why we need to manage credit risk so hopefully today in this talk I'm going to take you through the process of creating a machine learning algorithm to chartered credit risk and this involves the very first steps which is data gathering and also creating the target definition which is basically what do we call risk and after we have the rate of it on the target we have to clean the data we have to pre-process the variables in a way that we can feed them into machine learning algorithms and when we have done that we want to select the variables that are more predictive and once we've done that we'll build the machine learning models typically and then we can add a negative complexity through model stacking I'm going to talk to say a few words about that and then once we've done all this process how do we actually take it to life so the first thing you need to build a machine learning model is data so how do we get the data Sapa gets data from credit agencies various agencies and institutions that gather and store data about financial information of people in general and the information that they provided information on mortgages credit cards correct accounts and we can get information about how much money people have in those accounts how many accounts they have for how long have they helped them have they paid have they notes for all this information we get from credit agencies and we also use information that the applicants provides to Sapa at the time of application for example what do they want the loan for do they have a job how a what kind of job they have what is the salary etcetera so bettering these two data sets together for you to have an idea we end up with around 3,000 characteristics per person and we got the data for a few several hundred thousand borrowers so once we have the data set then we need to define the target the target is what we want our machine learning models to predict so we want to predict risk so how how do we determine target the predicts risk basically we want a vector a target that will say that will categorize people as whether this person is at the financial difficulty or not and the way we do that or in order to do that what we do is we monitor the financial accounts of customers for several months and we look for signs of financial difficulty financial difficulty may involve default for example they just couldn't pay the debt back I can also look for signs of missed payments in any one month so if we see any signs of financial difficulty in mathematic terms we did that word where a zero and if we don't see any sign of financial difficulty very mathematical things with a sign a board over a one so what we end up at the a at the end of this exercise is with a data set that will have hundreds thousands of a several hundred thousands of rows which are alvaro with three thousand characteristics which are the columns and a single sign the vector with one of zeros determine whether the customer is at the financial risk financial difficulties or not the next step now that we have our data set is to pre-process variables right so variables usually don't come in a way that you can use them straight away so we have to do something about that so what can we do while we pre process differently categorical variables and numerical variables categorical variables are variables I have labels instead of numbers for example gender instead of number you would have female or male so what can we do with that the categorical variables they have a lot of null values null values is when we actually don't have a value for a customer for that variable and there are a few options you can choose to throw the data away but then you would be throwing away a lot of data so you don't really want to do that what we want to do is to replace those null values for something else and how can we do that in the traditional way you can choose to replace those values by a random sample of the data so you choose your variable whenever you have a variable you select a random sample and you fill the gaps that would be one way the other way is to replace the variable by the most frequent category if you have a category that appears a lot then you choose to fill all the gaps with that category we don't like to do that at supper because we think that if the value is missing it is for a reason if we don't have financial information for that person there has to be a reason so then we choose to feel or what I got differently and we create an additional category that we call missing so whatever there was a null value now there is a missing label so then there are also rare values that you can touch that you can have in your categories for example if your category if your variable sorry is City you can imagine that cities like London will have a lot of representation in the variable because the population is huge but smaller cities or small towns we have lower representations say like broken house so then when you have rare categories what happens is that the mullah tends to overfeed because they will think that everybody were coming from that city behaves like the only one border was that it's all so we don't want that to happen so then we want to remove the rare values and how can we do that again so you can take a random sample of the variable and feel they can feel and replace the very values with that but you can choose and replacement by the most frequent category but we don't think that this is the way to go so what we do at River is we are generational category that is called rare value so now every time in your category you have city like or Museum kings or broken earth you will have activity levels it's called rare value and once you have your label is replaced and your variable is full we need to put that to Louie to transform that into a number so that we can fit it to a machine learning model and how can we do that again there are a variety of methods you can replace each label by the frequency it has so like if you have 1,000 customers from London you will replace London by 1,000 or 1000 divided is the total number of followers you can also assign an ordinal values like if you suspect that your variable is ordinal Excise days of the week you can put one in Monday and you can put a 7 in Sunday you can do one hot encoding I don't think I have time to describe that here and what we do at SOPA is we assign a value before but before we assign an order to the variable so what we do is like for each label in the category in in the variable so we measure the value of the target then we order decreasingly depending on the value of the target and then we put a number so say for example we take London and I see that the default right there is 60% so that one is my highest default right I put in number one and then I evaluate the next city the next seat is an exit and then I say for example broken house just to give the example my default rate there is zero so that one is the lowest right so I assigned according to the fold rate from the highest to the lowest and then the highest we will number one the like on label will be number two so what until I put the number two the rest so that basically we are assigning some order first and then I'm given the number according to the force due to the order and that allows me already to capture some information in the variable at a time of transformation so then we have all activities now in the form of numbers what do I do with the numerical variables again numerical values can have null values which is the lack of information for everyone borrower in that variable in particular and you can also have outliers Allah is more or less the same as a rare value right so it's a number that is way far in the distribution of the numbers present in the body also make all your your customers look like this and then you have this one that looks like this so it tends to make models in particular linear models overfitting so then you want to do something with those values and the way we start to learn our values and outliers is virtually very similar see again you can replace them by a random sample of the variable or you can replace in this case by the median or the mean so basically the most frequent value although we can do what we do at sofa actually is to replace that number for a number that is very far in the distribution but still in the distribution right so if you imagine the normal distribution here and our layer was here we bring it to the end so we make it within the population but C is different from the rest of the customers so this is for you know cleaning data but now what at the moment of using these variables in machine learning models there is still some more pre-processing that you need to do depending on the machine learning model that you want to use so for linear models like logistic regression and also naval networks the performance of the model benefit from normalizing the data from squeezing the values of each variable between minus 1 and 1 so this is what we do we normalize the data and here specifically normalization is dividing the variable subtracting the mean and divided by the variance there are other ways of normalizing as well as if you choose and another important thing is that linear methods like logistic regression they assume L relationship between your variable and the target and that is more often than not not the case so the variables are not linearly related with your target so it is a good idea to try and make a linear transformation to find the linear association between your variable and the target and these confirmations typically or most famous I'll ability logarithmic transformation you can also make square roots or you can basically use any function that you like that finds a linear association between the variable and the target what we do at SAAP is something that we call driven in and basically it makes a tree find the linear cuts in the variable so how does this work so for each single variable so I grabbed one variable let's say salary and I buy I built their a classification tree a very shallow one one two splits three at most so what it does the classification tree is within that variable it finds the main cut that will make the best separation between you know defaulters and no defaulters or people at risk and of people are not not at risk so that in a sense is kind of finding the linear or associations between the variables are not so then a I can find three or four cuts within the variable that kind of make that variable more linear to the target than it was before and then after I've made the trees what we do is that we replace the value of the variable by the probability outputted by that single classification and regression tree if that makes sense so after so now we have a dataset that is ready to be fed into the machine learning model something very important particularly when you're working in a commercial environment is a feature of variable selection right and here I specifically say reducing the number of features from a few thousand to a few tens and why is this important for a variety of reasons first it's easier to implement so when a developer has to code this is the code to call the information about the customers is much faster if they have to call only 50 variables than if they have to call 1000 variable the second reason is that it's faster so a model that is trained and predict using sense of variables is much faster than one that uses three thousand variables and we want to produce a smooth experience for the person that comes to the platform and finally it's more reliable so as you can see like we rely on credit agencies information so the more variables that we use from them the more the room that is to bring a mistake happen right if something crashes on the accreditations we have it has lighted info on what we do so therefore we want to we want to reduce the feature selection to a few things and how do we do that and it is a massive drop from two thousand to a pretense so we do it in two stages here on the first stage we build a classification tree for each variable so work like that take variable number one I be like rusev occasion three versus the target and I look at the accuracy or the rocker you see a third model if that variable is predictive I keep it if that barrier is not predictive at all I just drop it and at the same time I will a random forestry using all the 3000 variables random for it is able to provide information about the importance of the feature it will run the features of this one it's the most important so this one is the least importance and then I look at the importance that it gave to the variables and I remove all the ones that have 0 importance so by doing these two things I'm able to fast from 3,000 variables to a couple of hundred and ones that we have a couple of hundred I think it's a good idea to let each machine learning model decide what is best for them and we do that in a process that is called recursive feature elimination so we do recursive feature elimination for each machine learning model and how does recursive feature elimination work so you build one model say it's random forest again with an example on your 500 variables you measure the performance of the model then you remove one feature you measure the performance again if the performance dropped then the feature was important so you want to put it back if the performance didn't drop then the feature was unnecessary so you can discard it and then you do the same with the next feature and the next feature and the next feature until you find the smallest set of features that have the higher performance for the machine learning model in this case random Horus and we do the same for logistic regression and we do the same for HP boost and then it takes the day society to kind of scan all those models and find the one the world's the best so it says the last page is a little bit artistic if you want so now that we have the variables pre-process and we selected the ones that we want to use the next step is to build the final machine them in models and to optimize it so what we do is on the types of variables that we selected we build a battery of machine learning models and just to give you an idea with a linear models logistic regression and multi adaptive regressive splines within three borders including random forests and gradient boosted trees and we also be the neural network and we hope that these models will give a high probability to customers that can actually play the roles with ease and we attribute a low probability to customers that may encounter some financial difficulties at the time of paying the loan so we have now a variety of models we have like seven models we want to test the performance we want to know which one is the best which one are we going to use so their way of assessing the performance of the models there are several ways in fact the one that we use the tapa is the estimation of the rock you see the rock a you see tells you for each probability value how much success we have and what is your trade-off right I think let's say I choose a treasure of 0.5 and I'm going to say that if the probability is higher than 0.5 then I will determine that this customer is a healthy customer so the rocker using is your an indication for that treasure of how many times how many healthy customers the model was able to pick and for that amount of healthy customers how many false positives it gets like how many times did it say that this customer would be able to repay when you actually couldn't right so it gives me an indication of the success of the model and ideally I want the model that captures the greatest amount of healthy customers and the lowest amount of false positives I think the lowest number of times saying that this customer could pay when to the coolant and for those of you familiar with three models you will know that themselves there are in symbol algorithms so what I assemble models so basically this is a classification and regression three so in a random forest or in a gradient boosted please there is not just one tree to make the prediction there are several trees to make the prediction and how are they deal so you have your data you select the sample of your data you win the classification tree you extract another sample of your data you win the second classification tree you just like another sample of your data you will design specification 3 and so on so you haven't moved it to the specification 3 typically hundred and each one of them is slightly different from the other they are not very similar they are slightly different and then we ask them kind of to vote so we measure the average probability of all of them and this was shown back in the late 70s that improves the accuracy of the prediction quite dramatically right so as an example imagine like if you want to sell your house would you go and ask only one state agent or would you go and ask 10 state agents and then get the average probability or the average price so the model here is doing the same and this is intrinsically the way random for its work and gradient boosted trees work as well but we can actually do the same with logistic regression in a process that is called buying of the predictors and this was also described in the late 70s so by Univ the practice works very very similar so we spread one sample of the data set and within the logistic regression they will suck another sample within another model and so on then we end up with say 100 models slightly similar but not equal and then we have them also the probability and then we get the average probability of those hundred models to predict the probability of repayment and we have observed that for our data that improves the ROC AUC of the second decimal which is quite a big improvement so now that we have these all the models and hopefully have selected the one that works the best etapa we have also played a little bit with model stacking and I want to tell you later what this is so Morris packing is a technique that allows you to build another machine learning models on top of the individual machine learning models right so you have all your machine learning models that you build in the first stage each one taking into account the tens of features and put in an output probability and then these probabilities you can fill into a second machine learning model and you can ask them to predict the probability of report of repayment now based on the probabilities instead of in the create agency's variables and these typically improve the performance of your learning algorithm because it is able to capture where each one of those models is doing well and it is able to discredit where each one of those models is doing bad so meta modeling can be as simple as just averaging the probabilities so you can take an average of all these probabilities I'm into meta model or you can take the average of these two or the three it's up to you and also you can take you can build another machine that we model so here you would have a logistic regression using those variables or you can have a grabbing boosted trees using those variables as input right so it is up to you it's very creative it's a lot of room to play so what we have shown up so far and this is our experience so when evaluating the individual models they're unsurprisingly perhaps we found that gradient boosted trees and neural networks are the ones that individually perform the best so those centers so those are able to classify the greatest amount of healthy people without capturing false negatives and when we actually take the probability of these two together and we use it to make the final probability we find that that increases the performance even more and we have played with a lot of probability combinations for us and we have also been several machine learning models using that as an inputs using the probabilities as an input but what we have observe is that they post in performance given our models is not big enough to justify the expense of resources both computational and human to implement the like a machine learning model on top of that so we decided that we would not go this way at least not for this exercise so if you bear with me for the whole talk I mean what we have what we have at the end is like a dataset with sense of variables we also have a target within several models we selected the two that work the best and we have average the probability and the average probability of the best performing model is the one that we are actually using to classify the probability so to classify not to determine the probability of a person that comes to a platform of repaying the loan if we were going to give it to them so what do we do next we need to bring this to life right so we need for each customer that comes to a platform we need to score them quickly with this model and how do we do that well ahsoka we have developed a toolkit that we very creatively have named predictor predictor things on top of Python and it uses numpy and pandas to pre-process the variables to remove no valuables to make all the variable transformation and it also uses scikit-learn for the traditional machine learning models and for recursive feature elimination we use fires to present to build the multi adaptive recursive lines we use X abuse for gradient boosted trees we use scales from neural networks we use mud from nib and c1 for data visualization and it data science with most of our research using Gupta notebooks that allows for easy you know visualization of your data of your genius so we do a research using predictor which means that we don't have to write code from scratch every time we want to pre-process data and build the machine learning algorithms because all of that is in the functionality of predictor itself and we also used to lichte to deploy so basically when a customer comes to us what is assessing that customer is a Python script Python software predictor so this allows to decrease the overhead between research and development because our developers don't have to write everything from scratch in any other language right they just implement all the tools available in Python for that our predictor is flexible so which means that we can improve it every time so for sample we started with logistic regression and maths in a second exercise we added extra boost and now we have added carers and we have also added an in-house records the feature elimination using XG boost basically you can modulate some complexity of stock on top of it as you progress in your research and as I mean it is proprietary but we are actually thinking of open sourcing in so I hope I could give you a flavor of what we do a topper of what it takes for an end-to-end machine learning process to be built and I would like to thank you for your attention then no we throw an entire book I mean which is not that which throws entire logos every time a customer comes here says an individual is that your question okay no no it will be just entire book yeah yes it's it's okay it is a fair question I think the question today is like we do univariate feature selection and that's a risk because a variable may have interactions with other variables do we take care of that the answer is yes so first I will build a tree and we find the rock a you see for that variable versus the target if the rocker you see is around 0.5 which basically means that is random we remove it and then very unlikely if you put it back it will be successful in any case in a stage we go over all the variables that we remove one by one in a recursive featured addition and if I find that it improves a performance on a certain threshold I will put the feature back okay yes exactly this was his process again produce that two class classifiers like a guy is gonna pay back instead or it is actually a probability measure because it makes not actually the same thing yes I want to ask this because open banks have like obligations deviation because of Ostia to like producer scenario analysis like very recently like what what is very square its closing on in there like 1 percent worse scenarios and that's what you really need yes okay so this I think the question is if we are output in a classifier just you know lend or not lend or if we're output in a probability so our models all of them they output a probability of repayment and what we do with the probability later only say so if the customer has a very high probability of repayment which means that it's actually very healthy financially customer so that we put it in a high market and which our lowest interest rate and then if there is someone that we found that is a little bit more risky we are still happy to lend if they are going to pay higher rates right so this is the way we work with the probability on top of what I showed that is of course the entire business metric that you know to satisfy with compliance regulations and all of that but I didn't have time to talk up with of that here we want to save this business item this is our how important is it for you to be able to explain why one customer is more risky than the other and basically on top of the image the neuromas we're kinda working process and usually the problem might be that they don't have like exclusive answer wife like one customer is riskier than the other like house yes fortunately at SOPA stakeholders like including the CEO and the product managers and etc they are very acquainted with machine learning so they have an idea of how trees work how Neverland this was work and they have an idea that this is kind of black box see if you want the problem is when we have to talk with our investors because they don't necessarily have the idea so we would have to bring it down to line terms we will have to show so basically when we can we try to explain whether tree does very likely that's not the case because they may not understand so we have to show them the numbers I think how our models are performing to change while not when it comes to machine learning algorithms that I'm aware of there are legislations about what variables you can use so we cannot use gender for example we cannot use age but yeah yeah well the first thing is that we have to comply with the regulation so variables like age and gender with cannot use so we don't pass them to the model this is just illegal so we don't do that there may be account with some particularly look for correlations between H and probability of repayment I don't particularly find that H is a great predictor yes anyone we are blind to race to the game with so I couldn't tell you if there is any correlation between race and address because although we are blind to that information so information that is illegal we are blind so I cannot use it to insert anything to begin with yes so for this exercise we actually don't have a balancing problem we can find that 40 percent of the people we could define that they have missed payments or defaulted on any other account other than loans which is what we lend but we do have the problem for flood for example the tiny amount of flow is very very tiny so what we do in that case is we rebalance the data set so basically we multiply we over sample from the fraud class to over represent it in the data set and then we adjust the probability accordingly yes so the beauty of a model like this is that it allows you to play so basically it comes to the business needs and the economical situation of the moment and you can decrease or increase the attrition so if you are in a very stable economic situation and we have a lot of money then we can lend a lot so then we will allow a lot of more positives but if the situation changes then we are all false negatives and situation changes basically we do exactly the opposite so you can play with that pressure depending on the situation it's not a set rule there could be and it's particularly important for cases where people is new to the UK for example then you don't have financial history because they have not been trading here we have been started to do some research but we haven't gone deeply into that but I mean there are agencies that gather social data for example you could work with that we haven't done that at the moment also it's quite regulated so it's not you know very easy to go in there but anything the logistic regression that we do already is quite developed because we do bargain of logistic regression so I don't have one I have 100 models and I already linearized the variables with a very complex way so my logistic regression or the logistic regression with what's up why it's already quite predictive if I have to give you the numbers I think ok so from a simple logistic regression to a balian of the predictors I would say that there is an order of the second decimal so it would go from 0.9 to 0.93 or something like that and then the gradient boosting and neural networks or even better and it's not just because that would be benchmarking it versus ourselves we also birch market versus with the predictive variables that are out there or very select rate agencies provides predicting variables and we benchmark it with those as well I don't think I'm allowed to share those numbers unfortunately can you cope with traumatic life events that change the nature of somebody's risk that would let's say for example somebody's to be married to a spendthrift partner and they've just completed a divorce so their credit gets the trash for the past but might actually month quite reasonable for the future mm-hmm and yes and we don't do that from the machine learning perspective then we have a dedicated team that is going to tackle individual situations like this yeah okay you want 1,200 yes yes and what I don't that they're not important because I did the three versus the target and the accuracy of in the rock agency was 0.5 which is random and then or the importance by a random forest was zero so that tells me some information but then like going back to the question that the other person asked me so once we have our final data set I just want to be absolutely sure that I didn't miss anything important so I will put the feature one by one and if it is important and I'd missed it originally I keep it basically [Applause] [Music] [Applause] [Music]