Using A.I. for Credit Risk Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we don't want to use the microphone I also don't have any slides I'm not a big believer in PowerPoint so you can stare at the screen as much as you want it's not going to change and and the other thing I just kind of wanted to mention is as I'm not a you know college lecturer or a priest doing a sermon if you have a question you can just ask while I'm talking you don't have to wait until the end that's that's fine I wanted to talk a little bit today about the use of AI in finance and gives some use cases that we've been working on to give you a little bit of a background I've actually been working in the AI space since 1980 and I started working in speech recognition and and natural language processing and then kind of went through expert systems neural nets v1 and case based reasoning and then I worked on neural nets b2 we're currently at neural Nets b3 which is deep learning and and a lot of the work that I've done has been in areas around measurement of risk in finance and one of the things that I I did a few years ago was I was doing some work in the UK and I came back and I co-founded a company that was a marketplace for equipment leasing and I wasn't terribly interesting but after a few years I was able to sell my shears in it and had a little bit of you know money lying around and that gave me some breathing room and that I didn't need to figure out what I was going to do next for a little while and I started working on kind of some interesting problems and one of the problems that I found that was very interesting is prostate cancer diagnosis because of all the major forms of cancer diagnosis it's the one with the lowest accuracy rate when human beings attempt to diagnose it skilled pathologists have about a 70% accuracy rate in in diagnosing prostate cancer and I had this idea well what if I looked at the the an RNA microarray which gave me the gene expression values for all the genes in the sample could I run that through a machine learning algorithm and predict the likelihood of cancer nearby not cancer in the cell you know sample and in order to do that what I discovered is well there aren't really any very many algorithms at the time that you can feed a vector of 25,000 values into so the big problem was well how do you deal with these kind of very very wide and very shallow datasets because the typical cancer study has about 20 cases in it but it has 25,000 inputs right serious problem when you look at proteins the the vector for proteins grows to four hundred and sixty five thousand I can't currently do that I can do the twenty five thousand but so when I built those models I found that we could accurately predict cancer outcome with samples of healthy tissue we could tell whether someone had cancer somewhere else in the prostate with a 98% accuracy but when I looked at monetizing that I realized well in the u.s. at least it takes about seven years and a hundred million dollars to get started with an FDA approval so I started looking at what what else could we apply this algorithm to that we could monetize and because I had worked a lot in credit and Finance it was just kind of natural well let's try this same approach instead of genomics let's try it on subprime credit and I found that lo and behold the same algorithm works and we'd already solved for twenty five thousand attributes so since we were didn't need to look at more than three thousand attributes in credit the heart problem was was solved so I implemented you know a a beta test with a lender in the States and there they were a deep subprime lender they they lent between 300 and $1,000 for six to ten months very high interest rates what used to be called payday lenders right very very high default rates so their big problem was they had 32.8% no I'm sorry 38.2% first payment default rate so 38% of their loans never made the first payment that when you're charging like 700 percent interest okay you know but they weren't making money and so we took their bureau best practice model which was based on logistic regression and primarily used credits a credit score and we replaced it with a machine learning based model that instead of looking at four attributes which is FICO score debt to income ratio trade lines and inquiry counts we looked at over 2000 and that model when we just put it in initially derived a and fpd of 22% so in the first month we went from 38.2 to 22% but by making the model dynamic and every time someone paid or didn't pay alone injecting that in the model and retraining it it went from 22 to 18 the next month then from 18 to 15 then from 15 to 12 and then from 12 to 9 and eventually seven point a one of the things that you have to be careful about in doing this sort of thing is it's very easy to derive a zero percent default rate just don't lose lend money out right so you can over optimize the model to where you reduce the profitability of the portfolio so while we were doing this we were measuring how much profit was derived from the portfolio and what we discovered was seven point eight percent FPD is over optimized because it dueces the yield on the portfolio nine percent in their particular cases the sweet spot but you now have the mathematical levers to simply say we're just going to go you know from seven point eight to nine and lock in nine and that's transformative to the business and we just ran like that locked in at nine for two and a half years and that's where they are today so that's a case where you can replace the traditional logistic regression methodology was something much more powerful and the key to it is this when you deal with the credit universe and this is what we're doing specifically at add to the traditional methodology of rating credit and determining risk which is to apply logistic regression primarily with FICO scores that actually works it works for the section of the cohort that is at FICO 700 and above which is why banks lend to people if I go 700 and above it's not because they like them better it's because they understand the risk profile of that population inherently consumer credit like weather is a nonlinear problem but like most complex nonlinear problems there's a portion of it that's linearly addressable and that's what kind of makes the problem deceptive because you can accurately predict whether for three days you can't accurately predict whether for six months it's too chaotic and nonlinear credit is the same way you can accurately predict FICO 700 and above using the traditional methodologies if you want to accurately predict as you move away from FICO 700 you have to use more data and nonlinear models when you get to people who have no FICO scores right the traditional financial institution just says you don't a score you can't borrow from us we made an interesting discovery in auditing lenders we found a lender that was testing into their unscored population and so basically everyone who didn't have a FICO score whose identity they validated they loan to this lender overall had a 10% charge off rate right um they are most expensive loans charged off at 24% okay pretty typical the population of unscored people charged off at 14% so four percent more than the average of people with FICO scores and so the the lesson to the lender was you know if you simply took all those people who didn't have scores and you gave them your most expensive loan they would outperform all the people you're loaning to at your most expensive rate that to me is one of the most important findings that we've seen because the fact that people don't have credit scores does not make them high risk it makes them unknown risk and if you can look at data like we're looking at utility payments rental payments cellphone payments you know you you can look at all kinds of things that are outside the traditional credit spectrum and get a good idea of risk so what we're doing a tool right now is building up the sufficient level of data to where we have an understanding of the Canadian market that when somebody who's never been approved for a credit card or a bank loan comes to us we have an understanding of what we can look at and accurately determine the risk of loaning to them from a technical perspective you know when when we're looking at you know something like a cago competition we're really simplifying the problem you have a credit file in the credit file has a lot of debt in it and you extract the data and you create a vector out of it so what we're trying to do is simply establish a correlation between the vector that's input and a dependent variable so the dependent variable is in machine learning terms classification this is a classification problem can we determine what loans are good or bad right but that doesn't help you with pricing but the probability that alone is good or bad tells you how to price it because a loan that has a 92% probability of being profitable is worth more than a loan that has an 82% probability of being profitable so in essence all the problems become classification problems and what you're really measuring is a binary outcome but what is the probability of the binary outcome another example that where we're using AI at ad 2 is using deep learning to validate ID so when we ask somebody for an ID we want them to take a picture of their driver's license and we want them to take a picture of themselves and then we'll use a deep learning algorithm to determine the probability that the person in the ID and the person in the selfie are the same even though they may have grown a beard and worn and started wearing glasses the algorithm is going to tell us what is the confidence that this person is the same person the next step is how confident are we that the that the information typed on the ID is also the information entered into the application so that's another thing that deep learning allows us to do the traditional methodology is to create a template of a driver's license and say okay well here's we centered the driver's license and the name is over in this block and we do OCR on it the problem with that is what happens when people aren't very good at taking the picture of the license right and it's it's angled and it's tilted and suddenly the template doesn't work very well deep learning is much more forgiving of that type of problem than a templated approach and we also use a tool called hello Vera which we didn't write internally which is an AI based bot to provide support and the idea is and I'm sure you've seen these a million times you know you you type in a question and generally it says somebody's going to get back with you tomorrow the idea of hallo Vera is can we derive the an understanding of the question and can we provide an answer so they don't have to wait until tomorrow very much a work in progress you know the technology is not quite there yet but but the guy who created hallo Vera was one of the lead designers on IBM's Watson so I think they're gonna get there yeah so that's kind of an example of some of the things that we're working on in terms of you know what we were watching here and I will say in in in feature engineering it's kind of interesting because when you look at the analysis that was being done which was was very well represented you realized that this is stuff that computers are really good at and people aren't and there's some guys in California who kind of figured that out you know at h2o AI and what they did was they hired a bunch of Kegel grandmasters and they started recreating their process in software and so h2o AI gives you an interface that basically you feed the data in and it does all the feature engineering and does the testing and builds the ensemble and gives you the results and I think that that's really the future of this type of work is leveraging much greater automation to perform the AI tasks the other thing how are we on time oh sorry go ahead I'm not hearing what oh well yeah I mean fundamentally it is I mean the basis of algorithmic trading is essentially machine learning algorithms although many of them are based on linear models but there's no reason why you you can't apply it you know to use a nonlinear model I'd say that the first large examples of AI in finance is algorithmic trading of equities and fixed income assets and you still see that going on yeah yeah yeah generally when you do ID verification you do some form of liveness testing as well so for instance you give a task to the person and that prevents the use of a static photo also if you if the photos match right you know it's it's going to generate a one hundred percent confidence and we're not gonna buy into a hundred percent confidence because it's not going to happen yeah the generally liveness is you ask a person to turn to the left and then come back Center to take a video of themselves and then you extract a key frame out of the video and that's that's the way you do it there's all kinds of different ways to do it yeah wait wait hold on let him get that microphone yeah when I first started doing this what I did was I essentially built two models I built a model for good times and I melt a built a model for bad times and the bad times model was built to track the performance of Lending Club and prosper assets from 2007 to 2009 and that gives you a good example of okay what's going to happen if the sky falls in right now the reality is you really don't know you know one of the reasons why we build dynamic models is that we want to get ahead of the market so we build a model based upon the past but the past includes yesterday right so you keep feeding data in and you're continuously retraining and the little blips that occur don't move the model because now you've got a large amount of data but you start to see the trend right and you start to see that hey there's a problem brewing one of the things that I noticed with lenders because I I was working with lenders during the last financial crisis is that the bigger banks they they have a model it's a fairly simplistic one they changed it twice a year okay the world ended in September with the collapse of Lehman Brothers from a financial perspective right but they weren't scheduled to change their model into March so the model didn't change and it's and I'm kind of like going guys do you not get that you know everything's just broken so I wanted to build models that were adaptive to what's going on in in the real world yes you had a question right I think the repeatability and the explained ability are actually more important than the accuracy particularly in consumer lending we we need to be able to fully explain the outcome of a model so for instance in the states and now in Canada and in Europe it's really not legal to use deep learning for credit decisioning because deep learning is inherently non explainable gradient boosting on the other hand is quite explainable so we focus more on a gradient boosting ensemble's right what I've found and if you look at cattle competitions over the past few years you'll find that the the overwhelming winner of competitions is is gradient boosting alright and and there's it's really the best balance it to deal with what what you're dealing with the the you know you you measure the performance of an algorithm by looking at the AUC or the Gini score you know you've got different methodologies for doing it but I found that those can also be misleading because the best al the best model isn't necessarily the one with the highest AUC score a good example is that SVM's perform quite well but SVM's are purely binary so SVM's give you a yes no they don't give you a 90% yes they just give you yes or no so all yeses are equal well that's not terribly useful in the real world so you have to look beyond the AUC and you have to look at you know the real world the outcome yet yes well none of them come from the credit scores none none are derived they're all raw inputs from a credit bureau so what we do is we take a credit file and in a credit file you know everybody looks at the credit score right and then they kind of put it down they say oh well we've got the information we need the real value of the credit report is not the credit score the credit score is just a fairly simplistic algorithm and it's not terribly predictive the real value is in the raw data and I'll give you a really good example lenders typically look at for things from a credit file they look at the FICO or beacon score they look at the debt to income ratio inclusive of the the new loan they look at the trade line data which is did this person have any 30-day late 60-day lates 90 day lates they have knock out such as was their bankruptcy or a lien or a judgment right and those aren't really factored into the model they're just kicked out at that point right and the last thing they look at is inquiry count right so lenders have a belief that if you have more than four to six inquiries in a six month period you're a higher risk what we discovered was that's not actually right I mean it's kind of right but what's really predictive is the rate of increase of inquiries over time if you build a time series out of the enquiries you gain much more knowledge so for instance if you have six enquiries over six months one per month you have one risk profile if you have no inquiries and you dump six of them this month you have a really different risk profile so you know you have to look beyond beyond the summarized data and get into the raw data and that's where we look at you know thousands of attributes and just I'll give you one other example we we work with a large subprime lender in the UK and they had a very different problem than the lenders we generally deal with in the US because they are an auto lender and they have completely nailed their business they have a 3% collection a 3% repo rate and a 6% collection rate in subprime auto it's a remarkably good the problem is it took them two to three days to make a decision on a loan because it was being made by human underwriters so they were able to process about a hundred loan applications a day and the business is doing great highly profitable now Along Came two web-based used-car sites that could generate a thousand applications a day well the value of those applications was such that most of them are not very good a lot of them are junk and so they can't really assign they can't just hire more people to solve the problem so they came to me and said well could you replicate our human process in an algorithm and reduce the two to three days to under second and so that's what we built and it's been running for about eight months now and we took them from a hundred applications a day to a thousand applications a day without increasing their staff and I think that that's a transformative effect yeah yeah right you have to I mean hmm yeah how do we determine you know what's often referred to as feature engineering how do we determine what the inputs are that we're going to look at and how we order them and structure them is that yeah what when you're doing with a credit report the credit report is time-based it tends to go back five or seven years right and so that's the limitation right there right and so yes you do provide some weighting that you know more current is more valuable than older right so one of the things that we do in in the dynamic modeling is to give more weight to newer data that comes in then old data because there's a higher volume of old data and the new data then gets more weight but you have to be careful not to give it too much weight or you're introducing fluctuations and into the model yeah yeah oh sorry oh you're welcome well we're really looking at their behavior and not their motivation now it's often assumed in the market that well they're they're too basic motivational areas one is the willingness to repay alright and the other is the ability to repay I mean there's some people who might just have the money in the account they don't want to pay you right they just want to keep it there are other people who you know absolutely intend to honor their obligations and they're just really bad with money and they just are never going to have it and you know these are things that you generally see in the credit file now these are extrapolations right and they're approximations they're never completely accurate but what we tend to look at in credit modeling is you know the past behavior is fairly indicative so one of the things that we look at in assessing somebody who either has a pretty bad credit file or no real credit file is to look at their bank account right so we asked them to log into their online banking we tell them exactly what we're gonna do we get a copy of their last 90 days and we can look and see well are they bouncing a lot a lot of checks do they make more money than they spend what's the minimum balance on their account we can make determinations that are just as good as what we would find in a credit file about the probability that they're going to be in a position to repay that Oh overpay in terms of you mean paying more than they are or yes they're the same thing the the algorithm well the algorithm that I wrote three and a half years ago was an auto ML algorithm to apply all different types of algorithms to the problem and then build ensembles and test them against each other and determine which one works then we found after you know about three years that we were always picking the same one so we simplified that somewhat but but the the basic methodology is you have a string of numbers and you have an outcome and you train a model to correlate the outcome to the string of numbers the model doesn't care what it is that you're predicting right they're just numbers right so we're predicting the probability and you can think of the probability as the confidence score of the algorithm we're simply predicting whatever dependent variable someone thinks is important so it could be default it could be profit it could be conversion what is the likelihood that this person if offered a loan will take one and it can be profit in terms of we can take the historical view and say okay this loan generated a hundred dollar profit this loan generated a fifty dollar profit and then correlate and so when we get a new application come in we can go ah this one's like would generate a hundred dollars now the way that we deal with pricing is simply based upon probability so if you know what we're doing is saying the probability that this loan will be profitable is X if X is higher than Y then X gets a better rate it's and it can be as granular as you want we deal with floating point numbers so we don't really have I mean we generally use three significant digits but it doesn't really matter I mean we can use six and you can develop as many tears as you want I mean I'll stay my flight is tomorrow at 955 so [Music] yeah you know lending is heavily regulated and and you have to be very careful about using anything like demographic data or macroeconomic data for instance in in our models we exclude first name and last name because names are indicative of race indicative of gender indicative of ethnicity we exclude birthdays obviously we exclude zip codes because in the US and and I'm sure in Canada as well as if code or a postal code gives you is a proxy for race it's illegal to do to use those things in in most advanced cultures so we're very aware that you know just because something may be predictive doesn't mean that it's legal or ethical to use one of the things a lot of people have looked at is social media data all right and so there are people who have worked about well I know Joe and Joe Zak rook therefore I'm less trustworthy right I mean you look at the social graph of a person but or mele that oh I know a bunch of really honest wealthy people so I can repay my loan well that's ridiculous I mean when I look at my linkages on a social network would I loan most of them money no there's people who connected you know so social media data has not proven valuable we did a study in in Kenya that was really interesting and that the lender had the Facebook graph of all their applicants and they were absolutely sure that the count of friends would be predictive of outcome what we found was actually it is if they have no friends they're probably not a real person or they're locked in their home and they're not going to pay you if they had one or more friends they were real and there's no difference between having one friend and five hundred friends in the likelihood that you're gonna repay right so yeah it has some value in fraud mitigation it has no value in lending also I certainly would not want to explain to a regulator that I denied someone for a loan because they know some shady people somebody over here had yeah yeah interestingly enough that you bring that up we're working on recording our loans in an aetherium blood track and the value of it and it really has nothing to do with AI it's it's really just a matter of being able to record things in a ledger that is permanent and immutable and everybody can see it you know we want to put the loan data out there so that the the consumer the borrower can see it the regulator can see it the investors can see it everybody has access to it we can control it and that's what blockchain does so we will use it for that purpose blockchain and AI it's it's got a lot of problems at this point I mean if you know how aetherium works it updates essentially everything's got a delay right so you run an update across the network and so it's extraordinarily slow in terms of the kind of processing that we do there's no way that we can do the kind of stuff that we're doing in a blockchain as a distributed application so you have to kind of pick the purpose of it but I think as a ledger it's terrific yes sir [Music] I actually I mean I built a number of models from that Lending Club data set the issue would Kaggle is that how to put this kindly the the model with Kangol is that you find a bunch of really smart people around the world to form teams and solve problems and they get a prize okay I'm in the business of solving these problems for a fee not a prize so the problem with Cagle is that cattle owns the competition owns the work product which now means that Google owns the work product and I don't work for Google so is so but but yes it is very interesting to do the competitions and see how you score but I will say that we have some tools right now that out of the box will win a cago competition you can take h2o for instance and you can win a capital competition with maybe putting another day's work in optimizing the the algorithm that it produced so yeah okay is that it oh yeah sorry I'll be really honest at this point it's it's really not very important the goal is to capture an understanding of why people need to borrow money and then later on see how it correlates we we actually get really good information there you know we we had a woman who explained that you know she's she sprained her ankle and she wasn't able to work for two weeks and she lives paycheck to paycheck and her boyfriend was covering her expenses but the rent was due and there just wasn't enough money to go around and she had another week before she could go back to work but she's lined up an additional job and so here's somebody who on the surface looks like a really poor candidate but when you get some more information about them you realize well no they're not a poor candidate at all there's someone who's worth taking a risk on so well you know we'll take the information that that people put in and then eventually we'll have enough of that information and we'll run some algorithms against it will do sentiment analysis and topic extraction and then we'll we'll have kind of a machine version understanding and we'll be able to determine what are the you know what are good reasons for borrowing money from the standpoint of repayment yeah okay yes we look at two things so we look at credit files and bank accounts right so in the absence of a credit file we can look at a bank account in the absence of a bank account we can look at a credit card in the absence of anything we're not really set up to do a face-to-face analysis we're not loaning enough money and we don't have the staff now I work with lenders like in in Latin America particularly in Mexico where that's a very common situation you know somebody comes into a store and they want to buy a refrigerator and they don't have a bank account and they certainly don't have a credit file and what the lender does is it goes they go to their house and they interview the people and they look at the house where the refrigerators gonna go and they look at what they do for a living and they go to work with them and they make a determination and that's you know in a lot of the third world that's the way it works but in Canada Canada is a very heavily Bank population more so than the US you know pretty much everybody's got a bank account they may not have a credit card they may not have a credit file but you know people here tend to have bank accounts all right we have some interest on the part of an investor to put up the seed money to do what we can do is we need about 1.4 million to do an initial investigation and that will qualify us for a Department of Health grant and so we could you know then get a phase one and Phase two grants and that would carry us to a point where we could probably get something to market yeah so it's right now I've been focusing on lending but I'm hoping to get back to the to the cancer genomic stuff thank you all right thank you all very much [Applause]
Info
Channel: Victor Panlilio Photography
Views: 4,493
Rating: 4.909091 out of 5
Keywords: credit risk, artificial intelligence, A.I.
Id: 13uomwJ0Fr8
Channel Id: undefined
Length: 43min 11sec (2591 seconds)
Published: Mon Jul 02 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.