Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello thank you for joining my name is Derek keen and today we're going to get into the topic of multivariate adaptive regressive splines logistic regression and survival analysis this lecture is just one lecture in a broader series of lectures died in the topics of data science machine learning statistics and predictive analytics if you like this lecture please feel free to go and check out some of my other materials ok let's begin the overview of topics that we're going to get into today are going to include a discussion on the extension of regression techniques then we will move into the Mars algorithm followed by an introduction to logistic regression then we will dive into the topic of survival analysis and finally we will move into a practical application example where we're going to do some crime prediction in the US and this practical application example is one from my personal consulting work so I get not only into the topic of the statistical models and machine learning models but also some considerations and actually applying it in the real world in previous lectures we have extensively reviewed the mechanics of multiple linear regression including the various all of us assumptions we have also spent some time building on top of this framework and extending regression into Ridge regression lasso and elastic net techniques there are even more variations of linear regression such as polynomial representations and other variants which we have not fully explored the purpose of this presentation is to further expose us to advanced regression topics both linear and non-linear and continue to build off of our regression and knowledge-based so when we talk about polynomial regression we can look at the chart on the right hand side and we can see that different representations or polynomial functions will give us different shapes of curves now these are still linear regression models but just how we arrange the formula can even give us different insights and how we build up to our linear regression techniques we didn't really dive into the topic of polynomial regression too much in my lectures but I just want to show you you know a different form in statistics multivariate adaptive regression splines or Mars is a form of regression analysis introduced by Jerome Freedman in 1991 Mars is a nonparametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables the term Mars is trademark and licensed to sell fur to the system in order to avoid trademark infringement many open-source implementations of Mars are called earth or some variation can plane off the whole you know planet idea but for the purpose of this lecture we're going to talk about the technique as multivariate adaptive regression splines or Mars well why use these Mars models well Marvin software is ideal for users who prefer results in a form similar to traditional regression while capturing essential nonlinearities in interactions the Mars approach to regression modeling effectively uncovers important data patterns and relationships that are difficult if not impossible for other regression methods to reveal Mars builds its model by piecing together a series of straight lines with each allowed its own slope this permits Mars to trace out any pattern detected in the data the Mars model is designed to predict continuous numeric outcomes such as the average monthly bill of a mobile phone customer or the amount that a shopper is expected to spend in a website visit Mars is also capable of producing high quality probability models for a yes/no outcome Mars performs variable selection variable transformation interaction detection and self testing all automatically at a high speed there are some areas where Mars has exhibited high performance results including forecasting electricity demand for power generating companies relating customer satisfaction scores to the engineering specifications of products presence/absence modeling in geographical systems Mars is a highly versatile regression technique and an indispensable tool in our data science toolkit this section introduced is Mars using a few examples we start with a set of data matrix the variables X and a vector of the responses Y with the response for each row in X for example the data could look like this so we have our dataset on the left and then we have a plot of the data on the right hand side here there is only one independent variable so the X matrix is just a single column given these measurements we would like to build a model which predicts the expected Y for a given X so we're going to build a simple linear regression and the linear model for the above data is minus 37 plus 5.1 X in this case before we dive into the mechanics of Mars and some of the other regressions I just want to take a quick moment to have a refresher on a loss regression now if you want a more thorough breakdown of these techniques and diagnostics and you know how to measure the effectiveness of the models please refer back to some of my previous tutorials most notably the EDA analysis want to end on the introduction into regression techniques so here what we're looking at is we have on the left hand side we have a chart with various data points spread it and we have a line that is fitting through these data points and it's represented by the formula below the graph well when we look at this line not all the points are touching the line and for each of servation there are specific errors related to it and we can see these errors for a particular given point shown on the graph on the right hand side well when we take the square of the error value and then sum of the total errors to reach the sum of square error so one way to think about these errors is if you just square the error you come up with these solutions you geometric shapes of squares and if you take all these squares and you sum them up in total you reach the sum of the squares and what the OLS based approach is trying to do is it's trying to minimize this error by producing aligning with the lowest total sum of the error so they're trying to produce the sum of area is the square if you will that is smallest the basic idea is similar with Amar's algorithm let's take the following chart as an example so we have data that's spreaded not quite in a true linear way I mean there's some bends and kinks in the data well if we try to create a linear regression or a simple linear regression model it will create one with the smallest square error and it might look something like this and we can see the basic error in the chart below however if we fit a linear line on only a small part of the line or the spine above we can measure the error related to that specific spline and that portion of the data so the example on the right hand side is we have just a small section that is being modeled with a straight line and we can see the errors related to just those data points it's somewhat small because the line is fitting in a smaller section of the data we then create a pivot point and then add a second linear line connecting to this line to create a knot in the model and then we do this for the next section of the model and so on and so forth so each one of these lines has different errors associated it to each one of the lines the resulting level will have a lower overall error term and can match the pattern of the data in a much more elegant manner so when we look at this particular chart here and we just trace the shape of the line we actually see that it fits the data very well and when we compare it to a linear regression technique we find that it actually has a lower overall error in order to truly understand Mars models we have to first understand the concept of a hinge function and a hinge function in the correct key part of the overall Mars mumbles a hinged function is the point where linear regression models line is shifted into a different linear regression line there are two functions with one point hinge we are looking at utilizing and the reciprocal the hinge functions are the expressions starting with Max where Max AFB is a if a is greater than B else beam hinge functions are also called hockey stick or rectifier functions this is primarily due to the characteristic shape shape of a hockey stick that the hinge functions sometimes take a hinge function takes the form max of 0 X minus C or max of 0 C minus X where C is a constant called the knot the figure on the right shows a mirrored pair of hinge functions with a knot at three point one one might assume that only piecewise linear functions can be formed from hinge functions but hinge functions can be multiplied together to form nonlinear functions a hinge function is 0 for part of its range so it can be used to partition the data into disjoint regions each of which can be treated independently for example a mirrored pair of hinge functions in the expression down below this creates the piecewise linear graph shown for the simple Mars model on the left hand side when we turn to Mars to automatically build a model taking into account nonlinearities the Mars offer constructs a model from the given x and y as follows you in general there will be multiple independent variables and the relationship between Y and these variables will be unclear and not easily visible by plotting we can use Mars to discover that nonlinear relationship an example Mars expression with multiple variables is shown below this expression models air pollution when the ozone level as a function of the temperature and a few other variables the figure on the right plots the predicted ozone as wind and this vary with the other variables fixed at their median values the figure shows that wind does not affect the ozone level unless disability is low we can see that Mars can build quite flexible regression surfaces by combining hinge functions it is useful to compare Mars to recursive partition and recursive partitioning is also commonly called regression trees decision trees or cart models and we did a series of lectures on decision trees earlier that get into the concepts of these classification and regression tree models so some pros of the Mars software is that Mars models are more flexible than linear regression models Mars models are simple to understand and interpret compare the equation for ozone concentration to let's say the innards of a train neural network or a random forest or support vector machine Mars can handle both continuous and categorical data Mars tends to be better than recursive partitioning for numeric data because hinges are more appropriate for numeric valuable variables than the piecewise constant segmentation used by recursive partitioning built numerous models often requires little or no data preparation the hinge functions automatically partition the input data so the effect of outliers is contained Mars models tend to have good bias-variance tradeoff there are some cons that are worth noting with a mouse model recursive partitioning is much faster than Mars with Mars models as with any nonparametric regression parameter confidence intervals and other checks on a model cannot be calculated directly unlike linear regression models so cross-validation and related techniques must be used for validating the model instead Mars models do not give as good a fits as boosah trees but can be built much more quickly and are more interpretable and also the earth MTA and pulse flying implementations do not allow missing values and predictors but free implementations of regression trees such as our part and party and are do allow missing values using a technique called surrogate splits now we're going to shift gears and move into the topic of logistic regression when we revisit the classic OLS regression model example from before we see that the value of the regression line is continuous in nature in other words the regression line can range in value from negative infinity to positive infinity and is what we call unbounded in order to introduce the idea behind logistic regression let's just take the following example what if we wanted to utilize a linear regression model on a variable that is not continuous in nature let's say we want to predict a yes or no variable that we've encoded into a binary response so a zero represents knowing in this case and a one represents the value of yes when I personally think about binary response variables like yes or no I immediately think about probabilities even though this is subconscious you mean if you were to talk to me 10 15 years ago before you know I was really getting into the top ACEF data science I think about yes or no type scenarios but I wouldn't be thinking about the probabilities associated with it but now that I'm thinking in more of a statistical mindset when you think about yes or no it's really just a spectrum of probabilities hmm to kind of further build on this idea let's take the following example imagine you ask yourself this question er would I like to get some pizza for lunch well the answer is I really would like some pizza this is common I think about it every day around 11:30 or noon I then ponder how badly I would like pizza and begin to associate a probability that I'll order a slice for lunch if I'm really hungry and crazy mozzarella the probability will be much higher the same yes or the number 1 in this case however if I just say the large sandwich I'll be full and the probability will be much lower and I'll probably say no I don't want pizza or have a value of zero so an example here after I eat my large sandwich well the probability I will get pizza is less than 10% at some point we have to say whether or not we will get up and purchase the pizza there comes that critical decision time where we say okay I'm going to get it or not another way to put it is at what probability threshold well I go from not getting the pizza a value of zero to actually getting the pizza a value of one the general idea is that we state that if the probability is 0.5 or below or 50% we won't give the pizza if it's greater than 0.5 then let's get the the pizza and I use the example here of two faces coin two faces of a supervillain in the DC Comics Batman universe and what this villain does is for all this critical decisions he has a coin that he flips that is a two-faced coin one side has scratches on that the other side is just a normal face on it and so whenever he has a decision that he has to me he flips his coin which is a probability of 0.5 whether or not it's going to be heads or tails and then depending on the outcome he'll either say yes or no to whatever decision he's confronted with and it's really a fascinating character and the types of decisions that are made by to face and the comic books are really remarkable and if you've ever seen any of Christopher Nolan's Dark Knight trilogy and if you've actually seen the two-face phone on the site you can most certainly see you know how this comes into play this cutoff probability point this point 5 threshold if you will is a very interesting idea and it's the basis for a lot of statistical theory so logit and probit models and we'll get into some of these in the upcoming slides if we were to create a sample plot with the outcome variable a yes or no on the y-axis and the probabilities on the x axis we would see something like this chart so if I have probabilities ranging from zero to one on the x-axis and our response we will see a lot of no decisions so these are the responses of zero and then all of a sudden we will see a shift and then we will see yes decisions and at a certain threshold we will see this this blue line represents the probability of 0.5 oh and it's the exact point where we shift from no to yes this point is sometimes referred to as an activation function if we were to construct a line to encapsulate this idea it would probably look something like this no I drew this shape and Excel and my lines were a little finicky is okay so there really shouldn't be that many curves on the edges here so please just bear with me but if you look at it but the shape that we're seeing is generally referred to as a sigmoid curve which resembles an S pattern okay and this s-curve pattern is a central concept to logistic regression there are a couple different forms of the sigmoid curve and logistic regression which are called logit and probit based models although there are other log linked functions we can see the subtle difference in the sigmoid shapes here on the right hand side the difference in how they function is fairly minor and a skilled statistician will know when to use one variant over the other so if I'm looking at these two curves on the right hand side they have similar shapes I may follow this s-shaped curve but one is more pronounced than the other and this shape is important you know when trying to decide the outcome of a binary response variable such as a yes or no we will focus our efforts on the logit model in this presentation logistic regression is used to predict the odds of being a case based on the values of the independent variables or predictors because of the dichotomous nature the zero or the one of the dependent variable Y a multiple linear regression model must be transformed in order to avoid violating statistical modeling assumptions and here we see the transformation of a linear regression into a logistic regression we're going to break this formula down in detail and then upcoming slides and we'll give it a little bit into the math for those who are interested it is necessary that a logistic regression take the natural logarithm of the odds of the dependent variable being a case referred to as the logit or log odds to create a continuous criterion as a transformed version of the dependent variable thus the logit transformation is referred to as the link function in logistic regression although the dependent variable in logistic regression is binomial the logit is the continuous criterion upon which linear regression is conducted the logic of success is then fit to the predictors using the linear regression analysis the predicted value of the logit is converted back into predicted odds via the inverse of a natural logarithm namely the exponential function therefore although the observed dependent variable in logistic regression is a zero or one variable the logistic regression estimates the odds as a continuous variable that the dependent variable is a success or case or one in this in some applications the odds are all that is needed the basic approach is to use the following regression model the odds is the odds that the event e occurs is namely where p has a value 0 to P to 1 which is the NP being the probability value in this case we can then define the odds function as about the logit function indicates a mathematical relationship between the probability and the odds ratio as depicted on the run the concept of the odds and odds ratios can sometimes be misinterpreted by non statisticians to provide additional clarity we will use the following definitions and I think it's important that we have really firm understanding of what odds are and what odds ratios actually are so odds is the ratio of the expected number of times an event would occur to the expected number of times it will not occur the odds ratio for a binary variable 0 to 1 it is the ratio of the odds for the outcome 1 divided by the odds of the outcome equal 0 the logit function is the log of the odds function namely logit e-even natural log of odds of e or as bottles based on the logistic model as described before we have the following formula it now follows and then so we are able to then build into the probability of an event for our purposes we take the event II to be that the dependent variable Y has a value of one if Y takes only the values there are one we can think of es success in the complement eeprom of e as failure the odds ratio between two data elements in the sample is defined as follows using the notation P of x equals probability of X the log odds ratio of the estimates is defined as follows although logistic regression model the logit of y equals alpha plus beta tax look similar to a simple linear regression model the underlying distribution is binomial and the parameters alpha and beta cannot be estimated in the same way as for simple linear regression instead the parameters are usually estimated using the method of maximum likelihood of observing the sample of values maximum likelihood will provide values of alpha and beta which maximize the probability of obtaining the data set the maximum likelihood estimate is that value of the parameter that makes the observed data most likely defining P as the probability of observing whatever value of Y was actually observed for a given observation so for example if the predicted probability of the event occurring for case I was 0.7 and the event did occur than the probability is equal point seven or if on the other hand the event did not occur then the probability is point three zero you if the observations are independent the likelihood equation is as follows the likelihood tends to be an incredibly small number and is generally easier to work with a log likelihood ergo taking logs we obtain the log likelihood equation the maximum likelihood estimates are those values of the parameters that make the observed data most likely that is the maximum likelihood estimates will be those values which produce the largest value for the likelihood equation example get it as close to one as possible which is equivalent to getting the log likelihood equation as close to zero as possible now that we've talked a little bit about maximum likelihood estimators I want to now dive into some of the properties of these estimators the ML estimator is consistent as the sample size grows large the probability that the ML estimator differs from the true parameter by an arbitrarily small amount tends towards zero the ML estimator is asymptotic ly efficient which means that the variance of the ML estimator is the smallest possible among consistent estimators the ML estimator is asymptotic we normally distributed which justifies various statistical tests and this picture on the right hand side is of Ronald Fisher who in 1922 introduced the method of maximum likelihood and the statistics world Ronald Fisher is one of the pioneers of many of the methods that we use today so he truly is a giant among statisticians now we're going to shift gears and move into the topic of survival analysis survival analysis is a set of techniques that study of events of interest where the outcome variable is the time until the occurrence of an event of interest effectively what we're saying is survival analysis is the study of time not like time series or predictions of statistical science time series but the variable of interest is actually time survival analysis attempts to answer questions such as what is the proportion of a population which will survive past a certain time of those that survive at what rate will they die or fail can multiple causes of death or failure be taken into account how do particular circumstances or characteristics increase or decrease the probability of survival let's now get into very brief history of survival analysis these events of interest in the medical field typically represents the mortality rate for experimental drugs and medicines the analysis generally produces a timeframe until death which is why the technique is referred to as survival analysis an experimental medicine amongst other considerations is generally seen to be effective when the survival rate is extended for the experiment beyond the control survival analysis is the technique which allows for this determination to be made so when we think of survival analysis generally think of the effectivity of experimental medicines and when we have cancer treatment drugs you know how long will a person survive when they are on these type of drugs and if you think of it that way you will always understand the context of the survival analysis well here are some applications of survival analysis as we talked about with the medical drug testing there's reliability analysis in engineering duration modeling in economics we can look at event history analysis and sociological scenarios in the criminology there's criminological analysis that we can look at business applications include customer lifetime modeling churn rates if you will in the actuarial science and risk modeling the components for insurance applications there are many uses of survival analysis and we can use it in biology such as botany and zoology in order to successfully prepare a survival analysis we must first introduced the concept of sensory let's imagine for a minute that we're trying to understand factors which can cause the lifespan of a human to be shortened in our mock example we're gonna collect data from 1990 all the way out to 2050 so I'm depicting it on the right hand side a very simple graph where we have an arrow of time moving forward to the right hand side and we're going to start collecting represented by these dashed blue lines in 1980 we're gonna collect all the way to 2050 which is the length of our data collection for this particular cohort when imagining our data set we might think about variables that influence the lifespan of humans such as the diet their exercise socio-economic conditions etc etc however we will also think about whether we have data related to the full lifespan of a person are we collecting data from the date of birth to the date of death for a particular subject well what if the subject is still alive after the study has concluded well this concept is called censoring and we're going to get into this a little bit in the next couple of slides if we are collecting random samples of people to include within our study then it is entirely possible to believe that we will find individuals that do not fit neatly into the walls of 1980 to 2050 let us showcase this idea by representing the lives of individual subjects with an arrow so if we look at our chart on the left hand side where we have our walls from 1980 to 2015 and each arrow is representing a unique individual what we can see that some people were born well before 1980 and died shortly after 1980 there are some who were born maybe in 2010 that are living well beyond 2050 and we find some people that maybe were born in 1983 and they died in 2025 and so on and so forth individuals who fit within these walls are shown with a green arrow and those that do not the censored observations will be shown by either a red or orange arrow therefore we can think about censoring as a form of a missing data problem additionally there are two types of censoring that we should be aware of right censoring and left censoring write censoring will occur for those subjects whose birth date is known but who are still alive when the study ends the orange arrows if you will that we see on the right hand side okay this is a an example of the right sensory if the subjects lifetime is known to be less than a certain duration the life time is said to be left censored or the red arrows so they lived before the study and they don't make it through the entire study okay so they die shortly within the study but their origin pre-existed the length of data collection the data scientist will need to formulate a plan on how to treat sensory observations during the exploratory data analysis you know what do we do with these sensory observations do we omit them from the study decrease in our sample how do we treat these particular variables as something that we have to consider as we're collecting our data and moving into model building the survival function is the probability that the time of death or the event is greater than some specified time so here is an example of a very simple survival function a survival function is composed of the following it has an underlying hazard function and the hazard function is you know how the risk of death per unit time changes over time at baseline covariance the effect parameters how the hazard varies in response to the covariance usually one assumes a survival of zero equaling one or a hundred percent although it could be less than one if there is a possibility of immediate death or failure and what this is essentially saying is that at the start of my study so when no time has elapsed a hundred percent of my population is still alive or the event of interest has not occurred the hazard function conventionally denoted as lambda is defined as the event rate at time T conditional on survival until time T or later that is T is greater than or equal to T the hazard function must be non-negative and it is integral over 0 to infinity must be infinite but is not otherwise constrained it may be increasing or decreasing non monotonic or discontinuous as depicted on the right so we can see that the hazard function can take many shapes and one shape that is very well-known is called the bathtub curve and this is well known in engineering circles and that's what we see with the blue line up on the top of this chart hazard and survival functions are mathematically linked by modeling the hazard we obtained the survival function here is an example of the survival functions for individuals with different types of cancers the x-axis consists of a length of time and the y-axis is the survival probability represented as a percentage or proportion in this case in this example all of the subjects were alive at the start of the study so y equals 1 or the survival function of 0 equals 100% as we had discussed earlier at year 10 approximately 20% of individuals with colon cancer survived 40% survived with prostate cancer and 65% survived with breast cancer when I was first learning about survival analysis in grad school I had a moment of clarity where the analysis started to make sense to me and when you look at these particular charts that we see on the right hand side when evaluating these charts for analytical insights look for the separation between the various curves if you see significant separation in between the red and the blue line or the blue and the green line that's giving you a clue of something is driving that difference so when looking at survival functions in graphical form this is one kind of way that you can analyze the data that might lead to some key insights and ultimately a breakthrough so you take my advice when you're looking at at these type of functions just look for the separation it will lead you in a good direction we'll now get into the topic of Cox proportional hazards model sir David Cox observed that if the proportional hazards assumption holds or is assumed to hold then it is possible to estimate the effect parameters without any consideration of the hazard function let Y denote the observed time either censoring time or event time for the subject and let's see VD indicator that the time corresponds to an event so we have the expression below this expression gives the hazard at time T for an individual with covariant vector expansion or e variables X based on this hazard function a partial likelihood it can be constructed from the data sets it has follows the corresponding log partial likelihood is denoted down below this function can be maximized over beta to produce maximum partial likelihood estimates of the model parameters the Cox proportional hazards model may be specialized changing the underlying baseline function if a reason exists to assume that the baseline hazard follows a particular form so to build on this idea an example would include it's the Weibull distribution and when this happens or when we substitute the wildest tribution we call this the Weibull hazard function and what's interesting here is that we are now borrowing some of the concepts of the maximum likelihood estimates that we were using in logistic regression now these are personal likelihood estimates but the same underlying concept that drives logistic regression is essentially driving survival analysis as well the Cox proportional hazards model is the most common model used to determine the effects of covariant sun survival it is a semi parametric model the baseline hazard function is unspecified the effects of the covariance are multiplicative and it doesn't make arbitrary assumptions about the shape or form of the baseline hazard function this model also has some key assumptions that the covariance multiplied the hazard by some constant for example a drug may have the subjects risk of death at any time and the effect is the same at any point in time violating these assumptions can seriously invalidate your model so you have to be very careful when building a survival models that you're taken into consideration these key assumptions this concludes our introduction to Mars logistic progression and survival analysis we will now shift gears and focus on a practical example where we will be predicting the crime in the United States being able to reasonably predict the individuals who are most likely to commit crimes is a significant value to the criminal justice system specifically if the individual is within custody and a judge needs to decide whether or not they're at risk for recidivism and what recidivism is saying is that I have committed a crime and I'm in custody the judge is deciding whether or not to keep me in jail or to let me free to society and if the judge lets me free and I go about into my neural ways and I commit another crime that second crime is considered an act of recidivism data science can help to shed light on the underlying factors and when used appropriately can aid the just the judicial system in decision-making particularly in pretrial this improvement in decision-making reduces the costs of inmate housing and has a societal benefit through keeping the low-risk offenders out of the prison and the high-risk offenders behind bars this case study is one for my consulting career which I have personally prepared from the data munging stage to the final predictive model so I had my hands on all steps throughout this process so I was in the trenches I was getting my hands dirty with very raw and dirty data I had to get it into a shape or I could work with the data and then built the models and then drive the results from the battles in addition to writing up my findings in an official report I have presented the techniques to a major US cities Leadership Committee where they have used the results in the following ways to build a risk assessment system that has leveraged in pretrial decision-making for recidivism the refinement of a build plan for a multi-million dollar prison facility based upon the predictive models results now due to the sensitive nature I will be randomizing the data and results to fit the tutorial most of the tutorials that I have shown I have that some are code or I provide the data sets behind the scenes in this case the data that I'm showing here isn't the actual data that we've used but the results are fairly similar and the approach is really what matters here I will however go through some of the real-world considerations that I had throughout the process and showcase the preliminary results of our logistic regression model so when we talk about data science and the techniques you know there's a lot of theoretical reuse of the models but I want to just kind of highlight you know how do we take these models into everyday use and some of the challenges and considerations that I had throughout that process the odds are high pun intended that if you are reviewing this material you have a strong interest in data science and know all of the latest and greatest machine learning techniques however I would suspect that a substantial number of US data scientists could not explain most of these algorithms to business decision makers in an intuitive manner you know for example if you're talking about support vector machines and the kernel function and higher dimensions or neural networks you know the architecture of simulating how how brain works in a in a mechanical sentence these are very difficult and abstract concepts and I think of a large number of us would not be able to explain these in a way that a business business decision maker can make decisions when I first started on this project I was working under the former mayor of multiple US cities major US cities a gifted attorney and a Harvard professor so he was the project lead of this project and I was the analytical support I was the data scientist he had understood the value of advanced analytics and launched a series of big data initiatives during his tenure as deputy mayor in New York City while preparing my plan with him I was fairly shocked to realize the difference in understanding between the public and private sectors in regards to statistical modeling and what I mean by that is in the private sector most of the businesses that were working on we're constantly innovating and were driving to the latest and greatest techniques and we're very quick to adopt of these technologies and techniques into practice whereas the public sector there are more layers within the structure within the higher ease if you will that makes decision-making a little slower so when there is a revolutionary technique it takes a while before it's fully integrated within a government entity and I was never exposed to this before because most of my work was in the in the private sector but when understanding you know how the actual day-to-day business works in the public sector I was at a little shocked on this the reality is that the government sector is a little slower to adopt the technological advancements in machine learning he expressed that if we are to be successful in establishing the value of these techniques within the judicial system that we would need to emphasize interpretability of the algorithm over the predictive performance this initially was somewhat counterintuitive to me because as a data scientist we're trained to believe that we have to maximize our predictive performance but now I understand the rationale behind this if you can't explain what the algorithm is doing in layman's terms the users judges in this case will become confused and then they'll begin to doubt the value of the technique once these judges and users are comfortable with the nuances of the machine learning algorithms and it's integrated into the judicial process then we expand on the predictive framework emphasizing predictive accuracy so the idea of we're going to slowly introduce a model we're going to make sure everybody's comfortable with how this model is actually working and then once they're comfortable with it then we'll begin to shift the model into one that has a higher predictive accuracy but the trade-off in this case is that interpretability goes down so a neural network might be more accurate in this case but our ability to explain it in an intuitive manner will go down this is ultimately why settled on using a logistic regression to demonstrate how the tool can be built and then by using a random forest model for higher predictive performance at the expense of interpretability so this entire approach was built around the strengths of logistic regression before we dive in in detail let's take a moment to understand a little bit about the data we're working with here the final core cohort was constructed from various databases in the prison system judicial system and bail bond system in my samples in this case were about 4,600 the dataset was also pared down to ensure that there were no instances of right or left censoring so this 4600 included the removal of right and left censored observations this was important to ensure that our sample was representative of population at large coincidentally the data set now allows for us to really perform a survival analysis some variables were constructed around the expert opinions of criminologist an extensive literature review so when deciding on how to build up to this data set we performed a extensive review of literature surrounding when what are the variables of interest related to criminal behavior and what we were able to find out and validate is that there are certain risk factors based on of numerous studies and then we're showing some of these risk factors to the right such as whether or not you've had a prior FTA whether you've had a prior conviction it your present charges a felony if you're unemployed if you have a history of drug abuse and if you have a pending case well these are all risk factors that you're going to have an active recidivism I also spent a large portion of my time just trying to understand how to connect the pieces of data when working through the various disparate information systems so we talked about with the prison system having data we have the system from the courts and from the bail bond and systems I'd like to believe that I'm fairly proficient in SQL through my time in the business world and I've spent a lot of time building business intelligence technologies I'm very comfortable in a sequel server environment however getting the data into modeling or ready for mendler post significant hurdles that business users rarely encounter you know a lot of times when we're working with datasets we can very easily connect the pieces you know you just link two fields together and then you're good to go but these datasets and how they're constructed the logic of them were very tricky to work with and I was surprised by that I just thought you know hey I'd get these these extracts from these systems they'd send it out to me and you know I've worked my SQL magic and then BOOM I'd have it and data ready form and in a day the reality is that the data munging process took about three months of intensive on-site work I'd have to ask the business users or the government employees in this case you know hey what is this piece of information how does it work in relation to the others and it really took a very long time just getting the data constructed so I know a lot of you know are feeling very comfortable and confident about you know working with data pieces but I just wanted to even share this a little bit of wisdom from somebody who's been working with data for a long time that I I wasn't ready for how difficult it it actually turned out to be there were a total of 15 variables which were included as potential predictive variables and with the response very well being FTA or failure to appear a proxy for recidivism so what we're talking about with this FTA just to give a little context is then you were arrested for a crime and you were brought in front of a judge and the judge decides in this case to say okay I'm going to let you out into society I don't think that you are a risk and but your court date is gonna be in three months and you have to be here and then we will figure out you know what to do from there well three months goes by and the person who is supposed to come to court does not appear okay so they have a failure to appear or FTA and this failure to appear we are saying is a proxy for recidivism even though it's not a direct criminal hat and other self what I am I actually think it is a criminal act we're saying that just the fact that they didn't appear in court is functioning as a proxy for recidivism here is the listing of the final variables used for the analysis and a brief description so if you're interested in how these variables were constructed and some of the encoding methodologies here's some background information for you you the FTA definition we used is depicted as a dichotomous categorical variable and formally defined as follows a zero is an individual who has not had a failure to appear to court between the time the defendant was released from either the jail and their initial court trial in the county or a won an individual who has had a failure to appear to court between the time the defendant was released from either the jail and their initial court trial in the county the breakdown of the various considerations for the personal property and drug arrest types is as follows so if you look at the variable list from before we have these various offenses that were calling personal property and drug arrest types but there's a whole series of codes that police departments use when categorizing various criminal activities and so what we're calling personal would be codes related to assault battery kidnapping homicide and sexual assault property which would consist of larceny robbery burglary arson embezzlement forgery receipt of stolen goods and drug so dealing possession prescription and various drug types now we'll move a little bit into some exploratory data analysis just is gonna feel a little more for the data so the first thing we'll do is we'll construct a correlation matrix provided which provides a visual indication of the relative strength of the correlation between each variable a large red bubble depicts a more significant negative correlation whereas a large blue bubble signifies a stronger positive correlation by referencing the fta row this correlation matrix is showing that the variables of felony indicator drug charge and personal charge all of which have red bubbles associated with them may be indicators of potential predictor variables within our model in general there does not appear to be extremely strong correlations when we visually inspect the data the next step in the overall approach involved looking at some automated variable selection procedures such as forward selection damper selection and stepwise selection and we talked about these topics in the EDA lecture earlier so if you're looking for a refresher on how these techniques work please feel free to go back and check out those lectures the forward selection procedure had produced the following output in our only variables which contributed to the reduction of the AIC statistic lower is better in this case were incorporated into the final model all the variables month had address employment non-white and FTA ever were found to be statistically insignificant and did not improve the models performance thus were removed from the model in a key point that's worth noting is that the final model we presented included the employment and FTA ever variable despite the fact that they need the P less than 0.05 threshold here so when we think about model building were we're always taught that hey if the p value is you know less than 0.5 then we can use it otherwise it's completely disregarded in this case because we had subject matter expertise and we knew that certain variables historically were important in this case we loosened these thresholds to accommodate this subject matter expertise within our model so what you talked about in terms of statistical theory in the classroom versus what is actually applied in the real world you have to be willing to bend a little bit in order to to address all of the subtleties related to this particular science we're talking about human behavior this is an inherently more random evaluation so you know any type of you know for more formal literature review that can aid in the discovery of these variables we want to incorporate that logic as much as possible the next step was to produce a binomial adjust Russia model from the results we utilize the GLM model within the base Armour so building in our we decided to construct our logistic regression and the Diagnostics that the model produced indicate that each of the variables included are statistically significant at a confidence level of P less than 0.05 now remember this isn't the final model that we had presented this is just the results that we had from our our automated selection procedure the charge class variable has been retained in the model due to the overall contributions to the model even though some of the classes type equaling B are not statistically significant this box plot shows that the Pearson residuals for the charge class variable are consistent this indicates that the variable does not suffer from heteroscedasticity variability of the FTA is unequal across the range of values for different charge classes this is an indication that the charge class variable can be retained in the model not that we spent some time building your model let's go ahead and let's take a look at the results in our final algorithm in this case we have the probability of FTA equaling the following and if you remember earlier when we were walking through our logistic regression tutorial we find that we can actually rearrange the formula to find the probability of an event you can we see this it on the image on the right hand side here so in this case we're just restructuring the formula so we are emphasizing this probability and we have a number of variables in this case I encoded it with some shorthand just make it a little easier to interpret the probability of fta represents the probability that an offender will have a failure to appear based off of the models parameters the default threshold is a probability equaling 0.5 or 50% where a value of probability less than 0.5 will produce a 0 or no FTA and the probability greater than or equal to 0.5 will produce a 1 or will commit an FTA a discussion around calibrating the cutoff threshold will be addressed later so this probability of 0.5 where we had talked about later is the trigger between a 0 and a 1 and we can flex that a little bit in the play around the probability calculator can also be evaluated as a potential risk score for a failure to appear outcome if the probability generated for particular cases 0.05 this implies that there is a 5% risk for an FTA outcome based upon the specified food amount and once a will will play around with this idea as well the interpretations of the coefficients are not intuitive with analytic regression model and require further explanation there is an alternative representation of the variables which can be used to help drive decision-making this approach involves transformative variables into odds ratios which can then be interpreted in a more intuitive fashion so odd odds ratios are our friends here odds ratios are used to compare the relative odds of the occurrence of FTA given exposure to the variable of interest the odds ratio can also be use to determine whether a particular exposure is a risk factor for a particular outcome the odds ratio of one implies that the variable does not affect the odds of an FTA an odds ratio greater than one is that is depicting that the variable associated has a higher odds of an FTA and if the odds ratio is less than one then the variable associated will have lower odds of an FTA the odds ratio can compare the magnitude of various risk factors for that outcome these odds ratios can be interpreted in the following manner so for gender an individual who is male has an odds that is 0.75 four times less likely to have a failure to appear than a female controlling for the other variables holding values constant so a female is more likely to have a failure to appear than a male eight for each year and individual ages there is a 0.6 percent decrease in the odds that they will have a failure to appear controlling for the other variables the odds value being so close to one implies that this should not have a considerable impact in turn rain risk factors a charge class of deed an individual with a case ID or the most serious offence is a charge class of D has a napot of failure to appear that is 2.8 times higher than those who do not controlling for other variables so if you're arrested and your charge class is a D which represents a particular type of crime your odds of an FTA is 2.8 times higher than those who do not drug charge an individual with a drug charge has an odds of FTA that is 0.6 to 4 times lower than those who do not controlling for other variables so if you are arrested for drunk terms what we are saying is that you're probably going to show up to court the odds ratios can be seen as indicators of an underlying risk for failure to appear with this understanding the variables which indicate the greatest risk of FTA include whether the crime is misdemeanor with a stronger odds that a Class C and also individuals who commit Class D felonies so we're able to clean this type of information from our model for example let's consider the Class D type felony specific there is a substantial increase in the odds 2.8 times for having a failure to appear when DRC has this lower classification crime this insight can be used by a judge when considering release of the offender and the costs associated with a failure to appear so if I'm the judge and I have somebody sitting in front of me and they have a Class D type felony when I'm deciding hey should I allow this person to go debt to society or should I retain them with the thought that hey my odds are 2.8 times more likely for having an FTA to appear when I have this particular crime shouldn't it be weighed in the general decision-making process and one additional insight is that for each case ID in which the most serious offense is a misdemeanor the odds for failure to appear increases by 2.1 percent so if you have misdemeanors your odds for failure to appear increases slightly in each case ID represents a specific instance of a crime so one individual could have multiple crimes in each crime would have their own unique case ID the model that we had specified with a cut-off of 0.5 has correctly classified 69 point O 7 of the instances an incorrectly classified 30 point 9 3 the confusion matrix shows the various classification errors this initial model has a specificity of 0.295 and a sensitivity of 0.9 zero one these values describe the type 1 and type 2 errors and if you're looking for more of a background in terms of confusion matrices and ROC charts please refer back to some of the previous lectures the predictive accuracy of the model can be visually represented through this ROC chart on the right hand side the area under the curve for the model is equal to zero point six eight five what this is saying is that this indicates that the model is eighteen point five for some better at predicting failure to appear than through randomly guessing the outcome so by utilizing this model we're going to have an eighteen point five percent better prediction than just randomly guessing we need to discuss the cost of an error there is an underlying concern when producing predictive analytics models related to the cost of a Mis classification particularly with crime if the model predicts was 69 percent accuracy that it must be incorrect 31 percent of the time these classification errors represent real costs to the municipality and could be the difference between releasing and individual back into society and then having an FTA that costs the courts time and money or even worse the released person commits a serious crime like murder the other type of error would be that the county would place an individual into the jail system when they would not have had a an FTA in the first place this also costs the county time and resources and used to be considered as well if the intention would be to more uniformly balance the classification performance the thawing approach can be utilized we adjust the cutoff threshold from 0.5 remember where we said if it's less than 0.5 it's a zero or if it's greater than equal to 0.5 that's one to a different probability threshold in order to mitigate the classification performance and type 1 and type 2 errors additional calibration work from subject matter specialists was necessary before this technique was to be used in practice what is the cost of these classification errors and how do we find a more cost-effective threshold is it better to have more if we're going to be wrong in the prediction do we want more people to be in the prison system where the cost of the county is high or long more people to be on the streets and we have to weigh the costs about you know the real dollar costs associated with both outcomes and these questions became catalyst for additional follow-up research so these are questions that criminologists are actively pursuing and we actually subcontracted a pair of brilliant criminologists who specialized in this one particular field to help calibrate our mouse models I would now like to take a moment to show you some techniques that you can use when trying to rebalance this probability threshold because I think it would have some value we explored two different techniques to further balance the performance based upon the cost matrix being identical for each classification so typically when you have a cost matrix and you're balancing and the type of errors you know one is more punitive Lee weighed versus the other in this case we're just saying that hey the cost of the errors are going to be identical in the case but in practice we actually use something different the chart identifies the ideal balancing point for a calibration of the probability threshold the minimum difference threshold or MDT approach indicated to adjust the probability threshold from 0.5 to 0.7 won this results in a decrease of correctly classified instances from the initial model and an increase of incorrectly classified instances the MDT balance model has a specificity of 6.4 43 and a sensitivity of the 0.6 4 to 9 and the confusion matrix is shown here so by changing the probability threshold from 0.5 to 0.7 1 we actually found that the overall predicted performance was lowered because we're now saying there are going to be more zeros and less ones in this case the overall performance when it was rebalanced resulted in the confusion matrix that we see on the right hand side so we had a better balance of the specificity and sensitivity in this case both are hovering around this point 6 4 but at the expense of overall predictive performance so this is just kind of highlighting some of the trade-offs that we see when calibrating these type of models the maximize some threshold or MST approach indicated to adjust the probability threshold from 0.5 to 0.6 9:1 this results in a decrease of correctly classified instances from the initial model and an increase of incorrectly classified instances the MSE balance model has a specificity of 0.61 and a sensitivity of 0.67 the MST approached indicates a reduction in predictive performance from the initial model but higher than the MTT based approach the change in the probability threshold balance alpha specificity and sensitivity ratios more evenly for both the MST and MDA approach than the initial model that's important so we've built our logistic regression we feel comfortable with our parameter estimates and now when we're looking at you know ways to further improve our performance well considering the cost of the errors in the overall approach is important and by balancing these thresholds were able to mitigate some of these type 1 and type 2 errors this now takes us to the final results however if the intent would be to use this model and the opportunity cost for false positive was equal to the cost of a false negative classification the MST approach provides a stronger mechanism to draw from the analytics for decision-making the MSD calibrated model contains a sixty five point six percent predictive accuracy while more evenly distributing the classification errors this model would indicate that an additional 966 individuals would be predicted to not have an FTA and 613 would be classified as having an FTA based upon the algorithm and the classification errors this calibrated predictive model could be leveraged to potentially reduce the influx of defendants awaiting trial within the jail population by approximately 7.5 percent this model can then be used to extrapolate the total amount of inmates within the prison complex provided the logistic model is being used exclusively the predictions were then applied to the existing prison expansion plan and used to drive down the cost of the new facility by reducing the number of beds by 130 and saving the county 1.4 to 2.8 million anomie for the county let's now talk about how we can expand on this idea even further the next phase of the predictive model deployment plan would be to introduce a computer-based risk assessment system the basic concept is as follows and we're going to take a 20,000 foot view of the overall approach of Judge will enter the defendants ID into the system when deciding whether or not to keep the person in jail for an FTA or if recidivism is a concern this system will take into consideration demographic variables their age their sex education as well as behavioral variables the number of prior felonies time at address etc then it will run the algorithm behind the scene and produce a probability level the software assigns the probability threshold to a specific risk category then this risk category is shown to the judge so if we look at our little table on the right-hand side we see the probability threshold ranging from 0 all the way to 100% or 1 now we can use this as a cut-off to trigger a 0 or a 1 in our logistic regression but we can also think of this spread of a spectrum as an indication of risk so if the probability threshold is on the lower end of the spectrum well there's gonna be a lower risk for an FTA but if the probability is on the higher end of the spectrum well then the risks are going to be higher so the idea is the judge plugs in some information of the system they hit enter or they click on a button and out pops a message hey this person's had a very high risk of an active recidivism or an FTA the judge then utilizes this information and their professional judgment when deciding whether or not to release the individual into the population at large and I can't stress this point enough that we developed the predictive analytics models to aid in the decision-making in the case of we're talking about human behavioral characteristics you know until the accuracy gets to a certain threshold on these models you know there are some soft considerations that have to be brought in and that's what makes a judge so important to the overall process so what we're doing is we're providing an aid to the judge to help kind of keep them on a certain course they can completely disregard the algorithm if that's what they feel but if they're looking to substantiate some of their gut feelings in this case well we can draw from logistic regression and these type of techniques in order to indicate relative risk levels you you
Info
Channel: Derek Kane
Views: 17,332
Rating: 4.9741936 out of 5
Keywords: Logistic Regression, Statistics (Field Of Study), Regression Analysis, MARS, Splines, Survival Analysis, Machine Learning, Data Science, Predictive Analytics, MLE, Recidivism, Odds, data mining
Id: 17QbQF__9XM
Channel Id: undefined
Length: 82min 43sec (4963 seconds)
Published: Tue Jun 30 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.