Statistics 101: Logistic Regression, An Introduction

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- [Brandon] Hello, and welcome. Brandon here. Thanks for choosing my video. If you like the video please give it a thumbs up. If you think someone you know can also benefit by watching, please share. And as always, please subscribe. I appreciate it very much. So, let's go ahead and get started. So here we are in logistic regression, a very useful if, in my opinion, underutilized statistical procedure that is not all that intuitive, which is maybe why it's underutilized. Now as with many of my other videos we're gonna start out with an actual problem. Now this problem is one that I made up, so I made up the text and the data for it. So just keep that in mind going forward. However, I do think it has the side benefit of being potentially useful in your everyday life. So let's go ahead and take a look at it. So we'll call it First-Time Home Buyer. So as a first-time home buyer you are busy organizing your financial records so you can apply for a home mortgage. As part of this process you order a copy of your credit report to check for errors and gauge your credit score, which can range, at least here in the US, from 300 to 850. Now lenders will factor in your credit score when deciding to approve or not approve you for a mortgage. They will factor in other things like your income, how long you've been at your job and other things. But your credit score is definitely an important part of their decision. Now it turns out your credit score is 720 on that scale of 300 to 850. Now while you're doing your research, which you are dutifully doing as a potential home buyer, you find some raw data online. So there's data floating around the web everywhere. So you were lucky enough to come across this data set that has 1000 applicant credit scores and whether or not the application was approved, so yes or no, for the home mortgage. Now using the data you found, you would like to do the following. Number one, develop a model that will provide the probability and the odds of being approved for any given credit score. And again we will do all of these as we progress throughout the video series. Number two, discover approximately what credit score is associated with a probability of 50%. So the odds are even for being approved. If I walk into the bank with a certain credit score it's basically like flipping a coin. My probability is 50% of being approved, which is the same way of saying the odds are even. We want to know what credit score that is on our scale. Number three. Input your score of 720 into the model to determine the probability and the odds of you being approved for a mortgage, which of course is very important to you. And finally, determine how improving your credit score from 720 to 750 would affect your probability and odds for being approved for the mortgage. So let's say your score is 720. You find that out. You're gonna wait a little bit and see if you can get your credit score a bit higher up to 750 by paying down some debt, maybe you know that you're gonna get a promotion and a higher salary sometime soon, or something like that. And you think that your score may improve. And you want to know how that improvement in your score would affect your probability and odds of being approved for the mortgage. So here is just a little chunk of that 1000-observation data set. So there are only 15 here. But I just wanted you to see how it's organized. So we have the credit score on the left and approved on the right. So again N is 1000. The credit score is the applicant's credit score from 300 all the way up to 850. Now approved is coded as a one for approved and zero for not approved. So it is binary. It is a dichotomous variable, and it is mutually exclusive. So you're either approved or you're not approved. There's no in-between. Now as a good analyst and a good stat student or whatever it is you might be, you could create a scatterplot of your 1000 observations. But it looks like this. Now what is this? So if you look on the left-hand side we have approved and we have zero at the bottom. So that means the application was not approved. At the top we have a one. So that means the application was approved. But we have the data points in two lines. So the credit score along the bottom, so FICOscore is just a certain type of credit score that's widely used. So if the dot is on the bottom that means for whatever credit score that was, the application was not approved. If it's at the top it means it was approved. Now how can we put a best-fit regression line on a scatterplot that looks like this? It doesn't make any sense to do it how we would usually do it in normal linear regression. So obviously we're gonna have to come up with some other technique. And that's what logistic regression allows us to do. Now that we have set the stage with the problem we're gonna look at, what is logistic regression? Now logistic regression seeks to do the following, among other things. It seeks to model the probability of an event occurring depending on the values of the independent variables. In this case credit score. Which can be categorical or numerical. So, model the probability of an event occurring depending on the other independent variables. It seeks to estimate the probability that an event occurs for a randomly selected observation versus the probability that the event does not occur. So for a random observation in the data or some other observation that we would want to predict we want to estimate the probability that the event occurs versus that it does not occur. It seeks to predict the effect of a series of variables on a binary response variable. So in this case we only have one independent variable credit score. But we could have more. So logistic regression can work a lot like multiple regression with several independent variables and the one dependent variable that is binary, so zero or one. You can also seek to classify observations by estimating the probability that an observation is in a particular category. In this case the applicant is either in the approved category or they're in the not approved category. So we can classify observations. So, model, estimate, predict and classify. Let's try to understand and visualize the problem we're working with. So in this case we have a bunch of credit scores. So an applicant walks into the bank and may have some sort of credit score. Now the bank or other lending institution feeds that into their lending model. Their credit score goes into the model, and then when it comes out it's either approved or it's not approved. So this black box in the middle is what we're trying to understand. So we could ask, what is the probability that an application having a credit score of 670 would be approved? So it would end up in the approved category up here on the top. So credit scores get put into some model, a decision model by the bank or other lender. And then the bank or the lender puts that application into the approved or non-approved categories. That's basically what we're trying to model in this logistic-regression problem. Now I am kind of making the assumption that if you're studying logistic regression you have to some extent studied simple linear regression and multiple regression. Now if you studied those you might have a very good question. Why can't I use one of those for this type of problem? Well, here's why. Number one, simple linear regression is one quantitative variable predicting another quantitative variable. Now in this case we have a dichotomous dependent variable. So approve or not approve is one or zero. It's not a quantitative variable. Now multiple regression is just simple regression with more independent variables. So those are basically the same type of problem. Then we have nonlinear regression. That's still two quantitative variables, but the data is curvilinear. Now if we ignore those warnings, running a typical linear regression in the same way on this type of data has some major problems. Now binary data, in this case approve or non-approve, does not have a normal distribution, and you can see that by looking at the scatterplot, which is a condition needed for most other types of regression. The predicted values of a dependent variable can be beyond zero and one in those other types of regression. So remember in logistic regression we're dealing with probabilities. And the rule of probability is that it has to be between zero and one. If we use the other types of regression, the values can be beyond zero and one. Which obviously is not going to work. And probabilities are often not linear. Such as U shapes, where the probability is very low or very high at the extremes of the X values. So you can probably think of different examples. So one example could be the probability of contracting the flu. So the probability of getting the flu is higher if you're younger, so a baby or infant or toddler, and if you're older. So say in your 60s, 70s and 80s. The probability is higher in the extremes than it is in the middle. So probabilities often have different shapes in their distribution along the X variables. So now that we have set the stage by introducing our problem and going over the basic conceptual foundation of what logistic regression is, let's talk about where we're going in the next video. So in the next video we will do the following. We will review basic probability. So we won't go into much depth. We'll just go over the basics. Because obviously understanding probability is central to learning about logistic regression. We will learn about what odds are and what the odds ratio is. Because again that's central to understanding logistic regression. We will briefly discuss how to interpret the odds ratio in logistic-regression context. And finally, we will note things we have to keep in mind when interpreting the odds ratio. So the odds ratio is related to probability, of course, but there are some dangers in how we interpret it. And we'll definitely discuss that in the next video. So let's go ahead and wrap up this video, and I will see you in the next one.
Info
Channel: Brandon Foltz
Views: 559,953
Rating: 4.9253268 out of 5
Keywords: statistics 101 logistic regression, logistic regression basics, logistic regression tutorial, logistic regression brandon foltz, logit regression, logistic regression analysis, logistic regression model, logistics regression, simple logistic regression, logistic regression explained, logistic regression, binary logistic regression, logistic regression for dummies, logistic regression for beginners, What is logistic regression, logistic regression example, Regression analysis
Id: zAULhNrnuL4
Channel Id: undefined
Length: 11min 25sec (685 seconds)
Published: Sun Mar 08 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.