- [Brandon] Hello, and welcome. Brandon here. Thanks for choosing my video. If you like the video
please give it a thumbs up. If you think someone you
know can also benefit by watching, please share. And as always, please subscribe. I appreciate it very much. So, let's go ahead and get started. So here we are in logistic regression, a very useful if, in my opinion, underutilized statistical procedure that is not all that intuitive, which is maybe why it's underutilized. Now as with many of my other videos we're gonna start out
with an actual problem. Now this problem is one that I made up, so I made up the text and the data for it. So just keep that in mind going forward. However, I do think it
has the side benefit of being potentially useful
in your everyday life. So let's go ahead and take a look at it. So we'll call it First-Time Home Buyer. So as a first-time home buyer you are busy organizing
your financial records so you can apply for a home mortgage. As part of this process you order a copy of your credit report to check for errors and gauge your credit score, which can range, at least here in the US, from 300 to 850. Now lenders will factor
in your credit score when deciding to approve
or not approve you for a mortgage. They will factor in other
things like your income, how long you've been at
your job and other things. But your credit score is
definitely an important part of their decision. Now it turns out your credit score is 720 on that scale of 300 to 850. Now while you're doing your research, which you are dutifully doing
as a potential home buyer, you find some raw data online. So there's data floating
around the web everywhere. So you were lucky enough to
come across this data set that has 1000 applicant credit scores and whether or not the
application was approved, so yes or no, for the home mortgage. Now using the data you found, you would like to do the following. Number one, develop a
model that will provide the probability and the
odds of being approved for any given credit score. And again we will do all
of these as we progress throughout the video series. Number two, discover
approximately what credit score is associated with a probability of 50%. So the odds are even for being approved. If I walk into the bank
with a certain credit score it's basically like flipping a coin. My probability is 50% of being approved, which is the same way of
saying the odds are even. We want to know what credit
score that is on our scale. Number three. Input your score of 720 into the model to determine the probability and the odds of you being approved for a mortgage, which of course is very important to you. And finally, determine how
improving your credit score from 720 to 750 would affect your probability and odds for being approved for the mortgage. So let's say your score is 720. You find that out. You're gonna wait a little bit and see if you can get your
credit score a bit higher up to 750 by paying down some debt, maybe you know that you're
gonna get a promotion and a higher salary sometime
soon, or something like that. And you think that your score may improve. And you want to know how that
improvement in your score would affect your probability and odds of being approved for the mortgage. So here is just a little chunk of that 1000-observation data set. So there are only 15 here. But I just wanted you to
see how it's organized. So we have the credit score on the left and approved on the right. So again N is 1000. The credit score is the
applicant's credit score from 300 all the way up to 850. Now approved is coded
as a one for approved and zero for not approved. So it is binary. It is a dichotomous variable, and it is mutually exclusive. So you're either approved
or you're not approved. There's no in-between. Now as a good analyst
and a good stat student or whatever it is you might be, you could create a scatterplot of your 1000 observations. But it looks like this. Now what is this? So if you look on the left-hand side we have approved and we
have zero at the bottom. So that means the
application was not approved. At the top we have a one. So that means the
application was approved. But we have the data points in two lines. So the credit score along the bottom, so FICOscore is just a
certain type of credit score that's widely used. So if the dot is on the bottom that means for whatever
credit score that was, the application was not approved. If it's at the top it
means it was approved. Now how can we put a
best-fit regression line on a scatterplot that looks like this? It doesn't make any sense to do it how we would usually do it in normal linear regression. So obviously we're gonna have to come up with some other technique. And that's what logistic
regression allows us to do. Now that we have set the
stage with the problem we're gonna look at, what
is logistic regression? Now logistic regression
seeks to do the following, among other things. It seeks to model the
probability of an event occurring depending on the values of
the independent variables. In this case credit score. Which can be categorical or numerical. So, model the probability
of an event occurring depending on the other
independent variables. It seeks to estimate the probability that an event occurs for a
randomly selected observation versus the probability that
the event does not occur. So for a random observation in the data or some other observation
that we would want to predict we want to estimate the probability that the event occurs versus
that it does not occur. It seeks to predict the effect of a series of variables on a binary
response variable. So in this case we only have
one independent variable credit score. But we could have more. So logistic regression can work a lot like multiple regression with several independent variables and the one dependent
variable that is binary, so zero or one. You can also seek to classify observations by estimating the probability
that an observation is in a particular category. In this case the applicant
is either in the approved category or they're in
the not approved category. So we can classify observations. So, model, estimate, predict and classify. Let's try to understand and visualize the problem we're working with. So in this case we have
a bunch of credit scores. So an applicant walks into the bank and may have some sort of credit score. Now the bank or other lending institution feeds that into their lending model. Their credit score goes into the model, and then when it comes
out it's either approved or it's not approved. So this black box in the middle is what we're trying to understand. So we could ask, what is the probability that an application having
a credit score of 670 would be approved? So it would end up in
the approved category up here on the top. So credit scores get put into some model, a decision model by the
bank or other lender. And then the bank or the lender puts that application into the approved or non-approved categories. That's basically what
we're trying to model in this logistic-regression problem. Now I am kind of making the assumption that if you're studying
logistic regression you have to some extent studied
simple linear regression and multiple regression. Now if you studied those you might have a very good question. Why can't I use one of those
for this type of problem? Well, here's why. Number one, simple linear regression is one quantitative variable predicting another quantitative variable. Now in this case we have a
dichotomous dependent variable. So approve or not approve is one or zero. It's not a quantitative variable. Now multiple regression
is just simple regression with more independent variables. So those are basically
the same type of problem. Then we have nonlinear regression. That's still two quantitative variables, but the data is curvilinear. Now if we ignore those warnings, running a typical linear regression in the same way on this type of data has some major problems. Now binary data, in this
case approve or non-approve, does not have a normal distribution, and you can see that by
looking at the scatterplot, which is a condition
needed for most other types of regression. The predicted values
of a dependent variable can be beyond zero and
one in those other types of regression. So remember in logistic regression we're dealing with probabilities. And the rule of probability is that it has to be between zero and one. If we use the other types of regression, the values can be beyond zero and one. Which obviously is not going to work. And probabilities are often not linear. Such as U shapes, where
the probability is very low or very high at the
extremes of the X values. So you can probably think
of different examples. So one example could be the probability of contracting the flu. So the probability of
getting the flu is higher if you're younger, so a
baby or infant or toddler, and if you're older. So say in your 60s, 70s and 80s. The probability is higher in the extremes than it is in the middle. So probabilities often
have different shapes in their distribution
along the X variables. So now that we have set
the stage by introducing our problem and going over the basic conceptual foundation of
what logistic regression is, let's talk about where we're
going in the next video. So in the next video we
will do the following. We will review basic probability. So we won't go into much depth. We'll just go over the basics. Because obviously
understanding probability is central to learning
about logistic regression. We will learn about what odds are and what the odds ratio is. Because again that's
central to understanding logistic regression. We will briefly discuss how to interpret the odds ratio in
logistic-regression context. And finally, we will note
things we have to keep in mind when interpreting the odds ratio. So the odds ratio is related
to probability, of course, but there are some dangers
in how we interpret it. And we'll definitely discuss
that in the next video. So let's go ahead and wrap up this video, and I will see you in the next one.