Lecture60 (Data2Decision) Generalized Linear Modeling in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to lecture 60 of my course from data to decisions I'm Chris Mack your instructor and in this particular lecture we are going to look at generalized linear modeling in our did you generalize your Mahlon will we use a function called g om an extension of the linear modeling function we've been using for a long time so to start let's pick some data pried data that I've used before when demonstrating linearity squares regression I will set up just a specific set of 11 numbers and plot them up for you alright so here's a graph of those numbers and you can see that there and I randomly spread around a linear trend we've actually seen this data before it's part of the ants come problem sets or data sets that we've used to demonstrate some issues related to ordinary least-squares regression you know how to model this and are using oil-less let's choose the LM function linear model Y is a function of X in this case because there's only one predictor variable and we run the model and I print the summary summary shows me that the intercept is 3 the slope is 0.5 both of those parameters are significant from a t-test perspective now what we'd like to do now this is all review you've seen all that a bunch of times if you've used our fur I mean linear modeling you know exactly what's going on let's look at a new routine called generalized linear modeling this is what we talked about in our previous lecture and in generalized linear modeling you can specify one of a number of link functions and one of a number of probability distributions to describe the distribution of the residuals those distributions have some constraints they have to be from the exponential family of distributions but there's a lot that properly falls into that category now what kind of link functions what kind of probability distributions are available well here you actually specify the family the distribution family first and then the link function to go with it here are a number of family distributions that are possibilities and we're going to see that the binomial family is the one that we'll use a lot with the logistic regression but we can have a good old Gaussian or normal distribution gamma distribution poisson distribution and others each distribution family there is a specific default link function so if you don't specify the link function it reverts to this default and it's a kind of a natural link function to use with that distribution but it's not the only link function that you can specify the Gaussian distribution is what we assume for an ordinary least-squares regression and guess what the link function is the identity function so if we use generalized linear modeling with a Gaussian family in the identity link well are doing exactly ordinary least-squares regression so let's do that just and using the GLM function rather than LM function you specify family equals Gaussian and link equals identity we will run that model and give us the summary let's look at the summary we'll look at that intercept of 3 in the slope of 0.5 standard errors are identical to what they were before let's double-check 1 standard error of the slope 0.1 179 let's go up to our OLS ah there it is point 1 1 7 9 all right so the generalized a modeling routine finds the maximum likelihood estimator and maximum likelihood estimator for a Gaussian distribution of residuals and an identity link function is nothing more than ordinary squares regression let's a little bit about the mechanics but of course we would never use we have no reason to use the GLM function if we're going to use a Gaussian family of distribution and identity link however let's pick a different data set and just try out a different kind of regression in particular a logistic regression and the data set we're going to use is survival on the ship the Titanic everyone knows about the Titanic it's sank in what was it the 20s and a large percentage of the people died something like 38 percent of the passengers survived everyone else died is a very tragic accident and a lot of people have studied this tragedy to try to understand impacted survival and this data set is in a package which I've already installed but you will need to install all Titanic don't you install that package you load the library up and it has data that we can try to understand I'm going to use this data set called Titanic train this is training data also has some testing data that you can compare to we'll just look at the training data and I'll stick all that data into a variable called data dot raw all right let's look at that data dot raw the first column is called passenger ID and it's nothing more than the passenger ID we're not going to need that survived is an indicator variable 0 1 1 means that that person survived and 0 means it doesn't every foe is a different person in fact this fourth column over here is the name of the person we are not going to of course need the names we're gonna need some of these other variables P class is the passenger class first-class second-class and third-class sex is the male or female age age of the passenger notice that here's an n/a value n/a it means unavailable a person's age isn't we don't know what it is so we mark it specific way called na sip SP is how many siblings or spouses of yours are on that same chip at the same time so siblings and spouses are all people of the same generation who are related to you that's this column so you see some people have zero some people have three etc P ARP is is a parent or child care in our child so do you have your parents on the same ship or your children on the same ship if so how many right so zero one two Tara ticket is an identifier for the ticket number working for not going to need that at all fair how much they paid for the ticket a cabin what cabin were they in that might be useful if we had some appellee where all the cabins were but we don't we're not going to use that and then embarked is which city did you embark from you is Queenstown and s I think is Southampton and G C is Chelsea I can't remember exactly but we have these three different cities in which they joined the boat and that's the data in our Titanic data set Ashton is can we develop a model to predict AB ability of survival based on what passenger class you were whether you're male or female how old you are with its siblings on the responses on the boat etc etc now it's one of the problems some of the data has blanks in it look at the cabin here we find that some of them are marked with na s other of these cells are not marked with na s they're just Bank we're good to want to make sure that every blank cell is marked in the way that our can recognize and that's the a marking our has lots of routines to understand what to do or options with what to do with na but Mike is just another data point as far as R is concerned so let's first enopp our data and this is well for general lesson or modeling first thing you have to do is look at your data and probably you have to clean it up it's rare that you're gonna get a data set that is absolutely perfect exactly in the right shape to use it to generate models and here's one of the examples so we're gonna use this command to apply the our variable na to or there are data value na in place of a blank so what we'll do is we'll use this aida dot raw equal equal blank so what that's gonna do is look through every single cell and every time it finds a cell for data dot raw is identical to plank then this roll command here will be true this is a logical equals and it's asking is data raw but equal to logically equal to plank whenever that's true and it will execute this command which is take an A and stick it in that particular cell as soon as I run this now I go look back and look at my odd data you see that every place that was a blank here in the cabin now marked with an n/a we don't have any blanks left in our data we only have not available Xen A's for ever no data elbow now to further our clean up let's try to find out how many not available data points we have to do that will we can do it in a couple of ways the first way we'll do it is this s apply s apply applies a function to the data so I'll take the day draw then I'm going to do a function of X that is it's a function of columns and I'm gonna sum is dot na x so is dot n a is asking the question is it na are not available so I'm gonna look through every single data point and ask is this data point out of L if it is then then is dot na becomes true some of that adds up all the truths it counts number of truths so we can use some as a way of counting when a particular condition inside the parentheses is true so the function I'll apply is count the number of na s every column of X over the entire data so let's run that see what we get all right well let's look at what we got we have all the columns and underneath every column it tells me how many na s it found so passenger ID 0 survived 0 P class 0 names sex 0 age 77 now I'll look at the next X column here the show that there are 800 and next command rather there are 891 data points that's the length of one of the columns so 177 of those 891 passengers we don't have age information for well that's gonna be important because if we want to use age in as a variable in our regression model and we're basically going to throw away 177 passengers from the data set and that's a choice we're gonna have to decide whether we're willing to make our you see the cabin there's 687 an days we don't have a whole lot of information about the cabins we're definitely in ignore cabin or here embarked he had two passengers where we don't know where they embarked we could probably live with that uh there's uh now a choice for us taking advantage of what we've just seen decide pitch of the columns on to you so I'm gonna create a new data let me call it just plain old data as a subset of the original so I used this command subset it takes the original data data dot raw it's not original anymore I've actually modified it by changing all the blanks to n A's but then I'm gonna select specific rows I'm gonna select the second third fifth sixth seventh eighth tenth and twelfth row those are all the rows that I care about you can go and look at the data and see that means I'm getting rid of the passenger ID the name I'm gonna get rid of the ticket and the cabin those are all things that I know not going to need at all to do my modeling work so let me get rid of them and if I go look at the data that results you see have everything but those things I threw away now let's do a linear modeling what kind of modeling should we do our output is a binary data either you survived zero one or you didn't survive zero that means we're trying to predict the probability of survival we pick when we're trying to model a probability or the most popular ways of doing that is with the logistic regression ends we've already talked about in our class so let's use GLM to do a logistic regression to do that set the family to be binomial I know mule is the probability distribution of a binary variable zeros or ones and budget is the link function for a logistic regression so suppose we wanted to look at the impact of sex male or now on your probability of surviving well let's run them out model done let's look at a summary of my model you see yeah as I have an intercept and then Oh hold on a second sex is a categorical variable it's either male or female how do we do a regression on that now we need to convert our categorical variable into an index and or we'll do that for you automatically if you give GLM a column of data that has categories like male and female it will automatically convert those categories into the proper number of indicator variables this case I only have two different possibilities you want to make sure use a command called 'evil so if I do pebbles of data raw dollar sex and I asked how many levels there are I doesn't like that because this is is not a factor so I have to first convert it to a factor all right well what's the difference there as it is set up here these are character strings you can't really tell from looking at it but it just turns out that this is a character string called male this is a character string called female and again when I give that to R it will automatically convert it into factors which are male and female so if I use this factor of command it converts at text string character strings male and female into factors male and female and then it counts the number of levels and there it is now that I've converted to a factor I can count the levels and it says there's two levels one's called female other hole well that's pretty obvious when it comes to sex but for other things like say embarked may not be so obvious how many levels of embarked are there well we can ask and see that there's C Q and s there's three levels for the variable embarked all right so when we do the modeling it automatically converts male and female into a indicator variable it gives the name sex male as it is when it went on that column of sex the first Irbil it came to was male so it just arbitrarily assigned Heil to b1 and female then to be zero then it takes that variable that indicator variable called sex male and it found the best fit efficient or slope to be minus two point five let's interpret what that means what does that mean remember this is a logistic regression so the Y picked a variable the the response variable that we're predicting here is I over 1 minus pi the log of that so the log of the probability of survival either by the probability failure of ty on the Titanic so the probability of survival divided by the probability of dying is called the odds the odds of surviving and so the output of a logistic regression is the log odds this coefficient of sex male is minus 2.5 means the log odds is decreased by minus 2.5 if your man compared to if your only other Jetta lower survival probability fewer man compared to a woman I wanted to convert that to just the odds I take the exponential of that value so for example I could do exp of model dollar coefficients and it will exponentiate both of these coefficients the intercept and the sex male and now the odds are 0.08 smaller or a male actor 0.08 smaller from a L versus a female this is what we call the odds ratio issue of the odds for a man compared to the odds for a woman in its you know like a factor of 12 smaller from Hale compared to female that's the kind of things we can do with logistic regression and that's the kind of things we can do the generalized linear following in our next lecture we're going to dive into logistic regression a lot more and in fact dive into Titanic data set now but a lot more detail to see some other interesting aspects of doing logistic regression till then
Info
Channel: Chris Mack
Views: 41,366
Rating: 4.9389977 out of 5
Keywords: statistics, data analysis, linear regression
Id: kffIgjHxdpw
Channel Id: undefined
Length: 21min 14sec (1274 seconds)
Published: Mon Nov 07 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.