Poisson Regression | Modelling Count Data | Statistical Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi in this video I am going to talk about what is always in regression on how to build a Poisson regression model so this is basically used for modeling count data so we will will see how we can use Poisson regression to model count data if I went a subscribe to a channel please subscribe to our channel ok so in this session we'll first see the theory behind Poisson regression model and we also see where the uses of Poisson regression models you can use this type of modeling technique in a variety of riolu problems we will see what this problems you can you can you can actually solve using presenta Gration model we will take an application using our we'll take a use case to understand how we can actually you know use this modeling technique all right so let's start with the introduction well Poisson regression models are used to model count data as I've said in the first slide itself that we are interested in using this model in scenarios where the data that we have is nothing but count of something so the dependent variable or the target variable is count of something okay you mean what so what are the examples of such cases well examples could be you know the number of visits to a website by a user ok so that could be a count right count of our words run by a student that could be another example count of death in a hospital in a year number of accidents technically a number of calls to a call center number of flu or type overs cases in an area number of car or drop that causes a particular breeze or Express so these are typically cases artistical examples of you know account data dependent variable or target variable so what are the features of account data dependent variable well account of something is always positive right count out something cannot be negative so that's one feature the second feature is that count can take zero as well so count of something could be zero right so many times in this particular type of for data we have so many zeros in the target available and that's interesting because if you have so many zeros in the dependent variable then you need some special technique to model that because count of something for instance let's say count of Dave in a hospital it could be zero in a particular or in a given time period right there could be no data at all and that case is a pretty large number in many cases and that's why you know something else has to be done and in a typical linear iteration you would come across so many zeros in a dependent variable all right all right so and it can lead you and then it can go till infinity okay but it can't be a negative number so it is it has a lower bound right this is it lower bound right alright so this will keep this in mind while you know building this model on every time we face scenario you know like this all right so we assume here that the dependent variable is having a Poisson distribution okay so you may be familiar with what a Poisson distribution is and for not familiar you know we will have a brief over what poisson distribution is in the next slide so the motivation behind apollyon regression model is coming from the poison you know distribution so the gym gear is that the dependent variable or the target variable the target variable you know follows a Poisson distribution cause impunity distribution okay and one of the feature of this distribution is that it is always used for numerical and continuous data focus that's one thing to be kept in mind what the distribution is useful to model count and the rate of something rate means you know something happening at a given rate of time constant date per month okay or let's say a number of calls per hours of it so you know the rate of something if you have something like rate of something you use Poisson distribution in those cases okay around so examples that are given quite a number of examples so other examples could be you know modeling for traffic some incident rates and so on so it's very useful actually it's very useful in you know finance industry where you know you're trying to model for rate of default in a given period of time or trying to you know model for many of this financially events like you know let's say the number of financial crisis happening in a given period of time you know we also used in supply chain operation management okay in operation management in manufacturing in supply chain industry that's why we don't you get to know how many you know orders are coming in in a given period of time and so on okay so it's a heavily in application oriented and has a load of exist in the real world scenarios the probability mass function which is the discrete version of the probability density function so when it is a discrete function we call that as web was a you know probability mass function okay so this is probability is given as this okay so if X is a random variable okay and that takes a count of something let's say for instance count of death in the hospital it could be zero one two you know anyway can't be negative number then it is distributed like this exponential of the parameter lambda and I'll talk about the what lambda is and then flavor to the board X divided by X factorial and remember X is nothing but the the count of something okay count of something okay count of number of days count of number of course or something like that and it is always positive we're equal to a greater than zero because you know the factorial of a negative number is not possible but that's makes sense because count can never be a negative number okay so the power of it is here and we are interested in the lambda okay and what is lambda when lambda is the average number of occurrence of an event you know and even here you are taking it as a depth it could be any other event as well so lambda is being average number of occurrence of an event in a specified time okay so this is the parameter so you know labra in the case of an example of number of days is like what is the average number of days average number of even days we call that as lambda okay okay average number of deaths for a given time okay so I'm keeping that silence similarly the average number of calls in a given time that is also lambda okay so that's the parameter we are interested in okay and the actual average number of death in a given time can be done by lambda multiplied to the given time the delta T the duration under consideration okay so this is the you know basic theory behind a Poisson distribution function of poison you know Poisson probability distribution and how it can believe reached good-morning count data so we have some idea now they are not getting it the details of four poison division theory I've just given a brief overview what it is all about and we'll straight out you know where with an example and understand how we can use okay here is the graphical view of the probability mass function and you can see as you increase lambda so lambda can take you know it can be 0 1 and number racket so if you increase lambda for lambda equal to 1 we have this red one lambda equal to 4 this green one and lambda is containing blue one and you can see that when you liquid when you increase the value of lambda a Poisson distribution is you know becoming more like a normal distribution right this is more like a normal distribution rate so if you increase the value of lambda you tend to be approaching towards a normal distribution okay few more features of a portion regression model of causal distribution can be resected ok the expected value of x or expected value the random number that is taking count or something is the average of the rate or the order parameter in consideration which is lambda radiance is also exactly lambda hence mean equal to variance so this is an important feature of you know the portion distribution so if in a data if you see the mean and variance a you know more or less similar you can think of using a portion dealership a Poisson distribution model instead of a normal distribution model also good I mean there are of course other tests we perform before concluding that it is distributed you need to be distributed as a Poisson regression model however this is this will give you some sort of a negatives so the idea here is to understand whether the mean and variance are same in a given data and one thing to remember is that the portion distribution tends to become a normal distribution as lambda becomes large okay so now will we learn what Q is we handle voice integration model so we learn what foreign distribution is and that's exactly is the motivation now any regression model has to thing rates one is target variable and the set of independent or explanatory variable right so basically any regression model will have something like this is the dependent variable and we have is a function of several independent variables so we are interested in finding out the parameters which parameter and the function that relates the dependent variable within set of independent variables now here the dependent variable as I've said is a count of something right and count cannot be negative however the Xfinity variable can take any values it it can take negative values as well that's a typical problem that's a problem where you know you have a restriction on the dependent variable but there is no restriction on the independent variables so that is the problem so you have restriction on dependent variable right that it cannot take any negative values okay but there is no restriction as such on the independent variables that will create problems if you use the ordinary least square of the multiple linear regression if you use that assuming that you know it's a normally distributed and you know error terms of normal distributed and using on le square in this case it's going to give you the answers or the output is going to be problematic because it Adamson seems to be some work different hence we do one thing what we do is that we take the logarithm of one instead of y itself and then we model it in the function of X now how does that help well if you take a logarithm of something it can take negative levels right so if I say if I take let's say log of 10 to the power minus 2 a 10 to the power minus 2 is not a negative number but if you take a log of that it is nothing but minus 2 right so that is a negative number right similarly if I take log of point zero zero two so point zero zero two is not a negative number whereas the local point zero zero two is a negative number so by taking the log of the dependent variable we are getting rid of the restriction that was there before the restriction on the fact that it cannot take a negative number now our left hand side can take negative number so this particular term can be negative okay and of course there is no restriction on the independence so you know so that's the idea X exactly why we are going to use a log linear model instead of a linear model so we'll use the log linear model because we have a logarithm in D taken in the left-hand side of the model so taking the logarithm of date dependent wave and this is a class of model considering the generalized least square I am sure you are familiar with other general in d square models multiple linear regression logistic regression are also part of this generalized linear regression models where the only difference between these models are in terms of the distribution otherwise the estimation is more or less same we'll use the maximum likelihood to do the excavation the few assumptions here we have I think there will discuss few of them just you know we'll introduce to you y-values account so that that also rolling account otherwise there is no reason to use it Poisson regression if it is not account of something okay count must be positive so you cannot have a negative numbers okay and if the discrete discrete distribution that's one assumption so it has to be account of something it has to be a whole number zero or greater than zero okay so the count should be following Poisson distributions that's one texts you can do before even going out with this model this to be a brief test whether it is by you know by in taking the mean and variance or by doing some statistical test you can actually test whether your data follows Poisson distribution or not so just take the mean and variance and see whether they're very close to each other or not that's the case then you are good to go with that Poisson distribution or is integration model so a bit of more theory so this is how the regression model looks like we have log of the count count we have techn it does mean it could be Y X and whatever you can actually give and it is a function of a number of X variables or independent variables and we are interested in finding out the parameters alpha and beta ready much like your linear regression models right but remember this is not like your typical log model log in your model in a multiple linear regression the distribution itself is totally different so don't confuse with that okay if you take the logarithm into the right hand side you mu or you know we call that as Y all so we can we can say Y all so y equal to exponential of alpha plus beta X and exposed when you take an exposure of something you can actually write in this way expansion of alpha multiplied to expansion of beta x that's what we have learnt in mathematics where you know when you have something to added value is the power of something they are nothing but the multiplication of the basic shape now this is interesting because the the normal way of explaining the parameters of interpreting the parameters will not be the same it has two digits it so what we have learnt in a multiple linear regression interpreting the beta parameters is totally going to be different here so what is you know what what does alpha K is what is the intercept how do we how do you interpret the intercept okay so expansion of alpha is the effect of mean Y mean of Y okay remember this concept is mean of Y not wide-set okay this is my modeling for the average in occurrences average occurrences when X equal to 0 so when X is equal to 0 this becomes you know 1 right that means your average of X or expected value of x is of the word e to the power alpha right so that's exactly the integration how do we interpret exponential of beta so that is what we have in excellence in beta right associated with X this is nothing but with every unit increase in X the predictor variable has a multiplicative effect that means it is not only going to be impacted by the value of what is the value of exponential of beta it would also get affected by what exactly the value of alpha as well that's not the case in linear regression right if you have y plus alpha plus plus beta x in that case you know if you if you have X increased by 1 unit the corresponding increase in Y is just beta there is no relations with alpha but here it is there is a relationship because it has a multiplicative effect because you are multiplying the exponential of beta 2 exponential of alpha so the marginal effect is not just dependent on the value of beta it's also dependent on the value of alcohol okay so that's why it is different but in nearly multiple linear regression that's not the case it's only the beta values that matters while explaining the marginal effect okay the few more interpretation I won't get into the details but you know this briefly I'll talk about the beta equal to 0 and then exponential of beta is 1 right so this becomes 1 so the expected count of Y the expected count and Y and X are not related that means every time you have a beta of 0 the model itself is of no use because they are not related at all exploratory variables and your dependent variable or target variable they are not related so there is no point in going ahead with the model however it can be greater than 0 or less than and how do you interpret when it is less than 0 if expansion of beta is greater than 1 remember exponential of Britain we are talking about e to the power beta is greater than 1 the expected count okay or the average of the dependent variable which is nothing but expected by work or dependent variable is expansion of beta times larger than when you have X equal to 0 okay slightly tricky to understand but if you think for a while you can easily get it ok so we're comparing you with the first condition right we talked about in the first condition the condition number 1 so condition number 2 and condition number 3 basically comparing it with the one x equal to 0 what is the effect and if exponent you know X is not equal to 0 and the case is that x1 should have beta is greater than 1 how is it related to the first condition and similarly we can also compare the third one slightly complicated but yeah if you think for a while you can always get it very neatly more they're not an issue so one thing to remember is that it has a multiplicative effect that means when you have alpha guitar together impacting the dependent variable in in in contrast to your Lydia gray multiple in irrigation where they were imparting independently or exclude differently separately all right so we'll take an implementation will use a problem and we solve it is not I mean so positive region is estimated using maximum likelihood pretty much like most of the mov LM models so the one technique that is used is maximum latitude to you know find out the estimates will use the Z LM function in our it's part of the mass package so you haven't downloaded you can you know download and install it and get it on in on your R so that we can use to build the model okay so let is an example so ever use an example that is there in one of the website and will be the link to the site so here is the case where you know the we try to find out the number of hours you on by students given their certain attributes and the two attributes have been given is that the program to which the student has been has been admitted to and the score on on the particular students on mathematics okay so we see is the data and it will be a clear okay so we are trying to predict so you are dependent little Y is nothing but the number of awards won by the student and we have to exploit available x1 is the program program is one two three you know one is I guess vocational the second one is general the third one is academic and the second expertise area be a school in mathematics okay so how does this to exploit a variable in the pac-10 the average number of awards you you know owned by students okay so we're interested in all that so you import the data set and it's there in the University of California Los Angeles UCLA Society then you know you can it's publicly available so you can import it on your add session of their so there are three variables one is a dependent one the number of awards and we have two independent variable score on math and the program widgets one is integrate two so math is the variable math is a continuous variable continuous predictor which is you know it can be any value set and the the program to which the student is M rule is a category one there are three categories general academic and vocational and we have put it one as general two as academic and three as vocational you can go you know that any way you want it just matter how we define the variables okay so this is the how this is how we have done the coding for the categories all right so first you read the data so that's the first step in any modeling exercise and then there are several exercises that you will follow right we'll find them being the variance the standard deviation in a minimum maximum and so on now one thing you might have seen in this data that I've so knew that the number of hours the dependent variable is mostly zero right because most students wouldn't have even known the single alert but there are a few people who have one hour to over 300 6q data it's not very you know you won't find it bell curve data in this case you'll never find that it's very skewed okay and technically this is we call that as a problem of dispersion we'll talk about that later point of time in this particular session so we worked with importing the data and then we will do exploratory data analysis to understand the distribution of the data whether it is actually more fitting a Poisson distribution not just look at the mean and variance and we get to know the similar and that's when you assure that you are good to go with it or is integration model okay given that we have a categorical variables we just you know do the factoring and you know it's pretty much simple like you might have done it in other all UNIX is this use and those were from not familiar with this use the function factor and then you know do this leveling okay so you just to make sure that you know that are understands you're letting properly so that you can interpret the results very well once you have done all the exercises of your analyzing data and they're sure about the fact that you know we have you're going to eye with a Poisson equation model use called the GLM function okay and use the distribution poisson so this is important and provide the data set that will be pocket okay so we're using the family of distribution portions that means you know the function is still the LM function you will understand we are trying to call a person integration model and this will fit across integration model to our data and then we'll summarizing data we can do it in two steps I have done it in this one step okay I'm an important thing to remember is that okay so I'm not going to get in the details of it but I'll just give an example orthography the standard error the standard error that you automatically get from know by fitting DLM model is actually not correct okay that seems to be an issue and there is an academic reason behind it okay it may not be as important in a practice in a real world in a more applied sector but theoretically or in academic setup there is some issues with the standard error the way we we will be estimation mac using the maximum likelihood estimation okay the standard error is more to do with you know some sort of a distribution and that's no I am NOT going to go to the details of it hence you have to go find what is known as the robust standard error if the normal standard error we not or the normal standard ever from the you know move like a normal distribution and one civility integration will not be very suitable so we find what is known as your robust standard so there are two steps to follow and we can already read more about what robo standard errors are I am NOT going to explain to you but you know if you might have come after so go standard link other non ordinary square model says that in fact in my channel I have talked about in quite a few times not getting the details here okay so these are the steps I've given you the code you can find out the robot standard errors through this you can find out the lower limit upper limit of each of these estimates okay this code should be followed okay and you can use that and you know the we will you know I am recommending you should actually go ahead and run through this learn the steps or going to school in are to be able to understand later ended is a better feel of it alright so we have this results now okay so now run your model and we also find out the robust standard error now given this okay so okay somewhat see so this is the results okay so I am not able to see what the way he will serve okay let me explain this is the standard this is the intercept which I did for you this is me intercept this is the school format okay and this is the you know vocational and this is academic okay so we had how many factors general academic vocational side so vocational academics are given to us this is compared with general right like any factoring you might have say in the three factors or three categories there will be two estimate side there will be two estimates and there will be one base the base here is general and academic and vocational have been compared to that okay so this is the intercept this is the you know the beta values alpha and beta values this is the you know the standard error the upper limit and lower limit of this level limit and the lower limit okay okay I am saying your limits sorry lower limit and we upper limit okay it so how do we interpret this now we have this result in place you know we have intercept we have the slope coefficient for vocational and academic these are the category variable abuse and we're also the estimate for the continuous video independent variable which is store on mass okay how do we interpret so the coefficient of math is 0.07 okay what does it mean how do we interpret in the typical multiple linearization would have we have said that if you increase the of the score on mass is increased by one unit okay the corresponding increase in the dependent variable will be 107 but that's not the case left how we interpret this the expected log count or the log of the count count of Awards because remember we are you know we are modeling for the count of a work site number of hours won by the students okay so if you if they're not the the the score on mass is increase by 1 unit the corresponding increase in log of count not count the log of count of hours is 0.47 it's slightly trickier compared to a multiple linear regression and I highly recommend to read this particular statement more carefully to understand the coefficient of math is point four seven this means that the expected log of count expected lower bound so one unit increase in math score is point also now there what are the indicators do indicators program academic and you know general right then the comparison here similarly you can explain this value site the way we come we interpret the categorical visible take that one as a base and compare the two other estimates to the base estimate so we compare how somebody enroll in academic program will do better in winning hours number of hours compared to genu over similarly for vocational program as compared to a general program right that's the way you interpret one important thing we had missing occurs in multiple linear regression always find what is known as a square right we have not talked about our squares like yet because a square is what you tell you we you know the best part of or rather what percentage of variation no dependent variable is explained by a set of independent agents so that's not the case here have it are square like thing in Bosch integration model because the fundamentals behind a square assumes that you know some sort of different distribution that is not the case in percent equation one so you can't have the R square like things yes now some people who are using zero R square pretty much like you have pseudo are scoring logistic regression but that can no way we explain the way you explain R square adjusted R square in multiple linearization so how do you measure the goodness of check how do you measure that efficiency of the model okay so doing a predictive model so find out the root mean square would mean root mean square error so that's one best way so far in a build a model in training data and test it in validating in test data or hold out sample data right so that's one way of ensuring that you are doing them your your model is performing really really good or not and also when it comes to interpret ability we look at the estimates the signs of the estimate and also look at the strength of the estimates and you also get a feel of where the model is actually performing well or not and all is good to go through some matrix so I would highly recommend to look at you don't mean square error cell okay some few important points to be remembered while building a partial regression model so if you also always check out for the over dispersion over dispersion is basically related to cases where you know you have excess Geo's right because most of the students in a data sample who don't have won any awards so there will be a lot of zeros test the data itself is so skewed that it is over dispersed by a single value okay so hence you should always ensure that you have sufficient number of independent variables that actually explain you should not have this omitted variable bias or we should not make important independent variables that's important thing to remember you always need a large sample size because estimation that you are doing is MLE maximum like distribution and for that you need a large sample size the interpretation is very tricky as we have already seen both interpreting estimates and you know interpreting the existed sorry B 0 R square is different from what we normally do in the multiple linear equation so be careful about that and remember one thing the aspect cannot be interpreted in a similar way then as you do in an ordinary square multiple evaluation and it is very useful in many real-world problems as I have already talked about it in the beginning of the session there are a number of industrial applications whether it is operation in finance in supply chain in many manufacturing companies you actually come out as cases where your dependent variable is a count of something and that's exactly where you should provide with modeling engine aging portion regression models thank you so much and prevent subscribe to our Channel please do subscribe and also you can visit to our website it's there in the description section thank you
Info
Channel: Analytics University
Views: 18,800
Rating: undefined out of 5
Keywords: poisson regression, robust regression, linear regression, non linear regression, multiple regression, bayesian regression, lasso regression, segmented regression, ridge regression, quantile regression, analytics, data science, supervised learning, data mining, spline regression, decision tree, clustring, count data model, panel data model, time series regression, principal component analysis, log linear model
Id: BknajRmh99I
Channel Id: undefined
Length: 38min 27sec (2307 seconds)
Published: Mon Jun 05 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.