Logistic Regression in Python | Gradient Descend | Data Science Interview Machine Learning Interview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome back to my channel in today's video we will focus on a fundamental motion learning model logistic regression it appears in interviews frequently sometimes the interviewer asks you to write down the loss function or the likelihood function and sometimes interviewer may even ask you to code it up from scratch actually the first algorithm that the interviewer asked me to write down the equation is logistic question in this video i will first give you an overview of the algorithm followed by a detailed implementation then i will explain to you one optimization meaning batch gradient descent that is often used in practice now let's jump right into it let's start with a fun fact of logistic regression logistic regression is actually not a regression it's a classification specifically it's a binary classification it is used to predict a binary outcome from a linear combination of variables so there are two possible outcomes representing two different classes class zero and class one for example we can use logistic regression to classify if an email is spam or not based on the subject and the body of that email now you know what logistic regression could do the next question might be how does it do it to put it simply there are only two steps the first step is to get the probability of classifying a data point in class 1. it uses a function to project the linear combination of all features into scores between 0 and 1 representing the probability of being in class 1. this function is called a logistic function or sigmoid function it looks like a big s and will map any value into a range 0 to 1. positive numbers become high probabilities and negative numbers become low ones the second step is to predict the class based on the probability we get in the previous step and the threshold we set if the probability is larger than the threshold the prediction will be class 1 otherwise it's class 0. for example if we set the search code as 0.5 then if the probability is over 0.5 the prediction is class 1. i hope these two steps make sense to you now let's go over it again with more details and some equations we could use p x beta to represent the probability that y the independent variable is class 1 given x and beta x represents independent variables in machine learning we call them features beta are the parameters p x beta equals the logistic function g of a linear combination of features so we could write it this way where logistic function g of z equals to 1 over 1 plus e to the negative z so you see that we have introduced a bunch of betas we need to use them to predict the outcome but betas are unknown how do we get them typically we use a training data set to obtain betas say they are a total of m training data points each data point has n independent variables from x1 to xn and observed class y so there will be n plus one betas from beta0 to beta n we will use this training process to obtain the values of all betas a method called maximum likelihood estimation is often used to get betas specifically we use betas x and y to formulate the likelihood of getting the observed class then we obtain betas to maximize the likelihood in other words we want to select betas that maximize the probability of observing the data we observe so let's first write the likelihood of getting the observed class for each training data point we have a vector of features x i and the observed class y i the probability of that class would either p x i if y i equals to 1 or 1 minus p x i if y i equals to 0. recall that in a sequence of benue trials y 1 to y n each trail has its own success probability p i then the likelihood function is this note that this form involves the power of y i and 1 minus y i so we typically take a log of it to simplify the calculation log likelihood turns products into sums now that we have a function for log likelihood we simply need to choose the values of beta to maximize it unfortunately if we try to set the derivative equal to 0 will get frustrated because there's no closer form for the maximum so we'll take a different approach using gradient descent to minimize the log loss function let me first show you the log loss function it's actually just the opposite of the log likelihood function then we want to obtain betas to minimize the log loss typically we use the gradient descent to reduce the log loss over multiple iterations intuitively speaking we start with a random guess of betas then we compute the log laws associated with them next we get the gradients at each parameter and they will be used to update the values of betas a gradient at a particular parameter is a partial derivative of loss function with respect to that parameter we repeat this step until the loss reaches the minimum value in other words the loss converges now i will give you the form of the gradient at each parameter for the purpose of this video we will not derive it step by step if you are interested in learning how to derive it feel free to check the link in the description for detailed explanation for the gradient at beta j it equals p i the probability of observing class 1 minus the observed class y i times x i j then we take the average from all data points you see we have introduced i and j here but don't be confused both of them are just indexes i is from 1 to m representing the index of number of data points while j is from 1 to n represents the index of number of features okay i just gave you an overview of logistic regression let's see how to implement it for implementation we follow the memplus helper function approach the map function contains the main logic of the algorithm and it leaves the details to be handled by helper functions if you have watched other machine learning videos on my channel you might already be familiar with the benefit of this approach but in general the code is clean and organized it's easy for others to understand it also the code is modular chaining one helper function will not have any impact to other functions now let's start with the main function first we initialize the parameters we differentiate beta0 and other betas because beta0 has a different form or gradient than other betas we then use a helper function to derive the gradients at each beta lastly we use another help function to update beta values using the gradients we repeat these steps for the number of iterations we have specified if you have watched the video on linear regression you may notice that the main function of logistic regression and the linear regression are almost the same if you know how to implement one of them it's easy to figure out the other now let's go over the help functions the first help function is to initialize parameters it's very straightforward we simply set the starting value of beta0 at zero and the starting values of other betas as random values the next function is to compute gradients we have looked at the equation of the gradients at each parameter we initialize the gradient at beta0 at zero and other betas as a vector of zeros then we loop through all the data points to accumulate the gradient computed for each data point inside the for loop we first get the prediction using the logistic function then we obtain the gradient at beta0 which is simply the prediction minus y i the gradients and other betas are represented as the prediction minus y i times the js feature of the i's data point we'll accumulate the gradients from all the data points and normalize them by the number of data points m the last helper function is to update the values of betas based on the gradients we have obtained one thing i'd like you to pay attention to is the sign when we apply the changes to betas it depends on how we calculate the loss in the function we subtract the observation from the prediction if the prediction is overestimated the gradient is a positive value and we will need to subtract the gradient from betas if we do it the other way i.e subtracting pi from y i we'd need to add the gradients to betas finally let's look at the complexity of the implementation both initialize params and other params functions loop through all betas once so both are on the compute gradients function goes to each data point and each feature so the time complexity is omn if the number of iterations is i the overall time complexity is omni in terms of space complexity the only intermediate variable we have created is to store the gradients so it's all in now we have done the implementation part you could use it to do a classification task but there's one potential problem i want to point out the implementation can be low efficient for a big data set and there's one optimization we could do to make the grading design process more efficient the gradient descent we have just implemented is batch gradient descent it looks through the entire dataset in order to make one step towards the target this can be very slow when the data set is large it may happen in real-world conditions when we need to deal with millions even billions of records in that case it takes a long time to look through all the data points or you might not even be able to fit the entire data in the memory of a machine so it will make the batch gradient descent process more efficient so how to improve it one method is called meaning batch gradient descent it takes a random mini batch from the entire data set and computes the gradients from it in this way the data is much smaller and it could fit into memory and getting the gradients becomes faster the downside is that gradients obtained from a mini batch tend to be noisier than the gradients from the whole dissent if a data set contains outliers and the noisy data may steer the gradient away from the optimal direction but overall the parameters will gradually progress towards the target through many iterations so in practice meaning batch grading descent is very commonly used it's more practical to have an approximate optimal solution that can be computed in a relatively short amount of time once you understand the concept the implementation is pretty straightforward in fact there are only small changes we need to make to use mini batch gradient descent in the compute gradients function we add another input parameter batch size and each time we randomly sample a number of data points with size equal to the batch size then the gradients are computed as a mean of gradients contributed by those data points awesome guys you have just learned how to implement logistic regression and how to use mini batch gradient descent to make the training process more efficient i hope you have learned something new in this video let me know if you have any questions as always guys i appreciate you for taking the time to watch this video i will see you in the next video
Info
Channel: Data Interview Pro
Views: 5,720
Rating: undefined out of 5
Keywords: logistic regression, logistic regression in python, machine learning interview, machine learning interview questions, data science interview, data science interview questions, logistic regression implemention, logistic regression implementation in python
Id: gN79XvB7vTo
Channel Id: undefined
Length: 12min 50sec (770 seconds)
Published: Thu Apr 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.