Linear Regression in Python | Gradient Descend | Data Science Interview Machine Learning Interview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys welcome back to my channel in today's video i want to talk about a machine learning model or a statistic model you might be very familiar with and that is linear regression the reason i choose to talk about linear regression is that it appears in motion learning interviews very frequently where the interviewer asks you to implement it starting from scratch so in this video we will mainly focus on the implementation part i will explain to you what the gradient descent is and how to implement it correctly once you finish watching this video you will not only have a good understanding of linear regression but also be able to code it up from scratch if you are interested in learning then keep on watching i guess you already have some basic knowledge on linear regression in case you don't i am going to give you a brief overview feel free to skip this part if you want to dive into the implementation directly so what exactly is linear regression at a high level you can think of linear regression as a task of feeding a straight line through a set of points in the simplest case we consider a point as two dimensional it has x and y values a straight line is important here that's what linear in linear regression refers to it means that we assume the true underlying relation between x and y is linear this is an important assumption if the underlying relationship is non-linear we should not use linear regression to model the data now let me show you two examples in the left plot it's clear that the relation between x and y can be considered as linear while in the right graph assuming x and y have linear relationship would not make much sense so we need to make sure our assumption is correct before using it in reality linear regression is very commonly used due to its simplicity and interpretability for example we can use it to predict the demand of a certain product based on its price now you understand what linear regression does and an important underlying assumption let's look at its equation for the simplest case we can use y hat equals beta0 plus beta1 multiplied by x to model the data y is the variable we want to predict it's called the dependent variable and y-hat is the estimated value of y x is a variable we could use to predict why it's called an independent variable beta0 is called the intercept and beta1 is the slope we could easily update the equation with more independent variables instead of using one x we could have x1 to xn and each represents an independent variable now you probably ask where do betas come from or how to get the values of betas that's a great question because it is a core of understanding the exact relationship between x and y one commonly used method is to calculate the average of the square difference between observed values and predict values this error is also called the mean squared error here y i refers to the observed value and y i hat is estimation we want to obtain betas to minimize this error so it becomes an optimization problem this method is very commonly used and it actually has a name called ordinary least squares put it more formally under these squares is a method using linear regression which approximates the parameters to get a minimum square distance between observed and predict values i hope so far everything makes sense the objective is to find the optimal betas so that the error is minimal but how do we do that now i introduced another concept to you called gradient descent we use it to reduce the error over multiple iterations intuitively speaking we start with a random guess of betas then we compute the mean squared error associated with them next we compute the gradients at each parameter so that will be used to update the values of betas don't worry if you don't understand what a gradient is i will explain the calculation in detail later we repeat this step until the error reaches the minimum value in other words the error converges putting everything together let me show you a visualization on how it works we have some data points with x and y values at the beginning we take a random guess of betas and plot x and the predictive y using this red line you can see that the line does not seem to reflect the relationship between x and y accurately and we have a large error so we compute the gradients and use them to update the values of betas then we plot it again we see the error drops and the predictions are more accurate so the red line becomes better to model the data after few iterations the error becomes smaller and smaller and finally we can get the minimum error and betas obtained are optimal now you understand how it works conceptually let me show you how to compute the gradient first of all what is exactly a gradient actually it's just the derivative of error with respect to a particular parameter so the gradient at beta0 is error with respect to beta0 and we can decompose the gradient using the chain rule a quick recap of the chain rule derivative of y with respect to x equals to the derivative of y with respect to u multiplied by the derivative of u with respect to x so the gradient of beta0 equals the derivative of error with respect to y multiplied by the derivative of y with respect to beta0 we have defined the error before which is a mean squared error so we can get the exact form of the gradient at beta0 note that the derivative of y with respect to beta0 is one similarly we can get the gradients for other betas here i use beta i to represent any beta other than beta0 and they all have the same form now you understand not only the gradient descent algorithm but also how to compute the gradient for each parameter let's dive into the most interesting part the implementation of linear regression i'm going to use the memplus helper function approach to implement it i consider this approach as the best way to implement a machine learning algorithm we start with a main function which only contains the high level logic of the algorithm then we use a few helper functions to handle more detailed implementation using this approach will help you keep the code organized and it's easier for the interviewer to follow during interviews it's also good coding practice because using helper functions modulize your code so a component can be changed easily without impacting other components so let's start with the main function the map function follows exactly what we have talked about previously first we initialize the parameters based on the dimensions of the input data you can think of the input data as a y table we have m data points in total and each one has n columns each column represents an independent variable so there are n plus one betas from beta0 to beta n all betas are one-dimensional we can say each of them is a scalar secondly we compute the gradients of betas then we use the gradients to update the values of each beta we repeat this process multiple times now we are done with the main function we would also need three helper functions let's go through them one by one the first function is about parameter initialization we can simply initialize beta0 as zero for other betas we could use a vector to hold all of them and the vector has the same size as the number of dependent variables which is n each beta is initialized randomly next it comes to the core of the algorithm computing the gradient in this function we are going to compute the gradients for all betas all of them start at 0. again we should be the 0 separately because its gradient has a different form than others we loop through all the data points and add the gradient contributed by each data point to those variables inside the for loop for each data point i we obtain the prediction y i hat and then we get the difference between the prediction y i hat and the observation y i afterwards we can obtain the derivative of the error over y which is two times the difference between observation and prediction finally we can get the gradients of betas for each midpoint we divide the gradient by n so the gradient computed at the end will be the average over all data points the next step is to update betas using the gradients we just obtained note that we don't simply add gradients to betas we scale the gradient by multiplying learning rate the learning rate is a rate of speed where the grading moves during gradient descent we don't want it to be too high or too low setting it too high would make the grading descent process unstable too low will make it slow to converge we will need to tune it to get the optimal learning rate now i want you to pay attention to the positive sign when we update betas during interviews you want to explain clearly on the sign or the direction of the gradient that is applied to vedas this is determined based on how gradients are computed in the previous step in the previous step we get the gradient of error with respect to y as observed value minus the estimation so if y is overestimated the gradient of error with respect to y would be an active value that's why we add the gradient to betas during interviews it's worth explaining the reason so that the interviewer is convinced that you truly understand it okay now we have done the implementation let's take some time to analyze its time and space complexity suppose we have n independent variables and the total number of data points is m then in each iteration it takes omn to compute the gradient because of the double loop and it takes om to update the betas clearly the bottleneck is on computing gradients so if we end up updating beta's i times or they are i iterations the total time complexity is omni for the space complexity we need to check what other intermediate variables we have created we have created a variable for beta0 and a list of n elements to hold other beta values the former is a one and the letter is o n so the space complexity would be on all right guys that's how we can implement linear regression i hope it helps let me know if you have any questions or feedback as always guys i want to appreciate you for taking the time to watch my video i will see you in the next video

Info

Channel: Data Interview Pro

Views: 5,892

Rating: undefined out of 5

Keywords: linear regression, data science interview, machine learning interview, machine learning algorithm, gradient descend

Id: RIg3iuen7MY

Channel Id: undefined

Length: 11min 58sec (718 seconds)

Published: Fri Apr 16 2021