The Mathematics of Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this video is sponsored by Coursera there are various problems and tasks in the world that are very hard to solve through just programming computer through traditional methods and explicit instructions making computer games I phone apps or desktop applications are very doable through normal means whereas making a machine that can beat the best human in a complex game making a car that can drive itself where having a computer recognize objects are not so simple these are not things you can easily just tell a computer to do one way around this is time the computer how to learn and having it figure out how to get better through lots of practice or two computers case lots of data this is machine learning machine learning helps Amazon suggest relevant products for you it's used to read handwritten addresses when you send a letter it's used to figure out whether an email is spam or not it's used to D monetize your favorite videos across this entire platform and it's used for a few other things in order to make this possible we need to combine a lot of math with a lot of programming so let's see how all that works the main types of math you're going to see in an intro machine learning course are these right here but even if you don't know any of them you'll still be fine for this video so let's say you want to make an algorithm for a real estate site that can estimate the market price of a house first we're gonna need some data may be from another realty site so the algorithm can learn from it for simplicity we are only going to include one variable as an input parameter the size of the house and square feet in our case and the goal is if we give it a new input value it will output an accurate price let's plot this data to see what it looks like I'm just gonna treat the axes as one digit numbers to avoid any large values here now obviously this is a linear look to it which means we can estimate this with a best fit line I'm sure this is not anything new to most of you guys but where we're going with this probably isn't exactly what you're thinking first off let's guess a best-fit line with let's say y equals x and by the way we're gonna fix this line at the origin for this entire part so all we can change is the slope of course this first guess is not the best fit line but why is that well we see at a size of 1,000 square feet our line guesses $100,000 when it should be $150,000 therefore the error is 50k but reduce values you see it's just 0.5 which I'll put up here when the size is 2,000 square feet our line has a prediction error of 2 and we can figure this out for everything before moving on what we're gonna do is Square all of those errors this is simply because we don't care about whether our prediction is too low or too high error is error so we just want the positive values to see how truly quote off our line is we could do the absolute value I guess but that's not what's used in practice if we add up all those errors we get 26 point one two five so if we had a plot of the error versus that slope value that we guessed then we can plot that a slope of one while fixing the origin gets us an error of 26 point one to five if we increase the slope to one point two five to get a closer match we can again calculate those errors add them up and get a new and lower total error value and we can then put that on our error plot a slope of one point six eight gives us a total error of one point three two seven and we can do this for several other slopes to create a parabolic error function and now this tells us that right here is the slope of our best fit line because it yields a minimum error now let's go back to that first guess we made we obviously knew this underestimated everything and slope needed to be increased but how would a computer know which direction to go well here's an algorithm to figure it out the new slope will be the old slope minus a constant times the derivative of the error function at that point for those who don't know calculus that's okay right now we're guessing a slope of one for the best fit line so if we pull up our error function we remember that a slope of one gave us an error of about 26 when we added everything up at this value we need the slope of the tangent line now it's clearly negative and I'll just make up a number and say it's negative ten so the algorithm says the new slope equals the current one we're guessing of one minus a constant which I'll just call 0.01 times that slope or derivative this yields a value of one point one and if we move our dot to that slope we are now a little closer that best fit line of minimum error we're slowly stepping to our goal we then repeat the process the tangent line here may have a slope of negative nine and as the algorithm tells us the new slope is the current one minus that constant times the derivative again we have inched closer to that minimum value if we continue this we will find that minimum error and that's the slope of our best fit line this is called gradient descent an optimization method to find the minimum of a function real quick I know plenty of you are thinking wait there are known equations out there that calculate a best fit line why not just throw that into the program and well you're right this is just linear regression which you learn in many algebra classes and that's probably better for this we have to introduce gradient descent simply first because it gives you an idea of a program learning and fixing its error without explicitly being given a formula well that's not needed so much now we will students come across methods in which there's no formula that will help and no way a person could just look and figure out how an initial guess and parameters should be changed now we've only done the math with a single parameter of just changing the slope of our best fit line but adding more really does not change much let's say now we want to make a real best fit line include the y-intercept parameter well we couldn't make a guess for both and plot that against our original data calculate all those errors and make another plot of the air the only difference is now the error would be a function of the slope we guessed and the y-intercept we guessed meaning we would have a 3d plot of the air if our error started here the goal is again to walk down the curve until you reach a minimum and get a corresponding slope and y-intercept aka a best fit line remember before how the algorithm was finding the derivative and taking a step in the right direction to go a little downhill the only difference now is we take a partial derivative since there are more variables this will step our slope towards that best fit line then we do the same thing for the y-intercept and this will step that value towards that of the best fit line for those who don't know this yet that's okay let's say this is our 3d error function applied against slope and the y-intercept we're really doing is slicing the curve with a sheet parallel to one of the axes now this tells us which we to step in just that x direction aka whether to increase or decrease slope alone we're really turning this back into a 2d problem then we do the same with a slice in the y direction to step our y intercept value towards a minimum so essentially we work with both variables and keep altering them together resulting in a walk down the hill towards a minimum now in the real world in order to predict the market price of a house there's much more to it you may have inputs like the size of the house number of bedrooms bathrooms its age and plenty more in order to make a best-fit linear model to this we now need more parameters but again we just take a guess and then run all those partial derivatives or take those slices in higher dimensions and step each one a little downhill towards the minimum even though this would yield a 6 dimensional error plot we still break it down to a two-dimensional problem for each parameter if you go back to having one input parameter but what a more complex equation like with quadratic or cubic terms here you just make new inputs that include the original input squared cubed and so on but the math and algorithm works the same from here so that covers making a best fit curve to approximate data regardless of how many inputs we have but now let's mix it up what if now we want to predict an output that can only assume to binary values like whether someone passed or failed to test whether a structure will break or not or whether someone will develop a certain disease or not let's say we have some data on how many hours various students studied and whether they pass a certain test or not we'll say a 1 is passing and a 0 is failing so maybe someone studied zero hours and failed another studied one hour and failed two hours and passed and we'll put a few more 4 3 4 and 5 hours now the question is how do we make a program where we tell it someone studied maybe 3.5 hours or whatever and it will tell us the probability of that person passing well what if we made a best fit line there's actually some validity to doing this we could say when the line reaches a value of 0.5 and that's a 50/50 chance they pass or fail and this could be our boundary between likely passing or likely failing but the problem is if we add another data point way over here that does not exactly change much we'd expect someone who studied this long would pass but when included it really affects our best fit line making that boundary higher than it should be so let's try something else then we know the probability of passing or failing will only be between zero or one or zero and a hundred percent so let's look at a function that only exists between zero and one this is known as the sigmoid function and this is its equation now although our line from before has values way above one if we plug those Y values into the sigmoid function it will output only numbers between zero and one so the x-axis here will be the Y values from prediction line and our real goal now is again to find a line but this time one that minimizes the error between the actual output of one or zero versus what the sigmoid function spits out when we plug in the Y values of that line here's what I mean at an input of zero hours this line has a y-value of 0 so that will go on our x-axis at one our the line has a value of 0.2 so we put that on the x-axis at x equals point two at two hours the Y value is 0.4 which we put on our graph and we do the same thing for three four and five hours so those are when we input into the sigmoid function if we put in x equals zero then it'll spit out point five which is evident by the graph at zero hours of studying we know the person failed though so the actual output is zero which I'll also graph when we input point two we just go up to the sigmoid function which has an output value of 0.55 and the actual output is still zero because this person failed and point four goes up to a value of 0.6 which was actually a pass and we do the same thing for three four and five the difference between the sigmoid output and the actual output is our error and this is what we're trying to minimize now notice that if we just changed the y-intercept of our line all its Y values go up by the same amount this means our sigmoid graph all the X values would increase by the same amount moving all the dots along the curve to the right decreasing certain errors but increasing others if I just change the slope instead then the Y value at x equals zero does not change at all the y-value at x equals one went up a little which will shift the dot just a little along our curve the y-value at x equals two changed a little more meaning that will ship further up our curve each term after then changed a little more than the previous so they all move up the curve but everything gets a little further apart so we can move everything the same amount to the right or left by changing our lines intercept or we can move the dots at different intervals along the curve bunching or separating them which happens when we alter the slope there are some line then where we reach the sweet spot of having that minimum error using similar techniques to what we saw earlier we find the line that accomplishes this has this equation now if someone studies 3.5 hours we just plug that in for X and the equation outputs one point to 139 this doesn't mean much numerically but the point of all this was giving a line such that when we plugged the Y values of the line into the sigmoid function then we get a reasonable percent output if we plug this value into the sigmoid function it spits out point seven seven which means we can say that if you study 3.5 hours then you have a 77% chance of passing the test based on results of past students and hopefully this gives you the tiniest hint at why things like Netflix give a percent match for TV shows and movies or why doctors may say there's some percent chance of a patient having a certain disease and just for fun let's run through a quick example of lots of inputs where we definitely need a program to do the work for us let's say whether you get into college is dependent on two exams that are both out of 100 so those are our inputs the goal is based on other students scores and whether they got accepted we want to predict the odds of us getting in with a certain set of scores so I have all the data here I'm going to import into octave and graph [Music] okay so here we have a plot of all the data where you can see exam 1 scores here in exam 2 scores here and people got admitted are represented by pluses and though it's not admitted by these yellow circles so this is just the graph but now the question is let's say someone gets a 45 on exam 1 and a 70 on exam 2 what's their percent chance of getting accepted into this College based on how everyone else did well now I have to implement the algorithm to find that curve which yields the minimum error based on that sigmoid function graph [Music] and that took three hours I think I'll just end out all the mistakes I made now I've got what I need but first I want to show you that this ply is technically a three-dimensional one which I don't want to show exactly but what I can plot is a contour line here and this contour line shows in this case exactly when someone has a 50% chance of getting it so if you land on this line like if you got a ninety to one exam one but only a thirty on exam two then this means you have a 50% chance of getting it but now if I want to see someone's odds of getting in based on the scores that they got I just type that in and it will spit out the % chance so now you can see if someone got a 45 on the first exam and a 70 on the second one and they have about a 14.5 percent chance of getting into the school based on how other people did and I'll type in just a few more data points so you can see all the numbers and here you can see these first exam scores and the second ones and then what their chance of getting in based on the scores is you really just ignore these ones here now we've seen a lot of linear models but if you're a math person the word linear really doesn't elicit much of an interest unless it's preceded by none and that's where neural networks come in these are a very nonlinear and powerful technique with a machine learning a neural network may look like this and as always you have to give it information maybe that's amount of hours you studied I'm out of sleep you got the night before a test what you got in the previous test and so on and the goal is to use those to predict how you'll do on the next test which should be the output will we give the program some input values from previous data and then weight to our assigned to all those connections those weights multiplied the inputs and those are all added together to create new values here those are then multiplied by another set of weights giving us was supposed to be a final score of 0 to 100 let's say the predicted value of a test when that guessed is wrong most likely then you back propagate through the network and change the wage just a little to fix the error and this is how the network learns these weights are represented through various matrices and those are what are altered throughout the program there's much more to these but due to time for this video I'm not going to dive into it I might do a part 2 video so you can learn all of this ooh a machine learning course on Coursera where I got all the information for this video and this course put on my Stanford you'll learn the math and theory behind machine learning and you also be putting that all into practice in MATLAB or octave which they'll show you how to do by the end of week three year program that algorithm I showed you and by week four you'll make a neural network they cover so much more than what I went over and you don't even need to know calculus or linear algebra to get started they'll explain everything you need Coursera has thousands of other courses put on by industry leaders so if you're looking to get ahead for the next semester if you're trying to learn a new skill for your job or you just want to learn something new then I highly recommend them and you can get started for free right now links are in the description below and with that I will end that video there if you guys enjoyed be sure to LIKE and subscribe don't forget to follow me on Twitter and join the mid-foot Facebook group for updates on everything I'll see you all in the next video

Info

Channel: Zach Star

Views: 374,015

Rating: undefined out of 5

Keywords: majorprep, major prep, machine learning, neural networks, machine learning math, math needed for machine learning, math used in machine learning, gradient descent, what is a neural network, linear regression, logistic regression, octave, matlab, programming, machine learning algorithm, gradient descent algorithm, what is machine learning, how machine learning works, applications of machine learning, machine learning course, coursera, how computers learn, how machines learn

Id: Rt6beTKDtqY

Channel Id: undefined

Length: 16min 34sec (994 seconds)

Published: Fri Nov 30 2018