A Friendly Introduction to Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi and welcome to the machine learning peek of the grief and audacity so we're going to talk about today is what is machine learning well this is the world and in the world we add humans and we got computers and one of the main differences between humans and computers is that humans learn from past experience whereas computers need to be told what to do need to be programmed so they follow instructions now the question is can we get computers to learn from experience too and the answer is yes we can and that's precisely with machine learning is of course for computers fast experiences have a name called data so in the next few minutes I'm going to show you a few examples in which we can teach the computer how to learn from previous data and most importantly I'm going to show you that these algorithms are actually pretty easy and the machine learning is really nothing to fear so let's go to the first example let's say we're studying the housing market on our task is to predict the price of a house given its size so we have a small house that cost $70,000 we have a big house that cost one hundred and sixty thousand dollars and we'd like to estimate the price of this medium-sized house here so how do we do it well first put them in a grid where the x-axis represents the size of the house and square feet and the y-axis represents the price of the house and dollars and so to help us out we have collected some previous data in the form of these blue dots these are other houses that we've looked at and we've recorded their prices with respect to their size so in this graph we can see that the small house is priced $70,000 and the big house is priced at a hundred and sixty thousand dollars so now it's time for a small quiz what do you think is the best guess for the price of the medium house given this data would it be a thousand dollars one hundred and twenty thousand dollars or one hundred and ninety thousand dollars well to help us out we can see that these blue points kind of form a line so we can draw the line that best fits the data now in this line we can say that our best guess for the price of the house is this point over here which corresponds to one hundred and twenty thousand dollars so if you set one hundred and twenty thousand dollars that is correct this method is known as linear regression now you may ask how do we find this line well let's look at a simple example this three points we're going to try to find the best line that fits through those three points obviously best line is subjective while we try to find a line that works well since we're teaching the computer how to do it computer can't really eyeball the line so you have to get it to draw a random line and then see how bad this line is so in order to see how bad the line is we calculate the error so we're gonna for calculate the error look at the lengths of the distances from the line to the three points and we're just going to simply say that the error of this line is the sum of those three red lengths now what we're going to do is move the line around and see if we can reduce this error so let's say we moved in this direction and we calculate the error it's given by the yellow distances we add them up and realize that we've increased the error so that's not a good direction to go let's try moving the other direction we move it here calculate the error now it's given by the sum of these three green distances and we see that the error is smaller so we actually reduced it so let's say we take that step we're a little closer to our solution if we continue doing this procedure several times we will always be decreasing the error and we'll finally arrive to a good solution in the form of this line this general procedure is known as gradient descent now in real life we don't want to deal with negative distances corresponding to a point being on one or the other side of the line so what we do to solve this is add the square of the distance from the point to the line instead and this procedure is called least squares so we're going to cover in the census trying to the central mountain this is our Mountain Mount Everest this mounting the hi we are the larger error is so descending means reducing the error so what are we doing the credit the cinematic well look at our surroundings and try to figure out which way we can descend more for example here we can go in two directions to the right or to the left let's go to the left then we're going up insert error is ascending this is equivalent to moving the line downwards and getting farther from the three points but if we go to the right instead then we're actually descending which means our error is decreasing this is equivalent to moving the line upwards and getting closer to the three points so we decide to take a step towards or right then we can start this procedure again and again and again until we successfully descend from the mountain this is equivalent to reducing the error until we find its minimum value which gives us the best line fit so you can think of linear regression as a painter and will look at your data and draw the best fitting line now this method is actually much stronger if the data doesn't form a line with a very very similar method we can draw a circle through it or a parabola or even a higher degree curve for example the data here we can actually fit a cubic polynomial okay so let's move to the next example in this example we're going to build an email spam detection classifier so something that will tell us if an email is spam or not and how do we do this we do this by looking at previous data the previous data is 100 emails that we looked at already out of these 100 emails we have flagged 25 of them are spam and 75 of them is not spam now let's try to think of features of spam emails may be likely to display and analyze these features so one feature could be containing the word cheap seems reasonable to think that an email containing the word cheap is likely to be spam so let's analyze this claim we look for the word cheap in all these 100 emails and find that 20 out of spam loads and 5 out of the non spam ones contain that word so we can forget about all the rest of the emails and focus only on the ones that contain the word cheap okay so time for a quiz here's the question based on our data if an email contains the word cheap what is the probability of this email being spam is it 40% 60% or 80% well to help us out we can see that out of the 25 emails with the word cheap 20 of them are spam while 5 of them are not so these form an 80/20 split so the correct answer with 80 if you said 80 you were correct so from analyzing the data we can conclude a rule the rule says if an email contains the word cheap then we're going to say the probability of it being spam is 80% so we then associate this feature with the probability 80% and we're going to use it to flag future messages as spam or not spam we can also look at other features and try to find our Associated probability let's say we look at emails containing a spelling mistake and realize that the probability of an email containing a spelling mistake being spam is 70% or let's say we look at emails that are missing a title and find the probability of those being spam is 95% etc etc so now when future emails come we can combine these features to guess their spam or not this algorithm is known as the naive Bayes algorithm okay so now another example we are the App Store or Google Play and our goal is to recommend apps to users so to each user we're going to try to recommend them app that they are most likely to download we have gathered a table of data that we're going to use to make the rules on the table contains six people for each one of those six people we have recorded their gender and their age and the app they downloaded so for example the first person is a 15 year old female and she downloaded pokemon gold so here's a small quiz between gender and age which one seems like the more decisive feature for predicting what app will be users download well to help us out first let's look at gender if we split them by gender than the females downloaded Pokemon go on whatsapp whereas the male is downloaded Pokemon go and snapchat so not much for split here on the other hand if we look at age we realize that everybody who's under 20 years old downloaded pokemon gold whereas everybody who is 20 or older didn't that's a nice split so the feature the best splits the data is H therefore if you said age that was correct so we're going to do is we're going to add a question here the question is are you younger than 20 if yes then we'll recommend Pokemon go to you if not then we'll see so what happens if you're 20 or older then we look at the gender it seems like here if you're a female you've downloaded what's up whereas if you're a male you download it snapchat so we add another question here the question is are you female or male and if you're female we recommend what's up and if you're male then we recommend snapchat so what we end up here is with a decision tree and the decisions are given by the question we asked and this decision tree was built with the data and now whenever we have any user we can put them to the decision tree and recommend them whatever app the tree suggests is to recommend for example you have a young person you recommend them Pokemon go if you have an older person you check their gender if it's a female you recommend them what's up and it's a male you recommend them snapchat obviously there won't always be a tree that perfectly fits our data but in this class we're going to learn an algorithm which actually will help us find the best fitting tree to your table of data okay so let's go to the next example now let's say we're the admissions office at a university and we're trying to figure out which students to admit we're going to admit them or reject them based on two pieces of information one is an entrance exam that we provide them the test and the other one is their grades from school so for example here we have student 1 with scores of 9 out of 10 in the test and 8 out of 10 and the grades and that student got accepted we also have student 2 with scores a 3 in the test and 4 in the grades and that student did not get accepted and then a new student comes in student 3 this person has a son has scores of 7 and 6 and the question is should we accept them or not so let's first put them in a grid or the x-axis represents our score on the tests and the y-axis represents their grades here we can see that student 1 would lie over here in the point with coordinates 9 8 since their scores were 9 and 8 and the student 2 would lie right here in the point with coordinates 3 4 since their scores were 3 & 4 so in order to see if we should accept or reject Stu and 3 we should try to find it training that data so we look at the previous data in the form of all the students we've already accepted or rejected and it turns out that the previous data looks like this the green dots represent students that we've previously accepted and the red dots represent students that we've previously rejected so time for a quiz based on the previous data do we think student 3 gets accepted yes or no so to answer this question let's look closely at the data the red and green dots seem to be nicely separated by a line here's the line and most of the points over at are green and most of the points under it are red with some exceptions which makes sense since the students who got high scores are over the line and they got accepted in soon so what lowest scores are under the line and they didn't get accepted so we're going to say that that line is going to be our model and now every time we get a new student we check their scores and plot them in this graph and if they end up over the line we predict that they'll get accepted and if they end up below the line we predict that they'll get rejected so since students 3 has grades 7 and 6 a person will end up here at the point 7 6 which is over the line so we conclude that this students gets accepted so if you said yes that's a correct answer this method is known as logistic regression another question is how do you find this line that best cuts the data and - so let's look at a simple example is 6 points 3 Green 3 red and we're going to try to draw a line that best separates the green points from the red points and again a computer can't really eyeball the line so you can just start by drawing a random line like this one and given this line let's just randomly say that we label the region over the line is green and the region under line is red so just like with linear regression we're going to try to see how bad this first line is and the measure of how bad the line is would be how many points are we miss classifying we're going to call that number misclassified points the error this line for example misclassified two points one red and one green so we'll say that it has two errors so again like with linear regression what we'll do is move the line around and try to minimize the number of errors using gradient descent so I've removed the line a bit in this direction we can see that we start correctly classifying one of the points bringing down the number of errors to one and if we move it a little more correctly classify the other one of the points bringing down the number of errors to zero in reality since we use calculus for a gradient descent method it turns out that the number of errors is not what we need to minimize but instead something that captures the number of errors called the log loss function and the idea behind the log loss function is that it's a function which assigns a large value to the misclassified points and a small value to the classified points ok so let's look more carefully at this model for accepting or rejecting students let's say we have a student for who got nine in the test and one on the grades so the student gets accepted according to our model since they are over here on top of the line but that seems wrong since I student got very low grades you can get accepted no matter what their test score was so maybe it's simplistic to think this data can be separated by just one line right maybe the real data should look more like this where these students over here we've got a load test score or low grades don't get accepted so now it seems like a line won't cut the data into so what's the next thing after a line maybe a circle circle could work maybe two lines that could work too actually it looks like that works better so let's go with that now the question is how do we find these two lines again we can do it using gradient descent to minimize a similar log loss function at the for this is called a neural network now why is it called a neural network well let's see we have this green area here by and about two lines this area can be constructed as an intersection namely the intersection between the green area on top of one lines and the green area to the right of the other one of the lines so we're going to graph it like this we have two nodes each node is a line that separates the plane into two regions and from the two nodes we get the intersection which is the desired area the reason why this is called the neural network is because this mimics the behavior the brain in the brain we have the neurons which connect to each other and they either fire electricity or not they resemble the nodes in our graph which split the plane into regions and fire electricity for given point belongs to one of those regions and won't fire if it doesn't so we can't explain your aggression as a ninja we'll look at your data and cut it in half based on the labels and we can think of a neural network as a team of ninjas who will look at your data and cut it into regions based on the labels okay so let's dive a bit deeper into the art of splitting data into two we can look at this points three green and three red and there seem to be many lines that can split them for example there is this yellow line and there is this purple line so quiz which of these two lines do athing cuts the data better the purple or the yellow one well if we look at the yellow line it seems that it's close to failing it's too close to two of the points so if we were to wiggle it a little bit we would miss classify some of the points the purple one on the other hand seems to be nicely spaced and as far as we can from all the points so it seems like the best line is a purple one now the question is how do we find the purple line well the first observation is that we don't really need to worry about these points because they're too far from the boundary so we can forget about them and only focus on the points that are close and now what we're going to use is not gradient descent but we're going to use linear optimization to find the line that maximizes the distance from the boundary points this method is called a support vector machine so you can think of support vector machines that surgeon will see your data and cut it but before she will carefully look at what's the best way to separate the data into and then make the cut okay so now let's say we have these four points arranged like this and we want to split them it seems like a line won't do the job since they're already over the line and the red ones are on the sides and the green ones are in the middle so we need to think outside the box one way to think outside the box is to use a curve like this to split them another one is to actually think outside the plain and to think of the points is lying in a three-dimensional space so here are the points over the plane and here we add an extra axis the z axis for the third dimension and if we can find a way to lift it to green points then we'd be able to separate them with a plane so what seems like a better solution the curve over here or the plane over here well it turns out that these two are actually the same method don't worry if it seems confusing we'll get into a little bit more detail later this method is called the kernel trick as very well used in support vector machines so let's study one of them in more detail let's start with the curve trick so let's start by putting coordinates on the points this one is the point zero three this one is 1 2 this one is 2 1 and this one is 3 0 and what we need is a way to separate the green points from the red points so the points coordinates are X Y then we need an equation on the variables x and y that gives us large values for the green points and small values for the red points or vice versa so quiz which of the following equations could come to our rescue X plus y the product x times y or x squared the first coordinates squared this is a not an easy question so let's actually make a table with the values of these equations on each of the four points so here's our table here we have the four points on the top row and now each of the other rows will be one of the functions so here's the sum X plus y we fill in the first row the following way 0 plus 3 is 3 1 plus 2 is 3 2 plus 1 3 3 plus 0 3 now for the second row we're going to get the products 0 times 3 is 0 1 times 2 is 2 2 times 1 is 2 and 3 times 0 is 0 and for the third row x squared is the first coordinate squared so 0 squared is 0 1 squared is 1 2 squared is 4 and 3 squared is 9 so let's think which one of these equations separates the green and the red points we look at the sum X plus y and that gives us 3 at every value so it doesn't really separate the points we can look at x squared and that gives us different values for every point but we get 0 & 9 for the red values and 1 & 4 for the green ones so this one also don't doesn't separate them but now we look at the product x times y and that gives us 0 for the red values and 2 for the green ones so that one seems to do the job right it's a function that can tell them apart so that's the equation we're going to use you can see their products here and now for the red points X comma Y we have that the product X y equals 0 and for the green points we have that the product X y equals 2 and what separates a 0 and a 2 well a 1 so the equation x y equals 1 will separate them and what is XY equals one it's the same as y equals one over X and the graph for y was 1 over X is precisely this hyperbola over here that is the curve we want it so that is the kernel trick now we can also see it in 3d here we have the point 0 3 1 2 2 1 and 3 0 and we're going to consider them in 3 space so we're going to take the map that takes the point X comma Y 2 X comma Y comma X times y so where does 0 3 go 0 3 goes to 0 comma 3 comma 0 since the product of 0 & 3 is 0 1 2 goes to 1 comma 2 comma 2 so it goes all the way up since the third coordinate is the height the point 2 1 also goes to 2 comma 1 comma 2 and the point 3 0 goes to 3 comma 0 comma 0 so there we go we can split them using a plane so you can think of a support vector machine a kernel method as a surgeon who is a slightly confused trying to split some apples and oranges all of a sudden she comes up with a great idea the idea consists of moving the apples up and the oranges down and then successfully cutting a line through between them ok so let's move the next example let's say we have a chain of pizza parlors and we want to put 3 of them in this city so we make a study and realize that the people who eat piece of the most live in these locations and so we need to know where are the optimal places to put our 3 pizza parlors well it seems like the houses are nicely split into three groups the red the blue and the yellow so it makes sense to put one pizza parlor in each one of the three clusters but we're teaching a computer how to do this a computer can just eyeball the three clusters we need an algorithm so here's one algorithm that'll work let's start by choosing three random locations for the pizza parlors so they're here where the stars are located red blue and yellow now it makes sense to say each house should go to the pizza parlor that is closest to it in that case we can look at the map like this where the yellow houses go to the yellow pizza parlor the blue houses go to the blue pizza parlor and the red houses go to the red pizza parlor but now look at where the yellow houses are located you would make a lot of sense to move the yellow pizza parlor to the center of these houses same thing with the blue houses and the red houses so let's do that let's move every pizza parlor to the center of the houses that it serves as follows but now look at these blue points there are a lot closer to the yellow pizza parlor than to the blue one so we might as well color them yellow and look at these red points they're closer to the blue bits of color then to the red so let's color them blue and now let's do the step again that send each pizza parlor to the center of this houses that is serving in this way but then again look at this red house is there so much closer to the blue pizza parlor so let's turn them blue and then again let's move every pizza parlor to the center of the house as it serves and now we've reached an optimal solution so starting with random points and iterating this process helped us reach the best locations for the pizza parlors this algorithm is called k-means clustering but now let's just say we don't want to specify the number of clusters to begin with it's just a different way to group the houses so say they're arranged like this it would make sense to say the following if two houses are close they should be served by the same pizza parlor so if we go by this rule let's try to group the house let's look at which houses are the closest to each other it's these two over here so we grouped them now what are the next two closest houses it's these two over here so we grouped them the next two closest houses are these two so again we grouped them the next two closest outside is two so we unite the groups now the next two house right here so we grouped them the next two clusters are here so we join the groups the next two closest houses are here but now let's just say that's too big so all we need to do is specify a distance and say this distance is too far when you reach this distance stop clustering the houses and now we get our clusters this algorithm is called here article clustering so congratulations in this video we've learned many of the main algorithms of machine learning we learn to find you house prices using linear regression we learn to detect spam email using naive Bayes we learn to recommend apps using decision trees we learn to create a model for an admissions office using logistic regression we learn how to improve them using neural networks and we learn how to improve it even more using support vector machines and finally we learn how to locate pizza parlors around the city using clustering algorithms so many questions may arise in your head such as are there more algorithms the answer is yes which ones to use that's not easy given a data set how do we know which algorithm to pick how to compare them and evaluate them into algorithms how do you know which one is better than another one data set given the running time their accuracy etc are there examples other projects are the real up that data that I can get my hands dirty with them the answer to all these questions are more or in the Udacity machine learning nanodegree so if this interests you you should take a look at it thank you

Info

Channel: Luis Serrano

Views: 868,313

Rating: 4.9391785 out of 5

Keywords: machine learning, data science, math, statistics, probability, programming, data, artificial intelligence, mathematics, linear regression, logistic regression, neural network, decision tree, naive bayes, support vector machines, clustering, supervised machine learning, unsupervised machine learning

Id: IpGxLWOIZy4

Channel Id: undefined

Length: 30min 53sec (1853 seconds)

Published: Fri Sep 09 2016