Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning (Autumn 2018)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
All right. Hey, everyone. Morning and welcome back. Um, so what I'd like to do today is continue our discussion of Naive Bayes and in particular, um, we've described how to use Naive Bayes in a generative learning algorithm, to build a spam classifier that will almost work, right? And, and, and so today you see how Laplace smoothing is one other idea, uh, you need to add to the Naive Bayes algorithm we described on Monday, to really make it work, um, for, say, email spam classification, or, or for text classification. Uh, and then we'll talk about the different version of Naive Bayes that's even better than the one we've been discussing so far. Um, talk a little bit about, ah, advice for applying machine-learning algorithms. So this would be useful to you as you get started on your, ah, CS229 class projects as well. This is a strategy of how to choose an algorithm and what to do first, what to do second, uh, and then we'll start with, um, intro to support vector machines. Okay? Um, so to recap, uh, the Naive Bayes Algorithm is a generative learning algorithm in which given a piece of email, or Twitter message or some piece of text, um, take a dictionary and put in zeros and ones depending on whether different words appear in a particular email and so this becomes your feature representation for, say, an email that you're trying to classify as spam or not spam. Um, so using the indicator function notation, um, X_j-, uh, X_j- I've been trying to use the subscript J not consistently to denote the indexes and the features and ith index in the training examples and you'll see I'm not being consistent with that. So X_j is whether or not the indicator for whether words j appears in an email. And so, um, to build a generative model for this, uh, we need to model these two terms p of x given y and p of y. Uh, so Gaussian distribution analysis models these two terms with a Gaussian and the Bernoulli respectively and Naive Bayes uses a different model. And with Naive Bayes in particular p of x given y is modeled as a, um, product of the conditional probabilities of the individual features given the class label y. And so the parameters that Naive Bayes model are, um, phi subscript y is the class prior. What's the chance that y is equal to 1, before you've seen any features? As well as phi subscript J given y equals 0, which is a chance of that word appearing in a non-spam, as well as phi subscript J given y equals 1 which is a chance of that word appearing in spam email. Okay? Um, and so if you derive the maximum likelihood estimates, you will find that the maximum likelihood estimates of, you know, phi y is this. Right? Just a fraction of training examples, um, that was equal to spam and maximum likelihood estimates of this- and this is just an indicator function notation, way of writing, um, look, at all of your, uh, emails with label y equals 0 and contact y fraction of them, did this feature X_j appear? Did this word X_j appear? Right? Um, and then finally at prediction time, um, let's see, you will calculate p of y equals 1 given X. This is kinda according to Bayes rule. Okay? Um, all right. So it turns out this algorithm will almost work and here's where it breaks down, which is, um, you know, so actually eve- every year, there are some CS229 students and some machine learning students, they will do a class project and some of you will end up submitting this to an academic conference. Right? Some, some- actually some, some of CS229 class projects get submitted, you know, as conference papers pretty much every year. One of the top machine learning conferences, is the conference NIPS. NIPS stands for Neural Information Processing Systems, um, ah, and let's say that in your dictionary, you know, you have 10,000 words in your dictionary. Let's say that the NIPS conference, the word NIPS corresponds to word number 6017, right? In your, in your 10,000 word dictionary. But up until now, presumably you've not had a lot of emails from your friends asking, "Hey, do you want to submit the paper to the NIPS Conference or not." Um, and so if you use your current, you know, email, set of emails to find these maximum likelihood estimates of parameters, you will probably estimate that, um, probability of seeing this word given that it's spam email, is probably zero. Right? Zero over the number of, ah, examples that you've labeled as spam in your email. So if, if you train up this model using your personal email, probably none of the emails you've received for the last few ones had the word NIPS in it, um, maybe. Uh, and so if you plug in this formula for maximum likelihood estimate, the numerator is 0 and so your estimate of this is probably 0. Um, and then similarly, this is also 0 over, you know, the number of non-spam emails I guess. Right. So that's what this is, is just this formula. Right? And, um, statistically it's just a bad idea to say that the chance of something is 0 just because you haven't seen it yet and where this will cause the Naive Bayes algorithm to break down is, if you use these as estimates of the, of the parameters, so this is your estimates parameter phi subscript 6017 given y equals 1. This is phi subscript 6017 given y equals 0. Yes? And if you ever calculate this probability, that is equal to a product from I equals 1 through n. Let's say you have 10,000 words appear of X_i equals 1, p of X_i given y, right? And so if, um, you train your spam classifier on the emails you've gotten up until today, and then after CS229, your project teammates sen- starts sending you emails saying, hey, you know, we like the class project. Shall we consider submitting this class project to the NIPS conference? The NIPS conference deadline is usually in, um, sort of May or June most years so, you know, finish your class project this December, work on it some more by January, February, March, April next year and then maybe submit it to the conference May or June of 2019. When you start getting emails from your friends saying, let's submit our papers to NIPS conference, then when you start to see the word NIPS in your email maybe in March of next year, um, this product of probabilities will have a 0 in it, right? And so this thing that I've just circled will evaluate to 0 because you multiply a lot numbers, one of which is 0. Um, and in the same way this, well, this is also 0, right? And this is also 0 because there'll be that one term in that product over there. And so what that means is if you train a spam classifier today using all the data you have in your email inbox so far, and if tomorrow or- or, you know, or two months from now, whenever. The first time you get an email from your teammates that has the word NIPS in it, your spam classifier will estimate this probability as 0 over 0 plus 0, okay? Now, apart from the divide by 0 error, uh, it turns out that, um, statistically, it's just a bad idea, right? To estimate the probability of something as 0 just because you have not seen it once yet, right? Um, so [NOISE] what I want to do is describe to you Laplace smoothing, which is a technique that helps, um, address this problem. Okay? And, um, let's- let's- In order to motivate Laplace smoothing, let me, um, use a- a- a- Yeah, Let me use a different example for now. Right? Um. Let's see. All right. So, you know, several years ago, this is- this is all the data, but several years ago- so- so let me put aside Naive Bayes, I want to talk about Laplace smoothing. We will come back to apply Laplace smoothing in Naive Bayes. So several years ago, I was tracking the progress of the Stanford football team, um, just a few years ago now. But that year on 9/12, um, our football team played to Wake Forest and, you know, actually these are all the, uh, all the stay games we played that year, right? And, um, uh, we did not win that game. Then on 10/10, we played Oregon State and we did not win that game. Arizona, we did not win that game. We played Caltech, we did not win that game. [LAUGHTER]. And the question is, these are all the away games- almost all the out of state games we played that year. And so you're, you know, Stanford football team's biggest fan. You followed them to every single out of state game and watched all these games. The question is, after this unfortunate streak, when you go on- there's actually a game on New Year's Eve, you follow them to their over home game, what's your estimate of the chances of their winning or losing? Right? Now, if you use maximum likelihood, so let's say this is the variable x, you would estimate the probability of their winning. Well, maximum likelihood is really count up the number of wins, right, and divide that by the number of wins plus the number of losses. And so in this case, um, you estimate this as 0 divided by number of wins with 0, number of losses was 4, right? Which is equal to 0, okay? Um, that's kinda mean, right? [LAUGHTER]. They lost 4 games, but you say, no, the chances of their winning is 0. Absolute certainty. And- and- and just statistically, this is not, um, this is not a good idea. Um, and so what Laplace smoothing, what we're going to do is, uh, imagine that we saw the positive outcomes, the number of wins, you know, just add 1 to the number of wins we actually saw and also the number of losses add 1. Right? So if you actually saw 0 wins, pretend you saw one and if you saw 4 losses, pretend you saw 1 more than you actually saw. And so Laplace smoothing, you're gonna end up adding 1 to the numerator and adding 2 to the denominator. And so this ends up being 1 over 6, right? And that's actually a more reasonable may- maybe it is a more reasonable estimate for the chance of, uh, them winning or losing the next game. Um, uh, and the- there's actually a cert- certain- certain set of circumstances under which there's more estimates. I didn't just make this up in thin air. Uh, Laplace, um, uh, you know, uh, it's an ancient that -- well known, uh, very influential mathematician. He actually tried to estimated the chance of the sun rising the next day. And the reasoning was, well, we've seen the sunrise all times and so, uh, but tha- that doesn't mean we should be absolutely certain the sun will still rise tomorrow, right? And so his reasoning was, well, we've seen the sunrise 10,000 times, you know, we can be really certain the sun will rise again tomorrow but maybe not absolutely certain because maybe something will go wrong or who- who knows what will happen in this galaxy? Um , uh, uh, and so his reasoning was- he derived the optimal estimate- way of estimating, you know, really the chance the sun will rise tomorrow. And this is actually an optimal estimate under I'll say- I'll say the same assumptions, we don't need to worry about it. But it turns out that if you assume that you are Bayesian, where the uniform Bayesian prior on the chance of the sun rising tomorrow. So if the chance the sun rising tomorrow is uniformly distributed, you know, in the unit interval anywhere from 0 to 1, then after a set of observations of this coin toss of whether the sun rises, this is actually a Bayesian optimal estimate of the chances of the sun rising tomorrow, okay? If you don't understand what I just said in the last 30 seconds, don't worry about it. Um, uh, it's taught in sort of a Bayesian statistics- advanced Bayesian statistics classes. But mechanically, what you should do is, uh, take this formula and add 1 to the number of counts you actually saw for each of the possible outcomes. Um, and more generally, uh, if y, er, excuse me. If- if you're estimating probabilities for a k way random variable, um, then you estimate the chance that X being i to be equal to, um, so- so that's the maximum likelihood estimate. And for the fast-moving, you'd add one to the numerator and, um, you add k to the denominator. Okay? So for Naive Bayes, the way this mod- modifies your parameter estimates is this. Um, I'm just gonna copy over the formula from above. Right? Um, so that's the maximum likely estimate. And with Laplace smoothing, you add one to the numerator and add two to the denominator and this means that your estimates are probably- these probabilities they're never exactly 0 or exactly 1, which takes away that problem of, you know, the 0 over 0. Okay. Um, and so if you implement this algorithm, it's not- it's not like a great spam classifier but it's not terrible either. And one nice thing about this algorithm is is so simple, right? Estimated parameters is just counting ,um, uh, uh can be done, you know, very efficiently, right, just- just by counting, uh, and then- and classification time is just multiplying a bunch probabilities together. Uh, this is very confusing first algorithm. All right. Any questions about this? Yeah. [inaudible]? Oh sorry. This is y. Er, oh yes. Thank you. Er, yes thank you. All right. Oh, by the way, I- I was actually following the Stanford football team that year so, you know, they lost. [LAUGHTER]. Because, okay, I love our football team. They're doing much better right now. That was a few years ago. [LAUGHTER]. [NOISE]. All right. Um, [NOISE] So, um, in- in the- examples we've talked about so far, the features were binary valued. Um, and so, um, actually one quick generalization, uh, when the features are multinomial valued, um, then the generalization- actually here's one example. We talked about predicting housing prices, right? That was our very first world meaning example. Let's say you have a classification problem instead, which is you're listing a house you want to sell, what is the chance of this house to be sold within the next 30 days? So it's a classification problem. Um, so if one of the features is the size of the house x, right, then one way to turn the feature into a discrete feature would be to choose a few buckets, assert the size is less than 400 square feet, uh, versus, you know, 400 to 800 or 800 to 1200 or greater than 1200 square feet. Then you can set the feature XI to one of four values, right? So that is how you discretize a continuous valued feature to a discrete value feature. Um, and if you want to apply Naive Bayes to this problem, then probability of x given y, this is just the same as before. Product from i equals 1 through n of p of xj given Y where now this can be a multinomial probability. Right? Where if- if X now takes on one of four values there then, um, this can be a, uh, estimators and multinomial problem. So instead of a Bernoulli distribution over two possible outcomes, this can be a probably, uh, probability mass function probably over four possible outcomes if you discretize the size of a house into four values. Um, and if you ever discretized variables, a typical rule of thumb in machine learning often we discretize variables into 10 values, into 10 buckets. Uh, just as a- it often seems to work well enough. I- I drew 4 here so I don't have to write all 10 buckets. But if you ever discretize var- variables, you know, most people will start off with discretizing things into 10 values. All right. Now, uh, right. And so this is how you can apply Naive Bayes on other problems as well including cost line, for example, if a house is likely to be sold in the next 30 days. Now, um, there's, uh, there's a different variation on Naive Bayes that I want to describe to you that is actually much better for the specific problem of text classification. Uh and so our feature representation for x so far was the following, right? With a dictionary a, aardvark, buy, So let's say you get an email that's, you know, a very spammy email that's "Drugs, buy drugs now", [LAUGHTER] This is meant as an illustrative example, I'm not telling any of you to buy drugs. [LAUGHTER] Um, so if, uh, if you have a dictionary of 10,000 words, then I guess- let's say a is worth 1, aardvark is worth 2, uh, just to, you know, make this example concrete. Let's say the word buy is word 800, drugs is word 1,600, and let's say now is the word- is the 6,200th word in your, uh, 10,000 words in the sorted dictionary. Um, then the representation for x will be, you know, 0, 0, [NOISE] right? And they put a 1 there, and a 1 there, and a 1 there. Okay? Now, one, one- so, um, one interesting thing about Naive Bayes is that it throws away the fact that the word drugs has appeared twice, right? So that's losing a little bit of information, um, uh, and, and in this feature representation, um, you know, each feature is either 0 or 1, right? And that's part of why it throws away the information that's, uh, where the one-word drugs appear twice, and maybe should be given more weight for your- in your classifier. Um, [NOISE] there's a different representation, uh, which is specific to text. And I think text data has a, has a property that they can be very long or very short. You can have a five-word email, or a 1,000-word email, um, and somehow you're taking very short or very long emails and just mapping them to a feature vector that's always the same length. Just a different representation [NOISE] for, um, this email, which is, uh, for that email that says, "Drugs, buy drugs now", we're gonna represent it as a four-dimensional feature vector, [NOISE] right? And so this is going to be, um, n-dimensional for an email of length n. So rather than a 10,000-dimensional feature vector, we now have a four-dimensional feature vector, but now xj is, um, an index from 1 to 10,000 instead of just being 0 or 1. Okay? And, uh, n is- and I guess n varies by training example. So ni is the, uh, length of email i. So the longer email, this vector, the feature vector x will be longer, and the shorter email, this feature vector will be shorter, okay? So, um, let's see. Uh, just to give names to the algorithms we're gonna develop, these are- these are really very confusing, very horrible names. But this is what the community calls them. That the, the model we've talked about so far is sometimes called the Multivariate Bernoulli. And that model, uh, so Bernoulli means coin tosses, so multivariate means, you know, there are 10,000 Bernoulli random variables in this model whereas as a Multivariate Bernoulli event model. An event comes with statistics I guess. Um, and the new representation we're gonna talk about is called the [NOISE] Multinomial Event Model. Uh, these two names are- are- are- frankly, these two names are quite confusing. But these are the names that, uh, I think- actually, one of my friends Andrew McCallum, uh, as far as I know, wrote the paper that named these two algorithms. But- but I think these are- these are the names we seem to use. Um, and so, with this new model, um, we're gonna build a generative model, and because it's a generative model, or model p of x, y which can be factored as follows and using the Naive Bayes assumption, we're going to assume that p of x given y is product from i equals 1 through n, of j equals 1 through n, of p of xj, given y, and then times, you know, p of y. Is that second term, right? Now, one of the, uh, uh, one, one of the reasons these two models were very- were frankly actually very confusing to the machine learning community, is because this is exactly the equation [NOISE] that, you know, you saw on Monday, when we described Naive Bayes for the first time, um, that, you know, this, you know, p of x given y is part of probabilities. Right? So this is exactly, uh, so this, this equation looks cosmetically identical, but with this new model, the second model, the confusingly named Multinomial Event Model, um, the definition of xj and the definition of n is very different, right? So instead of a product from 1 through 10,000, there's a product from 1 through the number of words in the email, and this is now instead a multinomial probability. Rather than a binary or Bernoulli probability. Okay? Um, and it turns out that, uh, well, [NOISE] with this model, the parameters are same as before. Phi y is probability of y equals 1, and also, um, the other parameters of this model, phi k, given y equals 0, is a chance of xj equals k, given y equals 0. Right? And- and just to make sure you understand the notation. See if this makes sense. So this probability is the chance of word blank being blank if label y equals 0. So what goes into those two blanks? Actually, what goes in the second blank? Uh, let's see. Well- well, yeah? [inaudible]. Yes. Right. So it's the chance of the third word in the email, being the word drugs, or the chance of the second in the email being buy, or whatever. And one part of, um, why we implicitly assume, mainly why this is tricky, is that, uh, we assume that this probability doesn't depend on j, right? That for every position in the email, for the- the chance that the first word being drugs is same as chance of the second word being drugs, is same as the third word being drugs, which is why, um, on the left-hand side j doesn't actually appear on the left-hand side, right. Makes sense? Any questions about this? No? Okay. All right. Um, and so the way you calculate the probability, the way you would, um, uh, and, and so the way that, uh, given a new email, a test email, um, uh, you would calculate this probability is by, you know, plugging these parameters that you estimate from the data into this formula. Okay? [NOISE] Um, oh, and then, um, I wrote down, uh, [NOISE] right. And then, and then the other set of the parameters is this. [NOISE] Right. Kind of just with y equals 1, is that y equals 0. And then for the maximum likelihood estimate of the parameters, I'll just write out one of them. [NOISE] Your estimate of, uh, the chance of a given word is really anywhere in any position, being word k. What's the chance of some word in a non-spam email being the word drugs, let's say? Um, the chance of that is equal to [NOISE] I find that- well, this indicates a function notation. It looks complex. I'll just say in a second, uh, what this actually means. So the denominator, um, so this space means- so- and so if you figure out what the English meaning of this complicated formula is, this basically says, "Look at all the words in all of your non-spam emails, all the emails of y equals 0, and look at all of the words in all of the emails, and so all of those words, what fraction of those words is the word drugs?" And that's, uh, your estimate of the chance of the word drugs appearing in the non-spam email in some position in that email, right? And so, um, in math, the denominator is sum of your training set, indicator is not spam, times the number of words in that email. So the denominator ends up being the total number of words in all of your non-spam emails in your training set, um, and the numerator as some of your training set, sum from i equals 1 through m, indicates a y equals 0. So, you know, count up only the things for non-spam email, and for the non-spam email j equals 1 through ni, go over the words in that email and see how many words are that word k. Right. And so, uh, uh, if in your training set you have, um, uh, ah, you know, 100,000 words in your non-spam emails and 200 of them are the word drugs, that occurs, uh, you know, 200 times, then this ratio will be 200 over 100,000. Okay? Oh, and then lastly, um, [NOISE] to implement Laplace smoothing with this, you would, um, add 1 to the numerator as usual, and then, um, let's see. Actually, what- what- what- what would you add to the denominator? Uh- Uh, wait. But what is k? Not k, right? k is a variable. So k indexes into, ah, the words? What do you have? About 10,000. 10,000. Cool. How come? Why 10,000? [inaudible]. Cool. Yeah. Yeah. All right. Yeah, Right. Oh, I think I just realized why you say k I think, uh, overloading notation. When defining the possibility, I think I used k as the number of possible outcomes. Yeah, but here k is an index. Yeah. Right? So, um, uh, see I want a numerator and add to number of the possible outcomes in the denominator which in this case was there 10,000. So, um, uh, so this is the probability of, um, X being equal to the value of k, where k ranges from 1-10,000 if you have a dictionary size. If you have a list of 10,000 words you're modeling. And so the number of possible values for X is 10,000, so you add 10,000 to the denominator. Makes sense? Cool. Yeah. Question? [inaudible]. Oh, what do you do if the word's not in your dictionary? So, um, uh, there are two approaches to that. One is, um, just throw it away. Just ignore it, disregard it, that's one. Uh, second approach, is to take the rare words and map them to a special token which traditionally is denoted UNK for unknown words. So, um, if in your training set, uh, you decide to take just the top 10,000 words in- into your dictionary, then everything that's not in the top 10,000 words can map to your unknown word token or the unknown words special symbol. Yeah. [inaudible]. Oh, why did I write the run before? Oh, this is an indicator function notation. Uh, uh, so indicator function uh, boy- so if- if, um, and so this is- this notation, right? Means uh- well, so indicator of, you know, 2 equals 1 plus 1. This is true. An indicator of, you know, 3 equals 5 is- is 0, is false. So that's the- yeah, um, cool. Yes, uh, but this is a- this is a little formula that's either true or false depending on whether y-i is 0. Uh, I guess if y-i is 01 this- this is the same as not y-i I guess, so 1 minus y-i will give us 0- yeah. Cool. Okay great. Um, all right. So I think both of the models, ah, ah, including the details that maximum likelihood estimate are written out in, um, more detail in the lecture notes. Um, so, you know, when would you use the Naive Bayes algorithm. It turns out Naive Bayes algorithm is actually not very competitive with other learning algorithms. Uh, so for most problems you find that logistic regression,um, will work better in terms of delivering a higher accuracy than Naive Bayes. But the- the- the advantages of Naive Bayes is, uh, first it's computationally very efficient, and second it's relatively quick to implement, right? And it also doesn't require an iterative gradient descent thing, and the number of lines of code needed to implement Naive Bayes is relatively small. So if you are, uh, facing a problem, way you go is to implement something quick and dirty, then Naive Bayes is- is maybe a reasonable choice. Um, and I think, um, you know as you work on your class projects, I think some of you probably a minority will try to invent a new machine learning algorithm, and write a research paper. Um, and I think, you know, inventing the machine learning algorithm is a great thing to do. It helps a lot of people on a lot different applications so that's one. Um, the majority of class projects in CS229 will try to apply a learning algorithm to a project that you care about. Apply to a research project you're working on somewhere in Stanford or apply to a fun application you wanna build or apply to a business application for some of you taking this on SCPD, taking this remotely. And if your goal is not to invent a brand new learning algorithm, but to take the existing algorithms and apply them, then rule of thumb that's suggested here is, um, ah, when you get started on a machine learning project, start by implementing something quick and dirty. That's been implemented in most complicated possible learning algorithms. Start by implementing something quickly, and, uh, train the algorithm, look at how it performs, and then use that to deep out the algorithm, and keep iterating on- on that. So I think, you know, we're- we're- that's at Stanford. So we're very good at coming up with very complicated algorithms. But if your goal is to make something, um, work for an application, rather than inventing a new learning algorithm and publishing a paper on a new technical, you know, contribution. If you- if your main goal is, uh, you're working on an application on- on understanding news better or improving the environment or estimating prices or whatever. Uh, and your primary objective is just make an algorithm work. Then rather than, uh, building a very complicated algorithm at the onset, um, I would recommend implementing something quickly, uh, so that you can then better understand how it's performing, and then do error analysis which we'll talk about later, and use that to drive your development. Um, you know one- one- one analogy I sometimes make is that, um, if you are, uh, uh, let's see. So if you're writing a new computer program with 10,000 lines of code, right? One approach is to write all 10,000 lines of code first, and then to try compiling it for the first time, right. And that's clearly a bad idea, right? And it's a, you know, you should write small modules, run it, it test it- unit testing, and then build up a program incrementally. Rather than write 10,000 lines of code, and then start to see what syntax errors you're getting for the first time. Um, and I think it's similar for machine learning. Uh, instead of building a very complicated algorithm from the get-go, um, you build a simpler algorithm, test it, and then- and then use the- see what it's doing wrong, see what it's doing wrong to improve from there. You often end up, um, uh, getting to a better performing algorithm faster. Um, so here's- here's- here's one example. This is actually something I used to work on. I- I actually started a conference on email and anti-spam. My student worked on spam classification many years ago. And, um, it turns out that when your'e starting out on a new application problem, um, it's hard to know what's the hardest part of the problem, right. So if you want to build an anti-spam classifier, there are lots of you could work on. For example, spammers will deliberately misspell words. Uh, you know, a lot of mortgage spam, right, refinance your mortgage or whatever. But instead of writing th- the words uh, mortgage spammers will write M-0-R-T-G-A-G-E. Right. Or instead of G-A-G-E, maybe, uh, slash slash, right. But all of us as people have no trouble reading this as a word mortgage but uh, this will trip up a spam filter. This might map the word to- to an unknown word. There it was off by just a letter and it hasn't seen this before, and that's the lightest way to slip by this spam filter. So that's one idea for improving, um, spam or- actually one of our PhD students [inaudible] actually wrote a paper mapping this back to words like that. So the spam filter can see the words the way that humans see them, right. So- so that's one idea. Um, another idea might be a lot of spam email spoofs email headers. [NOISE] You know, uh, spam has often tried to hide where the email truly came from, uh, by spoofing the email header that, you know, address and other information. Um, ah, an- an- another thing you might do is, ah, try to fetch the URLs that are referred to in the email, and then analyze the web pages that you get to. Right, there are a lot of things that you could do to improve a spam filter. And any one of these topics could easily be three months or six months of research. But when you are building say a new spam filter for the first time, how do you actually know which of these is the best investments of your time. So my advice to, ah, those who work on projects, if your primary goal is to just get this thing to work, is to not so-somewhat arbitrarily dive in, and spend six months on improving this or spend, you know, six months on trying to analyze email headers. But you instead implement a more basic algorithm. Almost implement something quick and dirty. And then look at the examples that your learning algorithm is still misclassifying. And you'll find that, if after you've implemented a quick and dirty algorithm, you find that your sp- anti-spam algorithm is misclassifying a lot of examples with these deliberately misspelled words. It's only then that you have more evidence that it's worth spending a bunch of time solving the misspelled words, the deliberately misspelled words problem. Right. When you implement a spam filter, and you see that it's not misclassifying a lot of examples of these misspelled words, then I would say don't bother. Go work on something else instead or at least- at least treat that as a low priority. Okay. So one of the uses of, um, GDA Gaussian discriminant analysis as well as Naive Bayes is that- is, uh, they're not going to be the most accurate algorithms. If you want the highest classification accuracy, their are other algorithms like logistic regression or SVM which we talked about, or neural networks we'll talk about later, which will almost always give you higher classification accuracy than these algorithms. But the advantage of Gaussian discriminant analysis, and Naive Bayes is that, um, they are very quick to train or it's non-iterative. Uh, uh, this is just counting, and GDA is just computing means and co-variances, right. So it's very competition efficient, and also they are- they are simple to implement. So it can help you implement that quick and dirty thing that helps you, um, get going more quickly. And so I think for your project as well, I would advise most of you to uh, uh, you know, as you start working on your project, I would advise most of you to, um, don't spend weeks designing exactly what you're going to do. Uh, if you have an applicant- if- if you- if you- if you have an applied project, but instead get a data set, uh, and apply something simple. Start with logistic regression not- not a neural network or not- not something more complicated. Or start with Naive Bayes, and then see how that performs, and then- and then go from there. Okay? All right. So that's it for, uh, Naive Bayes, um, and generative learning algorithms. The next thing I wanna do is move on to a different cla- type of classifier, ah, which is a support vector machine. Um, let me just check any questions about this before I move on. Yeah. [inaudible]. Sorry you can use logistic regression with. [OVERLAPPING] Discrete variables [inaudible] Oh I see yeah right yes so yes, uh, right. So one of the weaknesses of the Naive Bayes Algorithm is that it treats all of the words as completely, you know, separate from each other. And so the words one and two are quite similar and the words, you know, like mother and father are quite similar. Uh, and so wi- wi-with this, uh, feature representation, it doesn't know the relationship between these words. So, um, in machine learning there are other ways of representing words, uh, there's a technique called word embeddings, um-[NOISE] In which you choose the feature representation that encodes the fact that the words one and two are quite similar to each other. Uh, the words mother and father are quite similar to each other. Yeah the words, um, whatever London and Tokyo are quite similar to each other because they are both city names. Uh, and so, uh, this is a technique that I was not planning to teach here but that is taught in CS 230. So in- in- in neural networks [NOISE] , right, but you can also read up on word embeddings or look at some of the videos and resources from CS 230 if you want to learn about that. Uh, so the word embeddings techniques. These are techniques from neural networks really. Will reduce the number of training examples you need so they are a good text classifier because it comes in with more knowledge baked in, right. Cool. Anything else? [NOISE] Cool. By the way I do this in the other classes too. In some of the other classes, somebody's got a question they go, no we don't do that we just covered that in CS 229 so [LAUGHTER]. Actually CS224N I think also covers this. Yeah, The NLP class, yeah, pretty sure, actually I am sure they do. Okay so, [NOISE] su-support vector machines, SVMs. Um, let's say the classification problem, [NOISE]. Right, where the data set looks like this, uh, and so you want an algorithm to find, you know, like a nonlinear decision boundary, right? So the support vector machine will be an algorithm to help us find potentially very very non-linear decision boundaries like this. Now one way to build a classifier like this would be to use logistic regression. But if this is X 1, this is X 2, right, so logistic regression will fit the three lines of data, Gaussian discriminant analysis will end up with a straight line decision boundary. So one way to apply logistic regression like this would be to take your feature vector X 1 X 2 and map it to a high dimensional feature vector with, you know, X 1, X 2, X 1 squared, X 2 squared X 1, X 2 maybe X 1 cubed, X 2 cubed and so on. And have a new feature vector which we would call phi of x. That- that has these high-dimensional features right, now, um, it turns out if you do this and then apply logistic regression to this augmented feature vector, uh, then logistic regression can learn non-linear decision boundaries. Uh, with these other features it's just regression and you actually learn the decision boundary. This is- there's a- there's a shape of an ellipse, right. Um, but randomly choosing these features is little bit of a pain right. I- I- I don't know. What I- I- I actually don't know what, you know, type of a, uh, set of features could get you a decision boundary like that right. Rather than just an ellipse and more complex as your boundary. Um, and what we will see with support vector machines is that we will be able to derive an algorithm that can take say input features X 1, X 2, map them to a much higher dimensional set of features. Uh, and then apply a linear classifier, uh, in a way similar to logistic regression. But different in details that allows you to learn very non-linear decision boundaries. Okay. Um, and I think, uh, you know, a support vector machine, one of the- actually one of the reasons, uh, support vector machines are used today is- is a relatively turn-key algorithm. And what I mean by that is it doesn't have too many parameters to fiddle with. Uh, even for logistic regression or for linear regression. You know you might have to tune the gradient descent parameter, uh, tune the learning rate sorry, tune the learning rate alpha. And that's just another thing to fit in with. We`ll try a few values and hope you didn't mess up how you set that value. Um, support vector machine today has a very, uh, robust, very mature software packages that you can just download to train the support vector machine on- on any on, you know, on a problem and you just run it and the algorithm will, kind of, converge without you having to worry too much about the details. Um, so I think in the grand scheme of things today I would say support vector machines are not as effective as neural networks for many problems. But, um, uh, but one great property of support vector machines is- is- is turn key. You kind of just turn the key and it works and there isn't as many parameters like the learning rate and other things that you had to fiddle with. Okay, um so the road map is, uh, we're going to develop the following set of ideas. We talked about the optimal [NOISE] margin classifier today, and, uh, we'll start with the separable case and what that means is going to start off with datasets, um, that we assume look like this and that are linearly separable. Right, and so the optimal margin classifier is the basic building block for the support vector machine, and, uh, we'll first derive an algorithm, uh, that' ll be- that will have some similarities to logistic regression but that allows us to scale, uh, in important ways that to find a linear classifier for training sets like this that we assume for now can be linearly separated. Um, so we'll do that today. And then what you'll see on Wednesday is, um, excuse me, next Monday, which is next Monday is an idea called kernels. And the kernel idea is one of the most powerful ideas in machine learning. Is, um, how do you take a feature vector x, maybe this is R 2, right, and map it to a much higher dimensional set of features. In our example there that was R 5, right, and then train an algorithm on this high dimensional set of features. And- and the cool thing about kernels is that this high dimensional set of features may not be R 5. It might be R100,000 or it might even be R infinite. Um, and so with the kernel formulation we're gonna take our original set of features that you are given for the houses you're trying to sell. For, uh, you know, medical conditions you're trying to predict and map this two-dimensional feature vector space into maybe infinite dimensional set of features. And, um, what this does is it relieves us from a lot of the burden of manually picking features, right, like do you want to have square root of X 1 or maybe X 1, X 2 to the power of two thirds. So you just don't have to fiddle with these features too much because the kernels will allow you to choose an infinitely large set of features. Okay, um, and then finally, uh, we'll talk about the inseparable case. [NOISE] So we're gonna do this today and then this next, uh, Monday okay. So [NOISE] and by the way I, you know, th-the machine learning world has become a little bit funny. I think that if you read in the news the media talks a lot about machine learning, the media just talks about, you know, neural networks all the time, right? And you'll hear about neural networks and deep learning a little bit later in this class. But if you look at what actually happens in practice in machine learning. Uh, the set of algorithms is actually used in practice, is actually much wider than neural networks and deep learning. So- so we do not live in a neural networks only world. We actually use many, many tools in machine learning. It's just that deep learning attracts the attention of the media in some way there's quite disproportionate to what I find useful, you know, I knew that's like- I loved that, you know but- but they're not- they're not the only thing in the world, uh, and so yeah and then late last night I was talking to an engineer, uh, about factor analysis which we'll learn about later in CS229 right, there's an unsupervised learning algorithm and there's an application, uh, that one of my teams is working on in manufacturing. Where we're gonna use factor analysis or something very similar to it. Which- which is totally not a neural network technique. Right. But still there, there are all these other techniques that including support vector machines and Naive Bayes I think do get used and are important. All right so let's start developing the optimal margin classifier. [NOISE] So, um, first, let me define the functional margin, which is, uh, informally, the functional margin of the classifier is how well- how, how confidently and accurately do you classify an example. Um, so here's what I mean. Uh, we're gonna go to binary classification, and we're gonna use logistic regression, right? So, so let's, let's start by motivating this with logistic regression [NOISE]. So this, this is a classifier H of theta equals the logistic function of pi to theta transpose x. And so, um, if you turn this into a binary classification, if, if, if you have this algorithm predict not a probability but predict 0 or 1, then what this classifier will do is, uh, predict 1. If theta transpose x is greater than 0, right? Um, and predict 0 otherwise. Okay. Because theta transpose x greater than 0, this means that, um, g of theta transpose x is greater than 0.5 [NOISE], and you can have greater than or greater than equal to, it doesn't matter. It is, it's exactly 0.5, it doesn't really matter what you do. Um, and so you predict 1 if theta transpose x is greater than equal to 0, meaning that the upper probability- the estimated probability of a class being 1 is greater than 50/50, and so you predict 1. And if theta transpose x is less than 0, then you predict that this class is 0. Okay. So this is what will happen if you have, um, logistic regression output 1 or 0 rather than output a probability, right. So in other words, this means that if y_i is equal to 1, right? Then hope or we want that theta transpose x_i is much greater than 0. Uh, this double greater than sign, it means much greater, right? Um, uh, because if the true label is 1, then if the algorithm is doing well, hopefully theta transpose x, right? Will be faster there, right? So the output probability is very, very close to 1. And if indeed theta transpose x is much greater than 0, then g of theta transpose x will be very close to 1 which means that is, it's giving a very good, very accurate prediction. Very correct and confident prediction, right? That, that equals 1. Um, and if y_i is equal to 0, then what we want or what we hope, is that theta transpose xi is much less than 0, right? Because, uh, if this is true, then the algorithm is doing very well on this example. Okay. So, um. So the functional margin which we'll define in a second, uh, captures [NOISE] this idea that if a classifier has a large functional margin, it means that these two statements are true, right? Um, so looking ahead a little bit, there's a different thing we'll define in a second which is called the geometric margin and that's the following. And for now, let's assume the data is, is linearly separable. Okay. Um, right. So let's say that's the data set. [NOISE] Now, [NOISE] that seems like a pretty good decision boundary for separating the positive [NOISE] and negative examples. [NOISE] Um, that's another decision boundary in red, that also separates the positive negative examples. But somehow the green line looks much better than the red line, okay? So, uh, why is that? Well, the red line comes really close to a few of the training examples, whereas the green line, you know, has a much bigger separation, right? Just has a much bigger distance from the positive and negative examples. So even though the red line and the [NOISE] green line both, you know, perfectly separate the positive and negative examples, the green line has a much bigger separation, uh, which is called the geometric margin. But there's a much bigger geometric margin meaning a physical separation from the trained examples even as it separates them. Okay. Um, and so what I'd like to do in the, uh, next several, I guess in the next, next, next 20 minutes is formalize definite functional margin, formalize definition geometric margin, and it will pose the, the, I guess the optimal margin classifier which based in the algorithm that tries to maximize the geometric margin. So what the rudimentary SVM does, what the SVM and low-dimensional spaces will do, also called the optimal margin classifier, is pose an optimization problem to try to find the green line to classify these examples, okay? So, um, [NOISE] now, um, in order to develop SVMs, I'm going to change the notation a little bit again. You know, because these algorithms have different properties, um, using slightly different notation to describe them, makes then the math a little bit easier. So when developing SVMs, we're going to use, um, minus 1 and plus 1 to denote the class labels. And, um, we're going to have a H output. So rather than having a hypothesis output a probability like you saw in logistic regression, the support vector machine will output either minus 1 or plus 1. And so, uh, g of z becomes minus 1 or 1, um, actually. So output 1 if z is greater than equal to 0, and minus 1 otherwise, okay. So instead of a smooth transition from 0 to 1, we have a hard transition, an abrupt transition from negative 1 to, um, plus 1. [NOISE] And finally, where previously we had for logistic regression, right? Where, uh, this was R N plus 1 with x_0 equals 1. For the SVM, we will have h of, I'll just write this out. Okay. Um, so for the SVM, the parameters of the SVM will be the parameters w and b. And hypothesis applied to x will be g of this, and where dropping the x_0 equals 1 [NOISE] constraint. So separate out w and b as follows. So this is a standard notation used to develop support vector machines. Um, and one way to think about this, is if the parameters are, you know, theta 0, theta 1, theta 2, theta 3, then this is a new b, and this is a new w. Okay? So you just separate out the, the, the, uh, theta 0 which was previously multiplying to x_o, right? And so um, uh, yeah, right. And so this term here becomes sum from i equals 1 through N, uh, w_i x_i plus b, right? Since we've gotten rid of [NOISE] x_0. [NOISE]. All right. So let me formalize the definition of a functional margin. [NOISE] So um, ah, so the parameters w and b are defined as linear classifier, right, so you know, wh- what- the formulas we just wrote down the parameters w and b, defines the a- a, uh, uh, re- really defines a hyperplane. Ah, but defines a line, or in high dimensions it'd be a plane or a hyperplane that defines a straight line, ah, ah, separating out the positive and negative examples. And so we're gonna say the functional margin of the, actually my hyperplane [NOISE] Okay, so the functional margin of a hyperplane defined by this with respect to one training example. We're going to write as this, um, and hyperplane just means straight line, right, but in high dimension. So this is linear classifiers, so its just, you know, functional margin of this classifier with respect to one training example, we're going to define as this. And so if you compare this with the equations we had up there, um, you know, if y equals 1 we hope for that, if y equals 0, we hope for that. So really what we hope for is for our classifier to achieve a large functional margin, right? And so, um, so if y_i equals 1 then what we want or what we hope for, um, is that w transpose x_i plus b is greater than, much greater than 0, and that the label is equal to minus 1. [NOISE] Then we want or we hope that [NOISE] this is much smaller than 0. Um, and if you, kind of, combine these two statements, if you take y_i, right, and multiply it with, er, that, [NOISE] then, you know, these two statements together is basically saying that you hope that Gamma hat i is much greater than 0, right, because y_i now is plus 1 or minus 1 and, uh, uh, and so y is equal to 1 you want this to be very, very large. If y_i is negative 1, you want this to be a very, very large negative number. Um, and so either way it's just saying that you hope this would be very large, okay? So we just hope that. [NOISE] And- and as an aside, ah, one property of this as well is that, um, so long as Gamma hat i is greater than 0, that means the algorithm, um, right, is equal to y_i. [NOISE] Ah, so- so- so long as the, um, functional margin, so long as this Gamma hat i is greater than 0, it means that, ah, either this is bigger than 0, this is less than 0 depending on the sign of the label. And it means that the algorithm gets this one example correct at least, right? And if- if much greater than 0 then it means, you know, so if it is greater than 0 it means in- in the logistic regression case it means that, the prediction is at least a little bit above 0.5, a little bit below 0.5, probably 0 so that at least gets it, right? And if it is much greater than 0 much less than 0, then that means it, you know, the probability of output in the logistic regression case is either very close to 1 or very close to 0. So one other definition, [NOISE] I'm gonna define the functional margin with respect to the training set to be Gamma hat, equals min over i of Gamma hat i, where here i [NOISE] equals ranges over your training examples. Okay. So, um, this is a worst-case notion, but so this definition of a function margin, on the left we defined functional margins with respect to a single training example, which is how well are you doing on that one training example? And we'll define the function margin with respect to the entire training set as, how well are you doing on the worst example in your training set? Okay, ah, this is a little bit of a plateau notion and we're for now, for today, we're assuming that the training set is linearly separable. So we're gonna assume that the training set, you know, it looks like this. [NOISE] And that you can separate it on a straight line [NOISE] that will relax this later, but because we're assuming, just for today, that the training set is, um, linearly separable, we'll use this kind of worst-case notion and define the functional margin to be the functional margin of the worst training example. Okay? [NOISE]. Now, one thing about the definition of the functional margin is, it's actually really easy to cheat and increase the functional margin, right? And one thing you can do, um, in regards to this formula is if you take w, you know, and multiply it by 2 and take b and multiply it by 2. [NOISE] then, um, everything here just multiplies by two and you've doubled the functional margin, right, but you haven't actually changed anything meaningful. Okay, so- so one, one way to cheat on the functional margin is just by scaling the parameters by 2 or instead of 2 maybe you can [NOISE] multiply all your parameters by 10 and then you've actually increased the functional margin of your training examples as 10x, but, ah, this doesn't actually change the decision boundary, right? It doesn't actually change any classification, just to multiply all of your parameters by a factor of 10. Um, so one thing you could do is, ah, replace, one thing you could do, um, would be to normalize the length for your parameters. So for example, hypothetically you could impose a constraint, the normal w is equal to 1, or another way to do that would be to take w and b and replace it with w over normal b and replace b with, [NOISE] right, just the value of parameters through by the magnitude, by the- by the Euclidean length of the parameter vector w, and this doesn't change any classification, It's just rescaling the parameters. Ah, but, ah, but, but that it prevents, you know, display of cheating on the functional margin. Okay. Um, and in fact, more generally you could actually scale w and b by any other values you want and- and it doesn't- doesn't matter, right? You can choose to replace this by w over 17 and b over 17 or any other number or any, right, and the classification stays the same. Okay. So we'll come back and use this property, in a little bit. Okay. [NOISE] All right. So to find the functional margin, let's define the geometric margin. An- and you'll see in a second how the geometric and functional margin relate to each other. Um, so les- let's, let's define the, uh, geometric margin with respect to a single example. Which is, um- so let's see- let's say you have a classifier. All right, so given parameters w and b that defines a linear classifier and the equation, wx plus b equals 0 defines the equation of a straight line. Uh, so the axes here are x_1 and x_2, and then half of this plane you know, in this half of the plane, you'll have w transpose x plus b is greater than 0. And in this half, you'll have w transpose x plus b is less than 0. And in between this- the straight line given by this equation w transpose x plus b equals 0, right. And so given parameters w and b, the upper right, that's where your cost high will predict y equals 1 and the lower left is where it'll predict y is equal to negative 1, okay. Now, let's say you have one training example here, right? So that's a training example, x_i, y_i. And, uh, let's say it's a positive example, okay? And so, um, your classifiers classify this example correctly, right? Because in the upper right half- half plane- here in this half plane w transpose x plus b is greater than 0. And so in the- in this upper-right region, uh, your classifier is predicting plus 1, right? Whereas in this lower half region would be predicting h of x equals negative 1. Right, and that's why this straight line where it switches from predicting negative to positive is the decision boundary. So what we're going to do is define this distance, um, to be that geometric margin of this training example, is the Euc- the Euclidean distance is what will define to be the geometric margin. So let me just write down what that is. [NOISE] So the geometric margin of, you know, the classifier of the hyperplane defined by w, b with respect to one example x_i, y_i. This is going to be gamma i equals w transpose x plus b over [NOISE] the normal w. [NOISE] Um, and let's see I'm not proving why this is the case, the proof is given in the lecture notes but, uh, the lecture notes shows why this is the right formula for measuring the Euclidean distance that I just drew [NOISE] in the picture up there, okay. Uh, but, and then, I'm not proving this here but the proof is given in the lecture notes but this turns out to be the way you compute the Euclidean distance between that example and uh, and the decision boundary, okay? Um, and uh, a- and this is [NOISE] for the positive example I guess. Uh, more generally, um, I'm going to define the geometric margin to be equal to this, uh, and this definition applies to positive examples and the negative examples, okay? And so the relationship between the geometric margin and the functional margin is that the geometric margin is equal to the functional margin divided by the norm of w. [NOISE] Finally, um, the geometric margin with respect to the training set is, um, where again uses worst-case notion of, uh- look through all your training examples and pick the worst possible training example, um, and that is your geometric margin on the training set. Uh, an- and so I hope the- sorry, I hope the notation is clear, right. So gamma hat was the functional margin and gamma is the geometric margin, okay? And so, um, what the optimal margin classifier does is [NOISE] , um, choose the parameters w and b to maximize the geometric margin, okay? Um, so in other words, thi- this- the optimal margin classifiers is the baby SVM, you know, it's like, a SVM for linearly separable data, uh, at least for today. [NOISE] And so the optimal margin classifier will choose that straight line, because that straight line maximizes the distance or maximizes the geometric margin to all of these examples, okay? Now, uh, how you pose this mathematically, there are a few steps of this derivations I don't want to do but I'll, I'll just describe the beginning step and the last step and leave the in bet- in between steps to the lecture notes. But it turns out that, um, one way to pose this problem is to maximize gamma w and b of gamma. So you want to maximize the geometric margin subject to that. Subject to the every training example, um, uh, must have geometric margin, uh, uh, greater than or equal to gamma, right? So you want gamma to be as big as possible subject to that every single training example must have at least that geometric margin. [NOISE] This causes you to maximize the worst-case geometric margin. And it turns out this is, um- not in this form, this is in a convex optimization problems. So it's difficult to solve this without a gradient descent and initially known local optima and so on. But it turns out that via a few steps of V writing, you can reformulate this problem as, um, into the equivalent problem which is a minimizing norm of w subject to the geometric margin, right. Um, so it turns out- so I hope this problem makes sense, right? So this problem is just you know solve for w and b to make sure that every example has a geometric margin greater or equal to gamma and you want gamma to be as big as possible. So this is the way to formulate optimization problem that says, ''Maximize the geometric margin.'' And what we show in the lecture notes is that, uh, through a few steps, uh, you can rewrite this optimization problem into the following equivalent form which is to try to minimize the norm of w, uh, subject to this. And maybe one piece of intuition to take away is, um, uh, the smaller w is the bigger, right? Th- th- the, the less of a normalization division effect you have, right? Uh, but the details I gave you in the lecture notes, okay? Um, but this turns out to be a convex optimization problem and if you optimize this, then you will have the optimal margin classifier and they're very good numerical optimization packages to solve this optimization problem. And if you give this a dataset then, you know, assuming your data's separable [NOISE] and we'll fix that assumption, uh, when we convene next week, then you have the optimal management classifier which is really a baby SVM and we add kernels to it, then you have the full complexity of the SVM norm, okay? All right, let's break for today, uh, see, see you guys next Monday.
Info
Channel: stanfordonline
Views: 79,676
Rating: undefined out of 5
Keywords:
Id: lDwow4aOrtg
Channel Id: undefined
Length: 80min 56sec (4856 seconds)
Published: Fri Apr 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.