Naive Bayes classifier: A friendly approach

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i am luis serrano and this video is about the naive Bayes classifier now your base is one of the most important things in probability and it's very useful in machine learning you may have seen it as a complicated formula regarding some ratios of probabilities I like to see this a little further and I like to think of it as what is the probability of something happened given that we know some information that something else happens and then naive Bayes is an extension of this which basically says ok once I have too many events and I don't know how to handle them are there any naive assumptions that I can make on them to make the math work easier and so this is what we're gonna see today so let's start with an example let's say we want to build a spam detector because we are tired of seeing a lot of spam email in our inbox and we want to sort it properly so how do we build it we build it with previous data unless our previous data is a set of a hundred emails and when we look at them carefully there are 25 of them that are spam and 75 and are not spam so what we're gonna do is we're gonna try to pick properties of the emails that we think may correlate with them being spam or not spam so let's pick one let's say we're gonna study the appearance of the word buy so we think that emails that contain the word buy are more likely to be spam than not spam so let's study that let's see how many emails that a spam have the word Buy and turns out there's 20 of them and let's see how many emails that are not spam have the word buy on them so there's five so let's forget about all the others and just look at the spam emails and here's a quiz the quiz says if an email contains the word buy then what is the probability that this email is spam given the data that we have and the options are 40% 60% 80% and a hundred percent so feel free to pause the video and think about it yourself given the data that we have what is the probability that if an email contains over by then it is spam is a conditional probability so I'll tell you the answer the answer is if we look at the emails that contain the word buy well there's $20 spam and five that are not so that mason 80/20 split and so from this data we can see that from the emails our continued whereby 80% of them are spam so the probability we conclude again just from this data that the probability is gonna be 80 percent that it's spam if it contains the word buy therefore we associate the condition containing the word buy with the probability 80 percent and that is exactly what Bayes theorem is you may have seen in a different way it's you know like a formula this is really what it is so just for fun let's do it for a different property for a different word let's say that we think that the word cheap may also be a good way to tell if an email is spam so let's study this word we count how many times the word cheap appears in spaniels that's gonna be in 15 of them and from the non-spam ten of them I have the word cheap so we forget about the rest and quiz again if an email contains where chip was a probability a spam 40 60 80 100 again feel free to pause the video I'll tell you the answer the answer is 60% because if you look at the split there is 15 spam and 10 no spam among the ones that contain the word cheap so that's a 60/40 split and therefore the solution is 60% so we applied base theorem for two words and obtain 80 and 60 now here's where things get complicated what if we want to apply it for both words at the same time so we want to see what's the probability of an email being spam if it contains both the word bye and the word cheap well we can do the same thing right we can count how many emails contain the word by and then look at how many contain the word cheap and then actually look at the overlap and so there's actually 12 emails that contain the words buy and cheap so that's some good data and then let's look among the no spams let's say that there's these five that contain the word buy and these ten contain they were cheap so actually there's none that contain both words but that's okay we're gonna do the same thing as before we have 12 spam emails and zero no spam emails that contain they were cheap so easiest quiz in the world if an email contains the word buying cheap wise probably a spam forty sixty eighty or a hundred and this should be easy right because there are twelve emails that contain both words zero emails that contain no words and this is a 100% 0% split so the answer is 100% and we are done right well maybe you're being skeptical like me right it seems like that's a little too much like any classifier that tells you something 100% is too strong and where lies the problem well the problem lies here that we had 12 emails that contained about words by and cheap and that's not bad but here we had 0 emails so among the non spam emos there are zero emails that contains the words buy and cheap and so that's just unfortunate among our data we don't find the two words but it's possible that these two words could appear right so we can't restrict ourselves to not have a classifier with the words buying cheap just because in our small data set the world stone appear so what could we do well one solution could be just maybe collect more data like go through a lot more emails until we find the words buy and cheap and then do base theorem on those but what if we just can't what if we can't collect more data and we have to do with the data that we have so let's think we have this situation what would you do if you have the situation and you have to sort of imagine how many emails would contain the words buy and cheap so what we're gonna do is try to guess the number try to come up with a sensible amount of emails that would contain the words buy and cheap even if we found none so let's look at a slightly larger DSL let's say we have a hundred emails so this is a different set than the first one we have a hundred emails and let's say that five contain the word buy and let's say that ten contain the word cheap and they don't overlap however what do you think would be a sensible number of emails that would contain the words buy and cheap so let's think 5 out of 100 is 5% so 5% of the emails contain the word body and 10 out of 100 is 10 percent so tempers the emails contained that were cheap so in an ideal world where everything was pretty how many emails would contain the words buy and cheap well what is what is ten percent of five percent it's zero point five percent so why don't we just assume that 0.5% of the email contained the word buy and cheap so we can sort of imagine that there is half an email that contains the words buying cheap answers all we're doing is math it doesn't really matter that there's half an email this will work out on our formulas what we did is an assumption we assumed that the words buy and cheap are independent they may not be right it could be that containing the word buy makes it easier to contain the word cheap because you're talking about a product they say buy cheap something or it could be the opposite that if one appears that sort of forces the other want to do not appear to be less likely to appear so it's a it's a quite a strong assumption as a matter of fact many people would say that's a naive assumption assuming that two variables are independent when they may not be is very naive however that's what our algorithm is based because it turns out that if we make these assumptions things still work well and it makes our math much much easier because now we don't have to collect thousands of emails we can collect these 100 and from the number of thousands by and the number of peers of cheap we can sort of cook up the numbers of Pearson's of buy and cheap so let's do that let's go back to our data we had 25 spam emails and 20 of them had the word buy and that's four-fifths and 15 of them had the word cheap that's 3/5 so we can imagine that the product of this is 12 divided by 25 so we could assume that an average 12 emails here out of 25 would contain the words buy and cheap so in order to find the actual number we multiply by 25 and we get that 12 emails have the words buying cheap so that was kind of lucky that we actually did find 12 we're not gonna be that lucky in the other case but we can still do it right so we have 75 emails five of them are buy that's 115 of them then ten of them have the word cheap that's two fifteen of them and the product of these two fractions again assuming they're independent is two divided by 225 so that's the fraction of emails that contain the words buy and cheap so to find the actual number or multiply it by 75 and we get 2/3 so in here we have 2/3 of an email contains the words buy and cheap and that's fair let's work with that so we go back to our data and on the Left we have 12 emails that contain the word buy and cheap and on the right we have 2/3 of an email that contain the word buy and cheap and we can do math with these ones right because now the quiz says if an email contains the words buy and cheap what is the probability that is spam so let's do some math what is the split among 12 and 2/3 well let's take the spam ones that's 12 and let's take the total number of emails that contain Buy and cheap and that's 12 plus 2/3 because there's 12 spam and 2/3 there are no spam so we can find the ratio between these and by the way if you've seen the formula for base theorem and there's a ratio and it's precisely this one so what do we do with this fraction well we put in lowest terms is 36 over 38 or ninety four point seven three seven percent because this plate is ninety four point seven three seven and five point two six three therefore our final answer is that the words buy and cheap give us a probability of ninety four point seven three seven percent of being spam that means if we have an email with both of those words is ninety four percent point seven three seven likely to be spam and that is precisely the naive Bayes classifier so now Bayes classifier basically it's a combination of Bayes theorem and be naive assumption that two events are going to be independent when they may not be but that naive assumption makes the math much much easier so let's do a little summary what we're really doing is we're gonna fill out this table and some places of the table we can't really fill out the data so we'll fill them out with other places in the table so let's look at spam and those animals we looked at the total was 25 spam emails and 75 non-spam emails in our dataset right now the next way we're gonna count how many of them have the word by so 20 of the 25 have the word by that's forfeits and then five of the 75 there are no spam have the word by so that's 115 because it's five divided by 75 now we're gonna fill in the next row so out of the spam emails the 15 of them contain the word sheep that's three-fifths cousins fifteen by twenty five and ten under $75 not spam contain they were cheap and so that's two 15s because it's 10 divided by 75 now we would love to fill in the last row with data the word their words buy and cheap but unfortunately this is not big enough to actually handle as an event that is so sparse like the words buying cheap appearing and you can imagine if there were more words it would be even harder so we have to cook up this row from the previous ones so what we're gonna do is the naive assumption that the words buy and cheap are independent so that one doesn't imply or push the other one to appear or stop it from appearing and if we make this assumption then we're gonna say that the product of these two is the probability of the word buy and cheap appearing so that's 12 divided by 25 the product of 4/5 and 3/5 so that's gonna be our probability and now if this is the probability of buying cheap appearing how many emails contain buy and cheap all we have to multiply by it by the total number which is 25 so 25 times 12 over I 25 is 12 so we conclude that 12 we must should contain the words buying cheap even if there is 12 or 14 or 10 or none logically if we have that assumption there should be 12 now let's look at the other two boxes well again we make the assumption that the word pine and chips are independent of each other so the product of this 2 which is 2 divided by 225 it's gonna be the probability afterwards buying cheap appearing in an email that is no spam so now how many emails that are not spam contain the word buy and cheap well product of the probability times a total number so how much is two over twenty two hundred twenty five times seventy five that's actually two-thirds so we have twelve spam emails and two-thirds of an email that is not spam that contain the words buying cheap so now we have to normalize right we have to see what is the split how many percentage are spam among the total ones and the total ones is twelve plus two-thirds that's all of our emails that are containing the word pion cheap so we divide twelve the spam ones divided by the total which is twelve plus two-thirds and we get 36 over 38 which is nine four point seven three seven now notice that nice ways extents and the idea is that this extends to many many more properties right because the point is if we have 50 properties and we can't check when they all appear at the same time we can check when one appears and then multiply things right so let's add an extra row to this table let's say we looked at the word work and we're wondering if the word work helps us in our classifier so let's study how much it appears let's say that it appears five times in our spam emails and 30 times in our non-spam email so it doesn't look like it's gonna help us that much it looks almost like it's a word that's more correlated to not spam but let's just study it so this 5 out of 25 is 1/5 so therefore one fifth of the spaniels contain the word work and six fifteen of the nonce problems contain the word work because 30 divided by 75 is 6 over 15 so again naive assumption that the words buy cheap and work are all independent therefore the probability that the three of them appear in an email is the product of these three numbers which is 12 divided by 125 and again if we want to estimate the number of emails that are spam that contain those three words we multiply the probability times the total and we get twelve divided by five so roughly twelve divided by five which is a little over two emails will be spam and contain the words by chip and work and now let's do it over here we assume again that the three words are independent of each other we take the product of the probabilities and that's gonna be the probability that the words buy cheap and work all appear in an email at the same time when the email is not spam so in order to find the total number of emails that are not spam to contain the words by cheap and work we multiply the probability that they appear times the total number of emails and we get that four out of 15 emails are not spam and contain the word by Cheban work because 75 times 12 divided by three three seven five is four fifteen so in summary out of the emails that contain the words by chip and work 12 over five are spam and four over 15 or ham so how many are spam divided by the total well we take twelve hour by five the number spam divided by the total which is 12 over 5 plus four divided by 15 and that is gonna be 36 over 44 put in lowest terms or 90% so that's how we combine the three words now is that 90 is less than 97 because the word work actually decreases the probability that an email is spam because as you can see work appears a lot more in spam emails so it does make sense because it's not a word that one would correlate with spam so some of these properties may increase probability and some of them would decrease it but the fact is a nice base helps us combine a bunch of different features into creating a model that calculates the probability that something is spam and these features get combined in a nice way because we don't have to wait until we find an email with all these features we can actually cook up probabilities without having emails that satisfy all of them so if you're like formulas this is really what happened in the background we have this is the formula of Bayes theorem and the letter S stands for being spam the letter H stands for ham which is actually how they call email that are not spam they call them ham and the red letter B stands for by so probability of s given B when you see that vertical bar that is a conditional probability so what the Left says is probability of spam in the word by appears and that's a ratio because most post probabilities are ratios and then the top we have probability of BI given that spam so out of the spanning knows how many of them contain the word by that was 20 out of 25 and then probability of s is email spam regardless of any words that it contains to us 25100 because if we remember there were 25 spaniels out of 100 total so in the bottom goes everything that total so that's the same thing 20 over 25 times 25 or 100 plus the ham ones so we have what's the probability of the word by appearing if the email is ham that's five or seventy five because out of 75 animals five of them have the word pie and the probability of animal being ham well 75 over 100 so if you do that whole formula you get 80% but the interesting thing is if you look at what we did it was exactly that and then what happens with naive Bayes is that we make this assumption that the probability of the word by and the word cheap appearing is the product of the probabilities of the word by appearing and the word cheap appearing again this is not supposed to happen the words buying cheap may be either correlated or inversely correlated maybe one implies the other one maybe one stops the other from appearing but we're gonna assume naively that the product of the probabilities the property of both appearing which is saying this that the probability of some event B intersection segment C is a product of probabilities of B and C appearing again is a naive assumption but we're gonna make it because I've makes our math easier and the full formula for a naive Bayes this is for two events but you can generalize this for many more events is probability of spam given that the words buy and cheap appear is that formula and if we look at all the probabilities here we say probability of spam if it contains the words buy and cheap well it's a ratio on the top we know these probabilities is 20 out of 25 or probability of by given that a spam probability of cheap given that it's spam is 15 or 25 you remember correctly or 15 e spam emails containing the word cheap and then again 25 over 100 for the probability that an email of spam in the bottom we have the same thing plus 5 over 7 5 the probability that a ham email contains the word pie 10 over 75 the product is at honey milk contains were cheap and then the probability that an email is ham which is 75 over 100 you do this math and you get ninety four point seven three seven but I challenge you if it doesn't look super clear look at this slide and go to what we did in night base and convince yourself this is exactly what we did what do then this whole video was nothing different than calculating probabilities by dividing one thing by another so thank you very much that's it for a naive base as usual if you liked it please subscribe for more videos coming up yeah please hit like share it with your friends and feel free to comment to ask any questions or any suggestions for this or any other videos you'd like to see and my twitter handle is louis likes math so thank you very much for your attention and see you in the next video
Info
Channel: Serrano.Academy
Views: 95,793
Rating: undefined out of 5
Keywords: naive bayes, bayes theorem, probability, conditional probability, machine learning, artificial intelligence, ai, mathematics, math
Id: Q8l0Vip5YUw
Channel Id: undefined
Length: 20min 29sec (1229 seconds)
Published: Sun Feb 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.