How Bayes Theorem works

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Bayesian inference is a way to make guesses about what your data mean based on sometimes very little data the way it works is tricky but it's not magic it's definitely something that you can wrap your head around and it's not impossible to do so my goal is that by the time we're done talking you'll have a pretty crisp picture of how it works Bayesian inference is just guessing in the style of Thomas Bayes who was a nonconformist Presbyterian minister he wrote a couple of books one about religion and one about probability so a Bayesian inference is just guessing in the style of Bayes so to illustrate it imagine that you're at the movies and someone drops a ticket you pick it up and you can see them from behind you know they have long hair but you don't know whether they're a man or a woman so you have to make a guess based on what you know about the attendees at your movie theater you might say excuse me ma'am is this your ticket now imagine instead that this person is standing in line for the men's restroom knowing this extra piece of information you might make a different guests Bayesian inference is a way to capture this common-sense knowledge about the situation and help you to make better guesses so to put numbers to this dilemma at the movie theater let's assume out of 100 women at the movies 50 have short hair 50 have long and out of 100 men at the movies 96 have short hair and for half long in this case we can see that there are definitely more women with long hair than men with long hair so it's a safe bet to assume this person's a woman now we just made a subtle assumption that there are about the same number of men and women at the movies this assumption no longer holds when we move to the men's restroom line here let's say there are two women out of every 100 people and 98 men maybe women keeping their partners company there's still one with short hair and one with long hair it's still half and half long and short hair but now there are four times as many men with long hair than women with long hair in this group now the safe money is to bet that this person is a man so to draw this a little differently out of a hundred people at the movies overall we'll make this assumption explicit that fifty of them are women fifty of them are men so this is how the different categories break down in the line for the men's restroom then they break down a little differently so to translate this to math the probability that a person is a woman is the total number of women divided by the total number of people 50% similarly for men moving to the men's restroom line the probability that someone is a woman is two percent ninety-eight percent for men now Bayes theorem is a little bit tricky so to be very precise we're going to have to talk math so if you bear with me for just three probability concepts we'll lay the foundation for presenting Bayes theorem the first one is conditional probabilities if I know that a person is a woman that's the condition what's the probability that that person has long hair so it's written as probability of long hair given that a person's a woman so to get this we just divide the number of women with long hair by the total number of women 50% and this doesn't change whether there's 50 women in in your group or two women in the group still if we know that a person is a woman the probability that they have long hair is 50% we do the same thing with men probability that someone has long hair given that they're a man is 4% so conditional probabilities if I know that B is case what's the probability that a is also the case this is not that you can't reverse DNA and have this be true so for instance if I know that the thing I'm holding is a puppy what's the probability that it's cute the probability is very high if I know that the thing I'm holding is cute what's the probability that it's a puppy well it might be a puppy might be a kitten it might be a hedgehog it might be a small human there's lots of things that it could be so the probability there is less moderate so these things are not interchangeable in conditional probabilities now concept to joint probabilities so what's the probability that a person is both a woman and has short hair so to calculate a joint probability you first find their conditional probability well if I know that they're a woman what's the probability that they have short hair and then you multiply that by the probability that there are a woman so in this case 0.5 times 0.5 we get 0.25 which is exactly what we knew it was going to be and the same is true for all of our different conditions so we can do this for the men's restroom too the probability that someone is a man and has long hair 4% someone is a woman and has long hair 1% joint probabilities are different than conditional probabilities here the probability that a and B is the case is the same that the probability that B and a is the case so the probability that I'm having a jelly donut with my milk is the same as the probability that I'm having a milk with my jelly donut these two conditions these two situations are identical so we can swap the order and finally concept three marginal probabilities if I wanted to say figure out the probability that someone has long hair I just add up all of the different ways if someone can have long hair they can be a woman with long hair or a man with long in the men's restroom line that's a 1 percent probability plus a 4 percent probability or a 5 percent probability overall you can do the same thing for short hair 95 percent now this last concept finishes our foundation we can get to what we really care about we know that this person has long hair what's the probability that they are a man or a woman this is a conditional probability but it's the reverse of the one that we know and we don't know how to answer this yet so this is where Thomas Bayes comes in he noticed something really cool you can calculate the joint probability that someone is a man and has long hair using the formula we saw before and you can also calculate the joint probability that someone has long hair and is a man now these are different formulas but we remember joint probabilities don't care about the order so these two things are equal which means the stuff that they're equal to on the other side are also equal to each other and we can do a little algebraic sleight of hand and now we have a formula for what we want to know if someone has long hair what's the probability that they're a man we have this expression to the right we can genera size that with A's and B's and now we have Bayes theorem one of the top 10 most popular math tattoos of all time so using this formula we can go back to the movie theater and plug in what we know we know that the probability that someone is a man we know the probability that if they're a man they have long hair and we know the conditional probability or sorry the marginal probability that someone has long hair which is just the probability that someone's a woman is long hair plus probability to some of the man with long hair and we plug all that in and we can say if someone has long hair at the movie theatres there is a 7% chance that they are a man similarly 93% chance that they are a woman now if they are in line for the men's restroom because some of those probabilities change that conditional probability changes someone's in line for the men's restroom and has long hair there's an 80% chance that they are a man and this is consistent with what we saw before we know that there are four men and one woman for every hundred people in line for the men's restroom that have long hair so four out of five long haired people are men 80% it all adds up so this example shows the mechanics of how to get Bayes theorem and how it works in practice it's usually used a little differently so to show this we'll have to do a little bit of a detour and first talk about probability distributions you can think of probability like a pot with just one cup of coffee in it you can fill up if you just have one cup to fill up you can fill it all the way to the top but if you have more than one cup you have to share it around or distribute it and you can share it with any proportion you want so for instance if we're representing the number of men and women at the movies we could share it 5050 but it'll always add up to a hundred percent we could even share it further in two different categories so here we see the joint probabilities of all of our four different categories that we were working with and you can see that this is just another representation of the distribution representation that we were looking at before now usually when we look at this they're side-by-side probability instead of percentage and shown in a histogram like this it can be really helpful to think of these as beliefs so for instance if I flip a coin and hide the result from you you might half believe it's heads and half believe it's tails until I tell you what it is if I roll a die and hide the result from you might believe about one-sixth that it's a one or two or three or four or five or six until I show you the result so these are what you believe probabilities can represent what you believe about something before you measure it similarly for Powerball tickets and even for things that are more complicated like let's say the height of adults in the United States in centimeters you might believe that there's a very small chance that they'll be over two hundred and ten centimeters and a smallest chance that they're less than 150 centimeters and then assign various amounts of this belief to all of the ranges in between it still adds up to one like all a bunch of cups of coffee lined up in a row and you put a little bit in each one the cups in the middle have more and it shows how your belief about some individual is distributed before you actually measure them now you can take these bins and chop them more and more finely again and again and if you keep doing this you can get to something that's actually perfectly smooth so it's as if you had an infinite number of very tiny cups and you put a tiny bit into the new testament amount of coffee in each one at this point it's probably no longer helpful to think of it in that terms but just thinking of it as a continuous distribution showing for all these Heights where am i placing my bets what do I believe and how much so once you have your beliefs you can use it to answer questions about typical Heights the average the median value the most common value or the mode now we'll use this in weighing my dog I have a suit named reign of terror she's little mischievous and when we go to the veterinarian rain squirms on the scale so every time we weigh her we get a different weight this last time we got thirteen point nine pounds seventeen and a half pounds and fourteen point one pounds what we want to know is how much rain weighs and this will be the basis for a decision this is important if her weight has gone up her food intake will have to go down and for her this is a matter of life and death so we don't want to make the wrong assumption and draw the wrong conclusion so if you've ever taken a statistics class before you know you can take these measurements add them up get the averaged 15.2 pounds you can calculate the standard deviation of these three measurements and also the standard error and come up with a one point one six pound standard error which when you show it graphically this red curve now shows the belief that results from those three measurements the distribution the peak of that hill is at fifteen point two pounds and one standard deviation on that curve is our standard error of one point two pounds so you can see looking at this that yes it's most likely that she's fifteen point two pounds but there's a lot of that curve that sits outside of the range of fourteen to sixteen so yeah she's probably between fourteen and sixteen pounds most likely between 13 and 17 pounds but she might even be lower than 12 and higher than 18 that is a weight really wide range and it's not a great basis for making a decision now you can see the three measurements there those three white vertical lines and you can see why our belief is so district dispersed because those three measurements are pretty dispersed it's hard to capture all that in one distribution so let's try it again with Bayes theorem so the way we'll do this is instead of a and B will substitute in W for her actual weight and M for the measurements that we took now this term over here the probability distribution of the actual weight is our prior this is what we believe about her weight before we put her on the scale the probability given a weight of getting certain measurements are the likelihood associated with those measurements and then the posterior is what we believe about her weight given those measurements so you can think of this as we start with a belief we take some measurements and we update it and then we have a new belief when we're done this term on the bottom we're going to ignore for the most part it'll be a constant but it's called the marginal likelihood so we're going to start by not assuming anything about her weight could be one pound 10 pounds 20 pounds 100 pounds we're going to let this be uniform and we're going to let the data speak so now our formula looks like this we can further simplify it and so we want to calculate this we want to calculate the probability of our measurements occurring given a weight and we want to do this for all of the possible weights and then we'll end up with the new distribution which is our belief what's the probability of each of those weights occurring given the measurements these two things are identical so let's start for instance by assuming what if she weighed 17 pounds in reality now because our measurement process is very noisy as we saw if she waved 17 pounds we would expect those measurements to be distributed as shown here some would be up way above 18 pounds be down around 14 pounds where we actually measured but not very many of them would be so to calculate now the probability of our measurements occurring given this weight we find what the probability of each individual weight is of occurring and we multiply that times that times that now these two are pretty small when you multiply two small things together they make something very small so the probability of her being at 17 pounds is pretty small we shift our belief over and say well what if she was 16 and a half pounds what if she was 16 pounds and we recalculated each time multiplying all of those actual probabilities together and then by the time we're done this is what we've measured at each of those weights this is the likelihood of each of those occurring and you can see that the maximum likelihood occurs at 15 point 2 pounds and this is commonly called the maximum likelihood estimate where you start with a uniform assumptions on your weight and it just so happens that the standard error on this is exactly what we calculated before a very cool thing connection here when you take the average and calculate standard deviation in standard error it gives you the likelihood that you would get by doing Bayes method and assuming a uniform prior not assuming anything about what the results going to be so we've already established though that that's a really broad result and not helpful so we need to start over now and let's start with what we know so some background information rain was 14.2 pounds the last time we went into the bed and she doesn't seem noticeably more heavy to me my arm is not that well calibrated but let's assume that she's within about a pound of where she was before so I take that prior and this is the form that it takes a normal distribution centered on fourteen point two pounds and you can see that most of that bulk is within plus or minus a pound and it extends a little bit further out allow for the possibility that she's actually gained a lot or lost a lot of weight but probably she's pretty close this is what I believe before I even put her on the scale this is the probability the prior the probability of her being a given weight so this time we're not neglecting the prior term we're not setting it constant we're going to use it and the way this plays out now is we assume okay what if she were seventeen pounds well we need to multiply that now by the probability of our prior showing that she's seventeen pounds which actually makes that quite small now we calculate and multiply the three probabilities of our measurements occurring so now we have something small-time something very small time something very small so we get a very small result probability that she will act that she actually weighs 17 pounds and now we repeat this process at 16 and 1/2 pounds and 16 pounds and 15 and a half pounds and 15 pounds all the way through and then by the time we're done we tally up all of those and we get this new posterior distribution it's normally distributed at about fourteen point one pounds and it has a standard error of less than a pound you'll notice it's even narrower than our original prior so we've taken our original belief and we've been able to sharpen it up just a bit and so incidentally the peak of this curve is called the maximum a-posteriori result we have to choose one value to represent our belief that's not a bad one to choose and now we compare this with our original estimate it's labeled non Bayesian here but more accurately it could be Bayesian with a uniform prior you can see that it is much broader and also the peak of that curve is in an entirely different place so the answer that we got it's more confident because it's more centered and it's probably based on what we know closer to being correct so this is how Bayes theorem is used most often in data science or an analysis it's a prior that you then update based on your measurements to sharpen up and get it we get a revised for set of beliefs so there's a lot of times that it makes sense to use Bayesian inference sometimes we just know things like if we're measuring age we know that everyone is more than 0 years old and so we can take that information and build it in and we can get sharper estimates with fewer measurements now so it should if you're paying attention make you a little bit nervous we UN's are actually not always aware of what we believe and putting it into mathematical distribution can be very tricky more importantly the reason we're measuring something is because we want to learn about it we want to be able to be surprised by our data so if we believe something that's not true it can make it harder or impossible to learn from our data I like how Mark Twain phrased this he says it ain't what you don't know that gets you into trouble it's what you know for sure that just ain't so so the way to avoid this pitfall is to always believe things that we think are impossible at least just a little bit so by leaving this room for something to be possible we can do like Sherlock Holmes says and once you've excluded the impossible then whatever remains however improbable must be the truth we don't want to exclude the improbable out of hand because then we're left with nothing Alice in a conversation with the Red Queen summed it up well - there's no use in trying she said one can't believe impossible things I daresay you haven't had much practice said the Queen when I was younger I always did it for half an hour a day why sometimes I believed as many as six impossible things before breakfast so the secret to using Bayesian inference well is to keep believing impossible things thanks for your attention here's how you can get in touch with me if you'd like to carry on the conversation I look forward to talking with you again soon
Info
Channel: Brandon Rohrer
Views: 422,232
Rating: 4.9303932 out of 5
Keywords: Bayes theorem, Bayesian inference, Bayes, Bayesian, statistics, probability, tutorial, lecture, teaching, how-to, Normal, distribution, statistician, How it works, Bayes law, Bayesian statistics, Bayes rule, Bayesian probability
Id: 5NMxiOGL39M
Channel Id: undefined
Length: 25min 9sec (1509 seconds)
Published: Tue Nov 01 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.