Why “probability of 0” does not mean “impossible” | Probabilities of probabilities, part 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Unfortunate that /u/sleeps_with_crazy is no longer here.

👍︎︎ 167 👤︎︎ u/nebulaq 📅︎︎ Apr 12 2020 🗫︎ replies

Summary: It's about probability densities.

👍︎︎ 38 👤︎︎ u/Bromskloss 📅︎︎ Apr 12 2020 🗫︎ replies

Nice motivation for measure theory at the end there. The fact that simple concepts like Dirac measures make the weird PDF near the end of the video trivial to work with is both exciting for potential learners and, best of all, true!

👍︎︎ 8 👤︎︎ u/seanziewonzie 📅︎︎ Apr 12 2020 🗫︎ replies

/u/sleeps_with_crazy isn't here any more, but has done a pretty solid job of debunking this sort of thing in the past.

👍︎︎ 56 👤︎︎ u/elseifian 📅︎︎ Apr 12 2020 🗫︎ replies

that video was a let down. I was hoping to start my day with Borel σ algebras and all I got was explanations of intervals.

👍︎︎ 6 👤︎︎ u/MycroftTnetennba 📅︎︎ Apr 13 2020 🗫︎ replies

So you're saying there's a chance...

👍︎︎ 9 👤︎︎ u/everything_is_bad 📅︎︎ Apr 12 2020 🗫︎ replies

If one wanted to assign a value to the probability of a singleton, it obviously has to be infinitesimal. This can be done rigorously, see for instance Infinitesimal Probabilities (2018)

EDIT: I found another very nice paper Fair infinite lotteries (2013)

👍︎︎ 12 👤︎︎ u/M4mb0 📅︎︎ Apr 12 2020 🗫︎ replies

This is why I think that probability theory should work with hyperreal numbers rather than real numbers. A probability of 0 should be cannot happen, and a probability of 1 should be must happen. With hyperreal numbers we can then say that something happens with probability 1/ω. For instance, for some uniform random process on [0, 1] the probability of getting 0.5 is not 0, but rather 1/ω.

👍︎︎ 7 👤︎︎ u/alcanthro 📅︎︎ Apr 12 2020 🗫︎ replies

Uhh by definition doesn’t probability 0 mean impossible?

Like for any continuous random variable X,

P(X < x) = P(X ≤ x) for any value x

Which can only be true if P(X = x) = 0.

👍︎︎ 9 👤︎︎ u/henzhou 📅︎︎ Apr 12 2020 🗫︎ replies
Captions
Imagine you have a weighted coin. So the probability of flipping heads, it might not be 50/50 exactly. It could be 20%. Or maybe 90%. Or 0%. Or 31.41592%. The point is that you just don't know. But imagine that you flipped this coin 10 different times, and 7 of those times, it comes up heads. Do you think that the underlying weight of this coin is such that each flip has a 70% chance of coming up heads? If I were to ask you, "Hey, what's the probability that the true probability of flipping heads is 0.7?" what would you say? This is a pretty weird question, and for two reasons. First of all, it's asking about a probability of a probability, as in, the value we don't know is itself some kind of long-run frequency for a random event, which, frankly, is hard to think about. But the more pressing weirdness comes from asking about probabilities in the setting of continuous values. Let's give this unknown probability of flipping heads some kind of name, like h. Keep in mind that h could be any real number, from 0 up to 1. Ranging from a coin that always flips tails up to one that always flips heads, and everything in between. So if I ask, "Hey, what's the probability that h is precisely 0.7, as opposed to, say, 0.70000001, or any other nearby value?" well, there's gonna be a strong possibility for paradox if we're not careful. It feels like no matter how small the answer to this question, it just wouldn't be small enough. If every specific value within some range, all uncountably infinitely many of them, has a non-zero probability, Well, even if that probability was minuscule, adding them all up to get the total probability of any one of these values will blow up to infinity. On the other hand, though, if all of these probabilities are 0, aside from the fact that that now gives you no useful information about the coin, the total sum of those probabilities would be 0, when it should be 1. After all, this weight of the coin h is *something*, so the probability of it being any one of these values *should* add up to 1. So, if these values can't all be non-zero, and they can't all be zero, what do you do? Where we're going with this, by the way, is that I'd like to talk about the very practical question of using data to create meaningful answers to these sorts of 'probabilities of probabilities' questions. But for this video, let's take a moment to appreciate how to work with probabilities over continuous values, and resolve this apparent paradox. The key is not to focus on individual values, but ranges of values. For example, we might make these buckets to represent the probability that h is between, say, 0.8 and 0.85. Also, and this is more important than it might seem, rather than thinking of the *height* of each of these bars as representing the probability, think of the *area* of each one as representing that probability. Where exactly those areas come from is something that we'll answer later. For right now, just know that in principle, there's *some* answer to the probability of h sitting inside one of these ranges. Our task right now is to take the answers to these very coarse-grained questions, and to get a more exact understanding of the distribution at the level of each individual input. The natural thing to do would be consider finer and finer buckets. And when you do, the smaller probability of falling into any one of them, is accounted for in the thinner *width* of each of these bars, while the heights are gonna stay roughly the same. That's important because it means that as you take this process to the limit, you approach some kind of smooth curve. So even though all of the individual probabilities of falling into any one particular bucket will approach 0, the overall shape of the distribution is preserved, and even refined in this limit. If, on the other hand, we had let the *heights* of the bars represent probabilities, everything would've gone to 0. So in the limit, we would've just had a flat line giving no information about the overall shape of the distribution. So, wonderful! Letting area represent probability helps solve this problem. But let me ask you, if the y-axis no longer represents probability, what exactly are the units here? Since probability sits in the area of these bars, or width times height, the height represents a kind of probability per unit in the x direction, what's known in the business as a "probability density". The other thing to keep in mind is that the total area of all these bars has to equal 1 at every level of the process. That's something that has to be true for any valid probability distribution. The idea of probability density is actually really clever when you step back to think about it. As you take things to the limit, even if there's all sorts of paradoxes associated with assigning a probability to each of these uncountably infinitely many values of h between 0 and 1, there's no problem if we associate a probability density to each one of them, giving what's known as a "probability density function", or PDF for short. Any time you see a PDF in the wild, the way to interpret it is that the probability of your random variable lying *between* 2 values equals the area under this curve between those values. So, for example, what's the probability of getting any one very specific number, like 0.7? Well, the area of an infinitely thin slice is 0, so it's 0. What's the probability of all of them put together? Well, the area under the full curve is 1. You see? Paradox sidestepped. And the way that it's been sidestepped is a bit subtle. In normal, finite settings, like rolling a die or drawing a card, the probability that a random value falls into a given collection of possibilities is simply the sum of the probabilities of being any one of them. This feels very intuitive. It's even true in a countably infinite context. But to deal with a continuum, the rules themselves have shifted. The probability of falling into a range of values is no longer the sum of the probabilities of each individual value. Instead, probabilities associated with ranges are the fundamental primitive objects. And the only sense in which it's meaningful to talk about an individual value here, is to think of it as a range of width 0. If the ideas of the rules changing between a finite setting and a continuous one feels unsettling, well you'll be happy to know that mathematicians are way ahead of you. There's a field of math called 'measure theory', which helps to unite these two settings and make rigorous the idea of associating numbers like probabilities, to various subsets of all possibilities in a way that combines and distributes nicely. For example, let's say you're in a setting where you have a random number that equals 0 with 50% probability, and the rest of the time, it's some positive number according to a distribution that looks like half of a bell curve. This is an awkward middle-ground between a finite context, where a single value has a non-zero probability, and a continuous one, where probabilities are found according to areas under the appropriate density function. This is the sort of thing that measure theory handles very smoothly. I mentioned this mainly for the especially curious viewer, and you can find more reading material in the description. It's a pretty common rule of thumb that if you find yourself using a sum in a discrete context, then use an integral in the continuous context, which is the tool from calculus that we use to find areas under curves. In fact, you could argue this video would be way shorter if I just said that at the front and called it good. For my part though, I always found it a little unsatisfying to do this blindly without thinking through what it really means. And, in fact, if you really dig in to the theoretical underpinnings of integrals, what you'd find is that in addition to the way that it's defined in a typical intro calculus class, there is a separate, more powerful definition that's based on measure theory, this formal foundation of probability. If I look back to when I first learned probability, I definitely remember grappling with this weird idea that in continuous settings, like random variables that are real numbers, or throwing a dart at a dart board, you have a bunch of outcomes that are possible, and yet each one has a probability of 0. And somehow, altogether, they have a probability of 1. Now, one step of coming to terms with this is to realise that possibility is better tied to probability density than probability, but just swapping out sums of 1 for integrals of the other has never quite scratched the itch for me. I remember that it only really clicked when I realised that the rules for combining probabilities of different sets were not quite what I thought they were. And there was simply a different axiom system underlying it all. But anyway, steering away from the theory, somewhere back in the loose direction of application, look back to our original question about the coin with an unknown weight. What we've learned here is that the right question to ask is what's the probability density function that describes this value h after seeing the outcomes of a few tosses? If you can find that PDF, you can use it to answer questions like 'What's the probability that the true probability of flipping heads falls between 0.6 and 0.8?'. To find that PDF, join me in the next part.
Info
Channel: 3Blue1Brown
Views: 1,581,402
Rating: 4.922101 out of 5
Keywords: Mathematics, three blue one brown, 3 blue 1 brown, 3b1b, 3brown1blue, 3 brown 1 blue, three brown one blue
Id: ZA4JkHKZM50
Channel Id: undefined
Length: 10min 0sec (600 seconds)
Published: Sun Apr 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.