Imagine you have a weighted coin. So the probability of flipping heads,
it might not be 50/50 exactly. It could be 20%. Or maybe 90%. Or 0%. Or 31.41592%. The point is that you just don't know. But imagine that you flipped this coin 10 different times, and 7 of those times, it comes up heads. Do you think that the underlying weight of this coin is such that each flip has a 70% chance
of coming up heads? If I were to ask you, "Hey, what's the probability that the true probability of flipping heads is 0.7?" what would you say? This is a pretty weird question, and for two reasons. First of all, it's asking about a probability of a probability, as in, the value we don't know is itself
some kind of long-run frequency for a random event, which, frankly, is hard to think about. But the more pressing weirdness comes from asking about probabilities in the setting of continuous values. Let's give this unknown probability of flipping heads some kind of name, like h. Keep in mind that h could be any real number,
from 0 up to 1. Ranging from a coin that always flips tails up to one that always flips heads, and everything in between. So if I ask, "Hey, what's the probability that h is
precisely 0.7, as opposed to, say, 0.70000001,
or any other nearby value?" well, there's gonna be a strong possibility for paradox
if we're not careful. It feels like no matter how small the answer to this question, it just wouldn't be small enough. If every specific value within some range, all uncountably infinitely many of them, has a non-zero probability, Well, even if that probability was minuscule, adding them all up to get the total probability of
any one of these values will blow up to infinity. On the other hand, though,
if all of these probabilities are 0, aside from the fact that that now gives you
no useful information about the coin, the total sum of those probabilities would be 0,
when it should be 1. After all, this weight of the coin h is *something*, so the probability of it being any one of these values *should* add up to 1. So, if these values can't all be non-zero,
and they can't all be zero, what do you do? Where we're going with this, by the way, is that
I'd like to talk about the very practical question of using data to create meaningful answers to these sorts of 'probabilities of probabilities' questions. But for this video, let's take a moment to appreciate
how to work with probabilities over continuous values, and resolve this apparent paradox. The key is not to focus on individual values,
but ranges of values. For example, we might make these buckets to represent the probability that h is between, say, 0.8 and 0.85. Also, and this is more important than it might seem, rather than thinking of the *height* of each of these bars as representing the probability, think of the *area* of each one
as representing that probability. Where exactly those areas come from
is something that we'll answer later. For right now, just know that in principle, there's *some* answer to the probability of h
sitting inside one of these ranges. Our task right now is to take the answers to these
very coarse-grained questions, and to get a more exact understanding of the distribution at the level of each individual input. The natural thing to do would be consider
finer and finer buckets. And when you do, the smaller probability
of falling into any one of them, is accounted for in the thinner *width*
of each of these bars, while the heights are gonna stay roughly the same. That's important because it means that as you
take this process to the limit, you approach some kind of smooth curve. So even though all of the individual probabilities of falling into any one particular bucket will approach 0, the overall shape of the distribution is preserved, and even refined in this limit. If, on the other hand, we had let the *heights* of the bars represent probabilities, everything would've gone to 0. So in the limit, we would've just had a flat line giving no information about the overall shape of the distribution. So, wonderful! Letting area represent probability
helps solve this problem. But let me ask you, if the y-axis no longer represents probability, what exactly are the units here? Since probability sits in the area of these bars,
or width times height, the height represents a kind of probability per unit in the x direction, what's known in the business as a "probability density". The other thing to keep in mind is that
the total area of all these bars has to equal 1 at every level of the process. That's something that has to be true for any
valid probability distribution. The idea of probability density is actually really clever when you step back to think about it. As you take things to the limit, even if there's all sorts of paradoxes
associated with assigning a probability to each of these uncountably infinitely many values of h between 0 and 1, there's no problem if we associate a probability density to each one of them, giving what's known as a "probability density function",
or PDF for short. Any time you see a PDF in the wild,
the way to interpret it is that the probability of your random variable
lying *between* 2 values equals the area under this curve between those values. So, for example, what's the probability of getting any one very specific number, like 0.7? Well, the area of an infinitely thin slice is 0, so it's 0. What's the probability of all of them put together? Well, the area under the full curve is 1. You see? Paradox sidestepped. And the way that it's been sidestepped is a bit subtle. In normal, finite settings, like rolling a die
or drawing a card, the probability that a random value
falls into a given collection of possibilities is simply the sum of the probabilities
of being any one of them. This feels very intuitive. It's even true in a countably infinite context. But to deal with a continuum, the rules themselves have shifted. The probability of falling into a range of values is no longer the sum of the probabilities
of each individual value. Instead, probabilities associated with ranges are the fundamental primitive objects. And the only sense in which it's meaningful to talk about an individual value here, is to think of it as a range of width 0. If the ideas of the rules changing between a finite setting and a continuous one feels unsettling, well you'll be happy to know that mathematicians are way ahead of you. There's a field of math called 'measure theory', which helps to unite these two settings and make rigorous the idea of
associating numbers like probabilities, to various subsets of all possibilities
in a way that combines and distributes nicely. For example, let's say you're in a setting where you have a random number that equals 0 with 50% probability, and the rest of the time, it's some positive number according to a distribution that looks like half of a bell curve. This is an awkward middle-ground between a finite context, where a single value has a non-zero probability, and a continuous one,
where probabilities are found according to areas
under the appropriate density function. This is the sort of thing that measure theory handles very smoothly. I mentioned this mainly for the especially curious viewer, and you can find more reading material in the description. It's a pretty common rule of thumb that if you find yourself using a sum in a discrete context, then use an integral in the continuous context, which is the tool from calculus that we use
to find areas under curves. In fact, you could argue this video would be way shorter
if I just said that at the front and called it good. For my part though,
I always found it a little unsatisfying to do this blindly
without thinking through what it really means. And, in fact, if you really dig in
to the theoretical underpinnings of integrals, what you'd find is that in addition to the way that it's defined in a typical intro calculus class, there is a separate, more powerful definition
that's based on measure theory, this formal foundation of probability. If I look back to when I first learned probability, I definitely remember grappling with this weird idea
that in continuous settings, like random variables that are real numbers,
or throwing a dart at a dart board, you have a bunch of outcomes that are possible,
and yet each one has a probability of 0. And somehow, altogether, they have a probability of 1. Now, one step of coming to terms with this
is to realise that possibility is better tied to probability density
than probability, but just swapping out sums of 1 for integrals of the other has never quite scratched the itch for me. I remember that it only really clicked when I realised that the rules for combining probabilities of different sets were not quite what I thought they were. And there was simply a different axiom system underlying it all. But anyway, steering away from the theory,
somewhere back in the loose direction of application, look back to our original question
about the coin with an unknown weight. What we've learned here
is that the right question to ask is what's the probability density function that describes this value h after seeing the outcomes of a few tosses? If you can find that PDF,
you can use it to answer questions like 'What's the probability that the true probability
of flipping heads falls between 0.6 and 0.8?'. To find that PDF, join me in the next part.
Unfortunate that /u/sleeps_with_crazy is no longer here.
Summary: It's about probability densities.
Nice motivation for measure theory at the end there. The fact that simple concepts like Dirac measures make the weird PDF near the end of the video trivial to work with is both exciting for potential learners and, best of all, true!
/u/sleeps_with_crazy isn't here any more, but has done a pretty solid job of debunking this sort of thing in the past.
that video was a let down. I was hoping to start my day with Borel σ algebras and all I got was explanations of intervals.
So you're saying there's a chance...
If one wanted to assign a value to the probability of a singleton, it obviously has to be infinitesimal. This can be done rigorously, see for instance Infinitesimal Probabilities (2018)
EDIT: I found another very nice paper Fair infinite lotteries (2013)
This is why I think that probability theory should work with hyperreal numbers rather than real numbers. A probability of 0 should be cannot happen, and a probability of 1 should be must happen. With hyperreal numbers we can then say that something happens with probability 1/ω. For instance, for some uniform random process on [0, 1] the probability of getting 0.5 is not 0, but rather 1/ω.
Uhh by definition doesn’t probability 0 mean impossible?
Like for any continuous random variable X,
P(X < x) = P(X ≤ x) for any value x
Which can only be true if P(X = x) = 0.