Lecture 20: Multinomial and Cauchy | Statistics 110

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

So we're talking about joint distributions, right? And there's a lot more to do with that, so to just continue. So last time, we calculated the expected distance between two iid uniforms, okay? So I wanted to do this analagous problem for the normal. Because I think that's another nice related example that has a different approach that makes it easier, okay? So last time, we did expected absolute difference. This is just an example, but I think it's a nice example. Expected absolute difference between two uniforms, and what if we wanna do the same thing with normals? So we wanna find the expected value of say, let's call them Z1 and Z2. So, we did this with uniform last time, now assume these are iid standard normal. Okay, wo last time we did this for uniform, using the 2D version of LOTUS, right? Completely analogous to LOTUS, except we had a double integral instead of a single integral. So these are iid standard normal. So, we could write down the 2D LOTUS here, and try to do that integral. And because they're iid, the joint PDF of Z1 and Z2 is just the product of the two marginal PDFs. And well, we could just try to do that integral, and we could probably get it with some effort. But that's not a good way to do this problem, it's better to stop and think about the structure of the problem, okay? So in the case of the uniforms, we've never particularly studied, what are the properties of the difference of two uniforms? On the other hand, the difference of normals is something we've talked about before. So instead of jumping right into this two-dimensional thing, let's see if we can actually simplify the problem first. So in fact, we've mentioned before that the sum of independent normals is normal. We haven't proven that yet, but we have all the tools to be able to prove that now. So let's just do that quickly to verify what I said before about the sum of normals, so just a little theorem. This is gonna be easy now, because we know MGFs. The sum of normals, so we stated this before. If X is, let's say N(mu 1, sigma 1 squared), and y is N(mu 2, sigma 2 squared) and they're independent, X has to be independent of Y, otherwise this won't work. Then the sum, we talked about this before, by linearity the means just add, and also the variances add. And we talked about the fact that if we took a difference, we would take the difference of means. But we would still add the variances, not subtract. Because if this were -Y, you would just think of it as plus -Y, okay? So anyway, let's just prove this fact now, which we haven't done yet, and this is just an easy MGF calculation. So we just use the MGFs, So let's get the MGF of X + Y. Since they're independent, we talked about the fact that since they're independent ,we can just multiple the MGF of X times the MGF of Y. The MGF of a normal, well, we derived the MGF of a standard normal before. But it's very easy to get from a standard normal to any normal, right? If we do this thing, mu + sigma z, we can immediately get the MGF of any normal. And that's just gonna be e to the mu 1 t, this is the MGF of x, mu 1 t, + one-half sigma 1 squared t squared. That's the MGF of X. We multiply by the MGF of Y, which is the same thing, you just change the subscripts. Mu 2 t + one-half sigma 2 squared t squared equals, Now let's just write this as one exponential and factor. So that's e to the mu 1 + mu 2 t, just factor out the t + one-half, this is all up in the exponent. One-half t squared (sigma 1 squared + sigma 2 squared), right? Sigma 1 squared plus sigma 2 squared t squared. Okay well, I ran out of space on this board, but that's the end of the proof. Because all we have to do is just say, look, that's the MGF. I have little more space, that's the MGF of N(mu 1 + mu 2, sigma 1 squared + sigma 2 squared), All right, so since the MGF determines the distribution, then that's the end, we don't have to do anything else. So, it's a very easy calculation, using MGFs. Okay, so now that we've proven that fact, and we see this thing, z1- z2. Rather than jumping into the 2D LOTUS, let's just say, what is that? Well note that Z1- Z2 is N(0, 2), just add the variances. So really all we're asking is for the expected value of, Expected value of absolute value of, now when we say N(0, 2), let's once again think about that as location and scale, right? We could take a standard normal, and multiply by the square root of 2, and that would give us variance 2. So the easiest way to think of this is as square root of 2 times Z, where Z is standard normal, right? That's just the scale, that gives it variance 2. Now this is just square root of 2 E|Z|. Now its just a one-dimensional LOTUS. And this is a LOTUS that you've actually seen. If you studied strategic practice five, we did this. But whether you remember ever looking at that or not, doesn't matter, this is a easy LOTUS. Whereas here, you have to do a double integral, here I just write down LOTUS. So I'll do this quickly, cuz on the strategic practice, it's just write down LOTUS. Integral minus infinity, to infinity |z| 1 over root 2 pi e to -z squared over 2 dz, And notice that this is an even function, That is, if we replace z by -z, we get the same thing. So we can just multiply by 2 and go from 0 to infinity. And once we go from zero to infinity, we can drop the absolute values. Then it's just z e to the minus z squared over 2. That's a really easy u-substitution integral, right, cuz you can just let u equals z squared, or u equals z squared over 2 if you like. And then you get exactly what you want, so that's then an easy integral. And if you simplify it, you get square root 2 over pi, which should be an easy calculation. It's also on the strategic practice, so I won't write out more of that calculation. So then that becomes just a simple one-dimensional LOTUS, that's a much better way to think of it. All right, so just an example that you don't always have to jump into the 2D LOTUS, just cuz you have this function of two variables. Okay, so that's a continuous example. I wanted to do some more discrete stuff. In particular, to introduce the multinomial distribution. Which is by far the most important discrete multivariate distribution, and I'll tell you multivariate distribution means. So this is gonna be called the multinomial. A multivariate distribution just means that's a joint distribution for more than one random variable, right? So we have all these normals and Poisson and geometric, and so on. Those are all univariate distributions, cuz we have one random variable. Now we're working with more than one random variable at once. And for this course, there's really only two multivariate distributions that you need to know by name. One is the multinomial, which we are about to do. The other one is the multinomial which is the generalization of the normal distribution to higher dimensions, and we'll get to that one later, okay? So multinomial as the name might suggest it's a generalization of the binomial, right? Bi becomes multi, okay? So it's like a higher dimensional version of the binomial, and let's just introduce it by its story. So this is the definition and story, Of the multinomial, which I'll sometimes just abbreviate to mult of np. It has to parameters, n and p like the binomial, except in this case, this p is actually a vector. So p = vector, let's say P1 through Pk, where we assume that that's a probability vector. And by probability vector all, I mean is that these are non negative and add up to 1. Cuz we're gonna think of them as probabilities for disjoint cases, so that encompasses all possibilities. So we want pj greater than or equal to 0, and the sum of all pj's = 1. That's the assumption, okay? So the binomial would just be if this is one dimensional and then we just have binomial np, but now we have k of them. So the intuition is that in the binomial, we just talked about success and failure, right? There are two possible outcomes, there are two categories. Multinomial means instead of two categories, we have k categories, okay? So it's a natural extension, right? And binomial, we have to classify everything as either success or failure for each trial. Here we have more than two possible result, okay? So we say that x is multinomial np, We think of that as saying that, so in this case, X is also a vector. This is a multivaried distribution so X = (X1 to Xk), if we can think of X. So like in the binomial, we have n independent trials. But I'll just call them objects instead of trials and each object, objects could be people, could be trials, could be anything, so just very general. We have n objects that we are categorizing, okay? We have n objects, which we are independently putting into k categories. So there are k possible categories, and the binomial is just success or failure, but now we have k categories. And there, each object is independently determined which category it falls into, okay? Just like in the binomial, we had independent Bernoulli trials. And if Pj is the probability of category j, by P of category j, I mean the probability that any one of these objects is in category j, has probability Pj. And we interpret Xj is just the count, is the number of objects in category j. All right, so that was a lot of writing, but the concept is really simple. We just have n things that we're breaking them into categories, and then we just see how many objects are in each category, right? So it's very natural, you can make up as many examples of this as you want, really easily, right? Just anytime you're classifying things into categories. It's very, very general. Okay, so let's find the PMF. So this is gonna be a joint PMF, cuz it's a joint distribution. So we want the probability that x1 = n1, blah, blah, blah xk = nk., right? That's a joint PMF, we just need to say what's the probability that there are n1 objects in the first category and to in the second category and so on? And we can immediately write down the answer just by thinking back to how do we derive the binomial PMF. All we have to do is imagine any particular sequence, it's gonna be P1 to the n1, P2 to the n2, blah, blah, blah, Pk to the nk. Just to have a little intuitive example in mind, let's just suppose this is very similar to how we did the binomial. But just to quickly review and generalize that. Suppose we just have three categories, just to have a little mental picture in mind. We had three categories, lnd let's just say our sequence, and let's just write one, two, three. Where one means category one and so on. So we might have a sequence like 23311112, for example, okay? So let's put a couple more 2s, there are four 2s two 3s four 1s for example. This says that the first object is category 2, right? We're just categorising the objects one by one. So any particular sequence like this, the probability would be P1 is the probability of category one. Multiple to the power of how many ones there are, right? I need to put another one there. P2 to the power of the number of twos and so on. That will be the probability of any specific sequence that has the desired counts, right? But then we can permute this however we want, then it's just going back to those counting problems. How many ways are there to permute the letters in the word pepper, or the letters in the word Mississippi or something like that. Where you start with n factorial, but that overcounts because the twos could have been in any order. The threes could have been in any order, the ones could have been in any order, and so on. So you have to adjust for that overcounting. Exactly like we did for the binomial, so we just divide by n1 factorial, n2 factorial, blah, blah, blah n k factorial to account for all the ways you could permute the 3s, permute the 1s, permute the 2s. Of course, there's a constraint here, this is if n1 plus blah, blah, blah plus nk = n. Otherwise, it doesn't make sense, right? Cuz we have n objects. We're assuming that every object is in exactly one category. So it wouldn't make sense if we added up these counts and they had too many or too few, makes no sense. So it's 0, otherwise. That is if the sum of these ns is not this n, then it's impossible, so it's 0. So that didn't require a calculation. It just required thinking about an example like that and just so different ways to promote things, right. So that's the joint PMF. It looks a lot like the binomial, all right. So it's a generalization of the binomial when you have more than two categories. So we'll come back to some other properties of the multinomial later, but just to do a couple quick properties to think about. We could ask about the marginal distribution, conditional distribution, things like that. So let's think about the marginal distribution first. Okay so we're letting X be multinomial. N, p. Sometimes I'll subscript a k, just to indicate what the dimension is, so the number of categories. And suppose we want the marginal, find the marginal distribution of just one of these component, let's say Xj. So Xj is just how many people or how many objects are in category j. We want its marginal distribution. What do you think that is? Yeah, binomial, why did you say binomial? Exactly, it's either nk or it isn't nk. So I mean if you said if you look at your notes, how do you get from joint distribution to marginal distribution? I would say if you take this thing and do k- 1 sigma sign sum over all possible things, do a lot of algebra. But that's not thinking about it, right? To marginalize we'd sum up the joint or we integrate in the continuous case. We sum in the discrete case, sum of everything we don't want, okay? But instead let's just think about the story, think about what it means. As you just said, each of these objects, either it's in category j or it isn't. We're assuming they're all independent trials. So if we define success to mean being in category j, the probability of success is pj in our object. So that's just immediate. I didn't write justification for this but that just proved itself from the story you know it's a complete truth just to say because the binomial, it's independent Bernoulli trial. That's the probability of success, okay? So we can get that immediately and in particular that also gives us the mean and the variance without having to do a calculation, E(Xj) = npj. And the expected value of the variance because we derived the variance of the binomial before we don't need to re-derive that. We already know the variance of a binomial is np(1- p) so this npj (1- pj), no additional work needed because we know it's binomial. Okay, so that's just immediate from thinking about what this means. So that's one property. That's the marginals. And let's think about kind of something similar. Let's call this, well, I call this the lumping property. What happened, the question is, we have all these categories, well what happens if we decide to merge certain categories together, right? Okay, so just to have an example in mind, let's let K = 10, so we're thinking of X as a vector. X1 through X10 and just to have a concrete example in mind. So this is multinomial, let's say this is multinomial, and, P1 through P10. And to have a concrete example in mind, well let's imagine we're in a country that has ten political parties. Okay, and you take n people and assume that the people are independent of each other, and you wanna know how many people are in, and assume that everyone in this country is a member of one of these ten parties. Okay, and then you take all these people and you say, okay. Ask each person which party they're in. X1 is the number of people in the first political party, X2 is the number in the second one, and so on, right? So that that would be multinomial if these are the probabilities of the different party memberships, all right? So now, what I call the lumping property is what if it's a country where it's, there are only two dominant parties, and all the other parties are much smaller? And so it might be kind of unwieldy to deal with this ten dimensional vector. Maybe we wanna compress all the third party, so suppose that the first two are kind of the two dominant major parties and the rest of them are kind of minor, so we may wanna just lump them together. So that's why I call it the lumping property, lump all the other parties together. So what if we considered, let's see, let Y = X1, X2, and then group all these ones together. So I'll just add them up, X3 plus blah, blah, blah, plus X10, right. So this would be like party one, party two, and then other third party grouped together. Without doing any calculation or algebra whatsoever, we can immediately write down the distribution of Y. Y is just gonna be multinomial. Same n. And then all we've done is group these categories together, but then it's the same problem again, it's just it has a larger probability just lump together all those Ps. Okay, so this should be obvious from the story, right. It's the same problem again. So just like we emphasized with the binomial we can define success and failure however we want. Here we can rearrange the categories and whatever, the only thing we need to make sure of is that each object is in exactly one category. So it would not work if you could be in more than one category or be in no categories. But if you define your categories such that it's true that each object isn't exactly one, then you get multinomial. Didn't need to do any algebra or calculus to show that. So that's pretty nice. Similarly, let's get the conditional distribution. So what if we wanted, so again x is multinomial. What if we want a conditional distribution where we got to learn what X1 is and we want the conditional distribution of the rest given that we now know X1. So we want a conditional, you might call it a conditional joint PMF because you're given X1. Let's say that we're given that X1 = n1, okay? And then we want the conditional joint distribution of everything else. So we know exactly how many people are in the first category. But we don't know about the rest of them. Well, given that X1 = n1, we want the joint PMF of the rest X2 through Xk. Still gonna be multinomial, but we have to be a little bit careful with getting the parameters right. So now this is gonna be k- 1 dimensional, cuz we know how many people are in the first category, but we're looking at the remaining k- 1 categories. And the number of people, well, n- n1 have been allocated into the first party, okay? So we have n- n1 people left. And then we just have to get the probability vector, right? Now if we just wrote p2 through pk, that would be a common mistake, but it should be easy to see that that's a mistake because those don't add up to 1. So it can't just be p2 through pk, right? I'm imagining that I've taken, and it doesn't matter which people. I can imagine, I'm conditioning on the count. But then I could further condition on which specific people are in category one, and then use symmetry. So I guess, so I may as well just assume that the first n1 people are in category one, okay, but to get these ps, well, then we have to think conditionally, right? So let's call this vector, let's call it p2 prime through pk prime where somehow we have to figure out what's p2 prime and so on. Because without the primes it doesn't add up to one makes no sense. So let's find p2 prime for example. Intuitively, I want this to be proportional to P2, right? Cuz I know how many people are in the first party, but that shouldn't kind of affect the relative distribution of the rest of the party. So basically you just have to renormalize. If I want to write that out, mathematically I would say P2 equals the probability of being in category 2, given a random object being in category 2 given, that it's not in category 1. Because we've already thrown out the ones that are in category 1. So just by the definition of conditional probability, being in category 2 I take the intersection of this and this, but once you say you're in category 2, you know you're not in category 1, so that's redundant. So the numerator is just P2. And the denominator is 1- P1, that is just the probability of not being in category 1. Or we could also write it as P2 over P2 + blah, blah, blah + Pk. So all this says is we've taken these and you know similarly for the other ones Pj prime equals Pj over P2 + blah, blah, blah + Pk. All this says is that we've renormalized this, it's still multinomial. Okay, so multinomials have really nice properties like this and you can see these things just by thinking about what it means without doing a calculation. So that's a very useful distribution for lots of applications. Okay, so we'll say more about the multinomial in the next lecture or the lecture after. But I wanna do one more continuous example as well, an example where we actually do need to do a calculation. And this is another kind of famous one. Good example of how do we work with joint PDFs, which I think we need more practice with or at least you need more practice with, then I'll try to help with that, so this is a good example. I call this the Cauchy Interview Problem. I call the Cauchy Interview Problem not because Cauchy ask this as an interview problem, but because it sounds more interesting to call it that than the Cauchy Problem. But actually for some reason, this doesn't seem like it should be a common interview problem, but I've actually seen this on several occasions asked as an interview problem just I think to test whether you can do work with joint PDFs and things like that. So okay, it is an interview problem, though it sort of shouldn't be in some sense. Anyway, I have to tell you what the Cauchy, I mean Cauchy was a famous mathematician, but in this context, Cauchy is referring to a specific distribution. The Cauchy distribution. It's a famous distribution that has a lot of kind of weird, scary properties. I just got some of these distribution plushies that I found online. I might bring them, but they're a little bit small to show you here. But if you come to my office hours, you can see them in my office. But little pillows illustrating different distributions, I have them in my office. The Cauchy is called the evil Cauchy and it looks pretty evil. And so let me first tell you what the distribution is, and then tell you a little bit about why is it evil. And then we'll try to find its PDF, which as I said, has been a common interview problem, find the PDF of a Cauchy. That's the problem, okay? So the Cauchy is the distribution of let's say, X over Y with x and y iid standard normal. So it's a simple definition, just take a ratio of two iid standard normals and we call that a Cauchy. And you can see why that could be a useful distribution for a lot of different applications where ratios is a pretty natural thing. So that's a Cauchy, and the problem is find the PDF. Of this random variable, okay? Let's call this thing T. Find PDF of T. So we're defining T to be the ratio of iid standard normals. We want to find its PDF. All right, so that doesn't yet answer why this is evil. Well, some properties of the Cauchy that we're not gonna prove right now, but just to kind of foreshadow why is this thing so evil. First of all, it does not have an expected value. If you try to compute e, expected value, it'll blow up. No, no, no, that's not that evil. There are a lot of distributions where if you try to compute the expected value, it blows up. So it does not have a mean, it doesn't have a variance. The thing that's really evil about the Cauchy is that If you take iid cauchys, so let's say don't just have T, we have T1 through TN. They're just iid ratios of normals. When we get to the law of large numbers later in the course, we'll see that when we average a bunch of iid random variables, we What happens if we average a lot of them is that should be close to their mean, right? You average a lot of IID things that should be close to the mean. In this case there is no mean. But the weird fact is that if you average all these Cauchy, IID Cauchy the distribution of that average is still a Cauchy, doesn't change the distribution. You can average a million IID cosye it's still gonna be Cauchy. So in some sence you that's kind of you're hoping soil as you collect more and more data you're hoping to converge to the truth in some sense. In this case if all you do is average then you just not getting anywhere the distribution doesn't change. There are other ways to work with, if you had Cauchy data there are other ways to work with it. It would be a bad idea to just take the naively average everything, there are other things you could do. Okay, so any way that's the Cauchy Distribution. Now let's find the PDF, just for practice with our joint distributions. And there are several ways we could do this. One way would be to use the law of total probability, and condition on y to make things easier. And that's a perfectly good way to do it. But I think I wanna just start by practicing just more directly how to just directly get the CDF. Let's find the CDF, and take the derivative and get the PDF. So with the CDF we could use the law of total probability, but let's just directly write down. It's going to be a double integral because we have an X and a Y and let's just write down that double integral and see if we can do it, okay? So let's find the CDF. So the probability that x over y is less than or equal to some number, t, that's what we need for the CDF. This is practice with, this is an event, it's an event that the ratio is less than or equal to t. We want to find some probability of an event where it's based on x and y, so unless we can think of some clever trick for simplifying this we basically have to do a double integral. Or else, we can use the law of total probability and do a single integral, but I actually don't think that's any easier here. So my first impulse would be to multiply both sides by y here. But you have to be careful in doing that because y could be negative, so we can simplify this a little bit by using symmetry first, And putting absolute values. This follows from the symmetry of of the normal. And you can think through for yourself exactly how I'm using symmetry here, but the basic idea is with the normal. If I have a standard normal and multiply it by minus 1 it's still standard normal, if I multiply it, if I randomly chose say with probably one half multiply it by minus 1, probably one half do nothing It's still standard normal. Have the same symmetry in the denominator, so sort of have two symmetric things. And we might as well just kind of absorb the plusses and minuses and write it this way, follows from symmetry. The reason I wanted to do that is just so that I could write this as x less than or equal to t absolute y, without having to flip the inequality or worrying about whether the inequality flips. Now let's just write this down as a double integral, okay? We are saying that x so we can either dxdy or dydx, but let's suppose that we are doing dx dy. And to get a probability well what we do, we integrate the joint PDF over whatever region we want, okay? So Y goes from minus infinity to infinity. And, the main thing again to be careful about, is the limits of integration. X, the inner limits can depend on Y, and X, we're looking at the region that goes up to t absolute y. So x goes from minus infinity to t absolute y, and then what we're integrating is just the joint PDF, right? So the joint PDF is 1 over root 2 pie, e to the minus x squared over 2. And then same thing for the y, 1 over root 2 pi e to the minus y square, because they're IID standard normal. So the other term, e to the minus y squared over 2, doesn't depend on x. So I could write it here but I could immediately then pull it out here. So I may as well write it here so that it's not interfering with this part. Sets e to the minus y square over two and there is another 1/2 pi this just stick over there. So all I did is write down the normal PDF for x and the normal PDF for y and I took the y part cuz that depends on x. That looks pretty ugly so let's see if we can do it, well, one thing that we could simplify is just recognizing what do we actually have here. So we have this integral, minus infinity to infinity, e to the minus y squared over 2, and then we have this inner integral. Okay? Now in one sense we can't do this integral. Because that's the normal PDF and you can prove that you can't do that integral. And in another sense, not only can you do that integral, you already know what that integral is right? That's just capital phi evaluated here, that's just the normal CDF. So actually it's just phi, so depending on whether you consider that doing the integral or not, it's just that, dy. That's just the definition of the standard, normal CDF, okay? Now these absolute value signs are a little bit annoying. So, let's notice that we have an even function, because y squared, absolute value y, this is an even function. So we may as well go from 0 to infinity instead, and multiply by 2. So then we'd have a square root of 2 over pi. I just multiplied by 2, and then we're going from 0 to infinity, e to the -y squared over two capital phi of ty dy. All right, and then you know, the clock is ticking on our job interview and we've get here and it's sort of possibly start to panic. And that capital phi is an intractable integral, that's why we call it capital phi, it's cuz we couldn't do it. Now, you are being asked in your interview to integrate an integral that you couldn't do, sounds pretty bad, However. One thing that might help, is that on the interview, we were asked to find the PDF, not to find the CDF, that's the CDF. And we know that the PDF is the derivative of the CDF, so the PDF is the derivative of the integral of an integral that we can't do. So somehow maybe that will save us. So let's take the derivative. So here's the PDF, PDF is the derivative of the CDF. This thing is capital F(t), if we call the CDF capital F. The PDF is the derivative, F'(t). So we're taking the derivative with respect to t, not with respect to y which would make no sense. Notice that this y is a dummy variable, okay? This is a function of t we're taking the derivative with respect to t. Okay, now there's a theorem in calculus that says, under some pretty mild conditions, if you have a reasonably well-behaved thing that you're integrating, you can exchange the derivative and the interval. This is a very, very well behaved function. Capital Phi is just a continuous differentiable thing between 0 and 1. Either the -y squared over 2, that's infinitely differentiable. It decays to 0 very fast, so this is a very, very nice function. So there's gonna be no technical problem whatsoever with swapping the derivative and the integral. We're gonna take the derivative of this with respect to t, and then we're gonna try to simplify it. So we take the derivative, bring the derivative inside, okay? So we have the integral 0 to infinity, e to the -y squared over 2. We're differentiating with respect to t, we're bringing in a d with respect to dt. So we're treating either the -y square derivitive which behaves as a constant when we're differentiating with respect to t. Then we take the derivative of capital Phi of ty, by the chain rule, y is gonna come out, because we are differentiating with respect to t. So y is going to come out from the chain rule, y. And then we just need the derivative of this, but the derivative of the standard normal CDF is the standard normal PDF, which is 1 over root 2pi, e to the -z squared over 2 in general where z is ty. So it's e to the -t squared, y squared over 2. I just squared this thing divided by 2dy. Now let's see if we can do it. So the square root of 2 here cancels this square root of 2. We have square root of pi, square root of pi, so we're gonna get 1 over pi. And then we just need to integrate from 0 to infinity of ye to the -t squared y squared over 2dy. Now this looks like an integral we can do. >> [INAUDIBLE] >> The other what? >> [INAUDIBLE] >> Say that again. >> [INAUDIBLE] >> There's another e to the -y squared over 2. Yeah, I forgot that one, thank you. There's another, we'll just combine that one with this one. So that would be 1. Uh-oh, I guess don't get hired, that's sad. There's another, e to the -y squared over 2 that I forgot. But now, I thank you, I put it back, okay? I haven't interviewed for any jobs since I came here five years ago, so I'm kinda rusty. So I put back either the -y squared over 2 that you helped me with, and now that should be okay, right? Now, this is an integral we can do, because we know that the derivative of y squared is gonna be 2y, and that's gonna be taken care of there, now it's an easy u substitution again. So we can just let u = let's say 1 + t squared y squared over 2. Just make that substitution, so then this just becomes e to the -u, okay? So du = 1 + t squared times, now we're treating t as constant again. We're changing the variable y, transforming it to u. So the derivative y squared over 2 is y, so we have y times 1 + t squared dy. So we have the Ydy, we're just missing the 1 + t squared, okay? So I'll just multiply and divide by 1 + t squared. Then we're just integrating e to the -u du, which is a very, very easy integral. We know that that's 1, either just by doing it or because it's the integral of the exponential PDF again. Okay, so then we immediately now have the answer, 1 over pi 1 + t squared for all t. So that's the PDF. If we wanted the CDF, all we would have to do is integrate this, then it's gonna do some arc tangent thing. All right, so that's the Cauchy. And let me just quickly just show you how you would start the other method, which would be the probability. I'm not gonna do the whole thing, because at some point, that's just gonna reduce back to this method. But just to show you what it would look like. Just as a quick alternative without going through the whole thing, cuz it's gonna be similar. But it's useful to have both methods. So this would be the method using the double integral, okay, and that's the PDF. Which by the way, we should check thati that's a valid PDF. Does it integrate to 1? Well, if you integrate that thing, you'll get an arctangent thing. And you can check that when you evaluate the arctangent thing, you will get 1, okay? So just quickly, the alternative using the law of total probability. x less than or equal to t absolute value of y. You kind of just think to yourself, what do we wish that we knew here? We could decide to condition on x, or we could decide to condition on y. This is gonna be the integral, let's say we condition on y. The probability x less than or equal to t, absolute value of Y, given, let's say, Y = y. This would be the law of total probability, right, just conditioning. We can choose whether to condition on x or to condition on y, but I think I wanna condition on y. Okay, law of total probability, remember the discrete case we just seen, sum over all cases, p of a given, b, whatever, p of v, whatever. We have a partition. And in this case, we're integrating instead of summing, so we're conditioning on y and then we're multiplying by 5y. Lower case 5y is the standard normal PDF. All right, well, let's see if this helps at all. This is saying to treat Y as just being known to equal little y, okay? So I can plug in little y there. And then the tricky part here is that we need to use the fact that x is independent of y. Because if x are not independent of y, you could plug this thing in, but then you still have this condition, okay? But since they're independent, you can plug in Y = y and then get rid of the condition, because they're independent. So when we do that, that's just gonna be phi. The probability that x is less than or equal to t absolute value of y, is just phi of tf to the value of y, just by definition, right? Because we're plugging in y, that's just a constant probability of x less than some constant. It's just the standard normal CDF evaluated there. Phi of (y)dy, which I think is the same as the integral we had. Does that look the same? Yeah, so over there, I just wrote out what this is, but it's the same thing. And then proceed in the same way. So that would be a second way to do this. We'll see a third way later on, just because this is a common interview question. So it's good to have more than three or more than two ways to do it. All right, so I'll stop for now. I'll see you Wednesday.

Info

Channel: Harvard University

Views: 50,055

Rating: 4.8210864 out of 5

Keywords: harvard, statistics, stat, math, probability, multinomial distribution, cauchy distribution

Id: xiVWNkQUqKk

Channel Id: undefined

Length: 48min 59sec (2939 seconds)

Published: Mon Apr 29 2013