17 Probabilistic Graphical Models and Bayesian Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're gonna start talking about graphical models or probabilistic graphical models specifically we're gonna talk about Bayesian networks which is a type of graphical model so we're gonna talk about the overall the general idea behind graphical models I'll define what a Bayesian network is and then we'll look at examples of graphical models that we've already seen so far so we'll look at how naive Bayes and logistic regression can be viewed as Bayes Nets and then we'll talk about it how to do inference in Bayes nets so a probabilistic graphical model is a way of representing a probability distribution in a way that the the distribution can be represented can be can be encoded such that we can read off conditional independence structure from a graph okay and I'll explain how that works but the idea is we want to to encode you know what variables are independent of other variables given other variables based on or encode that in a graph so that we can use graph algorithms to do inference and learning and this is all just about how we abstract the idea of conditional independence which is a very statistical idea which is probabilistic and abstract it in a way so that we can use you know computer science to think about how to efficiently reason about probabilities so here's an example this is a Bayesian network that describes a conditional independence structure and what makes it a Bayesian network is that it's the the distribution is represented by a bunch of variables which are the nodes in the graph the nodes in the network and edges between the variables or directed edges between the variables so they have arrow is attached to them they have a directionality that describe the conditional dependencies so there's a conditional independence structure here and and again what we're trying to describe is a full joint probability distribution so we're looking for the full joint probability distribution of of P of L R and W so winning the lottery raining and the ground being wet and what this Bayesian network structure tells us is that this joint probability actually factorizes into this form so specifically that lot win lottery because it's sitting by itself in the graph and has no connections to any other nodes it is completely independent of everything else and the variable for whether it's raining is dependent on or is is related to whether the ground is wet right and this is very this example is obviously a cartoon cartoon yet understandable situation right the lot winning the lottery has nothing to do with the weather but whether the ground is wet certainly depends on the weather and the idea here is that you know there's some probability of there being rain and then there's some probability of the ground being what given the rain so we can we can continue this we can grow this we can grow this model by including other variables and here we might have another variable where you know if the ground is white we might expect there's some probability of us slipping in which case we would have a conditional probability factorization of the joint okay so we have the full joint probability and we can factorize that into a bunch of a bunch of prior probabilities and conditional probabilities and we can see that we again have the same structure that we had before so we have P of L by itself and P of R by itself sort of but but then there's a P of W that depends on R and then P of s that depends on W and one of the interesting things here is that this this whole Bayesian network structure shows us that we can ignore certain ideas or we could have written this easily as the P of s given W and R right because because in some ways are determines which conditional probability distribution we use for W and W then determines what distribution we use for s but if we already know the value of W we don't need to consider our anymore so we can cross that out so to summarize you know this this I guess definition by example a Bayesian network describes a probability distribution among all variables would by putting edges between the variable nodes where the edges represent us having a conditional probability factor in our factorized probability distribution and it's important to note that that by the the structure that we've of the factorization that the network implies restricts that structure will restrict the different possibilities of what distributions we can have you know fit that factorization so you can't actually express all possible distributions all possible joint distributions if you're restricted to these factorizations so a few interesting things happen when you look at these factorizations so consider this new network which is slightly a slight variant of what we had previously or now we have another possibility for why the ground may be wet so the probability distribution or the factorized probability of distribution corresponding to this bayesian network is the following so now we have the probability of R by itself we have the probability of C by itself C being the carwash so it's either gonna rain or it's gonna be a car wash and then we have the probability now of wet ground given car wash or rain or not it's not really an or a car wash and rain right the two variables considered together and these variables remember they can take multiple states right this is the this is just a conditional probability table that's indexed by the state of C and R so we're not saying that CN are both true we're saying that whether they're true true true false false true or false false determines the probability of W and another way to characterize how a Bayesian network translates into a factorization of a joint probability is that we say we say in general that the probable X is going to be conditioned on the parents of X so where X here is wet ground I look at you know what are the parents of X and parents in the sense of the graph theoretic sense where these are the notes that have directed edges coming into this this variable so we have you know what ground Nizar is our node so what are the parents of what ground well rain and carwash are the parents so these are the the variables that determine which conditional probability we're gonna use and you can do the same thing with slip so slip the only parent of slip is wet ground and you can do the same thing with rain and carwash and who have no parents so therefore they just have you know P of R and P of C so you can probably start to imagine how this this type of organization enables more efficient inference of you know figuring out different conditional probabilities you might be interested in deriving from the full joint probability and but before we get into that let's look at some examples of other distributions we've looked at and see how they can be viewed as Bayesian networks so here are two Bayesian networks and I want you to just think about you know what did these look like so you know what is the joint probability of the top one so again for each node we look at which parents which variables are its parents and we say we're gonna have a probability factor that is the node conditioned on its parents and in this case each of the nodes or each of the X nodes X 1 through X 5 has Y as its parents right only one parent and that means that we have the probability distribution of you know that's gonna contain all these in terms that have to do with the probability of x given Y so that should seem a little familiar and then on the bottom we have something we have the opposite we have now X's are you know have the arrows are coming out of the X's toward the Y which means that the X's have no parents but the wise or the hy has all the X's as its parents so these are the distributions corresponding to these Bayesian networks and you can see that the these should look familiar so at the top we have the naive Bayes factorize distribution right it's it we make the it's essentially the conditional independence assumption about the features given the labels and then on the bottom we have something that looks kind of like logistic regression it's not exactly the same thing because we actually have the input likelihood here we have P of X but if you divide out P of X then you end up with you know P of Y given X where X is the Foley input and it's all the dimensions of the input so if we think about it now especially for the case of naive Bayes we know that we were able to compute things pretty efficiently with naive Bayes right we didn't have to worry about some gigantic probability table that considers you know if there's d dimensions we have k to the d possible probabilities for y right that would be the worst case scenario if we had to consider a view of Y given X for an arbitrary Joint Distribution it really simplified to the point we just need to add up all these log likelihoods for each dimension and that's really an instance of the power of Bayesian networks but where Bayesian networks generalize beyond naive Bayes and on the bottom we actually have the opposite effect where if we were to implement this you know in the most general way the probability table of P of Y given X 1 X 2 X 3 X 4 X 5 would be huge right again we'd let this is exactly what I just described we would have to consider you know K to the 5 different possibilities for P of Y given whatever the states are of X 1 through X 5 and what we did instead was we parameterize this right we didn't consider all possible probabilities we just considered the probabilities that would come out of a logistic squashing of you know a linear combination of the X variables and that's a big restriction right in general that you know quickly or strongly restricts the the space of possible distributions we might be interested in in representing so from so from these examples you should be to see that you know naivebayes has an eye in some sense a nice in quotes probability distribution in terms of efficiency and then and then the downstairs thing which we've we've shown we can parametrize to make logistic regression but in general the probability distribution that's described in the bottom is not nice right in the sense that it's expensive to to think about the bottom distribution so then looking at the graphs what is it about the top graph versus the bottom graph that make them more or less efficient or more or less expensive to reason about and that gets us to the idea of independence and what the naive Bayes graph really takes advantage of is this conditional independence which was the fundamental assumption it was the the first thing we talked about that enabled it to be be inexpensive to work with and in a Bayes net in a general Bayes net that's not necessarily just this one layered thing like the like the Bayes naive Bayes graph in in general the idea independence is that each variable is conditionally independent of its non descendants given its parents which is kind of a you know a lot to think about so let's look at some examples so consider this Bayesian network on the right and let's look at the variable C so variables C you know has a parent a and let's say we observe a so at this point what variables are independent of C now we know a depends on C because he is C's descendant or as more specifically specifically its its child but the according to the rule that on the Left B and D are independent of C and you can write out the probability distribution described by this Bayesian network and work this out and see why it's true but then furthermore you can do things like considering the variable E and suppose we observe the parents C and D so know the value of C the variables C and D and we want to consider is the variable e dependent on any of the other variables on top and the answer is no because according to the rule again you know C and D are the parents and E has no non descendants so easy becomes independent of a and B and again you can work out the math to see why this is the case if you just write out what the probability distribution is described by this network and another rule for independence in bayesian networks is that each variable is conditionally independent of any other variable given its Markov blanket which is a new term here most of you probably have not heard about it but a Markov blanket is the the set of all parents children and children's parents of a node so a family tree a nomenclature that would be something like the the parents the children which are easy and then the children's parents you know kind of have a weird interpretation but you can think of it as like spouses or Co guardians of the children so so in this case here's an example so node C has a Markov blanket that includes a Yi and D so a is a parent Yi as a child and then D is a child its parent so you know if this were a family tree that might be that might be a C and D mate might be spouses or Co Guardians of E but you know the analogy breaks apart a little bit but the idea is that once you have if you ever observe all the members of the Markov blanket then that variable becomes completely independent of every other variable in the network and that makes that makes things a lot easier and a lot more efficient to reason about so let's think about how we can use this independence especially in the task of inference so given a Bayesian network you know describing a joint distribution of x y&z you know one question we might ask that's that's a problem of inference is what is P of Y right what is the the marginal probability of one variable which would essentially require us to to enumerate or sum over all the possible settings of X and Z all the remaining variables and the first approach is enumeration so let's look at what that would look like so if we have the the same distribution where you looked at before this was the rain wet slip and a car wash example we had this structure this Bayesian structure so it was uh P of R and P of C had no parents P of W had seen R as parents or W has see NR as parents and s had W as a parent and let's say we're trying to figure out the probability of R given s so we're trying to figure out the probability that it's raining given that we slipped right so we walk down the street and we slipped on the ground and we're now we're wondering you know is it raining now it's kind of a silly question we really imagine you can imagine this can happen to a robot so a robot is you know rolling down some Street on some mystery planet that is that it's exploring and it slips and it doesn't have you know weather sensors so now he's trying to figure out based on its local motion sensors whether it's raining and the way we would do that loop the way we'd write that out is we would want to divide or sum up the joint probably the joint probability of all the variables for all the possible settings of W and C or those are the variables we don't care about and then we'd want to divide out the probability of s I mean but for the dividing out let's just say we could normalize that out and so we would just say that it's proportional to summing out all the variables that are not involved in our in our target distribution right we're looking for P of R given s so we want to sum out all the other variables which is C and W and here I've written the factorized form of the Joint Distribution but in general this is just the idea that we want to sum out the Joint Distribution the joint probability and this is expensive right doing it doing it the way that is written out is expensive because you know we have to enumerate over all the possible settings of WNC and and that's bad already in this example but in general that's you know that's essentially like oh and variables for which you have to consider the joint setting of so that's like you know K to the FN which is really huge so we want to think about how we can use the Bayesian network structure to alleviate this problem and it's not so much the structure I mean you can see it right on the screen it's right there in algebra for this example but in general when we have these Bayesian networks as inputs we're gonna be dealing with distributions that are a lot harder for a human to see how to simplify but let's try to simplify this case so we have you know a sum over W and a sum over C and what the essentially the essentially the trick that we're gonna try to do is we're gonna try we're gonna try to pull out terms that are not affected by the variable being summed over so for example PFR is not affected by either W or C so against pulled all the way out to the to the top P of W excuse me P of s given W is not affected by C so that gets pulled out of the C summation and then we're that that's as far as we can get so then now we're down to you know a slightly simplified form but it's still pretty expensive because if you look at this we're eventually are gonna sum over this distribution P of W given C R which is a case where we have to look up a probability distribution over two variables and and in general if you have you know - or excuse me if you have n binary variables they are in the conditional you know you need two to the N different probability distributions that you need to be to look up so that that's expensive we don't we want to try to avoid this but for now this is this is how you doin numeration and you can view this as a tree that sometimes is helpful to think of it that way so we want to consider all the possible settings of W and C and because these are binary variables that are either true or false we can we can draw it out nicely as a tree so we have P of R that sort of by itself and then we have to sum up over all possibilities of this you know second subtree of P of s given not W and P of s given W so on the left is not W on the right is W and then down at the next level we wanted then sum over all the possibilities of C right not C or or C and and so maybe this is a useful way of breaking this down but overall the idea of enumeration is the wrong idea because it's not fully taking advantage of the independence here but it's useful to think about because it's uh it's it's it's the goal that we're gonna try to accomplish accomplished but we're gonna do it in a more efficient way so instead what we're gonna look at is the technique of variable elimination and it's it looks at the at the beginning it looks almost the same as enumeration so though we have this summation over the joint probability distribution that's described by the phasing Network but because of the way that the Bayesian network has split up all these independent terms or conditionally independent terms we can we can try to isolate one variable at a time and we'll do this by looking you know in this example let's look at the C variable first so these are the terms that depend on the C variable so let's now have a you know function will make a function C or excuse me F of C that will you know taking some W which is the other variable this depends on and outputs the summation of the over C for all these values and and what we gain is now let's think about what the size of this function is so this function you know it has to be computed by summing over all possible C's right so there so there's some cost here so it's but that's only two or kay in general but but in this binary in the sense the C variable in our example was a binary variable it's just you know two terms that we had to sum over and then at the end we end up with a function that has two settings right it can get gets W as input so we can either get a true or false as input so we're gonna have to do this twice but that's not too bad and then we can once we once we have that function we can rewrite the original probability that we're interested in or this thing that's proportional to the probability we're interested in as the following right now now we have the same summation of all the terms that didn't depend on C and then just this F of C function so we've replaced the summation over C with a a function so let's run through a more general example so here we have a probability distribution over four variables W XY and Z and let's say we're trying to find the probability of Y okay and then the the brute force way of writing this heard that the the full of way of writing this is that you see you're gonna sum over all the variables that are not in our target probability so we're going to sum over W X and Z and let's run variable elimination so so the idea is we're gonna pick a variable we're gonna eliminate it by replacing it with a function replacing the terms that affect that variable with a function okay let's do W first so what are the terms in this summation or excuse me in this joint probability distribution that's been factorized what are the terms that depend on W so these are the terms the P of W and the P of X given W and those are those are the only terms that depend on W so from there we've just now defined a new function that takes an X and and by the way we have to think about you know what what should the input to this function be then and we should look at the terms that are affected by W so P of W and P of x given W and think about what are the other variables in these terms and in this case the other variables are just X there's just one variable that's that's not W and what we've done by replacing this with as a this summation over w as with a function of X is that we've eliminated W from the equation alright so now this is the new equation and so you can see there's no more summation or for W and there's no even there's not even a mention of W I mean it's sort of in the subscript but that's not that's not like the set that's not the state of W that has consider it's just that it's just an index saying that we're this used to come from the variable W so we've eliminated the W variable and now we can continue so let's continue so let's pick the variable X now to eliminate so what are the terms in this function that depend on X so here here are the terms that depend on X and and we can see that the the other variables that are the input to this or to the input to the functions that depend on X is just the Y variable so now we have the term that's f of X of Y and then finally we plug that one back in and where we end up with this so now we've eliminated the X variable and we eliminated the W variable before that now we're all we're left with is Z which we can just sum up and we're done right then we have computed the probability table for P of Y right not just not just P of Y given an input Y but we can actually know we've now actually computed the marginal probability of Y so that was you know an example of variable elimination but where I've sort of just gave you in order to to eliminate and in general you want to be smart about what order you do because what what are you used for elimination because they can strongly affect the computational cost of variable elimination right technically you can use any order right then the math will work when you use any order but some orders will eliminate large tables quicker and others will keep them around longer and a lot of that reasoning comes in by looking at the structure so let's think about the structure of this this Bayesian network and if you just draw out all the conditional probabilities that I at the top you can see that this is a chain structure Bayesian network right so there's a W that points to an X that points to a wide that points to a Z so let me summarize variable elimination on one slide and the idea is that it's based on the idea that very every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query so let's think about that so every variable that is not an ancestor of a query variable or evidence variable and in this case a query variable is the variable you want to you want to output the probability of and evidence is something you might want to condition on so in both cases those are ones where that's what we we want to we want to leave those alone we don't want to eliminate those but then everything else should be eliminated so here's the process you iteratively do this you choose some variable to eliminate from this set of non query and non evidence variables and you sum all the terms that are relevant to the variable and you generate a new factor and that new factor will take in as input any other variables that are affected by these terms and then you just keep doing that into them there's no more terms to evaluate the number of variables to eliminate rather now in you know we there's no free lunch here exact inference with variable elimination in general is and sharpey hard so that's worse than np-hard and but there are certain cases there's certain types of Bayesian networks for which we can we can prove that if you use a use the right variable elimination order you can get polynomial time or in actually even better news in certain cases we can get linear time so for a tree structured Bayesian network we can get a linear time variable elimination and we're gonna use that specifically in enchain structured bayesian networks where we can figure out the order very easily it's actually we just go essentially we just go in order of a chain where that may we're gonna be able to do variable elimination or exact inference in linear time so this is one of them the foundational methods for inference in a Bayesian network but now what about the question of learning how do you do learning in a Bayesian network so assume that we have a fully observed situation where we can see you know we have a bunch of training examples that that tell us the state of every single variable in our distribution now if that's the case learning is super easy and we already saw this with beta with a naive Bayes right it was just counting estimate each conditional probability just like we did for naive Bayes all we do is just count the probability of you know each variable given its parents from our training set and then we might add some priors like we did for naive Bayes to prevent overfitting or to prevent situations when we have don't have enough data from going going crazy but overall this is really easy there's no you know it is technically an optimization we showed this in our derivations but it's such a simple optimization that you just end up counting and dividing by the number of examples and that's really good news and then the the slightly less good news is that there's also ways to do this when we don't fully observe everything and we'll talk about that in the next class or the next video but of course you should already have a hint as to hell we're going to solve this because we solve that problem when we have unobserved or latent variables with the Gaussian e/m example
Info
Channel: Bert Huang
Views: 62,201
Rating: 4.8977637 out of 5
Keywords:
Id: zCWRTKnOYYg
Channel Id: undefined
Length: 30min 3sec (1803 seconds)
Published: Tue Oct 27 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.