Bayesian Networks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so in this video we're going to go over more about Bayesian networks I will briefly review what we talked about in class and so it's been a while since we had one of these videos and then we will talk about inference algorithms for Bayesian networks so the plan like I said is we're going to briefly review probability then we're going to talk about how to do inference in kind of a naive brute-force way which will actually reveal a smarter inference algorithm called variable elimination so let's get to the review so with probability there's a lot of very interesting math that comes from a pretty fairly intuitive set of identities and so it's good to just review these and kind of have them in our back pocket ready whenever we need to think about any sort of tricky probabilistic calculations so first of all I want to just point out some notation so as much as possible we're going to try to write random variables in capital letters with capital letter variables and this is just a convention that tries to separate random variables from their their settings their values so in this tiny example the the letter capital a could represent a random variable that could be something like the the weather today and then the lowercase a could be something like rainy or sunny or windy such that lowercase a represents one of the many possibilities for capital a and often times we're just going to write lowercase a as shorthand with the understanding that it's one of the settings for the random variable of its capital form so you know here's an example so one of the key identities in probability is the definition of conditional probability which is what we've written here which is that the probability of a given B or in the long form the probability of capital A equaling little a given B capital B equaling little B is equal to the probability of both of them the joint probability of a and B divided by the probability probability of B by itself so to make this more concrete let's give us some examples so maybe lowercase a could be the event that that the weather is cloudy and lowercase B could be the event that the season is spring okay so the question if from your P of a given B is what's the probability that it's cloudy given that it's spring and we compute that or the definition of conditional probability tells us that this is equal to the probability that it's cloudy and its spring divided by the probability that its spring by itself so if you move some terms around with some basic algebra you can actually get the the next identity that we should all have pretty much memorized as the the product rule which says that if we care about this joint probability the probability of both events happening again the probability of it being cloudy and it being spring that's also equal to the probability of a given B times the probability of B by itself and this is kind of intuitive it's basically saying that if we're really looking for the probability of a and B together then what we can do is you know write or figure out the probability of B by itself so that's P of B and then figure out the probability of a given the value of B that we figured out and you just multiply them together and you get the joint probability okay then this gives you finally the famous Bayes rule which says that we can essentially flip the roles of the variables in our in our conditional probability so where we start with the case where we've figured out you know the probability of a given B well we can use Bayes formula to use that to use that P of a given B to figure out the probability of B given a and it's a very useful formula but it really just comes from you know manipulating the products rule above and just some basic algebra okay so this is a Bayesian network so what it what the heck is on the screen so it basically is a way of representing the structure of the probability distribution and and it's one of these things just like with logic this might not represent the real world instead it's a way of for us to encode knowledge in a representation that a computer can navigate so what is the knowledge we're trying to represent here so on the screen all I've shown is the the structure of the Bayesian network I haven't talked about the actual probability tables that go into it and we'll deal with those a little bit later but for now let's just think about what does the structure tell us so each of these nodes or these things in these ovals on the screen represents a variable in fact is a random variable that can take on a true or false about value in these examples so we can you know it's either true or false that we're going to win the lottery today it's either true or false that it's going to rain today and it's either true or false that the ground is wet today so what we're trying to express with this Bayesian network is this probability this joint probability distribution the probability of you know winning the lottery and the it being rainy and the ground being wet all together and what the structure of the Bayesian network is tells us is that the we can actually write this joint probability like ever and above which is that we have the probability of the winning the lottery times the probability of it raining times the probability of it the ground being wet given whether it rained so there's a couple things here the first is that you know intuitively and as encoded in the Bayesian network winning the lottery has nothing to do with the weather presumably at least that would make sense so the idea is that right this this win lottery node in the Bayesian network has no connection to the rain and the wet ground variables so when lottery just sort of lives on its own and that's also reflected in the math and the mathematical representation write the equation on the left you see you know probability of L is just by itself it doesn't it doesn't interact with anybody it doesn't depend on anything and if you want to compute the joint probability of all the variables you just multiply the rest of the stuff which we'll talk about by the probability of L so the second thing is where things get a little bit more interesting so that's where you know the rain in the in the graphics right in the picture rain has this arrow coming out of it that points to the to the wet ground variable and what this tells us is that the probability of wet ground of the ground being wet depends on rain it depends on the rain and and again this is more or less based on reality right which is the idea that the ground is going to get wet if if it is raining and if it's not raining then the ground is probably not going to be wet but there could be other reasons well I guess wet and we'll get to that in a second so so here we in the in the equation you have the same or you have exactly that written out in in probability notation you have that you know the probability of rain sort of just lives on its own but once you know whether it rains you know a lot about what the probability of the ground being wet is okay and we can additionally add more variables to the Bayesian network to make it slightly more realistic or at least tell a more interesting story so maybe we're wondering whether what the probability is that will slip when we're walking on the pavement okay and this you know this extra added node has a the interpretation that that the the joint probability of all four variables now looks like this so now we're the probability of winning the lottery of it whether it's going to rain of whether the ground is wet and whether we're gonna slip okay and that's you know the probability of the lottery times the probability training times the probability of the ground being wet given that it rains and times the probability of us slipping given that the ground is wet and you'll notice that I didn't write this I didn't write that it's the probability of us slipping given that the ground is wet and that it's raining even though intuitively right there's connections here right we know that we you know we know that right the rains going to cause the ground to be wet and and and so so because it rained that might cause us the slip but the key is that there's this chain there's this effect right or there's a there's a chain of cause and effect right so the the rain causes the ground to be wet and then it doesn't really matter why the ground was wet because all that matters is because the ground is wet it's slippery and so we might slip on it so we don't want to write it that way okay and we can go a little bit further and look at examples where we say there's multiple multiple precursors to you know one of the variables in this case the wet ground variable so whether the ground is wet could depend on you know whether it rains but also whether someone was washing their car someone's washing their car maybe the ground gets wet and then you know going further down the Bayesian network you can see that you know regardless of whether it was raining or whether someone was watching the car if the ground is wet we might slip on the ground okay and we get then again this joint probability distribution and I've removed the lottery example because I was sort of trivially disconnected in hopefully that got the point across that if a node is not connected to the rest of the nodes then you can be treated completely independently okay so here all these nodes are dependent on each other in some way so we have the joint probability of whether it's raining and whether the ground is wet and whether we slip and whether someone's washing their car okay so then we have the way that's right now the way that the Bayesian network dictates the structure of the the joint probability breaks down into is uh that we have the probability of a training sort of independently of anything the probability of there being a car wash sort of independently of everything this sort of I guess there's some unreal unrealistic assumptions going on here but that's okay so we're saying that the probability of car wash just sort of lives on its own but then we have the probability of the ground being wet given that some given whether someone's washing their car or whether it's raining and so right that's a that's the key difference and the last one's the same as before it's the probability of slipping given that the ground is wet so let's think more about this so what's happening here the key is that the wet ground variable depends on you know the state of both of these parent variables and you can sort of think through logically what the what the dependency is it's that you know if it's raining or if it's if someone's watching their car the grounds going to be wet that's sort of the logic approximation of the what is you know what better often better modeled as a probable probability distribution right so I guess the the correct English way of saying this would be that if either it's raining or someone was washing their cars and there's a high probability the ground will be wet and similarly if it's not raining or if it's not or no one's washing their car right if both those vehicles both those variables are false then then there's a low probability of the ground being wet or there could be another reason the ground could get wet you know we already saw one example where we added in variable to account for another reason the ground might get wet but but so far we've only had a - and there's many you know in reality there's many many possibilities so the general rule that comes out of this example is that the probability the part that we put into the joint probability that represents a node in the Bayesian network is always going to be the probability of that variable at the at that node X in this case in the top right given its parents given the parents of X in the Bayesian network in the the directed graph representation so it's right the parents of the white ground variable are rain and car wash and that's why that's written that way as P of W given CNR so with this definition let's talk about inference so give them a Bayesian network like I said like I've seen so far it's a nice way to represent a compact probability distribution and you know with this contact representation we would hope that we should be to do inference efficiently and the punchline that you're going to see after we go through this is that we can't do it exactly efficiently we can do it we can do it officially for efficiently for certain Bayesian network structures but in general if inference remains a really expensive procedure in the worst case but that's ok we can deal with the better than worst-case scenarios if we are designing these networks ourselves ok so the first approach we're going to look at to doing inference is just the brute force approach it's going to be you know it's called the enumeration but and it's really not something we would want to do in practice but it's something that we can use to to get an intuition about how to do it more intelligently so consider that probability distribution we saw before right so it was a probability of raining and the ground being wet and slipping and someone washing their car and you know the it broke down into the prior probabilities of rain and car wash and the conditional probabilities of the ground being wet given the car wash and the rain and the conditional probability of slipping given that the crown is white so let's imagine that we want to do inference where we're trying to predict the probability of it raining given that we slipped right so or I guess that's an unrealistic situation let's say we lurk we look out our window and we see somebody slip and we don't know if it's raining we don't know if someone's washing their car we just saw somebody slip and we want to figure out is it raining so what we would do to compute that is use the definition of conditional probability and the definition of or marginalization of summing out of you know extra variables from a joint distribution and we would think about the joint distribution or we would rather we were some out the variables that we don't care about which is the wet ground and the car wash variables whose sum over all the possible settings of those which in this example would probably be our binary thing where we have a true and a false setting and for each of them we would divide out the probability of slipping of somebody slipping so all we've done here is just write the definition of conditional probability you take advantage of the the definition of joint probability and you know when one thing that can save us some computation time is that we don't actually have to really think about the probability of slipping you know the denominator we can sort of just leave that out and just some out the values of the joint probabilities which I've written here and I'm using this symbol to you know the symbols the symbol is the proportional to symbol it basically means that we're going to have to compute this for rain and not rain or rain equals true and rain equals false and then we just normalize the two and that would be basically a shortcut to getting that denominator and in some settings you might make sense to actually compute the Dom denominator Q of s but in this case assistance we're talking about a binary or boolean variable it's it's probably easier to do this normalization technique so let's talk about just that you know let's try to compute this the sum the summation so you know looking at this the first thing you should notice is that there's some you know there's some terms that we can move outside of the sum so rather than you know spending all our time computing you know all that for all the values of C and for all of values @wp of our for example we can just pull that out and we can we can actually do a few things so here's what we can do we can move terms outside of the summations that aren't part of the summations and we end up with something that looks like this where we have you know we multiply the probability the prior probability of r times the sum of you know the probability conditional probability of s given w for the s that we're taking as input on the left and then we sum will receive for all the terms that involve C and you know this is good this is good that we did this it's good that we save some computation essentially by moving things moving these things outside of the summations right because we would have had to look up that all the table of entries for all these terms had we not done that instead we only you know we look up the table entry for P of R once and then we you know compute the rest but the the bad news is that through this technique you know for all for n free variables that we're trying to sum over we're always going to have to some you know of 2 to the N or K to the end if there's K possible settings for each variable and this is unavoidable with new numeration so it's it's kind of too bad because because we have these nested we have these nested summations right we have the summation over w and inside that we have the summation over C so in this example it's not too bad right we have in just two variables we have the sum over and so we're paying two to the two but but in general you know once we have lots of variables we're not really doing much to exploit the the structure of the bayesian network right now we're just sort of doing something that's not much less expensive than you know the full summation up top so let's examine these nested summations you can actually draw these out as a computation tree and we can see that you know what I've drawn here this is something that the textbook does also with a different example you know at the top we have you know P of R and then it's multiplied by the sum of you know P of s and not W and P of s and W times other sums overseas so you can sort of see all the terms that write all the terms that you have to consider if you were to you know compute these summations the way that they're written out are on the board are on the screen and you'll notice that right in the middle of the you know right in the middle of the screen there's there's a redundant a couple of a few redundant nodes so the the prior probability over C and not C they occur in both sides of the tree right on the left and on the right and that's happening because as we evaluate this is uh you know these nested summations you know naively we we're going to look up P of C twice and we're going to look at P of not C twice so that's that's one small example of where we can save some computation but in general the trick to doing this more efficiently and taking advantage of the structure better is to do something where we sort of you know pre compute some of these summations on these products and then just store them in tables store them so that we don't have to look them up again and if we ever need to look them up again for another part of the summation it's just a lookup it's not you know how to do a product anymore you don't have to do another some another nested sum so this is that you know there's a number of approaches to doing this but the one that we'll talk about is something called variable elimination and it's pretty much what it sounds like basically look at all the variables you need to sum over and you eliminate one at a time and and let's look at how we go about doing that so here's the full summation that we're talking originally and let's say we want to eliminate the variable seat let's say we want to think about you know we don't want to have to do we want to do the some oversee and we want to store that value and then we don't want to have to think about doing the some oversee again so how would we do that what what we do is we would think about what are the terms in this in this you know factorized Beijing Network representation what are the terms that depend on C and the answer is fairly obvious it's these two terms that there's a prior oversee and the this conditional probability that depends on Co it's a conditional probability over W that also that given C and but also given R so these terms right they depend on C but they also depend on other variables and in this case they depend on W and R but we can kind of ignore R because R is fix are something we you know in this case in this inference where we're trying to find the specific probability of you know little lowercase R we can just say R is fixed so what we're going to do is we're going to say okay let's do the summation over C for these terms the terms that only the terms that depend on C let's do the summation and what we're going to end up with is a new function a new factor that no longer depends on C but it will still depend on everything else these terms depend on so that looks like this but this is a factor and this is a function that it's sub scripted with capital C to indicate that it was generated by eliminating the variable C but it's a function of W it doesn't depend on R anymore and we I could have written this as a function of W and R and since R again is a fixed value then it would be fine but it's just I don't know for for laziness I wrote just a function of W so sorry so here now what we can do is we can plug in you know F sub C of W for the terms I have highlighted in blue and we get this this is the updated summation with C eliminated where we eliminated C right so there's no longer a summation over so now what we have is in summation order W still so we still have to continue doing this and it's pretty straightforward this point you basically create a factor I guess when you don't have one summation you don't have to create a factor at all you just do the summation and you're done so this is a very simple example so let's do another example of these so we'll start with a probability distribution that is of this form and before moving forward let's list let's look at the you know this equation I wrote and and think about you know what is the Bayesian network structure of this so it's a probability of W times the probability of X giving W times the probability of Y given X times the probability of Z given Y this is a very common structure we'll see in a lot of applications like ok I'll show you in a second but let's think about let's just look at the equations first so let's say we want to perfect Pro you know infer the probability distribution over Y so we want to infer the prior probability over Y or the throw that's called the marginal probability over Y you're ignoring all the other variables treating the other variables as as if we don't know what they are so what does this look like can you imagine it you know given given what we have on top the definition of the joint probability what is the you know expression for the probability of Y ok so the answer is the that you're going to sum out all the irrelevant variables you're going to sum out the W you're going to send out the X and you're just some out the Z and that looks like this alright so now we've replaced the you know these these random variables these big random variables with a a small setting of the random variables so we're going to sum over all the possible settings for W or rather all your possible setting settings for capital W which we are denoting with lowercase W with a sum over all the possible settings of X so we can sum over all the possible settings of Z and I'm leaving why you know capital form because what this value what this is going to come out to be is going to come out to be a table we're going to do the summation we get a table we get a you know a representation of the distribution over settings of Y so it's a little bit abstract because it's not you know we're most math though we do often is with scalars but in this case it's going to be with a vector sometimes we'll end up with matrices or tensors returning out how many depending on how many variables we are working with in our distribution or sometimes even some even more abstract thing if it's a say if it's a continuous variable then it becomes more abstract a distribution is some weird function that we have to think about but in this case it's a simple simple thing it's just a we can let let's just say for the sake of simplicity it's a boolean variable so it's just two numbers it's a table of two numbers so let's think about how we do variable elimination in this case let's eliminate the variable W so what are the terms that depend on W and when we pull them out we get something that looks like this so now well we you know the process of doing this is you basically say what are the terms of depend on W and then what are the other terms that these factors depend on so in this case write P of W doesn't depend on anything else but P of X given W depends on X so we end up with a factor that is a function of X and is no longer a function of W right I'm writing you know F sub W just to indicate that it came from W but it's not really it doesn't we don't actually have to care where it came from we just say okay XW is now gone and now it's just a function of X so we can plug that in plug that right into the original summation above and we get this we get that now it's just there's just three pieces to it and now we can go ahead and eliminate you know X let's go eliminate X so X you know what are the terms that depend on X well there's f of F sub W of X the thing we just added and P of Y given X so when we sum that out we get get that P of wives now just this summation over Z and one of the interesting things that can happen here is that sometimes variable elimination doesn't get you you know a factor that's important and so you know so here you see that right if we continue doing variable in the elimination we try to eliminate Z from here we're going to sum over you know P of Z given Y over all values of Z which is going to be 1 right because that's a probability distribution over Z and we're just summing out all the you know all the possible states of Z so that has nothing to do with that doesn't affect to the value any more so when we go back and plug it in it actually goes away and we can actually now visualize this if we go back you know the question I asked at the beginning of this slide was you know what does this thing look like as a Bayesian network it looks like this it's a chain is a Markov chain something that we should we're going to see a lot of examples of this is one of the more common usages of Bayesian networks is to use it in a chain where you're sort of you can think of these as time a time series you can think of the W as a snapshot that occurs before X which is another time a step in time and then Y is a step in time after that and Z is a step after that and you can think of these these dependencies as transition probabilities just like if with Markov decision processes but there's no decision right you'll be like if you have an MTP with no actions so the key is that what we've this is this weird thing that happened where Z didn't have any effect on the P of Y it's some it happened because Z is a child of Y is not an ancestor of Y and and Z wasn't observed as you guys observed that I can that can affect our P of Y but if it's unobserved then then it doesn't really have any effect so there are some general rules that you can pull out of this ok so you can say actually that every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query now when we did evidence variables in the first example but not in the last example that's when you that's when you know the value of one of these variables and so so in this case that Z was not right it wasn't an ancestor of the query variable which was Y and it wasn't an evidence variable or an ancestor evidence variable so it's irrelevant to the query so just to summarize variable elimination works like this you iterate this process you start you know each iteration you choose a variable to eliminate you sum the terms that are relevant to the variable and you generate a new factor we plug that fat new factor into the original summation replacing write these terms these original terms and you just repeat this until no more variables need to be eliminated so this is good and we showed some examples we would just run through two examples that were nice we're valuable illumination was nice and fast and was pretty clearly faster than the brute force method although only a little bit because in both cases the examples were small enough that the brute force you know the enumeration method didn't look that bad but you can sort of imagine in bigger Bayes nets that that you can you know enumeration can become impractical quick pretty quickly okay now the bad news is that exact inference is still sharp be hard it's still you know something that we can't do in polynomial time unless P equals NP and that's that's because you can always build these worst-case Bayesian networks but in general or rather a specific case of bayesian networks which is what we're going to focus on mostly in this class and what we often stick to is that in tree structured Bayesian networks it's linear time so if you have a tree structured Bayesian network you as long as you you know do your variable ordering in the right way you get a linear time algorithm for inference so another important property in Bayes Nets is to understand how to identify independence in the sense that you should be to look at Bayes net and figure out which variables are independent of other variables you'll give in what we observe and and you know when anything is connected when any nodes are connected in a Bayes net there is some possibility of a dependency but the key is that there's you know the magic of Bayes Nets the reason that they're that they exist is for confuse conditional independence right so how things become independent if you observe certain variables along the way okay and so the two rules of thumb that you should be pretty comfortable with you know once you go through a few times is that the first one is that each variable is conditionally independent of its non-descendants given its parents okay so to visualize this you know look here's a simple Bayes net looks like there's a you know a and B are have prior probabilities and then C and D depend on and B and then e depends on both C and D together so if we observe a and we're interested in the you know the words interested in thinking about what's independent of C given that we observe the value of a the answer is from this rule this rule is that right it's all them all of the non descendents are independent or B and D are and are independent of C so that's kind of a actually is not that interesting example but another example of this is if you look at you know if you consider E and let's say you observe the parents of E you observe C and D and what happens is when you observe C and D a and B become independent of E and this is this is a representation of Bayes nets right you know right now a and B aren't that complicated but you can imagine that a and B could be you know could be connected to a really really complex you know other part of a Bayes net and the idea is that if you observe C and D then you don't have to consider any of that so extending this concept the further generalization of this idea is that each variable is conditionally independent of any other variable given what's called its marker blanket okay and so the markup blanket in Bayes net corresponds to the parents the children and the children's parents so for example you know again if we consider C and let's say we observe its C's markup blanket which would include its parent a its child e and its child's parent D so once we observe a and D it becomes conditionally independent of B and furthermore well we will kind of already had that but furthermore anything else that's attached to B or that that might attach anywhere else is independent of it okay so these become useful to actually probably will help you with the written homework but not so much with the coding the programming homework so let's actually talk about the programming for a bit so in the programming homework you're given a pretty simple Bayes net setup and and it's it's a kind of a classical a hidden Markov model Bayes net which is something we're going to look at closer in class you know as a way of representing a time series and basically what happens is that we're given we're given two types of probability distribution we're given you know this is as part of the homework exercise you get cook you know you get the AP is to call these probability distributions you get so you have the probability of a ghost location given the previous ghost location and you have the probability of a noisy distance measurement given a ghost location so what happens is the other your your pac-man your hunting ghosts and you you can every at every turn you get to get an estimate of or a noisy estimate of how distant each ghost is and we're going to use this this this Bayes net to try to figure out the probability distribution of where that goes could be so the goal is to take this information the probability of the ghost location given a previous ghost location and the probability of the noisy distance measurement given location and infer the ghosts location and the given the noisy distance measurement that you're seeing so right at each turn you're going to want to compute a probability distribution over all of ghost locations so there's going to be some probability for pretty much every single possible state but hopefully it'll be a high probability around close to where the ghost actually is and in order to compute this you're going to need the probability of the ghost location given all previous evidence so this is actually going to strongly relate to variable elimination because basically what we are going to do is you know you're always going to have a estimate for the probability of the current ghost location that's what you're maintaining or this is the the recursion but then in the next step using that estimate you know you're going to want to compute the new updated probability distribution over ghost locations given the previous one and and the wrong way to do this would be to you know go back and look at the full set of unknowns which includes all the ghost locations from the beginning of the game to the current time and and and compute of huge summation over all those possible states because that would be huge right in pac-man is an N by n grid so it would be you know N squared by the number of time steps sort of summation for each ghost would be would be too big so instead what you do is you you compute this factor or this this probability distribution over the ghost locations for the previous time step and and then that is all information you need you know in the same way that variable elimination could remove how you need to think about the nested summations to compute the next time step okay so and we can talk about this in class and go over your ideas about how to implement this so for the reading just you know have a look at chapter 13 that goes over some of the basics of probability that I went over in class and at the beginning very beginning of the video and then chapter fourteen through fourteen point two talk about no exact inference and and properties of Bayesian networks and also fourteen point four through fourteen point four point three that's a little fine-grained but there's some there's a lot some sections of that chapter that get a little bit too detailed but you know they're kind of things that are good for you to reference if you ever need to implement one of these things but for this class I wouldn't want you to get bogged down in some of these details so okay so I'll see you in class on Monday

Info

Channel: Bert Huang

Views: 215,775

Rating: 4.8800998 out of 5

Keywords:

Id: TuGDMj43ehw

Channel Id: undefined

Length: 39min 56sec (2396 seconds)

Published: Wed Mar 25 2015