Graphical Models 2 - Christopher Bishop - MLSS 2013 Tübingen

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks for that so I guess first of all there any questions from this morning any points of clarification if you want me to go over again good okay so we're going to move on to talk about conditional independence just probably do though I'm going to take a small Liberty if the organizer will forgive me and just show one slide just just to show you a picture of our wonderful new building for those of you who visited Microsoft Research in Cambridge but not in the last six months this is a this is our new building at 60% bigger than the last building which means we're now we're hiring we have internships for people partly through their PhD postdocs if you've just graduated and then permanent positions for researchers and developers so if anybody's interested please come and talk to me over coffee or a beer Oliver okay conditional independence so let's start with personal just start with the idea of of independence so if I have two variables a and B and I have some distribution of these two variables then we say that variables a and B are independent statistically independent if that Joint Distribution factorizes into the product of the two marginal distributions and you'll know from this morning that we can write that as a as a graph a directed graph very simple graph derails a and B with no link between them and we say we say they're independent what does that mean well if we look at the distribution of a conditioned on B then from the product theorem provillus yeah that's just the Joint Distribution and if I just divide through by P of B okay so that's just the product the product rule that we will learn this morning and if I substitute in that factorized form so if P of a and B is such that it factorizes then the this factor P of B cancels and is just equal to P of a okay so what that says is I have a distribution over two variables if if the distribution factorizes into the product of the two marginals it means that the the distribution of a given B is just equal to P of a which does not NBD now that said the conditional distribution of a given B is independent of the value of B or to put it another way if you know P of a and I tell you the value of B you learn nothing about a okay so that's what we call them independent right if the bethey if the distribution of weapon were independent of the culprit of the murderer in other words if we had two Butler's they both had guns their distribution of weapons were the same for the two people and was the conditional distribution of weapon didn't depend on who the culprit was and you have some prior notion of which culprit was more likely to be the murderer if I tell you the weapon intuitively you learn nothing about the culprit because it's the same for the two culprits that's the idea of independence okay so let's extend that idea then to a conditional distribution so let's consider two variables a and B but let's consider their distribution conditioned on some third variable C well I can apply the product rule again so this is and P of a given B and C times P of B given C okay so that's true in general I've not made any assumptions but now suppose that it's we're going to do this the other way around this time let's suppose now that this distribution P of a given B and C doesn't depend upon B right telling you the value of B doesn't give you any more information about the distribution of a given C so I can write so if that holds I can then write this as P of a given C times P of B given C okay so that's exactly the same as as before so P of a given B factorizes into the distribution of a time's the distribution to be the product of marginals that was the idea of Independence the idea of conditional dependence is exactly the same thing except we conditioned everything on C so B of A and B given C is P of a given C times P of B given C just a property of that conditional distribution that's Independence for that conditional distribution and there's a notation which you'll sometimes see which says if that holds you say that a sign of sada saw the vector you say that a is independent of B given C okay that's just the notation of a developed by Phil David notation and the key point here and you know if we'll have time we'll return to this in the third lecture we look to slightly more advanced topics the key thing to understand is that this must hold for every value of C okay so C is you know maybe it's a discrete or continuous variable but it has multiple different values in order to say that a and B are independent of C the definition of independence is that this end they're independent of each other given C provided this holds for every possible value of C if it's not true for just one value of C then we don't say they're independent okay you might say what happens if it depends on the value of C we may get onto that in the third lecture but for everything I'm going to say now when you hear about conditional independence its conditioned it's fun on all possible settings and all the conditioning variables it was hold true for all possible settings of those variables so this morning we introduced directed graphs we introduce director graft as specifying a factorization of a joint distribution we saw how to express take a joint distribution of the variables express it as the product of conditional distributions according to the graph and you learned how to do that what we're going to do now is view graphs in a different way we're going to view graphs as specifying a set of conditional independence properties okay so given that graphs define a factorization and give that we have a definition of independence we can ask what sort of in dependencies are implied by a particular graph and we'll see in a moment I'm going to derive everything in this lecture you can go in and read about it afterwards but the end result is is this that there will be a simple graphical procedure for determining conditional independence so the first example of the sort of real power of graphs we've seen already how graphs can be good for sort of visualizing models now we'll see how you can avoid having to do lots of hairy maths there just looking at the picture it's what we're going to do is I'm going to not exactly derive in a formal sense but motivate and help you hopefully understand the idea of being able to just look at a graph and just read off the conditional independence properties satisfied by the family of distributions represented by that graph okay so to do that we're going to look at three little examples first of all each of these examples involves just three nodes and this is the first first example so this is a graph over three variables so this graph defines a family of distributions joint distributions of the three variables a and C which factorize according to the following so remember we need a conditional distribution for each variable condition on its parents so we have P of C doesn't have any parents P of a given C P of B given C okay so this graph says that the Joint Distribution factorizes into that product of conditional distributions so the first question we can ask is are a and B independent so we want to ask is a independence of me given given no observations at all that's the question we're asking and we can answer that just using the product rule of probability together with our definition of independence so we want to know we want actually the sum and product rule so we want to know does P of a and B factorize into P of a times P of B well let's just compute P of a and B or P of a and B is given by from the sum rule we just marginalize out the unwanted variables this is P of ABCs under the C and if I substitute in our factorization we have P of C P of a given C V Gillan C and in general this is not equal to P of a times P of B okay so in general that factorization doesn't hold so in general a and B are not independent okay in which C is unobserved it might be that I choose is remember with a graph is a whole family of distributions I can choose a very particular distribution I can make very particular choices for for these conditional distribution such that this property did hold for that particular example okay but there'll be other examples which it doesn't hold so it's not the case that every member of the family of distributions described by that graph factorizes in that way so that's what I mean by saying it doesn't follow from the graph okay this is not true for all members of the family described by that graph oh it could be specific examples which it holds because of the particular numerical choices I've made but it doesn't hold in general for that family of distributions and just stop me the moment you think you're not understanding a hundred percent of this okay it's not you know once you get you so that's not bad but the first time you see it it's a little strange okay let's take the same graph but this time we'll imagine that we've we've observed C so we see is no longer a random variable going to fix it to an own specific value and we're going to ask for the when I'm going to ask whether a and B are independent given C so we look at P of a and B given C all right and so this is from the prior rule of probability so always go back to the joint distribution that's P of a and B given C divided by P of C and now I'm just going to substitute in for the Joint Distribution for the for the factorized form and specified by the graph well what is that well that's P of C P of a given C of B given C and then divide by this P of C and these guys cancel and so this is P of a given C times P of B given C okay so in this case a is independent of B given C okay really happy with that so in the previous case we had a was not independent of B given nothing okay so I derive those two results by sort of grinding through the algebra but we can also interpret the results in terms of graphical property so think of the two variables a and B that are interesting thieves which connect them in the graph in this case there's only one path the path via see a little graphical test we apply is whether that path is blocked or not and the definition of blocked is very simple if there's a node on the path that's observed it's blocked so is this node observed no it's not observed the path is unblocked and the variables are not independent they can sort of talk to each other because they are connected by a path which isn't blocked the observing see we can think of this as blocking the path from A to B it stops these guys from knowing about each other and so they become independent so a is independent of B when we conditioned on C when we observe C so I think of C as blocking that path from A to B now this node C which is sort of doing the blocking and it's it's connected to the tails of the two arrows so we call that a tail to tail node because it sees the tails of the two arrows so as far as this path is concerned if I think of a parking from a via C to B this node is tailed to tail on that path okay now in a big graph there could be lots of other arrows coming in and going out there could be many many different paths but for that particular path that particular node are called tail to tail the tail sail node has the property that observing the node blocks that path so if you if you this is new to you just sort of let it wash over you and then you know if you graph a few graphs time you'll sort of begin to get the sense of this ok let's look at another here's another example what I'm going to do is let you work this one out for yourself I just give you a little a little clue though so remember P of a B and C by definition is P of a P of C given a you'd be given C and if you just go through the little little exercise that we've just been through for the previous graph what you'll discover is that a is not independent to be given given nothing giving no observations and again we can think of that in terms of passing of a path from A to B via C this time the path is what we call head to tail so it's got a head of a an arrow and a tail through the head to tail path it's unobserved and these nodes can communicate and they are not independent of each other okay again if you do the little exercise and if you're not already familiar with this I do encourage you to do this just over during the coffee break and do the same thing here and what you'll discover is that a is intended to be given see in other words when we so graphically when we observe C it blocks the path from A to B and they become independent of each other so this all looks like it's going to be dead easy it looks like if I've got a big complicated graph and I ask you is node a independent of no W given node said and no G you're just going to have to look at the paths as if they're blocked or not it looks very easy turns out there's a little subtlety okay and the subtlety relates the third kind of node so here's a here's a three variables and if you look at the path between these variables this node has two arrows you call this a head-to-head node so let's have a look at the behavior of a head-to-head node so again we can write down the factorization of the Joint Distribution P of a B and C is P of a given its parents it doesn't have any peeve be given its parents it doesn't have any and then P of C given its parents which are a and B so we want to know first of all supposing we don't observe anything and we want to ask the question is a independent of B given nothing question mark what we need to do is to write down the Joint Distribution of a and B well from the sum rule of product of probability I just have to take the Joint Distribution and marginalize out the variable C alright so let me substitute this into here well when I sum over C the P of a pulls out of the the summation the summation is over C likewise for P of B and I left with a sum of the C P of C given a and B this is a conditional distribution so it's normalized when I sum over C I get 100% I get 1 so this is equal to P of a he of B so I end up with the result that a is independent of B given nothing oh dear right this is the opposite of the previous behavior okay I've not observed this node okay and yet somehow the path is blocked I've so the path is blocked because a and B are independent of each other they don't communicate because the node is unobserved it's the other way round from the previous graphs if this sounds abstract I'm going to show you a little example involving flipping coins in a bit that'll make it really clear I hope so what about if I observe C okay so here we go again so this time I'm interested in P of a and B given C well this is P of a B and C divided by P of C which is P of a if B P of C given a and B divided by P of C and in general this does not equal V of a given C P of B given C okay again it might do for a particular member of the family for which I've carefully chosen all the numerical values such that that holds but it doesn't hold for all members of the family described by that graph so I now have the result is that a is not independent of B given C okay so somehow observing C has unlocked the path is if a and B couldn't talk to each other but now we've observed C they can they're communicating they there now they now know about each other they become statistically correlated so we give you an example so let me just recap what we've learned then so we're going to consider paths connecting nodes and we've got three little elementary examples the first one is a tail to tail path between a and B and when this note is unobserved these variables are dependent but when we observe it they become independent it blocks the path between them likewise for this one which you're going to prove yourselves over coffee if I don't observe see these variables are dependent in general but they become independent if I observe this variable it blocks the path and that's a head to tail node and then the final example is a path which is head to head which has this opposite dependence so these are independent and be independent if I don't observe C but when I observe C they become dependent and again what I'm not going to prove is that even if I hadn't observed C if I had observed some descendant of C or some descendant in its descendant that also has the same effect and again I'm not deriving that now so all I've done is to show you three little examples by way of sort of motivation of a theorem called the D separation theorem and these separation is a way of just looking at a graph and answering just from the picture whether a particular conditional independence property holds or not if you're wondering why you should care about all this then you'll see some in particular applications why this if you like in depicting independence properties a part of the structure of your models their assumptions that you're making about the world okay so your model should reflect the sneers or capture the assumptions that you're making about the world so what you need is very quick way of looking at your model expressed as a graph and asking what assumptions is it making about the world in other words what in dependencies is it assuming and the way we do this is to use the D separation criterion and again I won't prove anything I will just show you two simple examples so let's consider this graph on the left let me ask a question I want to know is a independent of B given C so that's the question to answer that question we do the following we consider every possible path connecting a and B and we ask if that path is blocked if all paths are blocks then the nodes are independent if there's one or more paths that isn't blocked then they're not necessarily independent okay well in this case there's only one path this is the path this node is head-to-head and it's not observed but its descendant this is observed remember a head-to-head no that's observed or descendants observed that unlocks the path where the head-to-head knows the peculiar one that works the wrong way around if you like it unblocks the path so this part of the path is is unblocked okay because descendant of E is observed this node is tail to tail so that the hey the normal way it's unobserved so that doesn't lock the path so we have a apart from A to B that isn't blocked and so in general and this property does not hold okay let's consider the second example so the second example let's in this graph let's ask the following question is a independence of be given not given C but this time given F well again we look at every possible path in this case there's only one there's just this path and what about this node okay well its head-to-head it's not observed none of its descendants is observed either so remember that blocks head head node is back to front it blocks the path so that path is blocked I can stop there because every possible path is blocked but just you know just to reinforce the point if we look at this node as well this is a regular tail to tail node it is observed so that also blocks the path so actually the path is blocked here and it's blocked there so every path is blocked and therefore this conditional independence property does hold okay so that's the D separation criterion so I'm not deriving it or proving it I'm just trying to give you a flavor for it the point is an any question imagine a very complicated graph and ask you a very complicated question about several nodes up here and whether they're independent of several other nodes down there given some complicated pattern of conditioning you can answer that by just writing down the Joint Distribution writing down it's factorization according to the graph and then manipulating it using the sum and product rule of probability until you get the Joint Distribution of the two sets conditioned on the conditioning set marginalizing out all the other guys twenty-five pages of paper later if you've not made any mistakes you'll get the answer or you can do it by just inspection of the graph and looking at the paths okay so that the latter is a lot easier and less error-prone so yes okay what I want to do now is just show you again go back to a really toy sample I'm going to show you an example it involves a model just to flipping some coins evolving that little head-to-head node to try and you know this is a little abstract so far so you see a concrete example and you can see this in action and then we'll take any questions so let's consider two coins these are not bent coins these are ordinary fair coins 50-50 chance of heads or tails so we have two random variables the two coin flips let's call it coin one and coin two so coin one is a random variable that can be either heads or tails and there's a probability of heads of 1/2 and the probability of tails of 1/2 and likewise for coin 2 let's introduce another variable and this variable I'm going to call both heads so both heads is a boolean variable it's either true or false so it's true if coin one is heads and coin two as heads otherwise it's false okay so the first thing we can do is to think of this generatively and ask what's the probability of two heads the chance of getting in other words what's the chance that the what's the probability that the variable both heads will take the value true well I could imagine generating millions of examples and you can you know what they will look like because there are four possibilities and they occur with equal probability and coin one is either tails or heads and coin to Z the tails ahead and there are these four combinations they're equally probable so we can think of that as the generative model we could imagine running this many many times you flip coin one in and then you flip coin two their independent flips a coin want to coin two independent of each other then I compute the variable both heads which is just a deterministic function of coin one and coin two and I repeat that many many times and I just look at the fraction of times that both heads is true and that's obviously one in for a quarter like this is I think pretty trivial but what we want to do now is is to reason backwards so let's suppose we've observed the value of both heads and let's suppose it has the value false so I know the value of this variable is false well Jess is it we did before we can we can solve a little inference problem by just sort of crossing out all the states of the world that are inconsistent with the evidence that we have so we know that this variable is false so we can rule out this column and so now we have these these guys and of course these are equally probable so the probability that coin one is heads or you can read that off his coin one is heads in two states of the world its tail's in one state is heads the probability has the third I've just solved my inference problem of course by symmetry the same is true of coin - what happened now is that the coin flip started out independent right when when I hadn't observed true when I hadn't observed this variable they were independent of each other who's independent coin flips but now I've observed them okay this is a head to head node is blocked by the observation and they're no longer independent so just to prove that and so we're solving this inference problem but let's let's suppose that coin one is tails okay if coin one is tails again I can cross out the states of the world that are inconsistent with the evidence so we know that both heads is false so across this ad we know that coin one is tails so we know that it can't be head so you cross that out and so these are the two possible states of the world and they're equally probable and so the probability of coin to being heads well it's just a half okay half the time its tails half the time it's heads so it started out as a half then I observed this variable and it went down to a third now observe this parable has gone back up to a half again conversely supposing that this variable instead of being tails is heads was in coin one is heads I do the same thing again I cross out the states of the world that are inconsistent with my evidence that's these two so coin one is head so I cross out these two cases what do I learn now I learned that coin 2 is definitely tails the problem of heads is 0 okay so observing the value of both heads to be false correlates the value of the of coin 1 and coin to they are independent but making this observation that they're not both heads intuitively fairly obvious isn't it I've observed they're not both heads so the state of 1 is correlated with the state of the other okay so they've become conditionally dependent condition on this observation any few friends in the audience anybody are you happy and questions go over it again nope happy okay good and this is sometimes called explaining away in the book chapter there's a different example I think involving fuel gauges and batteries and things essentially it's sometimes called explaining away because if you have to in if you have some event in the world that you can observe and there are two possible things that could cause that event and they're independent you observe the event and it correlates them it correlates them because if you then discover that one of the events did occur it explains away the observation and therefore you no longer need the other event okay so I suppose a classic example is you know you wake up one day I guess it would be in countries we have to have lawn sprinklers you wake up when then your lawn is wet the grass is wet and it could have been that it rained during the night it could have been that a lawn sprinkler came on during the night so you have some prior so will imagine that raining and lawn sprinklers coming on are independent events so they may not be in a real situation and imagine they're independent but you observe that the lawn is wet you know that one or other all both must have happened because the lawns wet if you now discover that it rained last night you no longer need the explanation of the sprinkler having come on okay so to start with there may be a small chance of the sprinkler having come on you observe the lawn is wet that increases the probability of the sprinklers come on now you find out that actually at rain during the night now the probably the sprinkler coming on goes back down again but it's non-monotonic just as we just as we saw here okay so that's sort of the power of graphs to help you reason with probabilities you could do everything just analytically with the summon product rule but the graphs are are very helpful to to visualize and be able to compute these properties okay any questions about that little you know that brief introduction to conditional independence in directed graphs okay so let's have a brief look at undirected graph so I'm not going to spend very long on this that because most of the examples I discussing involve directed graphs the other main class of graphs are undirected graphs and in certain situations it's easier to express our model using an undirected than a directed graph I'll show you an example so one waiting about undirected graphs to start from the other end and think first of all about the conditional independence properties and then to think about what the factorization might look like so supposing you you said you know this idea of PARs being blocked is very simple but those head-to-head nodes with the arrows on sort of spoil everything because they work back to front made everything more complicated what about if we had graphs without any arrows on could we just have a simple conditional independence property which are simply path blocking in other words could you imagine a semantics for grass grass without arrows such that if I if I consider two variables or in this case two sets of arrow will set a and set B and ask are a and B independent given some other set see where that conditional independence would follow simply from graph separation simply from blocking if every path from A to B goes through an observed node it blocks that path or to put it another way imagine removing all the observed nodes along with the links that connect them if the graph if there are no paths left between a and B then they would be independent okay that's the conditional independence property of undirected graphs and you can see that if two if two variables are connected by a link they won't be they'll always be connected by a link so no matter what would even observe everything else in the graph they'll still be connected there'll always be a path between them that can never be blocked so that leads us to think about sets of connected nodes and by the way undirected graphs are sometimes called Markov random fields another terminology for them so what's the factorization going to look like well we have to think about sets of connected variables so think of this graph so the graph has four node just look at the red lines for the moment we've got four variables x1 x2 x3 connected by these four links and there's a fifth link along the diagonal well variables x1 and x2 is said to form a clique a clique is a set of variables such that every variable is connected to every other variable well only two variables and they're connected so they're a clique and a maximal clique is a set of nodes that are fully connected such that there are no other nodes in the graph that can be included and yet it still remain a clique and so this graph has two maximal cliques these three nodes so X 2 3 and X 2 X 3 and X 4 there's a link between every pair of nodes but if I include X 1 as well it ceases to be a clique because there's no link to next one and X 4 so X 2 X 3 X 4 is a clique and it's the largest clique I can have with those variables into a maximal clique and then the maximal clique is X 1 X 2 X 3 and the factorization property for undirected graphs is very simple it's just a product so the Joint Distribution over all of these variables so X here is just the vector of all the variables in the graph it's just a product of all the maximal cliques of some functions some non-negative functions of the variables only in the in that league so in this case the Joint Distribution is just some function of X 1 X 2 X 3 times some other function of X 2 X 3 X 4 that's the factorization property and there are two little technicalities the the factorization and the conditional independence properties this is a little bit technical but they only there only equivalent thing call the hammer z clifford theorem only equivalent if i make one more little restriction i have to restrict the size to be not any non-negative have to be strictly positive in other words they must not be zero anywhere the other little wrinkle is that directed graphs I never spoke about normalization constant anywhere they were automatically normalized the normalization of a directed graph just follows from the products of conditional distributions there's nothing here which says this product has to be normalized so in general we have to have a normalization constant to ensure that when I sum this over all possible states of X I get I get 1 so you know by definition Z is just the sum of this quantity over all values of X because if I sum the left-hand side over X I get the sum of this over X which is just the same as Zed so it just gives me 1 by definition and this is one of the downs so a nice thing about undirected graphs has this very simple conditional independence property a nasty thing about them is that it has this nasty normalization constant that is nasty because this son is over every possible setting of all the variables in the graph so if these were discrete variables I have to take every combination of settings of every variable so if there were if each of these was a k-state variable and there are four there are K to the power 4 settings of the variables okay so these this this summation here this summation of the X is over every possible setting of those variables all right so there are K ways to choose variable 1 and K ways to choose variable 2 there are K squared combinations add in a third variable though K cubed combinations ok with M variables there k to the power M the number of terms in this summation grows exponentially with the number of nodes in the graph one of the big applications for this is image processing you might have a million pixels in your image you've got a sum over two to the power a million terms obviously you're not going to do that explicitly so that's the downside of director graphs let me make this a little bit more concrete by just giving you a concrete example of an undirected graph it's very simple fairly standard problems the problem of image denoising so imagine I've got an image and we'll imagine it so just a binary image each pixel is black or white let's say but now somebody's come along and corrupted the images they've gone along picked pixels at random a small subset of pixels and just flip them from white to black or black to white at random so they've added this noise and my job to get rid of the noise well we can do that using an undirected graph and the undirected graph is going to encode a piece of knowledge that we have about the world the piece of knowledge we have about the world is that images neighboring pixels in an image tend to have the same state if I look at the world I see you know a big area of gray big area of white and so on if I take a photograph the neighboring pixels will very probably have the same color the same state if you're moving from Berners talked this morning he showed a digit digit 9 as a pixelated image and the great's ways a white round the edge and then the all the pixels of the the black image roll next to each other and then he showed the same image all the pixels are scrambled up and you didn't know what it was okay so in real images that the pixels we know where the two pixels are next to each other or whether a long way away and one of the basic properties of natural images the pixels don't go black white black white black white they tend to have big areas that are the same and so we can express that using a direct an undirected graph and this is the undirected graph the idea is the following imagine each of these nodes X I is if you like the true state of the image this is the the pixel label for the uncorrupted the unknown image what we observe is not X I directly but something which is potentially flipped it's flipped with a small probability okay so if this is a black this is probably black with a small chance of being white and vice versa okay that's the that's the noise model so this link is sort of capturing that effect these links between neighboring pixels capturing the fact that two pixels next to each other a much more likely to have the same value then they'll only have different values at an age and ages are rare in images so remember that the these potential functions these size these factors must be non-negative or strictly positive in fact and we can convenient them as Exponential's of real-valued quantities and by analogy with physics we call the quantum exponent the energy so the probability is e to the minus energy and we can write down a very simple energy function for this undirected graph so let's look at these various terms imagine X I so the x i's the x i's we're going to choose the x i's to be either minus 1 or plus 1 okay those are the two possible states of the image and if X I and XJ so this term is a potential over a click well this is a click this is a maximal click these two variables are connected but there's no other variable that I can add and the set will be fully connected so these are the cliques these pairs of variables than these pairs of variables so this this corresponds to this click here and if X I and XJ are the same plus 1 x plus 1 is plus 1 or minus 1 times minus 1 is plus 1 so if they're the same state this energy term is negative if one is plus or one is minus this is minus 1 and the energy is positive so we have a high energy if the pixels have different states they bring pixels have different states and a low energy if they have the same state and remember probabilities e to the minus energy so high energy is low probability so really what this is saying is that there's a high probability that neighboring pixels will have the same state and low probability that will have different states and then the same thing here but with a different weighting a different coefficient this is saying that the observed pixel is very likely to have the same state as the true non noise corrupted pixel but the chance it might be different the larger the value of eater the the bigger that discrepancy in energy the bigger that discrepancy in probably so here's an example here's um yeah Oh in principle what I would like to the question is I made minimizing the energy of the image in principle I like to compute the posterior distribution of all possible non noise corrupted images to practice the user may not want the posterior distribution of a 2 to the power a million images they might prefer just your best guess in which case yes you would maximize the probability which is minimization of the energy yeah but in principle we could actually get a whole Paseo distribution out of it yeah so okay in other questions okay so here's the example then so here's an image so each pics on this image is either yellow or blue so that's the that's the true image what I then done is to take each pixel and with 10% pic temps out of the pixels at random and flip them from blue to yellow or yellow to blue so this is the noise corrupted image that's the observed data those are the observed Y variables and now what I want to do and what this defines now is a posterior distribution a joint posterior distribution over a possible nominees corrupted images noise free images but as as as the question has just asked there what we really want of course is is our best guess or most probable image and a simple algorithm for getting that is just to take each pixel at a time and just maximize its probability conditioned on your current guess for all the other guys and that's just called iterated conditional modes very simple heuristic hack and if you run that for a while you get this so it's done no not a bad job of removing the image there's a better algorithm called graph cuts which is actually a a tractable algorithm which in this case can return an exact solution and so this is the exact the true most probable non-lawyers corrupted image and that's done a much better job in questions yeah the time that's a great question and I don't know but what I can do is show you a real example in running in real time and you can see time it if you like on my laptop yes I'm not a I can't I can't um yeah I'm not sure I can give you an answer to that and because of the particular structure of the energy function this is this is a relatively fast algorithm I mean certainly a lot faster than exhaustive exploration which would be an exact algorithm that would take longer than the age of the universe but yes I can't put the numbers on on oh sorry right yes this this this is a another you can think of this is another sigh in the in the factorization this is a unary term just over the exes and this is just a sort of bias so that if H is zero that says X is equally likely to be on or off but it might be that in my example most of the pixels yellow and only a few of them are blue and that would be captured by by that unary term so that's saying that's looking each pixel independently it doesn't and just saying what's the probability that pixel being on or off in independently of the other pixels just a unary potential so that allows you yes so so so this this will bias the solution if you like towards I'm either mostly being on or mostly being off or if H is 0 or B in the middle 5050 do we have a problem with learning it was the question we I'm sorry with yep and right so it's trying to find a compromise between you know when you've got when you've got big areas that are all the same if there's not too much noise then the the pairwise potential will will win out but it's just trying to find that balance between the pairwise and and the and what the data is saying which maybe may be in conflict so there's a lot of attention and that's what that's what you're balancing out in effect so in general or for this example okay contact on energy function so the question was going to give it some more context this notion of an energy function I said it was borrowed from physics the terminologies borrowed from physics I mean energy is sort of a physical property of the world the there are but there are nice analogies and probably Zubin garamantes inference tour will actually probe these rather more deeply and and there's a very nice analogy with the the second law of thermodynamics the notion of entropy and free energy which can be used to do approximate inferencing in complex models that I'm sure Zubin will talk about tomorrow in detail but essentially one way to think about this is okay they're just they're just they're just words I can call them but I like and but the probability is always a strictly positive quantity and it's convenient to write it as so I can always write as the exponential of some real quantity and it's a convenient thing to do because you know multiple you're multiplying all of these factors together in the exponential just adding them so you end up with an energy function which is just a sum of lots of different terms so that's why it's probably easier to work in terms of energy but they're just you know they're just deterministic lira lated to each other the question is how did I come up with the energy what they got to do with the graph so again I guess my point about the graphs is that I could just write down this energy function I could just you know I could just write down the energy function and the graph I guess is a way of expressing pictorially what the energy function or if you like the choice of probabilistic model is saying so I can for example I can read off some conditional independence property so if you imagine this grid goes on for a long way if I look at this node if it has four neighbors okay it's only connected to four other nodes if I observe those other four nodes then the state of this node is independent of everything else in the graph so it tells me that any long-range effects have to be propagated by local influence from one node to the next it's very very much a localized structure so this is not the most compelling example by any means I mean we'll see some graphs later I think the graphs are much more compelling in helping you understand what the model is assuming and how the model works and so on okay right great question so the quite the question was in director glass we had this nice generative story here we don't seem to have any arrows do we have a generative story and yes we do still have a generate the genesis story what we don't have is quite such a causal story so again I hesitate to sort of step on the landmine of causality because there's a big field and there are lots of subtleties but jittered directed graphs are often quite good at describing a sort of causal process you know first of all flip the coins and afterwards I can compute whether the coins are both heads and so on and I can run through in that in that so the generative story is very very simple okay so given a given a directed graph where there are no observations running it generatively is trivial because I just start with the top nodes I sample those then I get the next nodes down I know the values of their parents I just draw from those distributions so I can draw from the entire graph each step just sampling from the local local distributions very very simple there's also a generative story here this defines a joint distribution over all of these variables and I can sample from that distribution in principle but doing so is very much harder because there isn't any natural sequence in which I can sample so yes I can run this generatively so what I could do is I could I could generate sample images I haven't haven't done so I suppose I could have done it's a lot more computational effort I can't just do it in a single pass it's more sort of iterative in nature but I could draw lots and lots of images that were all samples from the distribution okay but it's just harder work so some things are easier in director graph some things are easier an undirected graphs yeah that's life isn't it yeah okay so the question the question I'll just repeat for the for the video the question is so in the directed graph it was automatically normalized because each of those distributions is normalized an undirected graph it isn't and I have to have to compute this normalization factor which is this horrendous sum and that's one of the disadvantages of director graphs it's important because supposing I said well actually no these these these are parameters these H and beta liter you know how do I know what to set them to well actually what I could do is get a whole lot of examples of images with or without noise let's say and I could I could learn these parameters from images but the problem is if you imagine the parameters in here the Z will depend upon the Z will depend upon those parameters okay this is a function of the parameters or an integrate or some out X&Y I'm left with a function of the parameters and if I want to do learning let's say I've got a training set of images and I want to tune up all the parameters so that my model it has the highest probability under that set of training images and I'm going to do it by some optimization technique whatever your favorite optimization method is you know it takes some gradients or something then I have to I need to know this this quantities ed it's sometimes called the partition functions another piece of terminology in order to be able to differentiate it in order to be able to maximize okay and and we just said to compute it requires this exponentially large sum and so that's a major downside of of undirected graphs is is learning the parameters okay so I think the question is when do you need the normalization constant so yes if you're just trying to maximize this distribution with respect to the X's and Y's the Z doesn't matter because it's independent the X and wises have them some doubt it's just a function of the parameters so for given parameters if I just want to find the most probable distribution then I don't need to know the normalization constants irrelevant but if I want to learn the parameters that I need to take that into account and that's that's going to be difficult to do exactly because it involves the exact partition function exact normalization constant involves is exponentially big some that I can't just compute directly all right question in the observation room go ahead from yes we have some questions Thanks what I'm asking is not just in context of images so here you consider a black and white image with just two pixels right if I have a grayscale image with like 8-bit grade scale image so I have 256 gray levels how do you define this energy function how plausible it is to define this energy function or with such images because that that colored image also had only two pixels like yellow and blue right okay so in for multi dimensional system how do you solve this problem okay so this this is a sort of the toy example in real images aren't like this and obviously I might di have to deal with you know red green blue pixels I have to deal with continuous values and so on and the answer really is just through more complex choices of potential function so what I don't to do is get is dive into deep discussion about how we how we do image modelling I'll just going to show you one brief example and then we're going to go to move on from this but in general since the whole theme of the of these three lectures is about modeling your particular distribution so lots of questions like how do I deal with this how do I deal with that the question of it often or almost answer themselves because you say how do I deal with this particular thing when instead of having two variables I've got six variables well the answer is you know six states the answer is you need a variable at six states then if you're telling me there are six states you need a variable six States or if there 256 you need 256 and I say well what do you know about this well you know that to two pixels are likely to be very similar in this particular way and so you have a model that captures what it is you believe about the color distributions of pixels and so on so in a sense all of this modeling is about describing your domain in terms of these factors or these potentials so you know we could go into a deep discussion of images what I wanted to do actually was just show you one example on real here we go unreal image so here's an image so here we don't just have a binary image we have RGB and obviously they're continuous values and but this is sold using essentially the same technology so this is this is actually a PowerPoint if I just just keep the ink so just coming out of show mode so now in edit mode so so if you've got a copy of Office you go into any of the office applications this is PowerPoint so I select an image I get a bunch of picture tools and one of them is called remove background and so if I just select remove background so it's just put it put down a bounding box add just a wild guesses of a bounding box then it's segmented out the foreground and background it isn't it isn't quite right of course it's missed a little bit of the leaves out there but essentially I can just do keep changes go back again capture a little bit more okay what that's doing is both of ones the top there we go okay keep changes okay so the point is I wanted to segment out that object and paste it into another image I could have spent hours going around the edge with sort of electronic scissors but instead here's just a simple tool that just allows you to segment it out and you'll see them also some extra little tool so for instance if if it made a few little mistakes I can select areas to remove or select areas to keep mark them up and it fixes those to be the things I've selected then it reruns inference so again this is using a Markov random field type model but if this is running real time and sort of mega pixel size images on on my laptop okay using essentially the just a more sophisticated version of the of the model that I just described okay so I just want to end up with a comment then about directed undirected so actually already looked at quite a few of the differences between directed and undirected so you see the director graph specifies both the set of conditional independencies and a factorization and the same is true of an undirected graph but they're actually different so here's an example of a graph over three variables when you recall that a and B are marginally independent but become conditionally dependent when I observe C so those are the conditional independence properties of this graph there is no undirected graph of a three variables which has the same set of conditional independence property so to capture those conditional independence properties over three variables I need to use a directed graph the the simplest example of an undirected graph that can't be expressed in terms whose conditional independence properties can't express precisely in terms of a directed graph is this graph over for nodes so in this graph if I the conditional independent is simple if I my pointer back if I observe two of the nodes let's say C and D 2 opposite nodes then a and B become independent and similarly if I observe a and B then C and D become independent but otherwise there if one or no notes are observed then the remaining nodes are dependent so again that's an example of a conditional independence property that can't be captured precisely by a director graph so they're sort of complementary and then if we want to go beyond these graphs there are graphs which have both directed links and undirected links which are more general than either of these they're called chain graphs and that takes us off into a very complex space okay all right so we move on to another chapters any questions about the what we've just covered I guess we've just spent a bunch of time looking at conditional independence properties and reading them from graphs can we view under X grassers director with arrows on both sides of the where you can answer the question oh okay fine and so the answer is you know you can invent new kinds of graphical notation right you can just invent your own and I'm going to show you something new and interesting a bit later on the next lecture if we get get a chance but the question is you know does it have a consistent semantics and is it useful and does it you know is it valuable in some way and and I guess a lot of things have been invented we've had director grass with loops over all sorts of things I want to show you one example of a own extended semantics that is actually very easy so let me come to do inference and quick question can we combine them yes there are there is a framework which includes both directed and undirected links that's that those are the chain graphs that I referred to earlier and they are more complex but they include directed and undirected a special cases so they're called chain graphs and Lauritz and I think is stephane if you read Stephan writtens book on graphical models he has a chapter on Jane gross yep okay let me press songs I want to talk about another kind of graph which is very useful for inference it's called the factor graph and we actually counter that briefly already so remember directed graphs specify factorization in terms of conditional distributions and so do undirected graphs and so we could subsume both directed and undirected graphs by just considering them as specifying factorizations okay so the key thing is that they specify a factorization and so we can express the factorization explicitly so the idea of a factor graph so fracture graph is sort of more verbose it's more explicit than either the directed or the undirected graphs because we're going to make not only the variables explicit but also the factors so in a factor graph we again have a node for each variable this case three variables but we also have a square node a solid square node for every factor so here's an example of a distribution of a three variables and this distribution of a three variables factors into the product of four factors so let's consider the first factor the first factor depends only on variables x1 and x2 so there's a link between the factor F a and variable x1 and a link from F a 2x2 okay so these you write down a circle for every variable little solid square for every factor and then you connect the factor to every variable on which it has a dependency FB also just depends on x1 and x2 but it's shown as a separate factor right we could absorb them together and call it a new factor but they're shown there's a separate factors there's more explicit where being much more explicit about the factorization properties FC depends on X 3 X and X 2 and X 3 and there's another factor unit factor that just depends on X 3 again in principle we could absorb it into FC we can also keep it explicit this is the more a more explicit but therefore more cluttered graphical notation so the Joint Distribution just the product of these factors so if you were given a directed graph then you could convert it to a factor graph so let's see is an example 3 3 variable so we know how to factorize the distribution according to that graph that's given there you're familiar with that now so let's represent that as a factor graph so as a factor graph we again have the three nodes x1 x2 x3 and now we have three factors F a FB and FC and those factors just correspond to the factors in the in the directed graph so F a is just P of x1 and so on so this is really just a just a slight change of notation the other thing that we sometimes do is put little arrows on on these links so we sometimes make this into a directed factor graph and that just it's just remembering that this came from a directed graph so it's remembering these arrows what it's really saying is that this factor F a is a normalized distribution of x1 or FC is a normalized distribution of x3 which depends on x1 and x2 okay we think that's just sort of a little a little bit of additional notation so the thing I want to talk about next is inference now inference is just sort of Bayes theorem but but in more complex in more complex situations there are three whole lectures on inference given by Zubin I have coordinated these lectures with Zubin and I'm going to give you a sort of a brief introduction to inference then he's going to go over some of the same ground again but go into much more detail but it's quite first of all it's quite useful I think they hear the same thing from two different people also I need to talk a little bit about inference because I need to introduce you to the idea of message passing and and so that we can use that a little bit later on and that will give you a lot of insight I think into these models so I'm going to give you a brief introduction to inference using graphical models and particularly using factor graphs so in principle you know how to do inference because you know the the some rule and the product rule and in principle that's all you need so if I give you a a model which is a Joint Distribution over let's say you know a million variables and I say the following variables have been observed and I give you their values that's our data and I ask you for posterior distributions over other variables then you all know how to compute them all you do is you write down the Joint Distribution you set the observed variables equal to their observed values and then you just sum out all the other things that you don't care about you're left with the variables you're interested in the problem is imagine imagine all my variables are just binary if there are a million variables they're two to the power a million terms in the joint distribution so just storing it or computing with it is 2 to the power a million and another 2 to the power 90 electrons in the universe or something so it's a big number okay this is completely intractable so what we're going to do is use the fact that our Joint Distribution is not a general distribution it's a factorize distribution in which each of the factors just depends on a few local variables I'm going to use that to enable us to do inference which is much more efficient so instead of being exponential it'll be much less costly it might even be linear okay in the size of the graph so this is sort of the key this is the key idea we've got to do lots and lots of Suns over thousands or millions of variables potentially and we could do it naively and that's going to be very expensive but instead we're going to exploit the factorization so this is the key idea imagine we've got two variables x and y and each of these variables has two states and we're going to consider the function x times y and we're going to sum this over all the states of X and all the states of Y well that's easy to do right here are the here are the terms so X it takes the states x1 or x2 and similarly for Y and so there are 4 possible terms in the sum so that if you like is the sort of the naive way of doing it we could do that for any any function of any function of x and y but this is not just any old function of X might factorizes into a function of x times a function of Y so we're going to exploit that factor as a factorization by interchanging the multiplication and the summation so I can write it in this form which is now a factorized form okay so analytically this is exactly the same as this okay but I've now exploited the factorization now computationally it's a very different thing because to evaluate this first thing I need to do for additions and then three summation so I need seven operations to evaluate this whereas here by doing the summations first I need two summations and then one multiplication that's three operations so analytically these two things are the same but computationally the second one is much cheaper so I've exploited the factorization to compute something just analytically the same but computationally much faster compute okay and that's the basic trick that we're going to employ but now we're going to consider factor graphs we're going to use the factorization to go away from doing naive exponential intractable inference to doing something which is much more efficient and much more tractable okay but but still exact I'm still doing exact inference so what I'm going to do now is sort of you derive again in inverted commas and our formal proof it's just a sort of a sketch to help you understand and to motivate this and you can go away and read about more formal derivation afterwards but to motivate a way of doing exact inference on graphs and we'll use factor graphs for doing this exact inference on factor graphs that exploits the factorization property to do to do efficient exact inference it's called the sum-product algorithm so here's this simpler sort of graph on which we could kind of derive this so this is a graph a factor graph over five variables involving four factors and so the the Joint Distribution and the Joint Distribution let me go to the next slide so the Joint Distribution so this factor graphs is the Joint Distribution of the few variables you over the five variables you W XY and Z is just equal to the product two factors it's f1 which depends upon u and W times f2 of W and x times F 3 of x and y times F 4 of y&z okay so we've got five variables but we're not dealing with all possible joint distributions over five variables we're dealing with a very small relatively very small subset those distributions which factorize into the product of four factors of that form there let me ask an inference question the question I'm going to ask is can you please tell me the marginal distribution of W okay let's just try and compute the marginal distribution of W well you all know how to do that because you all know about the sum rule so what we're going to do is we're going to we're going to take this joint distribution and from the sum rule we just need to sum out all these other variables so we just need to compute the sum over you the sum over X the sum over Y and the sum of Zed of P of U W X Y and Zed okay now let's suppose that each of these variables had K possible States all right this joint distribution it's like remember the table I had a big table of culprit and weapon is a three by two table is the six possible states of the world if I have five variables each has K States this is a table that's K times K times K times K times K so so simply to write down the Joint Distribution simply to write it down as a big table would be of order K to the power five okay if there are M variables that we Kate's power M so this is growing exponentially in the size of the graph that's bad news so to write it down as compute with it it's going to be exponential so that's bad but I'm not working with all possible distributions I'm only working with a special subset that factorizes so what I can do is to take this expression as factorization and substitute it into there and then I can use the fact that each of those little local factors only depends on a subset of the variables to interchange the summations and multiplications just we did on that that the XY example on the previous slide so when I do that I can I can write that as as follows so we start off with the if there's some on you well the only place that you appears is in f1 F of U and W of of U and W and then I'm left with for someone X the someone why the someone zf2 of w and x f3 of x and y f4 of y and z okay f2 of w and x f3 of x and y f4 of x and z thank you okay I will make deliberate mistakes from time to time so do you call out all right the thing in brackets here is a function only of W and I'm going to do is I'm going to give that a graphical interpretation I'm going to interpret that as a message which is sent from the factor node f1 to the variable node W so I'm going to this thing in square brackets I'm going to consider B a message that f1 sends to W and it's a function only of W because I've summed out the variable U and it's multiplied by this greatness well this great mess is a function again only of W and I'm going to interpret that as a message which f2 sends the W and again it's a function of W okay now let's consider message f1 how costly is that to compute well I only need this factor and that's K as a K by K matrix so it's K squared - to store it and this summation has K terms now you to evaluate the sum for all get every one of the K settings of W so there are K calculation to do each of which involves summing up K terms so again that's a quadratic that's a K squared so I can compute this message which is a set of km as I can view that in in order K squared storage in order K squared computational time it's as quadratic now let's assume I managed to compute this efficiently then this multiplication is just a sort of a point wise MA operation for each value of W to get this marginal so again that's an order K operation so everything is either linear or quadratic so far what I want to do now is again continue to exploit this factorization in order to compute that message recursively so what I want to do is to compute the message which f2 sends to W which is a function of W I want to compute that so let's have a look at let's have a look at this expression well the M if I look at the sums over y and z this factor F of W and X doesn't depend on Y and Zed so I can pull it out of the sum all right so let me do that so this I can write this as a sum on x f2 of W and X and I'm left with the sum on Y there's some one zf3 of x and y f4 of X and Zed and now what I'm going to do is again interpret this as a message so this thing in square brackets are going to open as a message which X sends to f2 and it's a function of X so that's this message okay so let's just summarize what we've got so far it says to compute the marginal distribution at this node W I simply multiply the two messages sent by neighboring factors and in general if I have several factors I just multiply all the messages to compute the marginal node I just multiply all the incoming messages from neighboring factors how do I get those messages well let's consider this message I first of all take the incoming message so this thing is a function of X that's like an incoming message from the variable note of the factor node so I take the incoming message I multiplied by the factor and then I marginalize over the corresponding variable okay so take the factor at the factor node I take the incoming message from the variable node and multiply it in I then marginalize out that variable to generate the outgoing message now I need to compute this message how do I do that well again I can this factor F 3 of Y and X doesn't depend on Z so I can swap those summations so I can write that as a sum on why F 3 of Y a 3 of x + y times the sum on Z F 4 of X and Z and again I can interpret these as messages I can't open those messages from factor nodes to variable date so this is the message which factor node F 3 sends to X it's a function of X and then this is the message which f 4 sends to X and again as a function of X so what this says is that to compute the message the outgoing message from a variable node I just multiply the incoming messages from other factor nodes again you'll notice that each of these each of these messages involves so the computation of these starting messages just involves a K by K matrix and I have to sum over K terms for each of the K values of the other variable so again each of these is quadratic so instead of being order K to the fifth this is now order K squared times roughly I'll call it the sort of the length of the graph times roughly the number of nodes in the graph you can imagine I can recurse this all day long for some great long chains of nodes so it's gone from being exponential on the size of the graph to being linear in the size of the graph and yet is still exact just by exploiting the factorization so what I've done is to derive a very specific calculation I've derived how to compute the marginal of this variable on this particular graph and now by a sort of an inductive step I hope to convince you I'll show you the sum-product algorithm in all its glory and you'll see that what you've just seen is a special case of that so the sum-product algorithm has three equations it says that to compute the marginal at a variable node X you just multiply together all the incoming messages from neighboring factor nodes remember this is always a there's always a bipartite graph factor nodes only connect to variable nodes variable nodes only connect to factor nodes okay so so at a given for a given variable node I have factor nodes some complicated graph I just take the product of all the incoming messages and that gives me the module at that node to get the message which a factor node sends to a variable node how do I do that well I take I take all the incoming messages from other variable nodes that said these guys I multiplied by the factor at the factor node and then I sum over all the other variables except the outgoing variable okay so this is X 1 and this is f so the measure which f sends to X 1 is just the product of all the incoming message nor the variable notice times the factor and then some so those are the messages which a factor node sends to a variable node and then the the message which a variable node sends to a factor node is is very simple so again or to get this message here lots of factors connecting to it so to get this message the message from a variable to factor I just take the product of all the other incoming messages now there's one little caveat here if you think about this for a moment I've sort of D when I went back to let's go back to this problem here I asked for the marginal of this variable and I and I wrote it as a product of stuff which depended on everything out here to the left times stuff which depend upon things to the right effectively I could do that because this was a tree structured graph so when I consider it one variable it breaks the graph in half and I've got a bunch of stuff over here involving variables and a bunch of stuff over there involving variables but there's no variable that's common to both of them ok and that's that's a property of a tree or definition of a tree really if the graph had some loop if there was some there was some other factor that sort of you know connected this together like this that would no longer be possible okay that factorization wouldn't be okay that wouldn't be possible and so what I've derived this sum product algorithm only applies to tree structured graphs something else as well a bit like that story about two firemen I guess and his diagrams supposing I wanted something else now let's suppose that I wanted now the marginal for X ok then I've calculated the marginal for W now somebody says can you get me the marginal Forex I could do the same thing again I could propagate messages in from here messages in from there messages in from there but most of that work I've already done right those first messages are exactly the same as when I was computing the marginal for W so I've just done all that work so I don't need to repeat those messages once they've been computed they can be cached so in fact if I asked for all of the marginals I've I wanted the marginal for every variable in the graph I would only have to do twice as much work if you think about it to get the marginal at any node i propagate messages in from all of the leaves back to that I can think of that as a sort of root node so if you imagine pick a node call it the root node doesn't matter what it is propagate messages in from the leaf to the root and from the root back out to the leaf again so that every link has had a message going in each direction okay that's twice as much computation I can now get the module for any node because there any variable node I've got the incoming messages from all directions okay so with only twice as much computation I can get all of the marginals so we have a sort of mesh message schedule which we pick a root node we send message from the leave to the root and from the root back out to the leaves and now we're done we can get all the marginal so we've had one message pass in each direction on each link and so that's exact inference for director graphs and for undirected graphs because they're all expressed they can't be expressed as factor graphs provided their trees tree structure graphs okay what if the graph isn't a tree what can we do one technique we can do is called cut set conditioning so supposing let's go back to here we go here's a graph which isn't a tree I could condition on one of these variables so let's say W has K possible states I could just fix it to one of those states and now what I'm left with is a tree we have broken the dependence property I'm left with the tree I can run my son product algorithm i compute on my marginals for that value of w then i set it to the next value and i do that and so on so i run it k times and I've got the K marginals conditioned on w I also have the prior 4w so I've got what I want given W times P of W and I can easily reconstruct the thing I want that's called cut-set conditioning and that's great if if by conditioning on one variable i can turn the whole thing into tree and if you think about that image problem an old grid of nodes there's no way I you know send that into tree I'd have been conditioned on lots of nodes and I then have to run the computation for every possible combination of settings of the cut set so sometimes cut set conditioning can work and be very efficient but in general it's it's not a good way to go the other thing I can do is to use graphical machinery something called the junction tree algorithm which would be you know an entire lecture in itself but essentially I take the graph and I transform the graph through a series of purely graphical operations until I end up with a something which does have a tree structure but the nodes are not raw variables and our composite nodes that contain many variables in the original graph now I can run some products on that tree and the course will be linear in the size of that tree but exponential in the number of nodes number of variables within one of those composite nodes and again in the worst case my junction tree just is one node containing all of the variables and it's exponential on the size of the graph so Junction tree is a general efficient algorithm for doing exact inference on graphs with loops but it may or may not be tractable in practice it depending on the exact structure of the graph the other thing I can do is just ignore the fact that it's a tree but recognize that it will no longer be exact therefore I can just iterate so I just initialize my messages and I just start propagating them according to some schedule and I just send them around and around the graph after a while I stop and I see what I've got and I see if that's useful and it turns out that it's remarkably effective right it also turns out after the event people discovered that actually it's the state-of-the-art in error decoding algorithm that was developed by completely different methodology turns out to be exactly the same algorithm so that's called loopy belief propagation you've got loops and you're passing messages around the loops just running the sum-product algorithm but iteratively for some period of time hoping you get a good answer tan very often you do finally then what if the messages are intractable so the message I've considered so far are obtained by just taking these little matrices and summing up they were discrete variables and I do little summations what if I've got continuous variables and I now have to do integrations on one of those integrals are intractable so imagine then my true distribution is something very complicated with lots of humps and bumps and I need to do some integral along this axis and of course this might be a million eventual space not a one dimensional space and I can't I can't do these integrals exactly so I can't even compute these messages what am I going to do well there's a very general very powerful technique called Monte Carlo which involves drawing samples from the distribution there's a highly developed field as an entire industry of people and algorithms for doing Monte Carlo sampling it's very general it applies to very broad class of models and very often it has a nice property which is that if you had an infinite amount of compute power you could come arbitrarily close to the exact answer so that's a really nice property in practice Monte Carlo methods generally speaking don't scale very well to large problems so if you're working with you know 100 million users or billions of data points if you're working internet scale the Monte Carlo methods are not very scalable they generally don't apply so one of the I think one of the most interesting developments in machine learning in the last sort of decade has been the development of approximate inference algorithms that do scale and to the whole bunch of these variational message-passing expectation propagation the whole bunch and this really I'm sure make up a good part of the lectures by zooming garum on e but roughly speaking instead of drawing samples what they do is to approximate the true but complicated distribution with a family of simpler distributions and then they explore that family to find the member which best represents the true report the true distribution and so those methods are approximate even with infinite power you see this multi modal thing will never be perfectly approximated by our you know modal family of approximation so then they'll never be exact but on the other hand they often scale and and so they're very effective for large-scale application and finally finally one last point about learning an inference we have learning we have inference we tend to think of you know inference or sort of figuring out these unobserved variables whereas learning is tuning the parameters well in this probabilistic view of the world is Bayesian view if you like there actually isn't really much of a difference here's a model let's say this might be a model of an image or this might to do a speech recognition here's a here's a sequence of things we observe I don't these might be and this might be bases in a DNA sequence and these might be variable saying whether it's part of a gene or it's not part of a gene I've modeled these as a sort of chain but I've got parameters well I don't know the parameters so I describe those by distributions and therefore they correspond to nodes in the graph so figuring out these guys you might call that an inference problem and tuning these parameters you might call it a learning problem but really viewed as a graph they're just unobserved variables so really learning an inference or the same thing in this framework okay we've just got some unobserved variables we're going to run inference to get the steerer distributions that's a good place to stop so any questions yes and no you can gent no you can still read off the conditional independence properties from the factor graph so yes you can still use the the disadvantage of the factor graph really is just that it's more cluttered because it contains more information than the neither graph so the in a sense the reason why you might not always and what typically happens is you're standing there to whiteboard with somebody you know trying to build some new applications the first and you're just trying to sketch out different models you just generally represent those as directed graphs because you're interested in what are the variables in my model or their relationship which ones depend on which ones and in fact directa graphs a nice quick way of sketching that out then you say okay we've got our sort of director graph this is the kind of structure we want now we need to say exactly what are these distributions okay and how precisely how does this variable depend on those guys then you go to the level of a factor graph you're specifying the you know the fact this is a Gaussian distribution it's mean is a certain function of this other variable and so on and then you tend to go to the factor graph and then the factor graph is also what you use to run inference Justin is a little terminology question a variational message-passing is that the same as variation inference it's a great question and so the terminology is a little confusing so variational variational calculus so you think of calculus is sort of optimizing things you know with respect to one variable or several variables but you consider something which depends on a whole function so something takes a functions input and produces a numbers map as a function al and you can imagine adjusting or exploring the whole family of functions and seeing how the output there is so that that's that's no functional that's the calculus of variations and so what we're doing in the proximal deterministic approximation framework so we're taking the true distribution which is intractable and we're considering a family of similar distributions let's say Gaussian distribution for example and we're exploring the family of distributions my varying the mean and the covariance so on and so we're very an entire function or an entire distribution so that's a variational procedure so that's you can call so variational inference is this whole family of expectation propagation and power EP and so on and so on within that space of variational methods there's a particular algorithm which depends upon optimization of a particular divergence measure and that's because that's known as variational message-passing and so but it's a little unfortunate I guess but variational message-passing VMP refers to a specific algorithm which is distinct from expectation propagation and various other algorithms they're all they all define a measure of the discrepancy between the true distribution and your family of approximations I they then optimize that measure of divergence by propagating local messages on the graph so they're rather like loopy belief propagation except the messages and the scheduling in the way they're computed can can vary from one specific algorithm to another so just that picking on that he talked about the KL divergence in statistics we have different major of divergences then we want to minimize that any any distance between two probability distribution how do we actually react to that that situation because I don't know which one is the best one KL people pick up KL divergence every time and then yeah they try to do it but that's a fundamental question how do we actually differentiate between to destroy okay first of all is it a he's way beyond the scope of my talk and into the scope of subin's talk but I'll answer that question briefly so yes there are various different choices for that divergence and the two most popular the KL divergence for variational message passing in the reverse scale the versions for expectation propagation so Tom minka who's running the practical session on influencing graphical models whose can be here I think this weekend next week who actually is the creator of expectation propagation that was his PhD thesis he wrote a very nice and report a few years ago showing that all of these different methods and power DP and so on are all special cases of the use of message passing taught to my a family of dancers called the Alpha family okay and so so alpha is a parameter going from minus infinity to plus infinity and different choices of alpha correspond to difference of the KL divergence the verse scaled versions are two particular choices of alpha and Perry pecan ones other choices of alpha and so on so he showed that there's this whole family of divergences called the Alpha family and the alpha family has particularly has sort of nice properties as a family of divergences and depending on your choice of alpha you can you get different inference algorithm so those algorithms have different properties and depending upon what you're trying to do you may want to use one you may want to use variational message-passing or you may want to use e P or you may want to use something else and it's very quickly getting us to the point where we actually hit the the frontiers of research because the general problem of doing inference efficiently at scale in complex models is very much an open research problem right there are lots of challenges is a very active researcher and a fascinating research area and one where if you can make progress you're immediately translates into all kinds of practical problems because increasingly we're wanting to do inference at scale efficiently so you want good inference but it's got to be fast and it's really only these types of methods that at the moment that can deliver that sort of promise
Info
Channel: Max Planck Institute for Intelligent Systems
Views: 35,414
Rating: 4.9723182 out of 5
Keywords: Machine Learning (Field Of Study), Graphical Models
Id: c0AWH5UFyOk
Channel Id: undefined
Length: 95min 47sec (5747 seconds)
Published: Sat Dec 28 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.