Iain Murray: "Introduction to MCMC for Deep Learning"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

if you've seen Orioles on video lectures then you might not get tried to rejigger it to sort of be relevant to this meeting and deep learning if you don't know much about Monte Carlo or MCMC then this first talk is for you I'm gonna give a very basic introduction to Monte Carlo and things like give something if you're lost at any point you get this talk is for you you should stop me I'm probably not going to get on to the auxilary variable methods before the break so I'm going to continue this talk after the break and tell you about hybrid Monte Carlo Hamiltonian Monte Carlo and advance things say I was looking on the internet to see what was going on at this meeting before I got here and this was something I saw one of you guys write on Google+ like summarizing the feeling from some of the talks in the first week so you've heard a lot about sparse modeling and not everybody in this community values using probabilistic models and probabilistic framework and really there's no point using Monte Carlo methods unless you're using probabilistic modeling so I don't want to like start a fight but I'm gonna and both of my talks give a little bit of sort of motivation of why I find it useful to look at probabilities and this is some conversation we can continue in the breaks and the panel discussion okay so there's a really basic reason I like probabilistic models and that's that I can draw fantasies from the model and see what the model believes about data so that lets me understand the models we can also use it as part of our learning and also it's a way of communicating high dimensional objects to people it's very hard to summarize a distribution over high dimensional objects but if you've got samples you can give those samples to people or to another model and that's a compact summary of the sorts of things that model believes so if you get any sort of responsible Bayesian modeling paper and I've just plucked a random paper this is it within you AI in 2005 you'll often see some holes drawn from the model and this is a robot mapping problem it doesn't really matter what it is they've got so no data in the bottom left and what they've done in the top is drawn priors from their models so they said we've got this model of what we think rooms might look like before we get data and this is typical samples of what we think rooms might look like and you can see that a real building looks like that and so in the posterior they get all sorts of jaggedy artifacts as a result of that and so you can understand what's going on you could go in and maybe try and come up with a better prior over over rooms so in the deep learning community there's a lot of this going on you've seen in marker Ilias talks that people regularly draw samples from natural image patches and patches of natural images from models and they can do this criticism of saying are we doing a good job our models fitting and so these figures have been in previous talks from work quite a few years ago and these are real natural image patches and these ones were drawn from this model that was state-of-the-art four years ago and if you look closely you can see problems and if you make them bigger you can see problems marker Elliott's been fixing up some of those problems but you know if you look at his patches and they're big enough they still suck to some extent right you know there's still a lot of work to do which is a good thing because otherwise we'd all go home and I find that useful so drawing fantasy is really useful for human learning because from a few examples we're really good at coming up with really rich hypotheses about what might help and what's going on and but you know we're interested in doing machine learning too so here's an example of using fantasy data to help machine learning most of you have probably seen the Microsoft Kinect it's this a camera that you can attach to an Xbox and it will it can see depth that has a special camera that sees depth so it can see people like this the gray figures and one of the things that the Kinect wants to do is understand the pose of people so that you can use your whole body as a controller in a game so they needed to write code that would actually solve that task that for arbitrary people for me and for some enormous fat guy and for a toddler would work and it would work even if I was wearing baggy clothing or a swimsuit and this isn't you know the hardest task in the world but if you want a robust system that you can ship in a games console and actually work then I think that's a challenge and and in some ways it's a fairly standard machine learning problem you get a whole load of training data of depth maps and labels and you know you're basically learning a classifier with some structure but the question is where do you get all that data from so you could imagine sort of paying a hundred volunteers to stand in front of a Kinect and and it would take quite a long time and then you're not going to span the whole space of different poses of people and then you're going to get all your volunteers to dress up in all sorts of different clothes and then you're going to label all of those images it's a nightmare so what they actually did was all of the images on the left aren't real data they're generated from a model of human pose and it's completely synthetic and then they can get as much data as they want over many different types of people as they want and then it doesn't really when you've got infinite data it doesn't really matter what you use to fit it so they used random forests they could have used other things I'm sure and it would would work say generating data is useful if you really want to train some part of the system and they'll it down you can get as much data as you want and I think the interesting future direction is how to make this thing deep if you're wanting to learn something that there's much less constrained than this Kinect problem how do you sort of go in and learn separately about pose and lighting and stuff by generating the right data and this is an open issue that I know people at Microsoft are thinking about okay so we've seen a lot of vision so I wanted to give you one other example of sort of probably the type of probabilistic modeling that I'm excited about this is a graphical model that was made by Rob Fergus who gave a talk at this meeting and it was from a grant application that he made with David Hogg and collaborator of his at NYU David Hogg as an astronomer so this is a model of the stuff in this node at the bottom there pixels which have been observed in telescopes and a lot of stuff went into deciding what hit the ccd the orientation of the telescope holla'd of calibrations stuff what's actually in the sky the stars in the galaxies and we have a whole load of detailed models of where those things came from so I don't think we can really deny that this thing is a pretty deep model of pixels in these images and this is a model that we've genuinely learnt from just seeing photons come to us on earth or very close to earth and we've constructed this whole thing so some human learning has gone on here and we've done this very impressive task and we're not at the stage where we could throw all of these pixels into a generic landing system and it would reproduce this thing unsupervised but that doesn't mean that deep learning and future learning isn't useful to these guys if you want to fit parts of this model you have a pretty hairy data pipeline and neural networks are for sure useful in bits of that pipeline but these guys are in this grant publication they're wanting to be pretty Bayesian about the whole thing so if you're a scientist you don't want to just sort of fit from them you really want to sew responsibly what you can and cannot infer from your data how sure you are about things you want to do model comparison of theories and I think if you guys want to sell a neural network to people like this then it really needs to be outputting probabilities and you need to be able to calibrate those ok so that's my spiel about why I like probabilities we can argue about that later so one of the things that we want to do is just draw samples for models and look at them so like for the connected problem so I'm going to start with that and then I'm going to later on get on to fancy methods that we'll need if we're dealing with things like Boltzmann machines so if you've got a model of a pose that says you first pick your shoulder somewhere and then you arrange the joints according to some distributions then it's usually pretty easy to just simulate from a feed-forward model like that here's a book that's online that tells you how to sample from all sorts of standard probability distributions the very nice book but the reality is that most of you guys will never need to read it because all of the stuff in books like this have been implemented in Syfy and other libraries you use so if you want an example from some probability distribution you just call a library routine and it will do it I'm going to explain a little bit about how some of those library routines work because we'll need to understand it so if you've got a real number and some distribution over that real number then as most of you will already know a way to sample from that distribution is to draw points uniformly at random in the area the white area underneath the blue curve and then once you've got those points you just read off the location and those are the samples so this is sort of obviously correct because the probability of being in some little region is proportional to the height and so how do you draw points uniformly under an under a curve when the curves complicated and you don't know how to do it analytically there's this thing called rejection sampling where you get some other distribution in red that you do know how to sample from and you scale it up so that it's always above the distribution that you want to sample from so if you've got a simple distribution you know how to sample from you can sample lots of points uniformly in the area underneath the red curve you then just throw away all of the points that don't happen to lie underneath the blue curve and what's left the black dots those are uniformly drawn under the blue curve you can read off your locations and you've got samples so rejection sampling whether you know it or not you've used because that I didn't really believe this when I first saw it but the best the fastest way to draw from a Gaussian distribution is a very specialized rejection sampler and that's what MATLAB uses so if you've ever called R and n in MATLAB you've been doing rejection sampling okay so once you've got your samples however you've done it you might just be looking at them and saying who aren't they pretty but often the reason we're doing Monte Carlo is to compute integrals so you've got a rich object X which could be the pose of a person or it could be an image or something and if you're interested in some function of that object F then you might be interested in integrals involving that functions so for example what I've drawn on the slide is if the summary is and I was interested in the length between my left thumb and my nose and I wanted to know what's that average distance so that's just some statistic I could measure from my model which I could see if that fits well or something then the average distance between my nose and my left thumb is this integral it's an expectation and if you wanted to solve that integral numerically it would be a bit of a pain X the is a vector describing my whole pose that's a lot of numbers and if you wanted to grid up that space and use quadrature it solve that integral it won't scale well it would use a lot of time and for high dimensional problems and poses that would definitely be impossible but by Monte Carlo this integral is easy all you do is draw a bunch of random poses maybe a hundred you look empirically in those hundred samples how far was my thumb from my nose and you find the average so there's simple Monte Carlo procedures unbiased if clear that if I draw a bunch of samples I'll get a pretty good estimate of this thing quite quickly this function is bounded because there's only so far I can stretch my thumb away from my nose and the variance of the estimate Falls with the number of samples so for this sort of problem Monte Carlo is like a really obvious way to go there's one other neat thing about Monte Carlo which I'm gonna have it seems kind of obvious but I'm going to mention it a couple of times I'm going to go through it slightly this function my thumb from my nose is actually a function of only a small part of the pose doesn't matter where my legs are it doesn't change the distance between my thumb and my nose so the integral I'm interested in is this integral over a subset of the variables that are actually relevant to the function so if I were doing numerical integration this would make things easier because it's now a lower dimensional integral except that we don't necessarily know the marginal probability the probability of a subset of the variables so if you look at the whole literature on approximate inference it's all about if you've got a model how do we compute marginals because that's a really hard task so marginalization is often seen as difficult but in the monte carlo world it's a non-issue and what you do is you just sample from the whole pose anyway even though you don't need to and then you just throw away the variables you don't need so marginalization in monte carlo is throwing variables away and in any other world it deterministic inference it's a hard thing to do right say so this is the standard monte carlo setup and it's nice if you know how to sample from the model so if I'm interested in properties of a simple feed-forward model I know I can just sample from P if I can't do that I need to do a trick called importance something I've got this integral I'm interested in and I can't sample from so what I do is I rewrite it as an integral involving a different distribution Q so Q could be something like a Gaussian something standard I can sample from and I've divided and multiplied by Q so I've just multiplied by one and as long as I didn't divide by zero here this is something I can always do okay so once I've rewritten the integral this is an identity then this is just an expectation under key so this is something I can apply Monte Carlo T I could draw samples from Q and I've and I got to pick u so I can definitely do that and then I've got the standard Monte Carlo estimate and I now instead of computing the average values of my functions I've waited them by how important they are so if Q is much larger than P and rejection sampling I reject the sample here I just down wait it that sample is less important because I'm over sampling it key is putting too much mass there so naively important something looks as if it solves all our problems because you can do this to any integral in fact if you look at these equations there doesn't really need to be a distribution P there at all I could do this to any integral I could just multiply and divide by Q and turn it into an expectation this estimate looks it's unbiased and you know it has the same property that as I draw more samples the error bars will get tighter so it looks good if you were to sit down and implement it you'll hit a couple of problems immediately one of them is that you might not be able to evaluate P here so you might not be able to complete this equation so if you've got something like a Boltzmann machine or the posterior and a Bayesian inference problem you only know this thing up to a constant and so then you don't actually know how to compute that equation so there's a version of important sampling that can deal with unnormalized distribution so you say I know what P star is but I don't know what the Zed is that would turn that into a distribution and in the method what's now looking messy equation for important sampling I've pulled out the normalizing constants you approximate those separately using the same samples that you drew and it turns out that when you go through the math you get the unnormalized weights that just ignore the normalizing constant you don't know you normalize them you make Matt up to one and then and everything drops out so you get an estimator which is consistent when you draw enough samples you get the right answer so I don't want to go through this line by line because that's no fun but if you haven't seen important something before your homework exercise which I really strongly recommend you do is prove that a good an estimator of the normalizing constant sub P is just the average of the importance weights that you get if you can prove this fact at the bottom of the slide you'll know everything there is to know about simple important something so if you haven't seen it before do do that then if you can't do it time anymore okay so you simple Monte Carlo setup is nice and if we can't sample from P we can just rewrite it so that we can sample from Q but when we turn to things like Boltzmann machines is not obvious how to proceed so there is a rejection sampling algorithm for RBMS and it was explained in Jeff's 2000 2002 paper on products of experts so if you haven't read that paper training products of experts by contrastive divergence I highly recommend it it's very readable it actually had this idea of stacking Boltzmann machines in it back in the tech report in 2000 and there's some code on my website to reproduce bigger to from that paper which I also recommend you go through okay so back to sampling the rejection sampling algorithm for IBM's is you've got a product of experts each hidden unit is an expert which is a mixture between two Buneary distributions and there's an expert for the visible biases so the mixture is between the uniform distribution and a specialized distribution so this is an expert from a hidden unit it decided to draw a uniformly drawn image so that's random noise this expert decided to draw an image from its specialized distribution so that's something blobby this is some other specialized distribution and then these experts drew uniform distributions the rejection sampling algorithm is if you do this procedure of drawing a bunch of images for each expert if they happen to be identical then you can accept that images draw from the model and that doesn't happen very often how you're going to wait quite a long time before all of those images are coincidentally a nice hand drawn for this was trained on amnesty so it's an it's a nice idea and it does help with the description of the paper but it's very hard to come up with a rejection something algorithm that actually ever accepts for high dimensional images and here's a little toy example that begins to explain what your problems in high dimensions are so if you've got any this is now real valued if you've got um a distribution you're interested in which is Gaussian or close to Gaussian and you want to you can't sample from this directly so you want to use a proxy for it Q which in this example is another Gaussian with a different width then you could ask how well would important something work how well would rejection something work if instead of something from P a sample from Q and it turns out that rejection sampling the acceptance rate disappears exponentially quickly with dimension so rejection something just doesn't work in high dimensions and less P and Q are basically exactly the same and important sampling the you don't reject any samples but you do rewrite them with this important sampling ratio and in this example the variance at the important sampling weights blows up exponentially with dimensionality and it's possible if you do things wrong to even get infinite variance so again if you have things that are even close to gaussian these simple distributions unless your approximation q is basically exactly right which is never going to happen these algorithms blow up and as an aside when I first made this slide you know I did some quick calculus on the back of an envelope to work out this expression the details of this expression don't matter it's just this blow-up but if you're preparing slides for talk it's kind of embarrassing if your equations are wrong and the first time I gave the talk I just had to fess up and say this might be wrong and then the second time I gave the talk I had the sense to actually check this expression using Monte Carlo and so I'm pretty sure this is right so this is another thing just whatever you're doing if you're doing maths you can often check it with Monticello and you'd be stupid not to and even though it took me a couple of times to giving these talks to remember okay so I want to give you one more insight into important something this importance wait remember it's an estimate of the normalizing constant of pee that's your homework problem too to work out why that's true so often people say well the invariance of an important sampler can be infinite or high and so you really can't say anything about the normalizing constant and that's not really true so let's look at what happens we've got a distribution over the importance rate so this is a random algorithm every time I draw a sample from Q I get a new importance weight and so these importance weights have some distribution and the lower in importance weight can be at zero they're positive quantities because they're ratios of probabilities and so these distributions are often skewed when you have a firm lower bound and it changes the shape of the distribution is not going to look Gaussian and so the mean of this distribution is often skewed off to the right from the peak but this mean of this distribution is the normalizing constant so I'm wanting to estimate the location of this red line and when I draw importance weights which are these magenta crosses they'll be scattered around and on average their position will tend to the true answer the mean so this is the nice situation of importance something where you looking for the average value of importance weights that tell you what a normalizing constant is now the problem is that this distribution goes out to infinity the importance weights in many problems are potentially unbounded and if I just add a little bit of mass to the tail of this distribution I can actually put this red bar that mean of the distribution wherever I like so we could have the situation on the bottom and you know this is my this is plotted in MATLAB this is a real distribution and this is where the mean is it just carries on out to infinity so if you have a problem with this situation where your importance weights have some mass which is there's some probability that the weight is very large then what will happen nine point nine nine nine percent of the time is all your samples will be in the bulk of the distribution and they'll all be lower than the value of trying to estimate so this is officially an unbiased estimator the average value of those purple crosses does tend to this red line and it happens because very occasionally you'll get a farm pole that's over here somewhere ninety-nine point nine nine nine percent of the time you are going to underestimate the normalizing constant so saying that important sampling is unbiased is kind of meaningless often if you're estimating normalizing constants you're going to underestimate and if you've got a whole bunch of different important something estimators it could be you pass three different numbers you usually just want to go with the biggest one okay so that's all there is to simple Monte Carlo really this idea that we want to draw samples and they're useful to look at and to compute averages and we can shoehorn Monte Carlo into any integral by in doing important sampling and I haven't been any questions that I've seen but do you stop me if that's not clear and so we've seen that this simple idea isn't going to work for Boltzmann machines and it's not going to work for high dimensional distributions unless we know how to decompose them into lots of low dimensional distributions so that's what we need this thing MCMC for Markov chain Monte Kelly so here's the rules of the game we're going to have a distribution on X which is going to be some rich high dimensional object like an image and we're going to know the energy or the log probability up to a constant so we know this thing but we don't know the normalizing constant maybe we will get a talk in the next lecture about how we could estimate that offline but it's not something that we can use within the inner loop of an algorithm and we want to draw fantasy images from from this model and so as a running example this we could be drawing an object like this an image patch and we could have a model that says that we prefer images where things look smooth and the reason that rejection sampling an important sampling are done is that they will throw down random images and say does this random image look and a lot of the time they'll have images that will upset this model and eventually you might find an image like this that's good but then you just carry on generating random junk again and anyone sensible what I said I found a good image maybe I should have sort of tried to learn from that and use something about the good image I eventually found to sort of work out what good samples look like so an idea is that once you've got an image from the sort of a reasonable candidate under your model you could look at lots of little local perturbations of that image and maybe a bunch of those are going to be plausible under the model too so you could construct some cue which now doesn't draw independent images but is conditioned on a known good image and that generates new images ex-prime and so that's an idea this distribution might be useful but you can't use this inside important sampling so this distribution puts a load of mass and some good images but it also puts very low mass on a lot of other good images so there are images which are good under the model but a long way away from this one we found and if you have keys that put low probability on good images then you're you're going to get very high variance so we need another idea and the other idea is Markov chain Monte Cali so now in what we're going to do is construct a random walk or a Markov chain where we start with an image and we change it slightly to get another image we change that slightly to get another image we keep going and after many steps you might have moved quite a long way you'll get a different very different looking image but it was through tiny changes so you didn't have to understand a lot about what the model likes you just needed to make small enough changes that there was some chance of finding good ones and this process is a Markov chain because it's defined by a transition operator where the way you generate an image at time T only depends on the image at the previous step that's what Markov means and the property that we're looking for the goal of this algorithm is that if we simulate this chain for a long time and said how we're going to do this year if we run this chain for a long time the marginal probability over one of the images so I run this process and then I just look at what I got at the end the probability distribution where I end up the probability over this image should be placed to the target model that I'm interested in so that's what we're trying to achieve and there's a consistency condition that we're going to have to satisfy if we have any hope of having an algorithm with this flavor so we want all of the images after a long time to each look as if they come from the model so let's say I run this process for a long time and I get this image in the top right if that image marginally if the distribution of that image is the model a tiny care about then if I take one more step of my Markov chain Monte Carlo Markov chain Monte Carlo algorithm I don't want to wander off crazy I want that next image also to be a good model from my good image from my model so mathematically the condition is if you draw a sample X you take one step to X prime then averaged over what that original image was the distribution over the next image should be your target distribution P and this is the this is really the main equation of all of MCMC your methods have to satisfy this condition and more that's you don't have to do much more than this it turns out there is a trivial way of satisfying this condition and that if this thing T is that the do-nothing operator so if Marco aurélio works really hard to generate a sample from his model and he gives it to me I could move that around and try and get a different image but I could just copy it and give it back to him and say here's a sample from your model and that's true it's just not very useful I haven't I haven't done anything extra I haven't told him anything he didn't know so if this is the identity where X prime is equal to X then this equation is satisfied that you need transition operators that do actually change the images and have some hope of exploring the whole image space and I don't want to get too mathematical so I'll just loosely say that roughly that's all you need as long as your transition operators do move around a bit and can explore the whole image space eventually and you satisfy this one equation then you do satisfy this this goal that I've been talking about you've got a process that you can simulate and when you pluck an image off the end of it it will roughly be from your your distribution okay so I've got a little cartoon that I could talk about so it's very hard to sketch realistic cartoons of high dimensional spaces and however you do it it's going to be slightly wrong but this is my version which is trying to capture the fact that in high dimension things look kind of spiky there are many different dimensions you can wander off into and find corners of your model so I've drawn something spiky and 2d and I could start my Markov chain somewhere run along processes is the processor can hop over the gaps and my density and I'll end up somewhere there's red blob my hope is that this red blob is a sample from my my distribution now if this process were kind of long as I've drawn it and I showed you just this red blob it would you wouldn't be able to say that I started here if you didn't see this trace right you did a little inference problem and say where did I start this Markov chain and I told you that I'd run it for a few hundred steps you wouldn't know where it come from and and that's the intuition behind where the proofs that this method of Aled come from the if you started at an image drawn from the model then because of this consistency condition the final position would also be drawn from the model but it looks as if it doesn't matter where you start from because you forget where you start from then you can actually initialize these Markov chains arbitrarily as long as you run them in for long enough there's one other intuition I want to pull out from this one figure so I started at this point in the top middle of the diagram and I used some random numbers to simulate this process and ended up here if I started here again but used a different random seed and I use different random numbers I would wonder round and I don't up somewhere else like you know I'd probably end up down here or maybe over here right and as long as this process is long enough I would the distribution over where I end up would the uniform over the support of this distribution it would be from my target distribution so I'm claiming that in this cartoon this Markov chain that I'm drawing is long enough to pretty closely draw a sample from my target distribution and what I want you to notice is what I haven't done when I simulated this process I haven't wondered down to the end and back of every single little frond of this distribution I haven't gone over the entire support of the distribution or visited every mode and this is common misconception that there's no way that you can possibly sample from a distribution unless you've had time to visit all the maids and that's not true at all you needed there to be some possibility that with some random numbers you would visit some maid so I didn't visit this maid down here but you can imagine on another walk of this length I might have done and if I really have to enumerate the whole state space in the one walk that I chose to do then there'd be no point in Montecarlo if you could enumerate the whole state space then you could just solve the integrals directly say this is just one important point that you're not going to visit every point in your model when you do MCMC and which is fortunate because if you have something like a Boltzmann machine there's no way you're going to do that okay so the big picture again is that we we can initialize arbitrarily we run some random walk after the first hundred steps I've stopped showing the actual trajectory and I've just used blue dots to show where the Markov chain visited and those points we're going to use as samples from the distribution those points are dependent they're strongly correlated for sure because they're all my adjacent ones and the time series are right next to each other but it turns out that doesn't matter in terms of Monte Carlo you can throw all of these samples without him to thin them out into the standard Monte Carlo estimator of an integral and that's a valid thing to do your estimator will be consistent and tend to the right answer it can be hard to compute what the error bars should be without understanding the dependency of the chain but even though these samples are dependent you can use them in simple Monte Carlo so again stop me if the big picture isn't clear but the question now is where do we get these t distributions from so this is what we want to happen what's the algorithm we can write down that actually satisfies these properties and the one that you'll have definitely seen in the last couple of weeks is Gibbs sampling so this is the one where you've got this high dimensional object you go to one of the variables and you can go to the variables in order or you can randomly pick them it's valid either way and I would scan through them systematically and that will work better and then you get that variable and you resample it so in an image you get one of the pixels you chuck away the value and then you re sample that to a plausible pixel given the neighborhood or in a continuous value problem you an update along one of the axes at a time and you re sample from the conditional distributions along those directions okay so that defines a transition operator T there's a delta function saying that almost all the components stay the same and then there's this conditional probability of the one component changing to whatever it changed to so you could do a bunch of math to try and show that that distribution T satisfies the properties I said but if something is kind of obviously correct so let's spin a story to see why Gibbs sampling works so forget we're doing MCMC for a minute and just imagine that I want to sample from a distribution P of X for any distribution P of X I can split it into a product this is the product rule and I can split into the marginal of all but one of the components and then the conditional of the remaining component given that choice and this factorization implies that a way of simulating from this distribution is to first draw from this distribution draw all but one of the pixels somehow then once we've done that difficult feat we've only got one pixel left to draw which we can draw from its conditional distribution and that bits easy so if we can do that with that we've got a valid sample the question is how do we implement this difficult step of getting all but one of the pixels so I'm going to draw the image with one pixel missing by asking Marco aurélio to very generously sample an entire image for me and then I'm going to throw away the pixel I don't want so we've already discussed the marginalization in Monte Carlo is trivial you just throw away what you don't want and so if I have an Oracle that can give me a whole image I can use that to get a sample from this marginal and then I can draw the pixel I myself so now what we've seen is that if McCrea gives me an image from the distribution i can come up with a slightly different image which is clearly also from the distribution and those images are clearly very dependent on each other all but one of the pixels are exactly the same but now you could use this process that I've used to generate an image and I could give that image to someone else and they could do this but with a different pixel J and they'd get a third image that's slightly different but it would also come from the right distribution so no complicated maths required to prove Gibbs sampling is right okay so I do want to do a little bit of math now to give some intuition into what other Markov chain Monte Carlo algorithms we could possibly imagine so Gibbs sampling is a very popular one it's sort of very natural but what other algorithms are there so in order to do that I'm just gonna play around a bit with the maths of the Markov chains and see what we can prove is definitely necessary and sufficient for these things to work so I want us to imagine playing a little inference game I'm going to draw an image and then from P and then I'm going to get that image and take one step of Markov chain Monte Carlo and generate a new image X prime so I'm going to give you X prime and not tell you what image I started with X and I'm going to ask you to think about what image I might have started with so you've got this image which is one step Markov chain step away from the one I started with and you've got to try and guess what I started with so the way the way to like express your beliefs is to use Bayes rule you can say my beliefs are about what image you started with given the image that you've just shown me is proportional to my prior beliefs about images and the likelihood model the I know that this is the corruption process you use for the image so this is Bayes rule this is a conditional belief about where you came from given where you are now and you normalize probabilities in Bayes rule and by just summing over the variable you're interested in and this sum here is the stationary condition so we know what that sum is okay so this little construction I can clearly always do for any valid transition operator which means that this condition has to hold for some distribution R and I can make it symmetric by multiplying both sides by this probability so the symmetric version is on the bottom so there's balance condition I've written down has to hold that every pair of images for some distribution R if you're claiming that your Markov chain Monte Carlo algorithm T is valid so this is two too many equations for my liking so let's for a picture and what we're saying is if you simulate a very long Markov chain using your Markov chain Monte Carlo algorithm and you pluck out a pair of states at random and look at the joint probability of generating the first state and then transitioning here then there is some reverse process R that with the same probability would generate exactly the same sequence in the opposite order and if you pluck out a random pair from that sequence you would see it with the same probability so this thing has to be true for your algorithm you can't come up with and want to come I'll give them that doesn't have this property so it's always possible if you're running a Markov chain to imagine undoing it and stepping backwards so a very common idea when people first see Monte Carlo is that they they want to come up with really clever moves and maybe use optimizers where they they say oh well I understand how to like come up with really good images they're an MRA so I'm going to run an optimizer to like move to a good image and but it's often very hard to undo the effect of that if you've gone from some rubbish image and you've used a sophisticated optimizer to go right to the mode of the distribution you sort of forget where you've come from and you can't plausibly imagine a process R that would take you back and then there's no way you'll ever be able to prove that the algorithm is Val so this idea of reversibility is very core to MCMC and in fact this condition is not just a necessary condition but if you can show that for all pairs of images that this condition holds it's sufficient to prove the stationary condition if I just sum this equation on both sides over X and out drops the one condition I actually need and the bottom of the slide ok so fortunately there's a very generic way of coming up with new algorithms that do satisfy a version of this balanced condition if R is actually equal to T this condition is called detail balance which might have you will have heard of detail balance isn't a necessary condition but it's a convenient one that the process is its own inverse so the algorithm again that most of you have heard of that satisfies detail balance is metropolis Hastings here you start with an arbitrary mutation process an arbitrary way of mutating or corrupting an image X into a new image X prime by maybe moving a few of the pixels or doing something clever and that Q that proposal won't satisfy detail balance so if you were to simulate a Markov chain using Q it might wander off to infinity it might not do something useful and and what me trouble to take things does is just reject some of those steps it stopped some of those musics from happening and that's enough to make the process satisfy detail balance and sample from the correct distribution so the cartoon here is the green walk is our Markov chain our metropolis Hastings Markov chain and at each point Q is just a Gaussian distribution centered on the current point so it tries to do a Gaussian diffusion and the red little off sheets are places that this Q distribution tried to go to and the metropolis Hastings algorithm said no don't go there that doesn't look good and so it stayed where it was for a step of the Markov chain and then it carried on moving and so the transition operator under Markov chain under metropolis Hastings the probability of going from an image X to a new image X prime is the probability proposing that particular move multiplied by the probability of accepting the rule so this min 1 ratio is the core bit of this algorithm it's the probability that you accept and move so if the new image looks good if it has high probability you're more likely to accept the move but if this is the sort of proposal you make all the time then maybe you should down wait making this move this is the probability this is a transition operator if the images are different the images are the same the expressions more complicated but it doesn't matter staying in the same place there's always a valid Markov chain Monte Carlo operators that you don't have to worry about it okay so I've got just in the slide set which you can download more text that goes through this in more detail and proves the detail balanced condition and I've also put into the slide set complete MATLAB code so that you can read through that for yourself but I'm not going to go through it line by line what I want to show you is the results of running this code so this you can apply to any distribution I'm going to apply it to just a unit Gaussian distribution so I've passed the code a function handle saying this is my energy it's a unit Gaussian distribution and then I'm plotting against time or iteration number where the Markov chain goes and this metropolis algorithm has a choice of what the queue is what I'm going to use to perturb the distribution and I can set the width of that proposal or I can set a step size so if I set the step size to one which is the same width as the distribution I make pretty rapid progress and if I were to crush this time series into a histogram and look at it I'd get a nice bell curve I'd be sampling from the distribution I am rejecting a bunch of the MU so I rejected about a third of the moves and that's because my queue isn't the same as the target distribution if I was something from the target distribution I'd be sampling a unit Gaussian centered at the origin I'm sampling from unit Gaussian centered where I currently am in my Markov chain which means I'm something from the wrong distribution and I reject some of the memes if I were to propose from a much wider distribution so my queue now has a width of a hundred that's what this bottom plot shows and then most of my proposals would just be in crazy places and the algorithm is still valid it'll eventually give me the right answer but most of the means of getting rejected because it doesn't want me to wander off to infinity and so in this time series of a hundred steps I only move five times and so although this algorithm is valid it was clearly is working less well on this one if I plot a histogram of these locations and I'm gonna get a nice bell curve so if rejections are bad and I got more rejections when I set a really wide widthh to my proposal is tempting to say well I should just set a really narrow perturbation if I don't move my current state very much then I'm not going to mess it up and the moves more likely to be accepted so the top plot shows the trajectory if I use a very small step size and now I'm only rejecting two moves in the entire time series and that's a disaster so in rejection something you want to accept all the time and Markov chain Monte Carlo and metropolis the only way that you're seeing the target distribution is by seeing rejections otherwise you're just sampling from an arbitrary stochastic process defined by Q so this trajectory is almost indistinguishable from just a Brownian diffusion with step size 0.1 so you you have to have the rejections in the algorithm to actually sample from the distribution you're interested in and they've been because information theory book argues that because the rejections are giving you information from the algorithm you want to see rejections about half the time and to maximize the information and that argument slightly dodgy and heuristic that the correct answer is in this situation the optimal acceptance rate is 0.23 for something but it doesn't really matter this thing is valid for any rejection rate and if you reject at least about half your samples then the thing is probably working well okay so this is in 1d and of course we're interested in something from high dimensional distributions say let's look at what metropolis does in high dimensions or two dimensions so here P this long ellipse is my target distribution and I don't know which the sensible direction to move in which is a lot the length of the ellipse so I'm just perturbing my current state with a dumb spherical perturbation because I don't have enough handle on the problem to make more intelligent moves and I can set some width to that spherical distribution but if I set the width really large then I'm going to propose off the manifold of the distribution and so I'll reject a lot now here's another example of a low dimensional pictures being misleading in 2d actually the right answer is to make the spherical distribution Q large because occasionally you'll make a big hop to the other end of the distribution and that will work well but in high D there are many many directions you can go in off this manifold that will be wrong and if you set this distribution to be wide the algorithm won't work at all so you're constrained to set the width of Q to be towards the narrower dimensions of your target distribution and again if you set the width to small then you're just going to do a Gaussian diffusion that has nothing to do with the distribution so if you set it as sort of roughly that I've drawn it you are going to do a slow diffusion in this direction you have to go a distance L in steps a of width epsilon and is this very painful diffusion that by random walk will take a quadratic number of steps and that in that ratio so when you've got strong correlations that you don't understand and you're not able to bake into your proposal mechanism and this diffusive behavior of just taking a random walk and accepting rejecting steps means things work very slowly so if this sort of algorithm had been used for a big base in neural networks in the 90s Bayesian neural networks wouldn't have worked so I'm going to get on to like the the fancier versions actually work for big neural networks in possibly after the break policy okay so what's the strategy for MCM see it looks like we want good proposals if we don't get our proposals right then we're setting ourselves up for a very slow walk and so potentially you want to be really clever about how you come up with your proposal so you could imagine approximating your distribution using some fancy technology to come up with Q and there's something really nice about MCMC which is that you don't have to get lucky and come up with one good approximation you can combine many transition operators and the performance of all of them put together is often better than any of them individually so if I start at some image and I use transition operator a then where I end up this thing will be drawn from the right distribution if this one was if I use a different transition operator from that image to get a second image then this one will also be drawn from the right distribution and by induction all of these images that I've drawn even though I'm using different transition operators each time will all come from the right distribution as long as those are all valid transition operators so some of them could suck some of them could be the do-nothing operator that just copy but that doesn't introduce any problems into the algorithm whereas if you try and combine important samplers if one of them has infinite variants your combination has infinite variance too so the great thing about MCMC is if you've got a whole bunch of different ideas you can just throw them all into the bag and hopefully one of them will work well for the particular bit of the states-based you are in at the moment right so it looks like when I first learned about MCMC I got a bit depressed because I thought the only thing that would be in MCMC research is coming up with good proposals q and that sounded like a fairly boring task where I have to look at each a new model and then come up with approximations and come up with q and you know that would only be one algorithm so the next bit of the talk and some of this will have to go after the break is what how we shift the goalposts to make MCMC research a bit more interesting so it looks like if you want to satisfy detail balance and metropolis is basically the only game in town but what you can do is change which distribution you're sampling from and that's the idea behind auxilary variable methods so again the point of MCMC what we're trying to do is approximate integrals so this is the integral I'm interested in I can rewrite that by introducing some irrelevant variables V and then integrating them out again and here X and V can depend on each other in some arbitrary way but as long as the marginal of their joint distribution is this distribution then this is an identity so given this equation is tempting to just say well I can see that V just drops out and is irrelevant so I should go back to here the stupid idea and auxilary variables is to say now I'm going to take this equation seriously and draw both X and V even though I don't need V throw V away and then compute my Monte Carlo estimate so what we're looking for a joint distributions X and V where for some reason it's easier to draw both X and V than just X and you've seen some examples of this so if you've got a restricted Boltzmann machine which has visible pixels and hidden units it's possible to analytically marginalize out all the hiddens so you could run Markov chain Monte Carlo just on the pixels in image space and you can do site Gibbs updates and that works but most people don't do that most people don't integrate out the hills and because it's very easy to sample all of it visible pixels given the hiddens and all of the hiddens given the pixels and so it's more convenient to define a sample that way and there's a nice paper by James Martin's in Ilya sutskever Mayo stats 2010 where they get distributions that weren't rbms and turn them into IBM's by introducing more variables just so that the sampling is more convenient so what I'm going to go through are a couple of examples of places where people have introduced extra variables to make navigation easier and the first of these the the like the the first real use of this trick was by Swanson & Wong in the late 80s and this is a an algorithm for eating models for spin systems so yeah I've just drawn it differently but this is an image of binary pixels or spins and they introduced these extra variables which are black bonds which sit on the on the edges where there are weights in the model and these bonds are very easy to draw they're drawn independently they can only be put down where the colors are the same so it's not possible to put a bottom joining a blue and a red dot but you don't put bonds down in every place that you possibly could so there's some rule for doing that after the bonds have been drawn you find the connected components so here's a connected component of all of these blue things are connected for some route between them and for each connected component the algorithm is able to independently choose whether to recolor it or not so this blue connected component at the top stayed blue in the final image but this whole red connected component in the bottom left all change color all at once and it's able to flip all of those spins together and if we zoom out there's this very large sea of blue and the top right of the original image and that's turned into a very large sea of red in the final image so this is one iteration of the algorithm makes these massive changes to the image and the computational cost is similar slightly more but similar to sweeping over the pixels once we've give something so it would probably take hundreds of iterations of give something to make them leave as dramatically as big as this so it's when someone in the 80s was just it's amazing new piece of technology that sort of really shook up what was possible to do with eating models and there's code for this on my website but I'll tell you now that it doesn't work at all on images modeled with restricted Boltzmann machines so Boltzmann machines are just leasing models so this algorithm applies but a lot of how well the algorithm actually performs depends on the details of what your weights are and what happens when you have Boltzmann machines apply to images is that it forms these connected components in it says oh I can either flip all the pixels in this component or leave them as they are and it forms very large components they're given that we're modeling images it's very unnatural often to get a whole bit of an image and just invert it so it decides not to so what happens is when you apply Swenson Wong to real image models it just doesn't move at all it becomes the do-nothing operator which it's a valid algorithm and completely useless so I mean why did I go through this if it's not useful to you guys it's a bit of a warning to be careful when talking to physicists they're very clever guys but they're often interested in completely different systems to what you are and they might mathematically be the same but in practice they're completely different and fortunately there's a do you mind if I go like three minutes over okay so fortunately there's a bit of Technology from physics which is actually really useful to us and it's been it was used in basic neural networks and the 90s by Radford Neal but it's continuing to be useful in energy based models and beyond and that's Hamiltonian Montecarlo it was originally called hybrid Monte Carlo so again it's an auxiliary variable method we have a Joint Distribution X the variables that we actually care about and V which are completely irrelevant we're just sampling them anyway for the front of it and this distribution the X's and the V's are actually independent so there's a product of two terms one for the X's and one for the V's the original term is the distribution we care about we put our energy function in here and this term is just a Gaussian distribution on the vs so all we've done is added some extra independent irrelevant variable to B which come from like the most boring distribution on the real variable so you can think of a spherical Gaussian and the sum of the two energies is an energy for the X's in an energy for the V's it's called the Hamiltonian and the physical analogy which people invoke is that E is like a potential energy so you've got a ball with position X which has some gravitational potential energy some locations are at the top of hills and some locations at the bottom of hills and the ball also has a velocity V and this ball has total energy gravitational potential energy plus kinetic energy and that total energy is the Hamiltonian okay but the physics story doesn't necessarily matter I'm free to write down this joint distribution I'm also free to apply any Markov chain Monte Carlo algorithm I like to it then throw away the B's and keep the exes so what Markov chain Monte Carlo algorithms could I use well Gibbs sampling is one of my favorites so I could choose to give sample velocity components and the v's are independent of the exes so their conditional distribution is just a Gaussian so I can just get the velocity of my ball and I can just kick it by resampling it from a a Gaussian distribution and then the the second move is kind of funky it's a metropolis move it's a proposal which you end up accepting or rejecting but the proposal is deterministic and it's the result of simulating some physics so you pretend this is a ball and you simulate where it would roll it's a high dimensional ball only surface and that changes both the position and the velocity so if you start off slowly at the top of the hill you can roll down to the bottom of the hill and end up really fast so you change both X and V trading off kinetic energy and potential energy but we know in physics that total energy is conserved so the Hamiltonian will be the same which means the probability of the final state is equal to the probability of the initial state so in under this ideal algorithm which we can't actually implement but if we could energy is conserved probability is conserved and when you work out the acceptance probability it's one so you do the simulation and then you just keep the state where it ends up you kick the ball by resampling its velocity and you simulate again now this the details of exactly why this is all valid and exactly how you would run it or a bit involved and I think it could probably give a half hour talk just on that but if you're going to go away and read about this there's something there's one part of it which often seems very mysterious you see these discussions that say that Hamiltonian dynamics maintain a phase space volume and it's often not very clear like what they're talking about and why they care so I want to go through one toy example that explains why you have to be a bit careful when you do this sort of update okay so this is simpler than Hamiltonian dynamics I've just got a real-valued variable that takes on values between zero and ten and the dash at one is just for my purposes because I'm interested in these two segments that I've just got a real number between zero and ten and I've decided that the segment on the right between one and ten is interesting and the segment between zero and one is interesting so I'm going to be clever and come up with a proposal that will take me between those two regions so the rule is that this is a deterministic proposal I'm currently at position X in the first part of the sequence I add one to my value and then sorry I'm multiplying that by nine and then I add one to my value so if I start out here at 0.5 this update rule will take me to five point five the second part of the rule is I apply if I'm between 1 and 10 so if I wrote 5.5 I would subtract 1/9 and I'd end up back at point 5 so this rule is reversible if you start out at one point and you apply this rule repeatedly you can only bounce back and forth between two points so you'd have to combine this with other Markov chain Monte Carlo operators to explore the whole space and but it's a proposal we could use in metropolis amongst other proposals and here's an argument of how we how we do it here's the metropolis Hastings acceptance rule in the bottom left we evaluate the probabilities of the stationary distribution and what are the keys while that's proposal probabilities so the wrong thing to say is well I started out at 0.5 and I'm proposing moving to 5.5 I always do that so that has probability 1 and if I were at 5.5 I would have probability 1 of going back to 0.5 these keys would cancel and I would end up with this acceptance rule now this acceptance rule is clearly wrong because let's imagine we had a uniform distribution here between 0 and 10 then if I drew a sample perfectly from this distribution 0.9 of the time I would be between 1 and 10 I apply this rule once and all of those points would move to the first segment so after one step of this algorithm 0.9 of the probability mass is between 0 and 1 and yet only 0.1 of the mass should be here if my target distribution is uniform ok so this algorithm is wrong and as it's just after 10 I'll leave it there and you can figure out why in the break if you haven't already seen it and we'll reconvene Thanks

Info

Channel: Institute for Pure & Applied Mathematics (IPAM)

Views: 18,492

Rating: 4.9748425 out of 5

Keywords: Markov Chain Monte Carlo, Deep Learning (Field Of Study), Computer Science (Field Of Study), Machine Learning (Software Genre), Mathematics (Field Of Study), Iain Murray

Id: Em6mQQy4wYA

Channel Id: undefined

Length: 61min 49sec (3709 seconds)

Published: Wed Aug 26 2015