Markov Chain Monte Carlo and the Metropolis Alogorithm

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Monte Carlo Casio in Las Vegas opened in 1996 (time point 3:01). However, Monte Carlo Casino in Monaco was established in 1850. Im not sure where this video is from, but this character only has a masters and starts off with misinformation. If i didnt know MCMC already, i would pass and look for reputable sources.

👍︎︎ 2 👤︎︎ u/Homersteiner 📅︎︎ Dec 29 2015 🗫︎ replies

Captions

all right so today we are going to be talking about Markov chain Monte Carlo and the metropolis algorithm so Markov chain Monte Carlo or MCMC is commonly referred to is definitely one of the most powerful tools and modern statistics and metropolis algorithm is the archetypal example in fact the metropolis algorithm was named one of the top ten most important algorithms of the 20th century by the journal computing in science and engineering alright so Markov chain Monte Carlo or MCMC comes from this family of techniques known as Monte Carlo simulations and so these can trace their origin back to the 1940s 1950s with a group of physicists working at Los Alamos National Labs in New Mexico and so one of these guys Dennis little um was recovering from surgery and while doing this he was just killing some time playing solitaire and he started thinking to himself you know I wonder what the chances are of actually being able to win any given game of solitaire because any of you who have played solitaire before now but just sometimes you deal out the cards and it's just not a winnable configuration of cards so he was sitting there trying to figure this out just straight combinatoric Lee and it pretty quickly became obvious that this just figuring this out analytically is not feasible I mean if you think about the number of configurations 52 card deck of cards can take that is 52 factorial which if you do out is on the order of ten to the sixty seven that's the same order of magnitude as what scientists have estimated the number of atoms in the Milky Way to be and so just computing this out especially at this day and age when computers were just starting to be built was just not going to happen so he got this idea that why don't I just deal out a bunch of games and just record the results and use that data to give myself a picture of what the probability distribution here looks like so when he goes back to Los Alamos labs he's talking to all his co-workers talking about Newman fami metropolis one of the guys same guys who worked on the Manhattan Project actually and they all quickly begin to see that this this concept right here of using a discrete statistical sample of some given system is an excellent way to actually approximate the probability distribution of it or really any given feature of it and so Monte Carlo simulation was born um and the name actually comes from not one of the scientists but a casino in Las Vegas and you can kind of understand why they named it after casino you know this whole idea of chance and probability rolling the dice you might say find some sample from a system it also might have had something to do with the fact that coulombs father lost a pretty significant amount of money in this casino and here is some artist's interpretation of what hula might have looked like back in the 1940s flipping his coin and leaning on his giant pair of dice all right so let's say we've got a probability distribution possibly in a very high dimension or it's really complicated and we want to sample from it or find the expected value of it and doing so by traditional means just doesn't make sense so we can do this using Markov chain Monte Carlo the basic idea is that we perform a random walk through the distribution favoring values with higher probabilities so what I mean by this is that once we have a starting point we can randomly pick a point nearby and then evaluate that points probability and if that points probability is higher than the point we started on we move there and accept that point otherwise we have I've predefined some function which gives us a probability that we will accept that point even though it's a lower probability but in fact if we don't accept it we just stare we are and try again and so the really amazing thing about this is that if we iterate through this scheme enough we will actually visit every single point in the probability distribution proportional to how probable that point is so here let's do a really simple example of MCMC and right now we are going to do this the simplest one we can think of and that would be a caption centered at zero with a standard deviation of one so obviously in real life you would not need to use MCMC for this it's very simple and well understood what this probability distribution function looks like and how to sample from it so here's just a plot of your general Gaussian and then we perform a ransom code to perform at CMC on this and here's the histogram of all the points I sampled from it after 250,000 iterations which you know isn't really all that much and we can see this very well shows the shape of the normal and it gives us our expected value right at 0 actually it looks like it might be it's the little left of 0 but I expect that's just some artifact of the binning I did here so now I am going to give you a once-in-a-lifetime experience to actually peer behind the curtain and see what that code was doing in the previous slide so you start right here this is our starting guess let's say I had no idea what a gouge should look like and I was thinking you know maximum likelihood it's probably around 0.5 so I start here and then we start iterating through this algorithm and so we take an x guess we move down here it's less likely now but apparently it passed the acceptance test and we moved there up but now this time we jump up here and it's like oh this is much more probable a hangout here as you can see we slowly just trace a path through this distribution function you know we start getting down low over here but really we like to hang out near the top here that's where most of our points are going to be but the important thing is that yeah we do actually explore these lower probability areas but in the end most of our points are going to be clustered near the top let's move on to a more realistic example of MCMC and that is going to be the one from the famous paper equation of state calculations by fast computing machines which was done by metropolis rose and blue rose and blue teller and Teller in 1953 and this is really the paper that started it all this was the origin of MCMC and metropolis algorithm and here's metropolis right here this is actually a picture from his ID badge when he was working at Los Alamos and so the goal of this paper was to calculate the properties of a system comprised of interacting molecules using Monte Carlo methodologies alright so in here's what the algorithm looks like so we've got this system of all these molecules together and we will move one of them slightly and then we will calculate the change in energy of the system Delta e and this is based on the Boltzmann equations which can actually tell us the energy of a given system determined by the temperature and just the distance between the molecules so this is the potential energy um and then if the energy is less than the previous state so that means if the energy drops when we move this one molecule we accept the change otherwise we accept the change with a probability of e to the Delta e where Delta is the change in energy over KT where K is the Boltzmann constant and T is the temperature of the system so that allows us to actually sometimes move two configurations of molecules wait actually increase the energy of the system slightly and are less probable and so what this does is it lets us explore all of the accessible states of the system at a given temperature and gives us a snapshot of what really the most likely ones are so here's what this would look like so let's say we've got this two-dimensional box and there is a bunch of molecules in it so we start out with this one right here and we'll put a box around it right here this is an actually a physical box this just this tells us the range in which we are going to allow this molecule to move so we are going to randomly move it somewhere in this box let's say right up here and then once we do that we are going to calculate the change in energy of the entire system and so let's say we've calculated the change in energy and by moving this molecule here the energy of the system has actually dropped so therefore do the laws of entropy of this system lakes this configuration a little bit more so we'll accept that and let this molecule stay here and then we will move on to another molecule and again put this box around it letting us know just see where we're gonna let it move and let's say I remove this one just a little bit down here let's say well this time you know this actually increases the energy of the system a little bit so we're not just going to reject it straight out white will give us some probability of moving there so that means we'll give it a probability of e to the Delta e over KT like we said on the previous slide and then we'll just choose some random number between zero and one drop uniform distribution and if that number is less than that number of e to the negative Delta e over KT we'll accept this change so let's say we did that and oh no it didn't pass so then we move it back to where it was and move on to the next molecule and do the same thing again the next molecule wouldn't do this for all of them then just repeat and go through this many many many times then at the end we have this nice wonderful data set of all the configuration this is taken and we know that it's more likely to take configurations where the energy is dropped and therefore it's a much more natural configuration and so really by doing this we can understand for a given temperature what the possible configurations of these molecules will look like so this was done using some of the first computers developed at Los Alamos this was actually done on a computer named ENIAC and each iteration meaning like each time we move one of these molecules in this box and compute its change in energy that took about three minutes so this overall experiment the paper was published on took days to run of course today with our technology doing this would take a fraction of a second so you can really see why with the advent of computers this became such an important method I've explained to you now what Monte Carlo is all about but what about these markov chains you might be wondering because after all we are talking about Markov chain Monte Carlo so a Markov chain is any sequence of numbers or events which follows this graphical model and what this graphical model is telling you is that each event in this sequence is only dependent on the event that occurred directly before it and so we can write that out as like the probability of X 5 given X 4 X 3 X 2 and X 1 is simply the probability of X 5 given X 4 so each event is only determined probabilistically by the event that happened before it all right so here is a discrete example of a Markov chain and so we can describe Markov chain using like a transition graph or this transition matrix these both describe the same system so we have these three things right here we've got x1 x2 and x3 and so what these arrows mean are so let's say we're at x1 and then we've got this one error right here with a 1 event so what this tells us is that we're if we're at x1 there is a probability of 1 that we will move to X 2 and then from X 2 there is a probability of 0.1 that we will just stay at X 2 and a probability of 0.9 that they will we will then move to X 3 then X 3 there's probability of 0.4 that will move to X 2 and a probability of 0.6 that will move to X 1 so given any initial starting point you will move through the system move over here and back to X 2 and say X 2 and back to X 3 to X 1 to X 2 and so over time you'll just explore this system and then we can also describe this as this transition matrix so what this matrix is saying is that T IJ is the probability of moving from x i2 XJ so like t 1 1 is the probability of moving to X 1 from X 1 and that's obviously 0 and then also moving from X 1 X 3 is 0 but then you've got this probability of 1 of moving to X 2 and then from X 2 like you've got this probability of 0.1 of just staying at X 2 this probability of 0.9 of moving to X 3 so let's actually kind of explore this system so let's take this starting point X naught which tells us it's there's a point 5 probability of being an X 1 a point 2 probability of being at X 2 and a point 3 probability of being at X 3 so then to move to the next system what we do is just so X 1 is equal to X naught times T so that's just we multiply from the right this vector X by our transition matrix T and so what that tells us is then the next step there is a probability of 0.2 will be at X 1 vo point six will be at x2 and a probability of point two will be at x3 given this state before it and then we can keep doing this in every iteration these numbers will evolve and eventually we are going to converge to 0.2 0.4 0.4 in fact no matter what we choose is a starting point you can choose any X naught right here we will converge to this and so really when we're performing MCMC what we're doing is actually creating a Markov chain our sequence of events is a Markov chain but obviously it's not just a simple discrete markov chain like this but one of infinite dimension if we are in some continuous space like the examples we've seen before and so the chain we've been looking at right here has some really important properties as far as MCMC is concerned but actually for MCMC to work we need it to have these properties and the two properties are that it's irreducible and then it's a periodic what irreducible means is that for every state there is a positive probability of moving to any other state and that means for example in this system that if we're at x2 there's a positive probability that through some secretive events we can reach x3 and we can reach x1 and that's very obvious because from x2 we can get to X 3 and from x3 we can get to x1 so from X 2 there is a positive probability that you can move to any other state the same goes for X 3 and 4 X 1 then the other important property is that it is a periodic and so what that means is you're not going to get trapped in cycles so let's say that this branch right here this probability of moving to X 1 from X 3 is gone and also let's let's take out this point 1 right here so pretty much we are going to get trapped in this cycle between X 2 and X 3 for you to cycle between from x2 we move 2 X 3 and then from X 3 move AG 2 X X 2 and just this continues and will never reach X 1 again so given these two properties we can say that the chain is or ghatak and what that means is that it will and can explore every state in the system that is possible and that's really important for MCMC so what we talked about is this method does explore the entire state space and does so the proportional to the probability of each area so that is a brief primer on Markov chains so here it is what we've all been waiting for the general form of the metropolis algorithm actually what we're looking at here is the metropolis Hastings algorithm because of this term right here but I'll get to that in a second so what we do is we start out with some initial guess and then we iterate over this scheme and so every iteration is taking one step in our domain so the first thing we want to do is sample from what's called the proposal distribution this function Q and what that is is a way to start in with whatever point we're at generate some new random point somewhere kind of near that that's a random walk so for example when we were talking about the molecules in a box that box we drew around each molecule and then moved it inside that that is our Q right here so the proposal distribution generates a proposal point X star where it's just dependent on where we've been this X I whatever whatever the previous iteration was on so so but you can see why this is a Markov chain then because every new point we choose is solely dependent on where we were beforehand so we have that and then we take it into this right here so what we're doing here is comparing the probability of our proposal point to the point we were already at then this term over here is the proposal function based on so here it's the probability of the point we're at given our proposal point and then that's over the probability of moving to the proposal point given that we were at our previous point and so this is metropolis Hastings algorithm but in the metropolis algorithm this term right here disappears and that's because in the metropolis algorithm we choose Q such that it creates a symmetric random walk and what we mean by a symmetric random walk means that a queue of X star given X I is the exact same thing as Q of X i given X star so moving between two points the probability of movement depends only on distance so that means these two points drop out here and we're only looking at the ratio of the probability between our proposal point and our starting point that we're at so what this is saying is we have this value U which we sample from a uniform distribution between 0 and 1 and we're seeing if it's less than this right here and so this right here let's say we move to our proposal point X star is more probable than where we're at that means this right here is greater than 1 so we take this min so we just choose 1 and that means U is always going to be less than this one so we accept it and move to that point but let's say this is less than 1 so let's say the probability of the point we're starting at is more than the probability of our proposal point so that means this is less than 1 and there is some probability that our value u is going to be still less than this and then in that case we accept this as well so that means no matter what there is some chance that we will move to our proposal point but let's say we fail this test then we just stay where we're at and continue with the iterations and also notice that this P doesn't actually have to be the full probability distribution function because if there's some kind of Bayes denominator they're obviously just going to cancel out so as long as you have some function here which is proportional to the full PDF we are good to go we're going to see in some later examples that that is actually really you full okay so let's return back to that example we did way back at the beginning the really trivial one with a Gaussian centered at zero with a standard deviation of one here for the proposal function Q we're just going to choose a Gaussian centered at X I with a standard deviation of 0.05 so that just means every time let's say we're starting right here on Majan as the curve right here we're going to take some normal distribution around it the standard deviation of 0.05 and step somewhere in that and that's this right here on the code so that's our cue so here let's walk through this code right here we have 250,000 and that's going to be the number of iterations we're going to take and I'm just initializing our output right here just you know good coding practice and our initial guess so we're going to say our initial guess is point five you know actually in truth that's not the worst initial guess obviously since we have better knowledge of what the PDF we're simulating looks like we could do a better guess than that but that is within one standard deviation so it's not bad and we expected to converge pretty quickly all right so here's the loop with the real metropolis algorithm built in so here's our Q our proposal distribution so we choose from a normal distribution with standard deviation of 0.05 and it's centered at X I so that's our preview point so here is our proposal X and then if some random number drawn between zero and one uniformly is less than the minimum of 1 and the PDF of a normal function centered at 0 with standard deviation 1 so that's the PDF we're trying to simulate we take that probability of the proposal point over the probability of the previous point and so if that passes so that means that if this value right here the probability of our candidate point a proposal point is greater than the probability of the point we're starting at then we just accept it and we set the next point in our chain to that proposal point also if that's not the case and maybe it's a little bit less probable this random this random number can still be less than this ratio and we can accept it but then let's say that number is greater than this ratio then we're just going to stay at the point we're at and just continue on and this runs through 250,000 times and here's that output again so this is the histogram of all the points in this vector X that we get from running MCMC we are going to move on to a much more practical real-life example so this is going to be one from climatology and atmospheric science so right here is a diagram of what's known as the energy balance model and so what this is talking about is its balance of radiation coming in from the Sun and then going out back into space and then combined with this greenhouse effect here of radiation coming from the Earth's surface just being trapped here so I'll walk you through this real quick so we got radiation coming in from the Sun and some of that is going to be reflected by clouds and aerosols things like that back into space so this never reaches the actual earth and then some of it's actually absorbed by the atmosphere so it's not being reflected into space and it's not being drawn by the surface but it's warming the atmosphere and then there is some fraction of it natural pretty large fraction if you see that is absorbed by the surface but then also some of it is reflected off the surface and that's what's known as albedo the fraction of the total energy which hits the surface which just bounces off and goes back into space then of the energy that's absorbed by the earth it can be it can leave in a few different ways it can be radiated back out or it can be turned into just sensible heat some heating the atmosphere and rising up or this evapotranspiration this latent heat it can be used to actually evaporate water and put it into the atmosphere and then you've got your greenhouse effect right here where some gases in the atmosphere will absorb this radiation this long wave radiation being emitted by the earth warming the atmosphere and then it'll actually shoot some of that right back down to the earth you get this psycho right here so that's really important for actually keeping energy in this system and keeping some temp warm nice livable temperature at the Earth's surface and then also since the atmosphere is warm it is emitting some radiation back out into space so this whole model right here is actually really useful for studying the climate of the earth um but this is a huge simplification of a really complicated system there are some major parameterizations in it and some uncertain parameters and so one way to figure out some good choices for these parameters would be MC MC so now we're going to use the metropolis algorithm to fit uncertain parameters in the energy balance model so using this model we are going to calculate the average surface air temperature of the earth under various parameter configurations and then we're going to compare that to actual observed data and see how good a fit those parameters are to approximating the real world and the parameter is going to be looking at our planetary albedo sulfate aerosol forcing the strength of water vapor feedback and the depth of the ocean affected by atmospheric warming so what these things are planetary albedo is just the amount of radiation coming in from the Sun gets reflected back into space by the surface of the earth and then sulfate aerosol forcing is the amount of solar radiation scattered by sulfate aerosols in the atmosphere so that's solar radiation coming to the atmosphere and it just scattered off so this no longer reaches the surface of the earth and then the strength of water vapor feedback this is actually something that's really important to the Earth's climate because water vapor is a really strong greenhouse gas but then as you heat the Earth's surface more water is going to be evaporate and put into the atmosphere but then since water is a really strong greenhouse gas that's going to heat up the Earth's surface which in turn is going to evaporate more water and so you can see you get this positive feedback loop which just keeps warming the earth so the strength of this feedback is something that we're going to toggle as well and the depth of the ocean affected by atmospheric warming because the ocean acts as a really great place to store heat so as this heat comes into the system of the earth a lot of it can just be absorbed by the ocean and the ocean it varies a lot less in temperature than the land so it's a huge regulating force in the Earth's climate but only a certain fraction of it is actually affected by the air temperature of the earth once you get down to a certain point it's just cold water it does not care what's going on in the atmosphere all right so what do we need to implement the metropolis algorithm here we need an initial guess for the parameters a choice for P X and a choice for our proposal distribution Q the initial guess can really be anything in the domain but we're better off having a good starting guess something that's close to what we think the maximum likelihood would be and in something like this we're likely science has been done before we can find something in literature which will tell us where we should start this so it'll converge much faster so in a previous example where we're talking about the Gaussian distribution we are trying to estimate P of X was really clear that was obviously just the Gaussian this time though our model just outputs a value the surface air temperature and not necessarily some probability but then using observed data we can construct a likelihood function which is maximized by the model output and how it fits observational data a really easy choice for this is just an exponential function e to the negative cost where cost is the square distance between the model output and observed data and then for Q of X we'll simply choose a random step chosen uniformly from just a range of values around whatever point we're moving from so this is going to be very similar to the box around the molecule we used for the state space example all right so now here is the code we would use I've got two functions defined in here step per am and EVM model so step two RAM is a function that we give it where parameters were at and then it randomly moves those parameters around set to the constraints we put in place and gives us some candidate parameters some proposal parameters so this is queue right here and then EBM model is our energy balance model so we give it a set of parameters and it outputs the cost of it so how close that is to actual observed data and then here is our test so we take some random number from the uniform distribution between zero and one and see if it's less than e2 the difference in cost how much does the cost change the difference from observed temperatures when we moved to this new candidate configuration you can see this is actually the cost function that e to the negative cost of our candidate divided by e to the negative cost of where we were before if this test passes we accept the parameter and then also we're just recording the cost for the next iteration to compare this right here but then if this fails if this random number is greater than this then we just stay where we're at and then we iteratively run over this if we randomly step our parameters we step the model we check to see whether or not we want to move to this new point and as always even if this new point is these new configuration parameters is worse than where you're at there is some chance we will accept it and before we put em in here of one or this but I chose to leave this out because well if this is greater than one then we just be choosing this min of one but no matter what this random number from zero and one will be less than that and then so if we accept it we record the stuff if not we stay where we are so here's the output actually from running this and so we can see that for the albedo that one parameter the amount of solar radiation reflected off the Earth's surface we get a really good idea here of where that value should be like right here and then sulphate amplitude same thing it's not quite as steep a curve but we get a good idea of what value we should use for this model to fit observed data then the water vapor feedback and the ocean depth um it's not quite as clear so like it looks like there might be some MLE some really likely value of it over here and maybe over here but it's kind of hard to tell so this would mean that maybe we want to run some more iterations of the algorithm and also just since we don't see a clear peak here that means this parameter is slightly less interesting moving it around doesn't do quite as much um to alter the output of the function the same thing with ocean depth kind of we do see probably around here there's some maximum and you wouldn't want to choose some parameter over here but through MCMC we have been able to sample this parameter space and find the parameters which give us the best results to get this energy balance model to fit real-world observations so today we learned about Markov chain Monte Carlo in the metropolis algorithm so this is a way to sample from a probability distribution and even calculate some expected values of it and you know this is really useful for when you have some complicated function in higher dimensions because if you've got some simple function you can get the same results using like proposal distribution or a transition function but those require that you have some idea of how to approximate the function using a simpler function or that you can actually compute the CDF of it and that's not always possible if you've got a really complicated function or if it's in really high dimensions MCMC is super useful for doing this and I hope after today you've gotten a good intuitive idea of how this algorithm works and how to actually apply it to real world problems we only talked about the metropolis algorithm which is by far the simplest MCMC method but now you should have a base to go off and learn about more complicated much more powerful and efficient Markov chain Monte Carlo methods alright I hope you've enjoyed

Info

Channel: Jeff Picton

Views: 195,703

Rating: 4.8720002 out of 5

Keywords: MCMC, statistics, markov chain monte carlo, metropolis, markov chain

Id: h1NOS_wxgGg

Channel Id: undefined

Length: 35min 34sec (2134 seconds)

Published: Fri May 04 2012