Machine learning - Importance sampling and MCMC I

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the metropolis algorithm which in order to approximate say the volume of a convex body in high dimensions it's the only sort of approximate polynomial time algorithm that we know it's a remarkable algorithm it was voted one of the top ten algorithms of the century of last century so it will be very useful to know that algorithm and variations of that algorithm include things like Gibbs sampling which probably if you do deep learning you've come across the term or if you do biology or if you do statistics and also will lead us to understand hyper Monte Carlo which is a very popular algorithm that combines Monte Carlo and sampling in order to do Basin inference in neural networks so this will allow us to attack problems like neural networks from a completely different perspective and again where the basic methods are very good isn't getting confidence intervals and so on if you want to do a robust analysis unit where you want to put all your prior assumptions and what you want to verify that those priors that intuitions are right by looking at data and predictive tests then you know the basis scheme is ideal for that but the problem is it requires the solution of integrals that we can't do by hand and it's fortunately when we do something we will be able to solve those integrals so that's what something is going to give us it's going to give us it's going to give us the machinery for doing Basin inference sampling that is Monte Carlo is to base an inference but optimization is to maximum likelihood or penalize like so you ought to think of the problem as an optimization one where you have a likelihood in a regularizer and you crunch an optimizer or you think of the problem as an integration one where you have also likelihood in a prior but where you need to get a normalization constant in order to get the posterior and the normalization cost as a sentient integral and that that's intractable and so we resort to something okay so let's review some of this material so in the last class in logistic regression we went over what the likelihood was for the logistic regression its Bernoulli because essentially we're trying to decide whether labels will be binary 0 or 1 so the success probability Phi I is essentially probability that Y I will be equal to 1 and then 1 minus Phi I is the probability oops it would be the probability of it being 0 then in Bayesian inference okay thank you in Bayesian inference we assign a prior to the weight and if I put a Gaussian prior and you look at you take the negative log of that prior and you'll see that a Gaussian prior is just essentially a Ridge regularizer in the optimization context if I wanted if instead of putting a theta squared that in the exponent of the Gaussian I had an absolute the sum of the absolute values of the Thetas then I would get an l1 regularizer and I could even do what we did for auto-encoders and have a prior that matches what we do what we did is nothing coated those priors I usually call spike and slap priors it's sort of a bit more advanced we're not going to go into them but they're there but for simplicity we're going to put a Gaussian right we say XI a with that Gaussian prior we typically have mu being equal to zero and so essentially what we're saying is we want two feet just to be small or we believe the Thetas that are zero are most probable whites because the Gaussian prior centered at zero just essentially is telling you higher probability at zero okay so in order to complete a posterior all we do is we multiply the likelihood times the prior and we normalize of a theta and the reason why we normalize of a Thetis because the posterior distribution of the feet you give an x and y which are the date in this case now solving that integral for that particular likelihood that we have on the screen in that particular prior is intractable something we can't do by hand and so we have to resort to numerical computing so we have to figure out a way a smart way of solving that enter so this is the set up for all monte carlo problems in bayesian statistics you have a likelihood you have a prior you don't know the Z so where Monte Carlo always comes in is to give us an approximation of Z okay or later on as we will see it will just give us without having to approximates it it will allow us to approximate interesting properties of the posterior which will allow us to do Basin prediction which is sort of an example type prediction okay so questions at this point because the the next 10 slides are from extensions of this life we need the nose and like we're just maximizing the posterior that's equivalent to maximizing the likelihood so I take it we're not maximizing the posterior really is it when we really not integrating on the posterior they finding the need ask you what we're going to find is the hopeless theory we're going to what we're going to try to do is to do a histogram approximation of the posterior so instead of belief some when we if you think if you think about the negative log posterior the negative log key of theta given X and y and if you think of it in terms of optimizing in terms will essentially give us a constant because that does not depend on theta for these particular models that is independent heater there are models of machine learning what that's not the case like like both machines restricted Boltzmann machines which some of you might have encountered before deep learning but in this case we're not concerned with those models so Z is a constant in terms of theta Z is not a function of theta and the reason why it doesn't depend on theta is because we're integrating our theta if we integrate our theta with just removing the effect of theta would find marginalization so we would get a constant and then we would get minus the negative log likelihood of P of Y given X comma theta minus the log of P of theta and so that's just a constant the negative log likelihood would just be minus the sum from I equal 1 to n of essentially the cross entropy why I log PI I plus 1 minus y I log 1 minus PI I and then in addition with if the mean is 0 we would have this regularizer which is minus 1 over 2 Sigma squared normal theta squared okay that's if we just take the negative log likelihood of the posterior so as a function of theta what we're saying is if we want to do when do logistic regression by optimizing what we would do is we do gradient descent on exactly that that objective function if you just consider the end the the cross-entropy as you did in your homework you know how to compute a gradient of that if you want to regularize the model by saying I want some of the features to be small because I want the input sudeerneer Network if they're not too influential I want them to go to zero in which case you would just add that victor squared the parameter Sigma in this case is essentially what lambda was when we discussed Ridge regression so essentially by adding a Gaussian prior you're just adding a rich regularizer to logistic regression and the same more the same would work for neural networks when you sample from the posterior do we know we haven't done that that's what this class will be about at this point I'm just trying to make the connection of just making sure that you understand what is the prior here why is the likelihood why are we setting a prior that is a Gaussian prior and I'm trying to argue that the Gaussian prior makes sense because it's just saying that you want your features to be small so you're putting a Ridge regularizer on your problem and there's two ways to approach the problem the way that we know so far is we would just do the gradient of that objective function that J of theta and then we're just to gradient descent and we find one single theta one theta and that theta that we find that's our answer the basin's will not believe in a single theta you will find many Thetas there's no going to be a single answer there's going to be many answers just very much like when we do a forest we take many each three is an answer and then we combine them so this is going to be that way all right so I'm going to proceed to move on to the sampling okay so key to this is the fact that we need to compute the normalization constant which is the integral of the likelihood times the prior the trick we're going to follow is we're going to multiply and divide by a distribution q of theta now that distribution could be for example Q of theta equal I don't know typically you want something with heavy tails but let's just say that it's a Gaussian which has a huge variance okay so this Q of theta you get to choose as a user of a Monte Carlo method you will design a queue of theta I will give you next some insights as to how to design this queue of theta but for the time being assume you can pick a queue of theta assuming that we have a queue of theta then what we do next is we rename quantities here this ratio here we're going to call W of theta okay now you can evaluate W of theta because Q features are Gaussian so you can evaluate given a theta you can evaluate a Gaussian moreover the likelihood is Bernoulli so you can plug in a feet and get a value of the likelihood for that theta and the prior is also a Gaussian so you can plug in theta and see what value you get for that gousa so we can evaluate W of theta and that's an integral with respect to Q of theta D theta what we're going to do next is we're going to use our computer to draw samples so you go to your computer and you say give me n capital n samples and let those samples be the zero mean and have a variance Sigma squared of 1,000 so any computer program can generate samples from a Gaussian distribution and we've seen before how we do that by using the cumulative of the Gaussian so in math we just say we're drawing samples from Q of theta for I equal 1 all the way to n using Python notation and so what I do next then is I replace an expectation with respect to Q of theta because this Assessor is the expectation of W of theta with respect to Q theta by a sample average so we're used to flipping coins and using counts that is sample averages to estimate the expectation of set a pointing head here we're going to do the other way around we're going to take an integral and we're going to replace it by yourself [Applause] so in other words I flip this coin end-times and I'm replacing I'm replacing the expectation with a sample average and that's essentially is what Monte Carlo is in terms of solving integrals any sample average converges to an expectation and that's just a consequence of the strong law of large numbers the law of large numbers just says you an average converges to an expectation is sort of a very classical result that is more than 100 years old and written Wikipedia go ahead and there's a choice of distribution for Q effect anything with this method yes it does indeed some cubes will be very nice idak useful we prepare be very bad I'm going to come to tell you what would make a good cube very soon but for now let's assume we do have a good Q if you do have a good Q let's just accept that we'll be fine okay I have a few plots where I show different cues and that will allow us to see the effect of Q okay so let's look at this a bit more in detail what we're essentially doing is we coming up with the bend approximation of the posterior so bring this picture here secession base in an important sampling we and here in 1d I can actually do numerical computation very precisely so I can actually compute the posterior so what I did is I took some data and then I did quadrature in order to compute the posterior distribution and I plot the posterior there note that the posterior is not Gaussian the right-hand side is a bit higher than the left-hand side and that's a consequence of the likelihood being a Bernoulli likelihood it's no longer the the likelihood logistic regression is not Gaussian the product of Bernoulli's and it has that particular shape that is shown there if I was doing maximum likelihood the answer would be that theta that where the likelihood is maximum so this is a function of future if I was doing now let's say I'm doing maximal likelihood with a reg prior what would he answer what would the answer be a number two good it's essentially this is the posterior the location of the maximum of the posterior that would be which say around two that's going to be the the estimate in this case I also generated the data from the model so I know the truth eater and the trophy design is actually two neither the maximum likelihood nor the posterior estimate which we often call the theta map so this guy here this value this value over here is what we call the theta map or maximum a-posteriori none of these two gave me the right theta the truth eater the generator data which was to what would have to happen for these methods to converge to the true theta what uh what is it that we know about maximum likelihood when does maximum likely to converge let me rephrase this why'd we bother to do maximum likelihood why does maximum likelihood make sense when the data's Gaussian but in logistic regression the data is binary and I argued and gave a homework asking you to compute a likelihood so if today discussing it works but it works in other distributions someone had another party infinite samples and samples of what kept on recomputing the posterior like using the posterior is a prior would write samples of data you mean yeah that was the result we have an exam coming up in two week less than less than two weeks make sure you revise your slide the result for maximum likely the reason why we do maximum likely because as n goes to infinity as the data goes to infinity if I have a bigger data set then actually ml would concentrate on two but I have a five small data set when you have a small data set the best maximum likely answer is not necessarily the true mechanism that generated a date I need there's variance in this it does put some probability in it it's not done maximum likely the saying there is some probability that this is the right value that whole distribution is really the likelihood is telling me how much each data is worth as a solution to the problem and we usually pick the one that as the highest value would pick a single one but it's only as the data grows that maximum likely converges to the true solution maximum likelihood also made sense because I argued it when you do maximum likelihood in different settings it's equivalent to matching the statistics of the world with what you imagine that's another good reason for to think of why you do maximum likelihood you're just trying to match the bottle to reality the theta map takes into consideration the likelihood so it basically takes this guy here which is the likelihood it multiplies at times this guy which is the prior and that gives us the posterior and then you just sum over all the area under the posterior to make sure that it sums to one and that's how I compute the posterior now as these as the proposal distribution I decided to use an arbitrary distribution q of theta and that's the distribution that I'm showing there with the dashed line just big you know sort of a queue that is very fly when I try to do Monte Carlo estimation what I will be able to do is get an approximation of the posterior that approximation will not be based on a single theta you will be actually a full histogram as you can see here by these yellow bars it's a whole approximation to the posterior that's the basic solution and why does it make sense it makes sense because precisely like in this example here the there is very little distinction between theta equal to and theta equal three in terms of the quality of the what they can offer you it's important to keep in mind that there is uncertainty the posterior through its volume through its variance is keeping track that there is uncertainty in your estimate now how do I write secession we're going after is that yellow histogram which is the posterior distribution how do I write down a histogram in math one way to do it is to do the following I'm going to say that the posterior of theta given the data which in logistic regression are the XS and the Y's will be equal to first I'm going to make this claim that it's going to be sum of 1 over n I equal 1 to N the weight theta I and then I'm going to introduce a function that's the Delta function and what this function will do is it will count how many samples of theta fell in the interval D theta so let's write that down Delta Theta I D theta is equal to the number of samples theta I in the interval D theta okay now the sort of picture to have in mind here let's assume that theta is a one-dimensional we have a posterior that looks like this the posterior need not be Gaussian it could be any distribution for a neural network it more likely that it will look like that then Gaussian even if the likelihood is Gaussian that's because you have non-linearity lonny non-linearity and the likelihood combined with the prior gives rise to a non Gaussian posterior I'm going to now make a slide important concept precise what we mean by density and what we mean by distribution what we mean by probability as opposed to the density function I'm going to say that this curve here theta given the data is what we call the density or the probability density function I'm going to next consider the interval D theta so I'm going to pick an interval D theta whose width is indeed d-theta this area here is the probability probability is defined as the area under the density if you have a Gaussian distribution and you look at the area under the first half of the Gaussian distribution what is that area equal to ah ha right so the probability of a now C number being less than zero for a Gaussian that has mean 0 is 1/2 so probability is the area under the curve of the dead sea in other words probability which we're going to I'm going to write as D theta and now I will be a bit more precise and put a defeater because we measure probability on intervals we measure the probability in areas that's why two weeks ago and asked you to consider this experiment of darts where you were throwing darts and I asked you what's the probability that you're going to hit us the dashed area at the square of the circle or the wiggly we always had to measure areas because providing continuous basis in the Oblivion spaces is assigned to areas so it's assigned in this case in 1d two lengths and two intervals defeat and the probability so using here a very loose approximation is essentially the height of the interval which is P of theta and I should condition all this on date given data given data times the interval length okay so the area is the probability the height is the density the width is the length of the interval which we also called the measure okay so what this now is doing is we say the probability is being approximated by if these are the samples what we're essentially doing is we're counting how many samples fall in each interval and then the height is the dictated by how many samples fell in each interval and that's how I compute those yellow bars next I just draw a bunch of samples and then adjust to a histogram of those samples and that gives me an approximation of the posterior distribution just counting how many samples fall in each interval so I need to choose the interval in doing this and and that's essentially how we approximate a distribution in Monaca okay typically we in high dimensions we try not to do this because if I was trying to get an approximation of the posterior high dimensions I would still need in 2d I would need a grid of N squared points and in 3d I would need n cube discretization and in D dimensions that would need n to the power D so again doing this is subject to the curse of dimensionality if what I'm interested in is in approximating the posterior distribution the most of the time we don't care actually about approximating the posterior distribution what we care about is doing some prediction so we will be able to beat the curse of dimensionality that way so we're not going to do this histogram um from who's done here measure theory or math if undergrad map so for you this measure would be the lebesgue measure so in probability actually I'm treating these integrals as Riemann integrals and for this class is okay think of it as Riemann integrals like the sort of Riemann integrals you learn in calculus but this notation P D theta happens quite often it actually means something slightly different is called the lebesgue integral and it's it's kind of like the Riemann integral except that is instead of summing what like this you're summing over sets that involve Y intersections and there is a reason for that because you can easily construct functions for which the Riemann integral actually doesn't converge so integral is not well-defined where is the lebesgue integral is well-defined but we're not going to go into those details so if we simply going to assume that we're in the occlusion space where everything is nice and practice and machine learning your usual will be in your pleadian spaces for discrete spaces okay the other thing we can do then to beat the curse of dimensionality is know that what we usually care is about making a new prediction given a new input and given a data set y1 to T and X 1/2 T so if you have T labels in your training set and T inputs in your training set given a new input T plus 1 you want to make a prediction in order to make a prediction according to Bayesian inference we need to integrate out over all the possible values that the posterior could take so that this becomes the integral of P of YT plus 1 given XT plus 1 comma theta times the probability and the D theta given X 1 to T comma Y 1 to T and that because of this relation we just learned is what we typically write as the density times the measure so when you're reading papers you will often find that notation or sometimes it's DP of theta when when you come across a notation you know be aware that the authors are just making precise what's the density what so what's the probability now that essentially removes the effect of theta because patients don't like to the idea of that there's a single theta then they want to integrate out over all the possible values that theta could take so they want to weighted prediction now what I can do is I can write this as the integral of P of YT plus one I can approximate this and to approximate this I'm going to use this histogram formula which is saying the probability that you will fall in an interval is the number of times you were in that interval weighted by the ratio of you know the likelihood times the prior divided by Q and the reason why you need to wait is because the samples are not coming from P the samples are coming from this other distribution Q so you need to correct with that weight the fact that your samples didn't come from P go ahead why is it that we don't my place said oh because we've actually estimated we've actually estimated Zedd very plea so here we've constructed it so that the ways we've actually constructed so that it's appropriately normalized so we don't need to divide by set I will when I do in normalize important something I'm going to make this even a bit more clear I'll come back to your question in this light for now let's assume that we have that histogram estimator if you have that histogram estimator then what we can do is we can replace it here with the sum from I equal 1 to N and now I'm just going to replace that expression that we had and I'm going to do a manipulation and I'm going to take the one over n outside integral and I'm going to take the sum outside the integral okay now what this integral is saying is some overall features but only consider the interval D theta only look at the where so only look in this infinitesimal small sample D theta now when you integrate Delta is essentially direct measure it's essentially a spike at theta I so if I multiply a distribution that has whatever shape times a spike I'm just going to basically be evaluating the distribution at the spike just sort of a result that many of us have seen before and have studied before and so this is essentially P of YT plus 1 given XT plus 1 comma theta I times w of theta okay so in this case I'm able to compute a predictive distribution which essentially is the likelihoods because this is nothing but the likelihood but weighted by how probable it is it's weighted by W so if a sample high has high posterior probability their contribution of the likelihood term will be high and that's essentially the base in solution the base in solution take take an average of all the predictions the predictions are the likelihoods weighting by W of theta and for a Bayesian you don't predict a single value you predict a distribution because when you have a distribution you can you know the variance you know how likely each theta is not just the one theta that you chose and isn't by informatics is very important because if you're trying to fundamentally say make an observation about either a patient or a fact about biology you don't want to just publish that fact without taking into consideration that you have a very flat distribution and you actually don't know what you're talking about that's where it boils down to you need to know the uncertainty when you're talking about any scientific fact and that's what makes sine to scientists that we can actually make informed arguments and decisions so it's important in it by informatics to do this type of analysis now coming back to the question that was asked before most of the time in Bayesian inference we don't know the normalization constant P of theta and so in logistic regression we actually have to do something slightly different I've now what I've been sort of been sort of cheating up to now because in in logistic regression we actually don't know the normalization constant so it's hard to approximate the posterior if I knew the normalization constant of the posterior then I could have used the estimates that we had so far but in practice what I have is I have my posterior density V of theta given the data I'm just going to use capital T for the data just to write less and that's equal to one over Z P of D given theta the likelihood times the prior say Garcia and so and that's also equal to P of D given theta times P of theta divided by the normalization constant which is the integral of the numerator so let's assume now that I want to compute any expectation the predictive distribution is just one expectation is expectation of the likely let's do the predictive distribution P of Y T plus 1 given X 1 to T plus 1 that's the integral of [Applause] P of Y T plus 1 given and actually Y wanted T I'm going to write it in yet a different way which will so the predictive distribution is what's the new why given the new acts and given the date that we had it doesn't use a theta because we integrating over feet and so that's going to be the likelihood times the posterior and then the posterior now is P of D given theta times P of theta and in fact Z does not depend on theta so I can take it out right so now I can rewrite this as the integral of P of YT plus 1 given XT plus 1 comma theta times P of D given theta times P of theta and I'm going to divide and multiply by Q of theta so I haven't done anything when I do that and then since Z was taking out I'm going to now expand that and Z is just the integral of P D given theta that is P of theta D theta and in fact I'm going to do one more thing in this line which is I'm going to do the same trick as I did above multiply and divide by cube feet so in other words this whole thing is equal to the integral of P of YT plus 1 given XT plus 1 comma theta times W of theta times Q of theta D theta divided by the integral of W of theta times Q of theta D theta this derivation right now is the most important revision important something for basic statistics this is the one you need to know this is the most important page because in in basing some a normalized important sampling is used a lot in same graphic society but in statistics you never know the set so what we're going to be doing is we're going to probably approximating two integrals at the same time we're approximating the integral for Z of theta and if we're doing the predictive distribution there's also an integral over the posterior on and so we actually are approximating the ratio of two integrals at the same time and then what we do is we apply Monte Carlo to each of that so we're going to apply Monte Carlo to the top and we're going to replace that by the sample average and we're going to apply Monte Carlo to bottom as well and we're going to use the sample average and that's it because that was our estimate of QP so we're just replacing the to expectations risk with respect to Q of theta with sample averages when the samples are coming from the distribution q of theta okay and this we can simplify a little bit more okay get rid of the 1 over N and in fact and now this sort of cool answer the question that was asked before the guy had the numerator the sum of the W of theta is is essentially we can change index here because the index is just an arbitrary index in the sum so we can use J and so since the W at the bottom and on top is the same what we're actually doing is normalizing the weights so that they sum to one so that we have a perfectly weighted prediction with respect to this an empirical distribution and so we typically write this as the sum from I equal one to n W tilde where W tilde is a century W tilde I is just essential W I divided by the sum of a J of the W J's w tilde or theta I times P of YT plus 1 given XT plus 1 comma theta so in this case it's clear where which essentially said has disappeared and said has disappeared because I'm ensuring that my histogram my posterior in this case I'm replacing my posterior by this Bend of width d-theta but I'm ensuring oops I'm ensuring that each height so this would be say w1 w2 and so on I'm ensuring that the sum of these W sums to 1 so this histogram has area 1 so it's a ballot probability distribution if you have a history with the discrete quantity all you have to discrete so you have to do is divide all the heights by the sum of the height and you get something that sums to us does that make sense now and that is normalized important sign and you could use this to do logistic regression for 510 parameters and if you have models in bioinformatics where you only have a few free parameters so you know if you have 6 5 parameters this is you know a reasonable technique to try out provided that you have the right cue now what would be a bad cue did anyone think of an example of a bad cue the uniform distribution in fewer theta doesn't have the key go on forever that's correct so he's realized an important fact if my q is say uniform said between 0 & 1 it can only propose Thetas between 0 & 1 and if my true theta happens to be 2 I will never propose a theta so the proposal distribution must include the support of the posterior whatever the posterior has positive probability then q must have proceeded positive probability another thing is so all these estimators do converge to the truth as the number of samples goes to infinity their unbiased as n goes to infinity even the normalized one as n goes to infinity but however they do have high variance okay so this is bias-variance tradeoff go you did all right answer but they'll sort of be oscillating about the right answer and will take a long time before we can quench those oscillations by having a better q essentially quenching the solute the base oscillations quicker one of the things we can do in practice if you're using a queue that is Gaussian is you could for example choose a queue that has mean let's say a scalar mu and then maybe Sigma Square or a vector mu and a parameter Sigma square let's assume that you're doing this in ten dimensions what would be a way of optimizing those parameters mu and Sigma based on optimization denounced someone else but with you what could be an objective function what would you be trying to minimize in order to estimate mu and Sigma this is a harder question - the difference between the posterior and here so that's a good you know reasonable strategy you would want to sort of make your queue be very close to the posterior the closer kuester to pursue the better chance you have of approximating the posterior the catch there is you don't know the posterior that's the quantity we're trying to estimate in the first place so but you could try to estimate the do maximum likelihood to figure out where the peak of you know just optimization to figure out where the peak of the posterior is and then you could try to fit the gas in there and then you would use that q and some folks do that but since eventually reviews in if you practice good if you priors good how do we test someone else how do we test if we have the right model when you're doing rate regression how do we do we know that we have the right model the right complexity lambda the right regularize cross-validation right so if we have test data we still doing logistic regression here don't forget we doing logistic regression P FD given theta is just the product of a bunch of Bernoulli's P of Thetas are gas and what you have a training data and what you're trying to do is make good predictions now if you have an extra data set and maybe here you won one for testing and what you want one for validation you keep this extra data set that you use to just tune mu and Sigma square so you adjust mu and Sigma square so that your predictive distribution works well on this extra data set so to cross validation that's one reasonable way to do this when you have a model that's where you're trying to prediction that's sort of a simple thing to do and most often in machine learning what we get is about predicting so never forget about cross validation your biggest friend in the world of prediction because that's how you when you don't know what something should be you create a test data set and you make sure that you can predict on that they just set so that should like given a computation time finding the best even Sigma if you like Sigma you could pick an N was large enough doing work oh yeah if you have infinite computation and so on and even some Sigma that's not as good with good work there's another thing you can do and that I haven't covered here for those of you who want a bit more sort of material i recoup this I wrote a tutorial years ago called an introduction to MC MC for machine learning I recommend you have a look at that for more details and important sampling one of the things we can also do so one of the things we can do is we can actually compute an expression for the variance of the Montecarlo thus I describe that in my tutorial and since what we're trying to do is minimize variance you know you can actually just using Bayesian optimization or gradient you can adjust mu and Sigma to minimize the variance of the Montecarlo instant ok so there's lots of cool things you can do in order to get a better cube use your data use cross-validation or use some something you know about the problem if you know it has high barriers that your objective is to minimize various and typically what people do for queue is that they just pick one that's heavy tail to make sure that whatever P is positive that Q is awesome so a t-distribution for example works well but I get important sampling is one of those techniques that only works in Lodi mesh which I theory you know up to six dimensions this can work quite well if you're interested in doing tracking like image tracking or if you wanna work for radar defense or trying to estimate the voltage of a market if you're working in finance then important solving is for one of the key technician is used there and open it's cut it's done implemented in a sequential setting and when you implement important something a sequential setting the algorithm was known as a particle filter or sequential important something so this is sort of very useful when you read that set I could go in that direction now and basically teach you how to estimate volatilities in markets and so on but instead I'm going to stick with more machine learning topics and we're going to try to study a technique that will work better will allow us to attack they show something well in high dimensions and intuitively the reason why important something fails is because important something has to have announced the queue has to have enough support everywhere where the posterior has is high whatever the posterior has a high value going to queue to have a high value there but beforehand before you know what the posterior is if you are in a thousand dimensions basically your challenge is you want to make everywhere in two thousand dimensions have high probability and a thousand emissions is a vast space huge and so you don't have enough probability mass in your cube to be able to provide enough support for the post here everywhere before you know is where the important parts of the posterior are and so the technique fails the alternative we're going to learn about now and in the next lecture is Markov chain Monte Carlo and it will be a much smarter way of actually navigating that high dimensional space so that you can actuate the maze of parameters and that will allow us to attack neural networks so if you have a neural network with thousands of parameters we will be able to use Monte Carlo to compute a posterior distribution such a net the consequence of that is that the estimates the under sample estimate will tend to be a lot more accurate than the maximum likelihood estimates and in addition we'll know the full posterior so we will know the confidence we will know a lot of things about what is the animal is refer to which are memories oh it's what they are normalize refers to this guy the posterior is our normalized in the sense that we don't know set if we knew how to cook it's dead if you know how to compute that integral by hand then this wouldn't be so challenging there are cases when you know the normalization constant then you still want it important something the way we did before in these slides and that's sort of important say in engineering communications you know like estimate with error rate in channels and so on there's a few problems out there where you are also computer graphics is going to simulate light in the room we do important sampling and where you know that about section 5 but in learning it statistics typically the real challenges we don't know the normalization constant we can write the likelihood we can drive to prior but we can't write they need to grow up the light to hit down to the prime but this trick gets around will allow us to solve this problem MCMC will be just another trick to get rid of set that will allow us to compute also a sample average approximation now MCMC is based on a concept of and TMC the first MC DJ thing now the first MC is Markov chain the second MC is multi-car Montecarlo basically means approximated integral pious Asad by an empirical knowledge now the Markov chain will be the mechanism by which we're going to generate the samples instead of using this proposal distribution q and r awaiting what we're going to do is we're simply going to start navigating the space and we're going to navigate to space using a Markov chain now what's a Markov chain the best way to introduce Markov chains is to use one example that we all familiar with which is the web and let's now think of a web that has three pages page 1 page 2 and page 3 and let's just say that page 1 points to page 2 so it has a link and in this case going to be point 2 where even though we're about to describe is the algorithm that Google uses to rank web pages X 3 points to X 1 point 6 of the time and so on now I've constructed a normalized these so that when you add all the entries for one row of this matrix to get one so the way to think of this is think of this as X 1 X 2 X X 1 X 2 x3 and what we're saying is that X 1 goes to X 2 with probability 1 and it never talks to itself so X 1 to X 1 that never happens and X 1 to X 3 doesn't have so with the matrix with describing how we transition from the previous value or the previous state that is the previous page to the next page and we're going to assume that the way we transition from one page to another page is probabilistic on the web what you do is you go to a website you look at all the out links of that website and then you flip a coin if there were two out links to my web page and one out link to his web page you flip a coin and with probability 2/3 you come to my page we pull into 130 good and that's pretty much what Google's PageRank algorithm does a Markov chain is such that the current page where you are the probability that you are at a specific page only depends on which page you were before so if I know that I'm on page 2 the probability that I'm going to be at page 3 next is 0.9 I don't need to know what the past was I can just read it off that ground so Markov essentially doesn't have the Markov process doesn't have memory beyond just the immediate past to know the biotechs tier just need to know the value t-minus one now there is a property that I teach in 340 and this I think is lecture 11 in my 340 class and you can actually go with a proof there it's a very simple application of linear algebra which so which shows that if the graph has some properties which I will soon making for you then if you take any vector whose entry sum to one and you multiply it by that matrix transition matrix T transition from one state to the next if you take that matrix to some power and if that power goes to infinity in other words equivalently you start you have an initial distribution that's the vector Miu you keep multiplying that by the matrix D forever and then you will converge to a single packed and it's regardless of what that new is it doesn't matter what it is whether it's 0 1 1 0 or whatever vector you have provided is some sum to 1 it will always converge to the same vector pi on the web that vector is the PageRank it's the importance of each page then 340 people are asked women this algorithm but that's essentially what it is it's just matrix vector multiplication the properties that it needs to have is that the graph has to be connected so each no it has to be connected to each other no so in other words it means that the graph has to be irreducible to subgraphs and we call that property irreducibility and the reason being that if you have two separate graphs then if you trap randomizing traveling in one graph you will never be able to visit the other graph and randomize it by the way flipping flipping a coin to decide where you go next going they are doing keep going with such a trajectory it's the same as doing matrix vector multiplication question randomizing operation the other property you don't want the graph to have with cycles so cycles are bad because as we this picture illustrates here for this particular transition matrix if because it's it's got this cyclical behavior right x2 you go to x1 if you know takes one you go to x2 then when you multiply any factor let's say hi for one thirty two thirds times that T all it doesn't swaps the positions of the third to 2/3 so that vector never converges guess what oscillate how can we get rid of oscillation that's an easy trick and a small wants to do all the music right you would add you could have a small constant so you can make this in other words absolute maybe you make this one minus Epsilon this episode is one minus epsilon and that's equivalent to just essentially adding itself link so provided that you added provided that is simply on oneself link so that the note either stays or goes next when you go to x2 you have to flip a coin the moment you're flipping your point you're breaking its cycle because it's going to be random now you no longer trapped in a cycle that solves the problem so the Google guys essentially what they do is they first they construct the matrix the noose extremely sparse matrix mostly zeros with once indicating the link where you're going to go next and then they are just a small absolute to every entry to ensure that one that the graph is fully connected so it's reducible and to to ensure that it's a periodic and once they have that matrix the Register's matrix vector multiplication except you do it in a smart way and and that's how Google racks results for you there's a few other heuristics that go into latex but that's the key important thing is indeed to they page right there's a very nice package but building a search engine Apache has free software so you can download the scene and solar this is software it's available to all of you and you can easily build a search engine using that software that's you know can index a million pages or so it except there's little reason to fill the search engine these days as big I tell you all right so let's assume that so the consequence that you take any vector x t it eventually will convert to a pie is that once you're in the limit if you have a distribution pie and you multiply by this matrix t will you'll still get pie another way to say this is that you have an eigenvector equation where you essentially say that one is the eigenvalue of the eigenvector equation your ax equal lambda x except that here is T and X is 5 so Phi is the probability vector indicating the frivolity of each webpage and T is just a matrix element from one to the other and this is also saying that the PageRank value from paid according to Google is essentially the eigenvector of the web drawn matrix vector multiplication component wise if you had three pages so if you had a vector that was like this by one point two five three and you had a transition matrix it was three by three just like in our example with three nodes in the graph then we can either write things in vector notation where you have two vector pi transpose being equal to Phi 1 by 2 by 3 such that the sum of pi is equal to 1 vector and in the continuous world that matrix vector multiplication so u vector PI this sort of just being over the discrete values becomes think of it over bins of an infinite number of bins so your vector becomes a function just like us in processes we're now considering a vector with many many entries it eventually becomes a function and your transition kernel which is just a probability of one violent given appraised value is just a conditional probability P of Y given X so this property of Google Pagerank that you take it back to x transition matrix in your venture Virgil vector is the same as this property which is essentially marginalization so in other words if you marginalize if start with an initial distribution we marginalize with respect to some transition matrix P of Y given X you'll eventually converge to a distribution here by now when we do with Markov chain Monte Carlos we use all these facts that apply to Markov chains but we didn't flip the problem completely the opposite way what we're going to do is we're going to assume that we have pi and we will try to construct a T that will give a sample from PI so in other words we can imagine we know the page rank of fish web page we know the page rank and if you knew the page rank the probability of each page how would you draw samples from the ground how if you throw the random walks in that's essentially the idea of metropolis now I have a product assad's very convoluted but it'll be a little the algorithm I see is only going to be about six lines and another in more preset on Friday but to conclude the property that others will have to satisfy is that is this matrix vector multiplication and in order to ensure this all I'm going to do is a introduced us condition called Intel balance that if I make an XD and I move to xt plus1 the process is reversible is the same as if I was at XT plus 1 and I move to XT and the reason why this condition is enough is because if I integrate both sides with respect to X D this guy will disappear and so I will get what I need I will get this the condition of the Markov chain the algorithm Excel will be extremely simple and it basically whether it's going to what I will be doing is the following I'm going to have an answer for you I'm going to have an answer to the problem and then I'm going to ask you can someone give me an answer to my problem you will give me an answer if it's a really good answer I will take your answer and I will drop the one that I have as a sample back into the back if you give me an answer that I don't like then I just not as I'm staying with what I have and I reject your answer and if you give me something and I'm doubtful as to whether your answer is better than mine I will flip a coin into sight whether I take yours or whether I stick with mine that's it that's the algorithm now this algorithm an algorithm is something a prostitue takes you from X key to XT plus 1 we will be able to interpret this as the probability of XT plus 1 given XT and in fact this is that algorithm is a transition kernel or a Markov chain in an infinite spacing function space without going into the theory the actual algorithm is quite simple it's just a few six lines of code and it will work a lot better than important something right that's

Info

Channel: Nando de Freitas

Views: 75,410

Rating: 4.8374557 out of 5

Keywords: nando de freitas, machine learning, mcmc, monte carlo, importance sampling, pagerank, markov chains, Bayesian inference, logistic regression, probability, statistics, Bayesian statistics, power method

Id: TNZk8lo4e-Q

Channel Id: undefined

Length: 76min 18sec (4578 seconds)

Published: Fri Mar 22 2013