A Primer on Gaussian Processes for Regression Analysis || Chris Fonnesbeck

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right it's three o'clock everybody I will try to stay on time here my name is Chris Fonz Beck I work in baseball operations and player development for the New York Yankees though I live in Nashville we are hiring by the way particularly if you're a data engineer we love to hear from you I'm going to talk today obviously about Gaussian processes first some boring technical stuff if you haven't seen the repository already here it is don't worry if you haven't you don't have to do everything hands-on but you can if you want it's github.com slash my last name /gp underscore regression there are instructions on installing a content environment etc and so on on the rimi page if you did install it already you might want to look at it again particularly if you're on Mac OS catallena I have run into problems with the new version of Mac OS with some of the GP stuff but if all else fails I've got the happy little launch binder button here so you can spin up a machine remotely wherever binder exists and run it that way or you can just sit back and relax and enjoy like a lecture sort of thing so it's up to you but these these I won't move this repository it will always be here for you so how many here have used Gaussian processes before in their own work okay good how many people have used Bayesian methods in their own work before how many people have used PI mc3 before okay this is perfect all right let's get going then I my original plan was to do a whole bunch of notebooks and I changed course kind of at the last minute and I sort of have a hybrid approach here so I'm gonna because a lot of this stuff is expository is that the right word didactic so and I think slides are better for that sort of thing and then we'll switch over to particular notebooks when we need them and the idea here is to give you a little bit of background and then give you some hands-on building GPS on your own just to give you kind of practice and I probably have more examples than we'll go through and that gives us a little flexibility in case we run you know short or long on time so please interrupt me and ask questions GPS aren't the easiest topic in the world to understand the first time around even for very smart people so it took me five or six times before I really got it so if you even if you leave today not knowing quite what they're all about don't worry it it's not you and it may not even be me it's just a hard topic okay what we want to do here is to give you some methodological practice and some implementation practice this isn't a theory course even though you will see math and I will explain all the math but do ask questions if anything is not understandable okay so the reason we're talking about Gaussian processes is that they're very useful like Bayesian methods in general they're very flexible method for doing nonlinear regression and classification though we won't talk much about classification here in a whole bunch of settings and it's a in a class of it is a class of Bayesian nonparametric models and and like I said there's a whole it's a large topic we're not going to cover it all in an hour and a half this is just going to be a sort of a quick introduction to them so again the motivation is kind of in their utility right it's the fact that there's a lot of messy data out there I've spent most of my life dealing with large observational data sets which means they're very noisy they're very nonlinear there's missing data and the default proaches approaches for statistical methods are often not suited for that messiness right so if you look at a textbook example so this is from a tech I found a textbook example of regression this is what you get right points are already lined up for you and you just got a draw line through it simple as that so this is regressing looks like reported weight on measured weight see people lie about their weight or something and you know it can those are sort of the sorts of examples you get you feel good coming out of class and then you go look at real data and nothing works and makes you sad so real data looks more like this right so we've got some of these are baseball examples actually so there's this swing speed of a bat against the path angle of the bat not linear regression they're the this is a factor over time I'm not sure where that one came from these are marathon data marathon timely having just had the New York Marathon here this is the Boston Marathon histogram of their of finishing times another baseball example where you know it's highly highly nonlinear okay so real data kind of looks like that we want methods for dealing with it so I'm gonna focus on applications to regression analysis I assume a lot of you have seen regression before so just one or two slides on really what regression is this is kind of the most general form of a regression model right it's you're making predictions about some outcome Y conditional on a predictor variable or set of predictor variables X and so G in the general form can be almost anything the simplest form is this linear regression model so if if X is just a matrix of predictors times some regression coefficients is this right it's an intercept plus a bunch of slopes that's kind of like the example I showed you before the very simplest one the univariate linear regression and so these are kind of the default assumptions right this is kind of the default approach to analyzing data just start with a linear model linear relationships among the variables and the other important assumption is parametric distributions for data all stall whey is normally distributed right so the idea here is you make a prediction based on the predictor variable and the difference between the prediction and the observation has this normal distribution centered on zero and sometimes that works sometimes it doesn't these things generalize but you know if you take you this is an example we're gonna see in one of the notebooks this is spawning salmon and the number of recruits that they get and you know linearity holds in some places and not others but we can generalize linear models so there's a general form is generalized linear models where G is some other function other than identity and maybe the parametric form of the distribution of the data is not normal so common ones here like logistic regression right so if you've got failures the zeros and successes the ones and you try to draw a linear linear regression you'll get this weird you know line like this and it's not very good at predicting anything so now we transform it we do this logit transformation right so this is the inverse logit so it takes this linear stuff and it puts it on the scale of 0 to 1 and so you're modeling the probability of a1 instead it's still a linear model actually even though that's nonlinear because it's still linear in the parameters another way of dealing with non-linearity without leaving generalized linear models is to do polynomial regression right where you even there's even with one predictor spawning salmon again we could take number of spawn or squared and add this polynomial term and make it quadratic will actually do this in the hands-on example and you can go crazy the more the more polynomial terms you add the closer fit you'll get to the to the data that is used to fit the model and we call that overfitting right and so implicit in this is another step you have to do which is model selection you got to pick the right size polynomial that neither to to under fit like the linear example or over fit that's a 15 degree polynomial on the other side almost certainly over fit okay whether or not you do linear models or non linear models or generalized linear models you have another choice to make which is we know what what set of methods what paradigm do I use to fit these models and we're going to use Bayesian methods here and this sort of a this is when we talk about probabilistic programming which is one of the newest buzzwords in in data science it really just refers to numerical methods computational methods for doing Bayesian inference and the reason it's called probabilistic programming is that Bayesian methods make inferences and predictions based on probabilities so everything in Bayes is a probability distribution and you manipulate probability distributions to get answers so almost any model that you can build in a classical sense frequentist sense we probably all took more classical statistics in college if we took one even if you're not a mystics major unless you went to school very recently they were probably always frequentist methods you were learning about t-tests and anova z' and regression you can do all those things as a Bayesian okay most there isn't there there are Bayesian versions of almost everything okay so I'm going to quickly describe some Bayesian methods just to make sure everybody has the same baseline of information this is a part of Bayes formula these are the interesting parts of base formula and what it's what you should notice is that the thing that we Resta mating over here is a probability so we're saying the probability of theta given Y where theta are the things that are unknown that's anything in your model that's unknown so it could be regression parameters if we're building a regression model like we're about to do they could be hypotheses they could be missing data they could be predictions anything you don't know you can put into a bay model and estimate it and then why are the data so we our inference is what do we know about theta after having observed Y and the way that we do that is we combine what we know about theta before we observe our data with information from the data so there's kind of two pieces here and these things have names we call this the prior distribution because it's before we saw the data this is the likelihood function Bayesian inference obeys the likelihood principle that says all information relevant to the unknown parameters in your model are contained in the likelihood and that is combined to get a posterior distribution okay so but you'll notice there's no equal sign here I didn't write an equal sign I wrote this symbol here which means proportional to it's equal to up to a constant and so what it's missing is a normalizing constant below that guarantees that this is a probability so probabilities go in that's a probability that's not a probability and so we have to scale it and so the problem is is that that's the hard part that's the least interesting part but it's the hard part and this is the part that's made Bayesian inference really really difficult the denominator is just the numerator integrated over all values of the unknowns theta so you integrate out your unknowns so we call it the marginal likelihood actually the reason it's difficult is that theta can be pretty big right there can be lots of parameters in your model even the regression model the simple regression model that we're about to fit has three parameters and intercept a slope and a variance or standard deviation of the error right that's a three dimensional integration I can't I can't do that I can do one dimensional sometimes two if you give me the right easy formula three forget about it okay most the malls I build have dozens to hundreds of variable so you just can't do this in closed form except in very special cases any questions about Bayes so far by the way I know some of you have done this before so this is boring review but anything need clarification okay that's the idea that's really all I'm going to talk about in terms of Bayes the whole point is calculating these posterior distributions and happily today we have lots of ways of doing that in general we even have lots of ways of doing that just in the context of Python since this is a PI data conference in fact there's a Stan tutorial going on right now and so I'm we use PI and C for a couple of reasons one is I'm part of the development team two it's a really easy way to learn Bayesian inference because it's a very high level probabilistic programming language okay it kind of hides the complexity of the Bayesian methods it certainly hides the numerical methods that are used to fit that so we're not going to concentrate on what MCMC is and all that stuff we're going to kind of just assume that for the moment we're gonna concentrate on how to specify Gaussian process models we're going to do it through PI MC this was started back in 2003 when I was a postdoc and it's based on piano and Theano is pour one out for Theon oh it doesn't really exist anymore it's an it's a it was a well it's still there you can still find it but the project is no longer active it's a framework for doing deep learning essentially and the reason what we've done is we kind of hijacked that deep learning engine to do probabilistic programming because a it's a very powerful computational engine and B it does automatic differentiation for you does calculates driven s-- automatically and the reason we need that is that we in prime C we implement what what I would call next-generation Bayesian inference methods and those methods rely on gradient information to help you get a better idea of what the posterior distribution looks like so we're gonna be used methods called Hamiltonian Monte Carlo and nuts which are all gradient based and CMC methods and it's an unfocused sponsored project and you should all have it installed on your computer so let's do the first hands-on and so if you're if you're doing hands-on I will do it here too if you're not playing along so you can watch so one of the notebooks in the repository is called spawning salmon I PI and B let me just open that up Mia this and I've opened it up here already I think I saved them to the repository unrendered but I will clear out everything here we see that at the back what is the repository it's um it's my last name its github calm slash my last name /gp underscore regression and there's a binder hub hub binder hub tag on there and you can actually run it in binder if you don't want to do the whole installation thing okay so this is the example that I showed you already so this is spawning spawning salmon and what we're gonna do just this is so the goal here is twofold we're not going to do GPS quite yet we're gonna build a regular regression model but going to do an evasion method a Bayesian approach using PI MCS so it'll give you exposure to regression and to PIME C so building any any any Bayesian model no matter whether it's nonparametric or not or parametric it's kind of a three-step process 1 specify your model to calculate posterior distribution and 3 check your model so the first step specifying the model is really here so we write this down and we turn it into code so this is the same regression that we saw earlier on right linear regression with normal errors and so what you have to do is figure out what your priors are and what your likelihood is remember that unnormalized base formula we have to specify the left the right hand side of that formula the likelihood and the priors and so in this model we have three unknowns beta naught beta 1 and Sigma and 1 likelihood which is the normal distribution which is where the data goes in ok and so what I'm going to do is specify zero mean normals with a standard deviation of 50 for the betas and 1/2 normal with a standard deviation of 50 for the Sigma's why did I choose those because I like those numbers they're just reasonably large numbers they're not too big they're not too small part of model checking as you go back and see whether those are good assumptions the idea is you go you know how big could it possibly be in a biological constraints of salmon spawning for example right and how small could it be for example so these go up to positive and minus 150 we're pretty sure they don't go higher than that for either of those values and then here's my half normal so it's just the absolute value of a normal essentially okay so those are my priors here's my data and then up I'm C model looks like this I'll quickly explain give you the tour here - t3 uses this notion of a Python context manager context manager are usually used to what you know manage network connections and files and things and it takes care of all the business of opening and closing connections a few other things as well what we use it for is to build our models so rather than have you know instantiate a model and say add this parameter add that parameter add another parameter which is fine it's just a little verbose so what happens here is any time C object that's created inside this context manager is automatically automatically added to the model yikes little hiccup hopefully it's coming back I didn't touch anything I swear okay there it is okay so the nice thing about I think about PIME C is that what you write down in code looks a lot like what you write down in the on the whiteboard or in markdown in this case right so here are my priors there's a beta normal mean 0 standard deviation of 50 and then I do shape equals - I could have beta naught and beta 1 as separate normals this is just more compact okay and then my half normal the same and then I can just straight-up calculate this deterministic so this Mew is just a deterministic linear combination of the data and the parameters and then here's my likelihood and likelihood to look a lot like priors don't they it looks very much the same thing the only thing that distinguishes it is this observed argument so these things have been observed so they're fixed remember we conditioned on why this is our conditioning statement it says we've observed this node so don't change the value right observe things don't change they're no longer random so everything else gets is treated as random and that's fixed okay that's it that's the model I already ran it okay all right and then once we do that we can press the inference button as one of the time seers Tom Thomas Vicki likes to say this is where we hide everything from you so we can fit this using Markov chain Monte Carlo just by calling sample okay and nuts is the algorithm that's using it's a automatic it's an auto-tuned version of hamiltonian monte Markov chain Monte Carlo and so but don't worry about that I'm gonna sample two chains you should always do at least two chains for model checking purposes okay and I ran it for a thousand iterations tuning it for two thousand probably didn't even need that many it's really a Hamiltonian Monte Carlo is very efficient okay we can look at the output okay so there's the estimate of beta naught about sixty slope of about 0.53 or so and then and then Sigma okay and then the nice thing is is that we can now sample so everything everything this this represents the posterior uncertainty in those parameters we if we collected you know if we had millions of observations of spawning salmon these would essentially be point probabilities with no uncertainty around them so as we get more information these things kind of collapse to a point so it's not the distribution it's not that we're saying these things are random and they can be 40 on some days and 80s on another it's a particular value we just don't know what it is so we're using probability here to express our uncertainty okay and then we can sample from that means sample candidate true regression lines right but this this is forced to be linear okay so it's sort of squint at it kind of looks a little nonlinear because of where the data is sitting but these are just straight lines right and then as we said we can generalize this to a quadratic or like a polynomial model we'll use a quadratic so with quadratic we're just adding another x2 term right we're squaring the spawners and adding it to the model so this is exactly the same as the previous one but with two changes one we have three regression parameters here instead of two and here's where we add it and it's x squared which is the other term and so now we have a quadratic model we run it just the same has a little more trouble with this one now it's found it so it does a bunch of tuning so sometimes it could be slow initially okay now we have one that's you know somewhat better fit than than the last one right we might try a cubic term or something like that or or you could use a totally different model structure if you wanted to ok any questions about PI MC model building or regression okay all right we're gonna go back to the slides for the moment can leave that open because there we're going to use this data again - unreal I go oh sorry yeah yeah so this is the magic of Jupiter just like latex so backslash alpha al al pH a and then you hit tab and Jupiter magically turns it into the magic of Unicode so they're just late tech characters okay all right so that's regression and Bayes now on to why we came here which is GPS so what we're gonna do here is build flexible non-linear models with gaussian distributions with normal distributions and this doesn't seem like a very good idea right because we've seen what normal distributions look like they're symmetric they have very narrow tails everything's really close to the mean it doesn't seem particularly flexible for modeling messy data well it gets better when you have more than one of them so the normal generalizes to the multivariate Gaussian in the multivariate normal so this is the bivariate case where we have two elements all right and so we have rather than a a mean and a variance we have a mean vector and a covariance matrix and the important thing is that the often so these are just the marginal variances for each for y1 and y2 but the important thing are these off diagonal elements which is the covariance which includes this correlation term here so it's the correlation between x y1 and y2 and it determines you know how much we know about y1 for example given the value of y2 the root their quote more closely related to one another so as as so Rho goes between zero and one where zero is completely uncorrelated and one is completely correlated ok there are two important properties two gaussians that make them very handy for building nonparametric models the first is the fact that it's really easy to marginalize what we mean by marginalized you is when you have a big multivariate Gaussian with hundreds of elements and elements and you only want a few of them it's really easy to get rid of the ones you don't want marginalize them out so let's say x and y are not just two values let's say it's a whole array of values say we only want half of them we just want to get rid of Y so we have to do this right we have to integrate out Y which seems we just I just finished saying how hard multivariate integration is right but if it's a normal distribution it's really easy you just cover that up and you cover these up and it and all you do is you pull out that top quadrant for the covariance function and that Sigma X is your new covariance matrix no math to be done whatsoever similarly it's really easy to condition and it might be clear why we need to do conditioning right given the value one value in our multivariate normal what do we expect some unobserved value or the next value to be it's not quite as easy it's not a matter as much a matter of you know pulling out individual values but you can write it down in closed form there's an analytic solution to it okay so there's a conditional mean so the it's the it's the marginal mean plus some other term and the conditional or the marginal covariance minus some other term that we can easily calculate okay so we can condition and marginalize really easily the reason we want to do this is that we're going to generalize the multivariate normal to not just contain many many elements but contain infinite elements okay so what we're going to have is a thing that we call a Gaussian process that rather than having a mean and a covariance matrix has a mean function and covariance function right because you can think of functions as sort of being generalizations of arrays right if you had an infinite number of objects you don't want to store those and access them like a dictionary or something you'd want to create a function that generates the one you need as you eat it right because you can't store infinity on a computer so that's the idea so this is the textbook definition of a Gaussian an infinite collection of random variables any finite subset of which have a Gaussian distribution so it's an infinitely large Gaussian and anytime you take any subset of them you just have a regular multivariate normal of whatever that size is you can see where that marginalization thing comes in right because you'll never want to deal with all infinite all infinity of them you'll just want to deal with the ones you've observed or the ones you're trying to predict never all of them at once okay so it's kind of a yet again statisticians have named things poorly and so it's it's not nonparametric implies you know no parameters it's actually we have an infinite number of parameters but a better way of thinking of it is that the number of parameters the effective number of parameters scales with your data it means bigger models more data allows you to build a bigger model smaller data allows you to build a smaller model that's usually a good thing you don't want to build small models with large data it's it's kind of dumb similarly you don't want a big build a model that's too big with lots and lots of data it's the bias-variance tradeoff sort of right okay so this is it a couple things to notice is that what we're doing here is we're modeling functions now right these are functions so every realization of a Gaussian process is not a set of numbers it's an entire function so we're going to model this underlying function of spawning salmon or marathon finishing times or age effects of some disease we're going to model the function directly not its parameters and in particular the covariance function the realization of a covariance function is a covariance matrix so you pass the function a certain number of values and it returns the covariance matrix corresponding to those values so the the most important thing about creating Gaussian processes is figuring out what your covariance functions going to be and depending on the one you choose you can fit different types of functions so for example if you pick a Gaussian process prior that has a quadratic it's called an X there's lots of names for this one sometimes it's just called the Gaussian covariance function which is kind of misleading exponential quadratic is what we call it in pi of Z it's got one main parameter which is a length scale which determines kind of how how far you have to go away between two points to make them more or less independent and you just kind of get these you know sort of smooth ish lines so this is a prior right no data this is sampling from a prior with a Gaussian process with a quadratic x Bob exponential quadratic covariance function another popular one is the maturing ssin it's a little more flexible you can see a little more jagged so it's also also got a length scale but it's got this parameter that's usually fixed to some number like one half three halfs five halfs and it determines kind of how rough it is so if you have something that changes potentially very quickly over small scales this can be a good one you can use it to model things that are periodic cosine covariance function okay there's a whole suite of them if you go into any of the packages that do GPS most of them will come with a library of covariance functions and these covariance functions are just they're basically just kernels if you're familiar with kernel models there are some function of distance because that's the whole game here I remember the covariance between x and y and that first bivariate normal that's kind of where the power is here it's kind of what what does this point tell me about that point then there's the mean function we don't pay a lot of attention to the to the mean function it's a function that generates mean vectors and will actually usually just set this to something simple like zero or a constant linear function maybe in the for the growth term in that salmon spawning example the reason is is that it turns out that the posterior Gaussian process doesn't involve what the mean which seems a little weird but it's true what it is useful for is as sort of as prior information you can consider the mean function as a prior guess at what that function might be because wherever there's no data whenever there's no data in a Bayesian model it what does it do it shrinks towards the mean that's because that's the only information you've got if there's no date no information your new data and everything's based on what your prior assumptions are and so for something like a growth curve you could assume that it's just okay in the absence of information we just Sam and just keep growing till infinity okay all right so what I'm going to do is draw from a prior Gaussian process so these are priors over functions another definition of Gaussian process is a distribution over functions so rather than a distribution over values it's a distribution over functions so every realization is not a number and sometimes this presentation is a little confusing because of this although I'm going to generate I'm going to generate a kind of point by point but I'm gonna generate functions from this prior and you can do it point by point which is how I'm gonna do it you don't have to so here's the prior it's got a zero mean and a quadratic exponential exponential quadratic covariance function with length scale one and so it kind of looks like this and that's just kind of a standard deviation two standard deviation band okay and what I'm gonna do is just pick a arbitrary point I think I'm gonna pick one or something like that and I'm just going to draw a value and this will be unconditional because there's no points existing right now and so the the code is down here so we go into numpy draw a random normal with the scale whatever our scale is at point one and we get some number right it'll be obviously different every time so that in this example sorry this is what happens right and so we're gonna assume for the moment there's well this is a prior so there's no observation noise that's our point that's one this drawn function the next point I draw has to be conditional on this point so I'm going to draw from a conditional distribution now remember conditioning is easy for nor it for normals so there's a conditional function here I've implemented it it's just that conditional function that I drew before it's not no magic and I pick arbitrarily point negative 0.7 draw a conditional normal and it's right there okay and so you can see we're kind of you know building a link sausage here right and and you can see we're sort of carving away we're gonna get you can see already what this lines kind of gonna look like based on what our covariance function is right it'll sort of look like that red line more or less and in principle we could take all you know we can take infinity in two points if we had the time we don't need to to characterize it but that's just one random draw I did it here's four more from the same function okay so these are just draws from that prior so it's a prior on whatever we're trying to model okay we convert that into a posterior once we've seen some data and we need software usually to do that so questions about the prior and the GP prior I'd be surprised if there weren't any questions it's not even uniform it's a normal with a mean of the normal infinite normal mean of 0 with this covariance function that describes a bunch of covariance matrices between any number of points I'd choose them arbitrarily I just packed I would so if I were to draw if I were to if I wanted to generate a bunch of them I'd do numpy linspace 0 100 and just draw a line I would do it all at once I wouldn't do a point I did it point by point for kind of illustrations sake but each time you point the possibilities of the point next to it but not so much the points far away correct unless I had a really big length scale the length scale was really big it means that things are highly correlated even far out and so you'd get these big you know curvy things that are like this as opposed to you know that kind of you still get the same you know eventually you want to get a line right something that represents a line so you wouldn't pick random there would be no reason to pick randomly we're doing conditional we're just doing conditioning you might be modeling Grint Brownian motion we don't know this is just a prior you wouldn't use a exponential quadratic to model Brownian motion that matern would be closer right because it was very jagged the functions I was drawing we're not very jagged yeah any other questions yes between thinking about it just as like every point in your expertise function yep yeah no because again you can marginalize condition whatever way you want the infinite stuff is out every any point you want access is always there for you to calculate but you can't you know in practice all you're interested in is the one ones where you've observed data the ones that you're trying to impute maybe or or extrapolate to so you they're always a bit because you've got a function you can always ask for a point you just don't hold them in memory all the time yeah okay so there's lots of ways of getting of doing GPS really easily in Python and they range from high level to low-level interfaces tensorflow probability just added them and they're very low level if you like to write tensorflow have at it pie stan is there very low level - you kind of have to roll your own pie MC and scikit-learn or very high level right you just say Gaussian profile we'll see how we do it in Plan C but scikit-learn it's just a class like you know random forests or something the other one is G PI GP float G PI torch they're sort of in the middle you have to kind of write classes which some people like to do and others do not so you have wide wide range of approaches alright so let's go back to our salmon recruitment now how would we do a GP for the salmon model that we just fit linear models - so you know again looks like that and the cool thing is we don't have to say that it's you know polynomial or exponential or log linear or whatever all we have to do is is do Bayesian stuff so the way it works just like any other Bayesian model you specify your model or specify your probabilities calculate posteriors do model checking so we have to specify prior and a likelihood and in this case a prior is gonna be a GP that's the thing we were just manipulating our data is gonna be Gaussian you can argue whether or not this data is Gaussian it it works as Gaussian they're just count their numbers of spawning offspring maybe something else is better but we're gonna treat them as Gaussian for now and the nice thing is is that a Gaussian process prior plus a Gaussian likelihood gives you a Gaussian process posterior in closed form it's a relationship known as conjugate and there's a whole bunch of examples of conjugant models in Bayes and this is one of them so makes it gives us another closed forms way of doing this so so in Python what we've got to do now in PI MC we've got a just as before specify priors for our unknowns except this time our unknowns are the hyper parameters for the covariance functions so this we're going to use our squared exponential one the same one I was using before this is that length scale I told you about and then if we've added another one here which is a like an amplitude it's a signal variance so length scale kind of determines how things vary along the x axis it sort of compresses it this way compresses and expands this way this one since it's just a value that you're multiplying it by so it's going to expand on the y axis here so we got two unknowns here we don't have to have that by the way it just makes it more flexible so we've just anytime you add a parameter you're making things more flexible so I'm gonna use half Koshi distributions what's the half Koshi distribution a Cosi distribution is kind of like a normal distribution with really really really really long fat tails it allows the possibility for really extreme values right remember normal has that that attribute where you know 67% of the values are between two standard deviations or within one standard deviation of the mean and around 95 of them are within two standard deviations of the mean not the case with a Koshi it's got it's got really peaked and then it's got really broad tails and then this is just gonna be the positive half because both of these because our scale parameters they always have to be positive so we want negative values so it's a good thing to use as priors because it says we're pretty confident that you know the value is going to be relatively close to zero but they could be leaves open the possibility for really extreme value so you're not binding yourself into a corner okay and so here's our exponential quadratic there is row there's a de and then here's my mean function remember I said we could set the mean to anything and here's where I'm putting that linear function and I'm just going to take the rise over the run of the data which is kind of cheating for baize right because I'm using the data to specify the prior but that it really won't matter you could easily have just put 0 there and then what we're going to do is take a marginal Gaussian process so this is marginal because it's this closed form thing if we've got Gaussian data with a Gaussian process we get Gaussian and so all we have to do is optimize that marginal likelihood that that denominator which we now have in closed form here okay there's the marginal likelihood in fact so here's the just like if we are building a model this is the noise so this is the observation noise we're dealing with data now not everything is gonna be described by the underlying process so here's the marginal likelihood this is the equivalent of the regular likelihood in a Bayesian model except this is the marginal likelihood and we pass it X the number of points because we don't want to deal with infinite points we just were interested in X points this is the grid of values over which we have observations and then here's here are the outcomes and there's my notes okay and that all I'm going to do here is optimize so the cool thing is you don't have to do crazy MCMC stuff although you can we can just optimize so find map is stands for find maximum a-posteriori which is the post the maximum of the posterior distribution so it's just going to he'll climb to the maximum value if we're interested in the uncertainty of our hyper parameters then you can do MCMC we'll have an example where we can do that but you don't have to so officially it's not you know like fully Bayesian okay and then how do we make predictions how do we you know once we have the values where we have observed data how do we go back and get other values like if we wanted to draw the full line this is no different than in any other Bayesian setup we use the posterior predictive distribution so the posterior predictive distribution is how we make predictions with a Bayesian model and it involves two components the posterior this is the posterior we're just we've just been estimating and that that is our you know epistemic uncertainty so it's the our unknowns and then there's alia torque uncertainty so things like noise and the observation noise right so you've got to incorporate the stochasticity of our data generating process this is essentially our likelihood here as well as there are uncertainty in the parameters okay and the good news is don't be afraid of the integration sign we get this for free as well from sampling from our model okay so all we do is we pass it new points that we're interested in and it'll give us predicted values after having fit our model okay just like if you're using scikit-learn you fit your you fit your boosted tree model and then you pass it predict and it generates brick did values this is the same thing here so in pi MC it looks like this so we we create this conditional remember we're doing now conditional prediction over some range of spawners so let's say I want to know everything from 0 to 500 so that's why I was talking about with you so you linear space and we're just going to predict from 0 to 500 and then oops this is out of date we rename sample PPC to sample posterior predictive because it's more descriptive and then we pass it the model the fitted model the Verrier variables that we're interested in which is this one I'm going to draw three little samples which will see if we go back to our notebook which we'll do now any questions about specifying the model we're just about to do it by hand here so here it is right all right so here's the GP this is just exactly what I had written on the slides and you know feel free you know this is your hands on time you can change some of this stuff if you want to if you want to play with different length scales different priors feel free and oh I didn't do okay all right so here I am gonna do MC MC I thought I had done maximization we can do both actually so the difference here is so if I do find map I'm just gonna get single values for each of these just the estimated maximum values now I get distributions of values okay whereas fine map would have given me you know that value of that value and probably something up close to zero for here and then and then here's the conditional piece so now I'm going to draw just draw free candidate lines okay all right and so that you know each of these could potentially be the true underlying process and and you can see it's kind of interesting because you know the whole idea the whole conundrum with nonlinear models is you know do you believe that there's this weird dip you know around 350 spawners probably not know if you know anything about the biology there's no reason why that should happen but the data is the data there it is and so some of the lines think it is and some of the lines think it doesn't and so you can generate a whole bunch of these so I can go back in and generate a hundred and you can see here that you know basically every line is kind of like a vote for the true posterior right most of them are kind of passing it by but they're still uncertainty they're right they're still uncertainty whereas if we had collected millions of points there'd be it would be a line basically through the data okay these are drawers from the posterior without any observation noise so this these are what we think the true line is if we can also draw ones with prediction noise so including the noisy bit so this would be if we want to predict new observations here we're predicting the underlying line which is subject stochasticity and now you can see now you should see most of the points within kind of the band that you get okay so if you wanted to generate new predicted observations this is what you would use if you were interested in the latent process so to speak you would leave the noise out I'll skip this exercise leave it for you so what happens if we you know want to project outwards what happens at six hundred eight hundred and Beyond right well if we know something about biology you start hitting you know starting to curve because there's density dependence right it starts to to many salmon it gets crowded and they can't produce as well the but just like anything else you have to be careful about extrapolation with Gaussian processes because once you get far enough away from the data what happens what does it start to look like if you have no data if there's no data in a Bayesian model what does it rely on prior so go best with prior remember the prior was was just a straight line like this so it'll say salmon all the way up right so the only only you only want to trust upon a extrapolation depending on how big your length scale is so if your length scale is pretty long you might be able to trust a bit of an extrapolation but other than that there's a example in on the PI MC documents of the co2 data right that's going up and up and up and when you extrapolate it on it it goes down again and say oh we're saved we're gonna it's all going to go down in a few years but it's just because we had a zero mean prior mean and so it goes back to the mean one year no data left so you have to be very careful about extrapolate how we're doing on time half an hour cool pretty good questions yes you may be interested in two different things one is what do I think the underlying reproductive reproduction curve looks like your expense or your expected value right with there but there's noise you know like these values down here are probably due to noise not due to a process right so if you don't include the noise all you're getting is the estimated posterior of that underlying function that you're trying to estimate if you add noise you're including how how far away from that expected value could observations be because we've also estimated Sigma which is our normal you know I'm white noise process essentially right and so that is better for predicting observed values where could or not how high so in other words if we observed value down here then something's wrong something's wrong with our model then we've underestimated the noise or what we've estimated the wrong model but all of these MA all these points are comfortably well not necessarily comfortably but well inside there's always a point that goes below it so it's contained inside of this sort of envelope of uncertainty so this includes stochastic uncertainty as well as parametric uncertainty so are we predict are we interested in estimating the function or estimating interested in estimate predicting new values this is for predicting new values okay all right yeah so we saw this okay this is uh-oh this one's on the log scale sometimes nicer to look at on the log scale that's the same thing with the stochastic noise on there again okay all right on to the next example so mmm you could accuse me of being all textbook on you yeah give you an example that's normal prior normal data good all right sorry about that all right so what if I what a shot this is weed they actually these are our own that we brought for the event because the the building didn't have a visa for these rooms so all right this that looks a green light at least yeah that rat hearing me okay all right turn it off and on again it works every time doesn't it okay what I say oh yeah counts of disasters so every year mining collapses and things and so these are count data typically you don't model count data with normal distributions one because counts are discrete rather than continuous right and also there's usually a mean variance relationship as the mean goes up the variance usually goes up so and what I've just described our characteristics of different distribution called the Poisson distribution so you'll often see Poisson distributions used to describe count data so our likelihood is going to be different now so we're going to assume that our data our Poisson distributed and our prior is still Gaussian so Gaussian plus plus on we can't we can't do our nice marginal GP now we have some sort of transformed Gaussian process so it's a we've given ourselves a little bit bigger of a challenge so for that we have to do we have to use a slightly different kind of Gaussian process and we call this a latent GP latent means there is a Gaussian thing in there somewhere but we're gonna transform it so that it's the mean of a Poisson because the mean of the poison has to be positive because it's modeling counts and you can't have negative counts so everything looks very similar it looks the same up till here actually right except they used exponential inexplicably rather than half Koshi it probably doesn't matter you as an exercise you could go in and swap them out and see that it won't matter other than that it's the same except now rather than marginal it says latent okay and Layton just says there's that we're gonna there's another step there's gonna be a different posterior distribution you can't just use the closed form way from before so the transformation we have to make is take this prior which is the GP prior which is Gaussian that can have negative values and how do you make something negative positive take the exponential of it right so inverse log transform so exponential of whatever F is it'll guarantee that all your values are going to be positive so we've done away with that boundary in pi MC those sorts of transformations will be rap in a deterministic nodes this is just an arbitrary deterministic transformation and then we as a Poisson likelihood rather than a normal here so our likelihood is poised on now where the expected value is lambda and then here are disasters here the year is here and it expects a 2d array so we have to add a empty dimension all right so that's the big change here is rather than marginal we have a latent with prior okay and so be the same as before call MCMC on it and let's give it a try okay it is mining disasters and small data set so it's just right here in one of the cells oops let me restart the kernel I must have been messing around with this before alright there's the data so this parts the same except for the latent part so there's our GP latent or prior Poussin and pisces got most of the important numpy functions that you'd want but they're buried inside of p.m. math and that's because they're actually piano functions so you want them to be theano functions whenever you can and then we'll sample right don't think I need this actually by default pi FC will use half as many cores as your machine has so on this so we'll use for so I'm just telling it to use to just so no one he's that much although as many as you can is best for model checking nothing like watching MCMC run yeah the next example we're not even gonna do hands on because it takes a long time yourself I need to get warnings divergences so I told you about gradient-based MCMC and so tries to follow the gradient of the posterior sometimes it diverges and goes off into a weird place and it reports how many divergences you get if they're just a couple it's usually okay there are Diagnostics in the RV's package which is being used to plot these things that you can these are probably okay I've done this model enough to know but okay and so now so what we did in the example that that we use for pi MC in general is what we use is this for as a change point model where we say there's an early mean some regulatory thing happened in here to change safety or something and then there's a lower mean down there and so a very simplified regression piecewise constant model very simple model this is saying this allows for you know arbitrary changes in the underlying risk if you like for a disaster and you can see that it you know there was sort of a step in here but it could be non constant who knows yes that's just overfitting noise or something is there a way to change the prior to so we've because we've done MCMC it's taken that uncertainty into account right so these are the posterior so here's the length scale so if the length scale truly is up around 11 then it will be smooth and if you know zoomed in on this you'll see some of those lines are going to be almost straight you'll change over a much longer period of time whereas others don't so that uncertainty that's the nice thing about this is that there's no over there no overfitting here it's just that that it's because a realization is a function so there's you can't add or subtract parameters from an infinitely parametric model so not that you can't over fit a GP but it's it's pretty hard yeah because the marginal you can only use the marginal when it's that conjugate normal normal because it makes the the posterior closed form and so all you have to do so that that big integral that's under at the bottom of base formula is closed form and you can actually just maximize that to get your get your hyper parameter values so it's a lot easier this there's no conjugacy involved and so you've got a you've got to build that what so what it does is it builds a latent GP and then transforms the normal with a and uses that place on likelihood and so it's just a harder so you can't you can't you can't specify it in closed form basically that's so a tape you can the model took a little bit longer it's a little bit more costly computationally but structurally they're they're different the thing that's being optimized is different in a marginal model versus Lane model yes goshen prior as well even first that dispersion a Gaussian process prior the the Gaussian process prior is on the log of the so we're modeling if you go back up to the model we're modeling the the log rate except essentially so we're modeling the Poisson rate that's a rate of accidents and the Gaussian process describes the log rate of that so if you take that positive value you take the logarithm of it you'll get something that's on the real line and you can you say that that's normally distributed so that's it's going to be Gaussian but it's the latent thing that's go sein not the observed thing everybody else really do something like ended like over discourse what's on it one thing yep yeah so hold on let me well I'm talking I'll change the priors and see what difference it makes yeah so there is a there is still there are still choices to be made about the likelihood and the priors like I'm just testing now what he's what's your name sorry Alan what Alan was suggesting is that the Poisson distribution may still not be suitable for counts because sometimes counts are over dispersed which means so the Poisson has only got one variable lambda and it describes the mean and the variance well it's not all sometimes you get situations where the mean of the variants aren't the same and so a negative binomial distribution is a generalization of that that allows for over dispersion so that's a slightly more complicated model so you could try that and do model selection to see which one's the best I usually try I usually start with the simpler model check the goodness of fit and if the fit is okay I just leave it alone if the fit is poor then I'll make it a more complicated model and it's running even slower with the half Koshi and all I'm gonna do is do a another plot of the posteriors salute a and see if there are any different fewer divergences so it looks like the link scale is a little bit more sensitive which makes sense the Koshi prior has a little bit longer of a tail let's see what kind of a difference that makes to this so there see this one is a little flatter so it looks like it's a little smoother okay and you can do goodness-of-fit checks to see which of these are is preferred but it's not overly sensitive okay back to the pardon me the red lines are drawers their posterior drawers yeah I haven't done any mean it's the posterior draw of the latent process the estimate of lambda over time okay all right and so we have yeah plus your hyper parameter estimates which we just saw okay another non normal model that I'll quickly go through this one won't have a hands-on component so this is from I used GPS in in practice for this before I was in baseball it was about statistics professor at Vanderbilt and we're studying infectious disease dynamics and one of the things about measles is that people show up with symptoms of measles but they sometimes don't have measles they'll have rubella or something else and so there's a bias having to do with detection bias sometimes you overestimate or underestimate the numbers of cases here we're talking mostly about over estimation so this is confirmation bias so you test somebody out in the field you you give a quick response based on the test that you do there but then you've lab confirm it to make sure and some of the lab confirmations come out negative and so to properly model it you want to correct for that and so this is exactly the same as the other one latent GP exponential priors and now I'm just using a binomial right because this one is out of how many potential cases how many were actual measles cases all right so now I'm estimating the confirmation bias and measles and it looks like this age this is always there's the most frequent application for me for GPS is and these sort of age specific things because age is almost notoriously nonlinear for a lot of things so you can see it goes down for kids and then goes back up for some adults and then and then the cool thing is is that you know you get you get some uncertainty out here for elderly folks because there are fewer much fewer cases that the size of the points represents the number in that age category so tiny points here are bigger more cases in here right so very useful and often I do this this you can roll this into another model this could just be a component of a lark this is a component of a larger model modeling the dynamics of measles this is just a correction for the bias that is age specific and it's important to take that into account and here again you don't you know how would you model that parametrically it would be difficult you some you'd have to do some polynomial thing okay all right but 15 minutes left so so I've shown you the good news the things that Gaussian processes can do now the bad news and we're in the age of big data Gaussian processes are not typically good for large data sets and the reason is you have to look at the posterior distribution of the posterior Gaussian process so this is the posterior of the covariance function so k star means posterior covariance function and it has a nice closed form like we saw before this is the marginal piece but if you look closely you take the covariance function evaluated at X which is the data and you do something to it and then you invert it so you're inverting a X by X matrix so if you've got a thousand data points it's in a thousand by thousand matrix inversion and it quickly is it's very hard to do it's cubic in compute time and quadratic in memory and so it really large models take a long long time and often croak so what we do is we approximate instead so what we can do is rather than have a full prior based on all the observed data we get sort of representative points of the data and use them as what they call inducing points and we have a sparse approximation to our prior so we have these things called pseudo inputs that are of size M which we presume to be much smaller than n like you'd pick for that measles example you might put five or something like that so it reduces the number of matrix eigen vectors that are used to fit the data essentially and so it reduce the expressiveness of the GP but it all sometimes doesn't matter so the example I'm going to use is that Boston Marathon data set that I briefly flashed up at the beginning so 2015 over 26,000 finishers it's crazy that many people are running marathons these days these are the really fast ones too you have to qualify for Boston so you'd have so to fit a regular GP to this data you'd have to invert a 26,000 by 26,000 matrix which good luck with that right so that's what the data looks like if we want to do is get we're after is the expected finishing time over the all of the age groups and so what we do is just add one additional piece here which is a set of inducing points we'll call them X U and you can specify them yourself for data this dense it doesn't matter very much you can probably just say 1 2 3 4 5 etc but PMC has a nice utility function that picks them using k-means so it just does clustering along the line and they're not useful clusters in the sense of clustering things but it's you know it'll find good places to situate points and it's more important when things are actually clumped in different places other than that it's all the same right except now we're using marginal sparse unfortunately this does not work at least in pi MC 3 in the latent case so if you've got large data that is not Gaussian you're kinda out of luck you got a roll you can't not that you can't do it we just don't have these high level classes to take care of it so I'm gonna pick 20 inducing ok it says 20 but I actually took 10 and and then everything else is the same marginal likelihood this looks exactly the same except we're also passing Xu here and it will expect that any time the GP itself is marginal sparse it'll say where are your inducing points and there are a lot of different ways of doing these inducing points the defaults think fit C is the default I just wrote it out to be explicit it stands for fully independent conditional they're all just different methods for reducing that eigenvector size and different methods work better in different situations it's always best to fit a couple of them and see which ones turn out better and tada looks like that and this doesn't run in too long all the time but I don't have a hands on for this because our last example which is what we'll go into now is also going to be sparse so we'll we'll get a chance to do it by hand as well okay all right the last generalization will do and this is really useful bit I use this a lot is doing multi-dimensional Gaussian processes I don't mean on the input side obviously it's infinitely dimensional but the output so up till now we've been modeling univariate time series mostly or aged series sometimes we want to do things like spatial stuff and I use these a lot in the baseball applications but the example I'll give you is from a textbook this is a GL from a geology textbook applied geostatistics from back in the late 80s these are just samples from various locations on a landscape and they're like three substances and the textbook it doesn't even tell what the substances are and they're measurements of each so it's just V U and T and XY location so our inputs are going to be X Y and we're going to model one of the substances here and the data kind of looks like this so it looks like what they did is they went in and had a grid and then wherever they found you know got a some threshold value they went in and sampled more all right so the dis sort of search strategy so around here so here we're you know there could be pockets in here where there's yellow but they didn't find any so it's sparse relatively sparse data spatial data two-dimensional and so what we're gonna do is come up with the under you know there should be a true underlying distribution of substance u that we're trying to estimate here and it's remarkably easy to do that this looks very similar to what we're doing before I'm using a slightly different covariance function i'm using the matern remember it's the jaggedy one i figured you know with something like this there could be rough net a little bit of roughness in the underlying function but the important thing is you may have noticed before and chose to ignore that there's always a one beside the as the first argument in the covariance matrix the covariance function and that specifies the dimension and so so if you want to make it two-dimensional you just put it to there and the only the only constraint on that is that it now expects two-dimensional inputs so kind of like declaring a layer in tensor flow and you've got to have input dimensions that's the input dimension everything else is the same there's our marginal sparse again our inducing points which come from k-means and then i'm not going to do MCMC this will take a little while so yeah that and that and and then after I'm done right then I want to evaluate it on a full grid so I go in and I take a 30 by 30 grid regular grid and I'll create a heat map at the end of it so I just passed that big gret Z Z in there okay so let's do the last example multi-dimensional GP the marathon one is in here by the way I think if it's not I'll make sure it's uploaded after I'm done clear all outputs okay it looks like we're modeling viii we're modeling substance v okay and then here I didn't actually show you find Matt before but there is and it's just doing optimization now because this is a marginal fast so now we won't have distributions for the unknown parameters they'll just have maximum point values so you are sacrificing uncertainty in the hyper parameter values but we are sampling from the posterior predictive but again it's going to be conditional on one value for the hyper parameters I'll get this and so you can you know I did 30 by 30 you could do a hundred 100 and get a nice smooth one looks better whatever you need all right almost at the end here so I will wrap it up and this is the same as creaking if you've done geostatistics before if you're done creaking you've already done Gaussian processes because they're the same thing all right there's plenty more where this can't this is just to whet your appetite give you the basics as I said before you can do classification with GPS that's when you do that use a latent right and that would be like up Bernoulli or a binomial thing where you're classifying something as 1 or 0 the kernels the the covariance functions are additive and multiplicative so if you have something that is say periodic but also you know like the co2 levels in the atmosphere you can combine a linear kernel with a cosine kernel and model components of things there's a really good example in gell-mann's textbook Bayesian data analysis where they model births through at different days throughout the year and they use a seasonal component and a component for holidays and a component for weekends and all of those things automatic relevance determination you can do sort of an automatic model selection thing if you've got a lot of variables and you want to you can have a different what this involves is having a different length scale for each variable Bayesian optimization if you're any of you are machine learners and you want to one of the best ways to select hyper parameters is to use Gaussian processes because there's some underlying surface that represents how well your deep learning model performs over different values of your hyper parameters and what you want to do is optimize that so you can model that function directly using GPS and then you can show that GPS are related to all these other you can show that they're related to support vector machines if you take a neural network with a hidden layer with an infinite number of hidden nodes and put normal priors on the weights it's a Gaussian process so there there are links to other things you can make deep Gaussian processes and so on the advantages I hope I've made clear it's flexible you get probabilistic predictions which are nice interpreter hyper parameters I think they're I think they are maybe you don't like scales it kind of shows how over what scale things vary over and it's relatively easy to automate and the limitation is let us as we saw scales poorly with large data but if you look in the GPI torch package they the creators of that package have come up with a new method called black box matrix matrix inducing and it calculates exact GPS for very very large data sets and and what it does is it utilizes GPS and paralyzes the matrix inversion and it's very fast so I use it I use that a lot too so so you may want to stray away from PI and c3 if you've got larger data sets so other than that I want to thank the pines III team this is kind of the current active membership the actual contributing number is a lot larger than this in particular bail bill angles bill angles pretty much single-handedly developed the GP capabilities of Pi MC 3 over to google Summer of Code summers and so absolutely amazing stuff and that's why it was so easy for me to do this for you today without having to dig too deeply into the internals of time C and with that I'll wrap it up and I got a minute or two for questions three minutes in fact and then I'm going to the airport so thank you very much and I'm happy to take questions if there are any yes yeah so you pick those like ten points and then you're calculating you essentially have ten eigenvectors so it it collapses the information in the data onto those ten points and so you're in your inverting a 10 by 10 matrix instead of a million by million so the different fitzy will do something like that each one does a slightly different way of gonna didn't want to get into the methodology methodological details but there's a good paper that summarize if you just look sparse approximation GP there's one that summarizes each of them and how they go about doing it and some will like underestimate the noise and and then some will bias things on the hyper parameter side so different ones there's always trade-offs when you do these shortcuts and and so it depends on which trade-off you're willing willing to make but yeah and back here yep yeah no I never throw out data points and Bayes makes because you have that posterior predictive distribution it makes it really easy to impute them you can do it automatically without having to throw out any any data so this does them automatically right any and anything you don't observe you can include as a variable in your Bayesian model and by the fact that it's of GP if you have an unobserved X it will compute that for you if you just ask for it in the conditional after you fit the data said we're I wouldn't do it as if it was way outside the data like I said with the extrapolation problem but anything near the data you should get a really good imputation and that sometimes how there's creaking you know the hole that's really what I was doing with that spatial surface right as I was interpolating values using this unknown function estimate okay thanks everybody [Applause]
Info
Channel: Coding Tech
Views: 22,675
Rating: 4.9307361 out of 5
Keywords: python, regression analysis, data science, statistics
Id: xBE8qdAAj3w
Channel Id: undefined
Length: 90min 10sec (5410 seconds)
Published: Tue Jan 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.