Neil Lawrence: Introduction to GPs - GPSS 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Dalman was the first one nickel our you were in Sheffield come on 2013 so this would be the fourth the fourth gas in process summer school and welcome to Sheffield so what we normally do the tradition of these schools is to do two introductions so typically I start by doing an introduction and then we have a different introduction and that's because I think certainly when I learnt Gaussian processes I found it helpful hearing as many times as many people introducing them as possible because it's a slightly different way of modeling it's the correct way of modeling unfortunately you've been indoctrinated with other bad ways of modeling so but it takes a while as we know bad habits take a while to overcome so that's the reason for doing it and also it's evolved into it's very nice because I am always doing one but then I kind of say the same thing and it's probably who's watched videos of me introducing Gaussian process online see there you go you can be utterly bored so I said because of that I keep on saying slightly different things I suppose and also then I know that we've got excellent expert in two lectures today Jeremy Oakley I'm Tirta to tidy up all the all the bits I've missed okay so I'm just to say a little bit about these models I mean these models are people normally talk about sort of a I suppose Carmen is 1960 with Carmen filters but that's come smooth as a not even Carmen so they come after that but this is wonderful this isn't the original copy of the book but I got this printed it's a co la conversación de cualquier quasi systematic ballet method the one Cafe which I think means on the compensation of some approxi sit almost systematic errors by the method of least squares so this is a by a guy called tor Val Tila it was written originally in Danish but I felt I had more chance with the French there's a French copy so you can find on Google Books but I just got this print out made and it's actually the first sort of description of Gaussian smoothing and it's a 1880 so 80 years or um before Carmen and it actually the example it uses is perhaps you've got my observations with a telescope and there's something systematically wrong with the telescope right so it's you've got this constant distortion rather than independent errors and that's what he uses to motivate it so that's a wonderful book and even more wonderful book is this um this book which I bought online because I got a copy out of the library and I bought a copy online by Amazon I think and then rather nicely inside it says a statistical Laboratory University of Manchester so I apologize the University of Manchester if this found it by illicit means but I suspect they sold off their lives or something it's called random processes and automatic control and it's by a guy called Hal laning and it's a book which is filled with material Gaussian processes now the fascinating thing about this is I think it's nineteen year 1956 this book it's filled with Gaussian processes and so it's about control systems engineering now how laning at the time he was at the instrumentation laboratory at Massachusetts Institute of Technology now he actually turned out to be the person designed the first real-time operating system for the lunar landing module so he built a priority operating system the one that ensured that when the when they had the error that it on the landing it ran the correct program around the program that was going to land the module rather than some other irrelevant program he designed that but this is so he became very well known as an expert in digital control but this is all about continuous systems and Gaussian processes and it's fascinating because there's a load of ideas in here that the things that we thought we invented within the last ten years but of course what happened was continuous systems took second place to digital systems I mean 1960 is common so the actual lunar landing so the navigation to the moon they used common filters to get there they actually used something called square root propagation and one heard of that it was invented to get because they had 16 bits of precision so it was invented square root propagation to get the thing to the moon because they had a lack of precision 16-bits still pretty good better than like what we had like 20 years later with our 8-bit computers but they needed to do square root propagation to do that so people moved very much I guess with the sort of going to the moon towards digital signal processing and digital computers and and there was a big forgetting about these systems within the signal processing community but there's actually someone who's really good although he doesn't cover it much but does a lot of working Gaussian processes nowadays I see Misaka so I've been really talking about smoothing and filtering systems and this book is excellent Bayesian filtering and smoothing if you want to know about those systems and how they interrelate it's from the filtering perspective it's not from a sort of Gaussian process perspective but see my work in general is really taking us back to sort of relations between these filtering systems in these continuous systems and why because we have much faster computers now so we can do linear algebra very quickly now one of the previous introductions with Philip Penix did he did a really nice good job of explaining one motivation for Gaussian processes is that they are entirely linear algebra and so now I tell that every time because I thought that's a really nice way of putting it linear algebra is like the modern thing we can do fast on computers yeah I mean that's what GPUs are doing fast for us that's what processes designed to be laid back blas we've got modules to run linear algebra for us and influencing Gaussian process is purely linear algebra so just as previous generations so this is like how Lanning actually and I've looked up where the how 9000 is named after him it's not clear some references that might be but it probably maybe it wasn't so this guy how Lanning the reason he moved towards digital control because it was the practical mathematical algorithmic approach for the time but nowadays I would say Gaussian processes are very practical and we're trying to make them even more practical that's my sort of long-winded introduction um now I'm trying to give you a bit of a historical perspective and also give you a hint I won't talk a great deal about sometimes we have CMO talk there's talks online from cyma relating Gaussian processes to filters and I really love that area but I'm not expert in it which turn is also doing work in that Aaron I think spoke at the last school on related things if I remember correctly but from the machine learning perspective and you're going to hear a siskel perspective later from Jeremy so I won't say too much about that the people that really introduced them to machine learners were Karl Rasmussen and Chris Williams with this book I mean there was a community of us working on them but they got popularized in machine learning for this book and this book is still I would say one of the best references if you're coming from machine learning background although publishers keep asking people to write new books in Gaussian processes so maybe someone will say yes one day so I would advise taking a look at that book now what I'm going to do is try and introduce these things by looking at Gaussian process the Gaussian density and then looking at basis function models which is one way of looking at Gaussian processes so it always strikes me that when I was at school I hated I guess we called it the normal distribution I suppose I hated the normal distribution it seems so uninteresting I really didn't like statistics because it just seemed to be the easiest bit of math because you had to remember ok the mean formula was quite easy to remember but standard deviation had a few complications in it but all you had to do is remember this formula we were given little tables of numbers and you had to punch them into your calculator it just seemed like well when are we going to do some math so I think it's ironic that the bit of maths I hated the most is the one I spend all my time doing and the reason is because I think most people when they think of a gaussian they're obsessed with the mean now what you'll see we do is we sort of I mean the mean can be included and it's often included in some applications but the main thing from a Gaussian process perspective that I'm going to try and communicate is about the covariance and multivariate gaussians that's where they get very interesting when you start thinking about the covariance rather than the mean so it's a it's a standard sort of PDF and it has this knife sort of bell-shaped curve what I love about this is this has become so predominant now that it's expected that if you produce exam results your students should conform to a bell-shaped curve it's actually very easy to do all you have to do is ask them lots and lots of questions and it will naturally come out as a bell-shaped curve I know because I tried it and my my course went from being bimodal to being a so it's a lot of work on the marking wasn't it people these guys to hold on the hiking but it totally turns from what is often a bimodal into a bell-shaped curve if you look at any individual assignment you get a bio modality the students some are doing better than others but if you add all those things up very quickly after what seven assignments after seven assignments we recovered a bell-shaped curve of course the students at the top of the curve is still the best ones because they're the ones that consistently scoring well and the skewness at the bottom of the ones that consistently scoring badly and the ones in the middle of some assignments getting a hundred percents on assignments they're getting nothing yeah so I mean that sees the way to get things bell-shaped you just assess an enormous amount it doesn't do wonders for your scores in terms of your feedback and everything else I'd say but but it makes them happy at the exam board so that's very important property of Gaussian variables I mean well I'm talking about there as of course is the central limit theorem now the central limit theorem in the case of gaussians if you sum Gaussian random variables of course they stay Gaussian they don't become something else now this is something I think when I was at school or at university I never thought of this is unusual they just present that as a fait accompli you know that's here it is if you add these random variables together then they stay in the same class of distributions and that seems quite full why wouldn't they but of course most distributions don't most distributions tend towards Gaussian so most things if you add random variables together they move towards Gaussian so it's a very unusual characteristic this sum of Gaussian variate being Gaussian and of course the sum of Gaussian variables is if we have individuals which is sampled with a mean and a variance like so then the sum of those variables is distributed as some Gaussian with this mean being equal to the sum of the means and the variance equal to the sum of the variances now this turns out to be super useful and important it actually turns out to be pretty important that's the sum of the variances and not the sum of the standard deviations as well if mathematically to sort of show that actually taking averages noisy variables does improve things but I won't go too much into that but that's one of the main properties of the Gaussian there's another property which is yes so as the sum increases of non Gaussian finite variance have to be finite variance variables you also get a Gaussian so that's a reason why Gaussian processes are sometimes important like um this morning I was a bit late running coming here because everything was going wrong like my bicycle feet broke on Sunday morning and I was going to take another bicycle and I was getting the kids ready for school and as I rode my new bicycle out then it had a flat tire so I had to get it back in I have another bike so I switched over the tires and so there are made all these things I was thinking well the some of those variables has to be Gaussian so why is it well I'm running particularly late well I just decided that it's because my wife is in Portugal today and that's a discrete variable that makes a major change to my life because I'm doing a lot more of the kids because she normally does a whole load of stuff and of course that affects things so if you've got a major single change you don't get Gaussian results you get you come late but if you um if you have lots of small things adding up going wrong then typically you would still expect to see Gaussian noise and that's the central limit theorem tells us that who was it who proved the first version of the central limit theorem anyone know too terrible first version of the central limit theorem bun I think it was Laplace wasn't it Laplace yeah you answer for every question who did what in math is is Laplace who am invented the Gaussian distribution Laplace yeah actually I've also invented it separately as an approximation to the binomial but the PLAs didn't know about that Gauss Krauss credits it to Laplace so this other property though is less is less unique two gaussians so scaling a Gaussian also leads to a Gaussian now that's not unique to a Gaussian scaling a lot of variables keeps you within the same distribution family but those two properties are the important ones I wanted to highlight because they're what means that playing with gaussians is playing with linear algebra because linear algebra is just multiplying and summing right so if you sum if you do a linear algebraic operation on a Gaussian random variable because it's just a matrix multiplication because it just involves scaling and then summing together it all stays within the same family and that's why inferencing Gaussian processes is so trivial so one of the things I never was communicated as a student and I kind of started evolved into teaching it myself so maybe some way I taught other than by me but um I don't know where let me just see that we've got the multivariate properties if things like this model become very easy so we're going to use a lot of a linear algebraic representation so if I've got a vector W and I say W is equal to a matrix times X and then I say X is drawn from a multivariate Gaussian which I haven't yet introduced so we wrote a spherical covariance well I immediately know that Y is drawn from multivariate Gaussian and because of properties because this is a set of independent Gaussian variables and all I'm doing with a matrix multiplication is scaling them and then adding scaling and then adding then I end up with Y being distributed as a Gaussian as well and as I won't really prove this but you know you can go through it yourself what in this case it's immediately possible to write down that Y is drawn from a Gaussian with zero mean because the mean should be W times the mean which is zero in this case and co-variants WW transpose because the covariance will be of the form w sigma w transpose where that covariance is denoted sigma now that's a one-liner now in Bayesian inference what I'm actually doing here is I'm saying I've got a likelihood probability of Y given X comma W is given by this and then I'm saying X is drawn from a Gaussian so that's a P of X so I'm actually then deriving P of Y this is P of Y the probability just be sort of why that's the marginalization that requires the sum rule so P of Y given X P of X it's a multivariate integral this is the bane of Bayesian inference this is the bane of Bayesian inference because it's a multivariate high dimensional integral and they're difficult to solve people spend their lives solving these integrals but in the gaussian case they're trivial and this is why gaussians are very nice to play with of course they're not always the right model but that's why they're very analytic because of that very unusual property which derives from these two so I'm the set of things we want to do with a Gaussian process turns out to be we'll want to do things like regression modeling now I think it's important is from a statistics background yeah and who hears from a machine learning background and who hears from no miscellaneous other what background you from formal methods in robotics theoretical physics you see this makes notation and philosophy quite a difficult sort of thing because actually the mass is the same for everyone but the language and the motivations differ individually so some stuff becomes quite confusing and you can end up with a lot of arguments about why things are in different ways the regression is a term from which field statistics yeah so who's a who comes up with that term Francis Galton yeah it's a term from it's a term about a phenomenon right so a lot feels you might call that curve fitting or you might call it I don't know what you call it but it's a bit weird that it's I mean regression it's such a strange term but I'm using it here to mean fitting lines to curves so that's a sort of standard thing we might do with a linear function and I like to use this example of the set of things this is a very common thing we want to do in in any of these fields we want to get from data and replace the data with a function that fits it in fact I should have done this in the introduction what we're doing here is we're focusing this school on uncertainty quantification so uncertainty quantification is exactly that art I mean it's but quantifying uncertainty as well so you've got some sort of system and you want to replace that system with a surrogate model which actually gives you additional information about uncertainty now the example I some examples of regression in say machine learning is safe prediction of the quality of some meat given some spectral information now an example that's from Sheffield which is um did you ever work on this no it's Tonio Hagen and other people is calibrations the c14 isotope curve so you get radiocarbon dating um and that's based on assumptions about the level of carbon-14 in the atmosphere that is aren't correct but you can find the correct values by lining up tree rings over a long period of time so you've got a true ground truth and so work done in Sheffield using Gaussian processes actually was the c14 calibration curve take take the the carbon-14 age and map it to the real age I suppose go isn't done I've had this slide for a long time but um I suppose I should update that because they didn't do really this in the Alpha goal but predict quality of different game moves is a regression problem so if you've got the score of some quality of different game moves like backgammon or go give an expert rated training data they did use expert rated training data but I don't think they did a regression model maybe they did at some point but um but I like to think of these things I mean it I like to think of the most simplistic thing we can do is this because thirteen year old end up studying um this this formula y equals MX plus C it's some got some characteristics we've got some response variable that is related to some covariant and then it's got two parameters and the parameters have some meaning here of a gradient and an offset now the interesting thing if you go back to say Laplace this is the sort of question he was worried about he said okay so they were doing I mean it's a great time because they're looking at astronomy and they're looking at um how they can use maths with astronomy they're trying to find out orbits of planets or comets or whatever and they it's very close to today in many respects they've got data and they've got models right and they want to combine them but one of the things that they came across was this idea that um if you've got two points you can fit this straight line right through you've got the you know I think I've even got it on the next slide you can compute what M is and what C is given the two points actually it's an interesting thing to realize that for them it wasn't obvious what you should do when you've got a third point like I didn't write om easily I mean what do we do when we've got a third point what do you learn at high school to do you would say these squares yeah so it's interesting like solar companies ASEAN do go get at her quasi systematic pala method than one carry one Kari is French for least squares and I guess it's a Legendre who coined that term the least squares you do least squares fit but Laplace had worked on it a long time before that and he didn't think that that was the answer he was much more so the reason for these squares if you read the paper isn't particularly well motivated from a modeling perspective but Laplace did think about this carefully from modeling perspective so actually just as a side remark and I quite like this one one of the things they tried doing and I read Stiegler on this and he talks about this so he's an interesting read if you're more interesting the details is they they fitted all the different possible curves or lines I mean here I've done lines but if you could in what they were often doing trying to fit bland tree orbits so they had basis functions so all the possible curves to the data and then they tried to select one that summarized those three so they sit and said okay well which one of these is the correct fit they sort of had a sense that one of them would be correct and the others might be incorrect right they didn't naturally want to average which is what we were trying to least-squares actually leads to averaging and they would do things like try and find the median fit so they'd line up all the fits and then define some for median fit it's difficult to get an ordering in multiple dimensions and then try and extract that fit which is sort of interesting I mean it does seem sensible to try and do things like that okay so the reason is that this is difficult is because of course it's an over determined system we've got three equations and only two parameters that are missing now the person I said it as before who solved this is Laplace I think he's great he's like my favorite because uh I don't know he just read loads of stuff by Laplace and I mean I can't read French very well but it's just he was really impressive but he worked in so many fields and he survived into his 80s like through three revolutions and became a count and yeah must be an amazing guy so my favorite thing that he said is actually well isn't it is not this thing that he's famous for saying so this is a listener neuroplasticity 'men who's heard of all classes demon yeah so it's a very ironic thing so here it's a bit too small so here I'm going to try my french again who is French so you love my action new day on donk movies are Jayla tech plays on tell universe calmly fete de song attack on Kilauea a calmly cause they're silly we kill the squeezer okay probably a bit like a french-canadian or something another disagreement there okay but I've got the English translation here and it says we ought then to regard the present state of the universe is the effect of its anterior state and of the cause of the one which is to follow um now David Applebaum disagreed with me when I said this but I would claim that's the Markov property but he's a probabilists oh he would he would claim that probably listen from an engineer's perspective that's the Markov property and he's writing this I think in the early 19th century so that's about 60 years before Markoff given for one instance and intelligence which could comprehend all the forces by which nature is animated and the respective situation the beings who compose it an intelligent sufficiently vast to submit these data to analysis it would embrace in the same formula the movements of the greatest bodies of the universe and those are the lightest atom now be careful here he's interesting because Boltzmann died extremely frustrated because no one ever accepted his model of the atom so this is before people believed in atoms actually see you have to be a careful when you read these things you tend to project modern thinking onto them but it's quite a fun thing to do because it's amazingly for it nothing would be uncertain in the future the past would be present to its eyes and then okay what's often not quoted is the human mind offers in the perfection which has has been able to give astronomy a feeble idea of this intelligence now what I didn't say is so this is known as Laplace is demon and it's known as um it's discussed by philosophers the mechanistic model of the universe and everyone says ah how foolish we could never do this I mean in the end quantum aspects I suppose mean theoretical physics it doesn't happen but you know from my engineer's perspective they don't come into it from my perspective this is true actually this is true I'm with Laplace um but I'm also with the place on another point that everyone fails to point out so this is sort of I mean there's all sorts of things in here there's a computational challenge which is something as we get into probably at numerix an uncertainty quantification we're worrying about today there's sufficiently vast intelligence but the challenge that Laplace is the ironic thing is this is page four in a book called a phyllis called philosophical essay on probabilities and what he says here is for it nothing would be uncertain in the future the past would be president his eyes so this is a bit strange that people are quoting someone who's writing a book on probabilities and the main quote is about determinism so something you should be a little bit suspicious and what you will never see written which I find bizarre I'm the only one I've heard say it I hear myself say a lot is that two pages later this is what he goes on to say the curve by a simple molecule of air or vapor is regulated in the manner just as certain as the planetary orbit the only difference between them is that which comes from our ignorance probability is relative in parts of this ignorance in part to our knowledge so this is his point that you cannot know everything you can't know about the curve of molecule where he doesn't know molecules exist actually but I'm anyway he's speculating right he's talking about the unknown that there are unknown things that you cannot deal with and that probability is the mechanism for dealing with that and that's what I utterly believe this is not written by Bayes it's a much nicer framing of this idea than Bayes writes and actually Richard price writes most of the stuff that is in if comprehensible in introducing Bayes letter this is an idea coming from Laplace that you should use probability to deal with these lacking and uncertainties and this is how all applies propose dealing with this challenge the plas said well what's going on is you've got uncertainty about the world you've got a model of the way you think things happen but your model is an abstraction it's a simplification of the world right y equals MX plus C no one believes in that unless you generate you know even a hook I think in physics you know the four Hookes law you have to buy special Springs so you know when you did Hookes law at school and you put weights on and you measure the displacement if you try that for a normal spring it doesn't work basically they're very very expensive Springs they make in order to do Hookes law to get a linear response you know very few things are as as simple as that and what Laplace said is okay so what's going on is that there's a difference there's an unobserved difference between what I see and what my model says now in statistics we would call that the residual but of course he didn't think of it so much in other areas we would I call it the noise right there's different names for this in numerical analysis they call noise something else which can be very confusing but this was a plastics fundamental idea you introduced what in some areas you might call them slack variables but you introduced a new variable and then you make a sumption about how that variable is distributed so you make a Gaussian assumption for that variable now the interesting thing is the plastid all the work on that thinking about that he also then did a load of work on is it Laplace this principle of indirection or something it's like a maximum entropy principle on trying to put as little information into watch that should be as possible because you haven't modeled it but it was actually left the gauss to say try using the normal distribution for these and then you get least squares and and Gauss claims priority over Legendre because he claims to use that to predict series which is another story but what I find interesting about that is his motivation for doing this is very much I've got these sort of things and something's missing and I need to put I mean that thing should be knowable right you can see from the text that the thing is he believes it's a deterministic thing in fact in astronomy a lot of that is going to be driven by the movement of molecules of air on a larger scale on a single molecule because of turbulence in the atmosphere and everything else is going to give you observation error but he's saying there's going to be some distribution over that and then Gauss says well that should be a Gaussian density but you didn't call it a Gaussian density in fact he credits it to Laplace so the beautiful thing about this is to my mind is um I mean if you're a statistician the sort of generations of arguments about whether you should do this but it just seems natural when you start a modeling and say well what about the converse situation of two unknowns and one observation so you've got some observation Y which you're relating to some other observation X and you've got two parameters to fit now the nice thing about this is you can only fit this if you make an assumption perhaps about what C might be because you can compute M given C so if I compute C being equal 1.75 I can compute M as follows so basically anything if I make a conditioning assumption on C I can compute a distribution over m but the really nice thing is if I assume a distribution for see just like Gauss was assuming and applies was suggesting you assume a distribution for those missing slack variables then you get a distribution of your solutions now Gauss's idea leads you to one set of parameters a single point estimate of parameters this idea here leads you to a distribution over possible values for M and if you would see a joint distribution over values for M and C and they would have as far as I can understand see no difference no inconsistency between that they did not distinguish between as some people do between genuinely stochastic variables and sort of things that you should know that's a fish Aryan perspective so I like to use that term and said don't talk about if you like this perspective you're just normal if you don't like this perspective your fish Aryan which is a weird cult of people that believe that something such as stock asta City exists so the nice thing about uncertainty quantification I mean what we're trying to do here is we're trying to build models that include the uncertainty in the uncertainty about the parameters and uncertainty quantification what we're very often trying to do is we can't do this in very complex models because they don't have these nice properties of the Gaussian but we can sometimes replace those complex models with a surrogate model and very commonly used one is a Gaussian process which is why we're bringing these two things together and hopefully that will make sense over we go through the next few days using this function that we're getting out of here this is a Gaussian process but it's a simple one as a surrogate for something else we're interested in so how am i doing to going too slow probably not going to so going fast but saying too much so uh okay so one of the things that we tend to do in a statistics or machine learning is we build linear models of this for multivariate linear models these are the foundational models for a lot of errors in statistics we would denote this parameter W as beta and I think that's really useful to you that distinction initially as I are damn stupid different notations actually statisticians think very differently about these parameters in machine learning we use W and that because it will weight a neural network weight she's not a great motivation for a parameter but I'm in stats they use beta now I think what I think is a good convention is go ahead and use beta if you care about the value of that parameter if it's something you want to know like um the rate of increase in disease with socioeconomic status right that's the sort of thing as that decision will we'll want to do have a parameter that has an interpretation if your parameter has an interpretation and you care about it call it beta if your parameter is just part of a predictive model and you don't really care what value it takes call it W B and that's very often in the case in machine learning and that's a big difference between machine learning and statistics motivationally in statistics normally you care about the value you're normally trying to fit a model to carry out about the value and that leads to a bunch of different ways then that statisticians will use these models versus machine learning people in machine learning we only care about what the prediction is okay so what we do need is a prior over W if we're going to do the same here so we've got a bunch of missing values here and then so that introduces the multivariate Gaussian density now I don't like the some notation I like little in a product matrix notation so I tend to write these models like this because it makes the Gaussian inference much easier when you use the sort of tricks I talked about earlier so um typically I mean I'm going to assume sort of background knowledge here typically we're looking for to apply Bayes rule in these places Bayes rule is just so not a rule I mean it's just the product rule I mean it's just gives it an authority that is not even merited if I just put that on that side it's the product rule right it's the product rule plus the ability to do algebra I mean its implications are massive but it's not really a rule I mean it's sometimes you need the sum rule of course as well and that turns out to be the difficult bit the sum rule of computing this P of Y the thing I wrote down trivially here because of Gaussian properties that turns out to be the difficult but getting the marginalization constant but um it's not really I mean yeah I'm Bayesian but I'm anti Bayes I don't know the term Bayesian so what we're typically trying to do in Bayesian inference is we have these priors we're combining them with a likelihood so in this case I've got one observation of Y given M I've got a prior over C and then I combine the two by multiplication very complicated so the prior says that C is this and the likelihood the data says that C should be this so the two multiplied and I get a posterior which is my posterior distribution for C has this complicated group for a Gaussian it's trivial and very amenable to computation so the reason why it's trivial is because the multiplication of the prior and the likelihood both being Gaussian leads to this form here so we end up with an exponentiation quadratic all we're getting asked were quadratic form of a Gaussian there and that's one of the reasons because gaussians have this form that's the reason they're so amenable to doing all these sort of things in Gaussian processes so what I wanted to do is just talk about how we're going to do that in the multivariate case and make a couple of points about gaussians so let's assume we've got a distribution over people's heights and weights in this room so Gaussian distributions for heights and weights now what I'm going to do is I'm going to sample from that distribution and I'm sampling from height and then I sample independently from weight and that's a valid way of getting something multivariate so that's what I'm sort of showing you here I'm pulling out those samples now those samples are not correlated right they're just sample from one and then sample from the other but that's not a very interesting way of doing a multivariate distribution but I want to motivate the multivariate Gaussian by starting that point so in reality Oh what's the guy's name he's one of the earlier statisticians invented the idea of body math like he predates mathematical statistics Ketola invented the idea of body mass index which which tries to relate the fact tries to get an index which is less dependent so it divides height into weight so the taller you are the heavy one it's got some quadratic which is weird because you think it'd be cubic you can read lots about whether it's a good measure or not but in reality what you would expect is this a correlation between height and weight so the first Gaussian distribution is independent so it actually looks like a ball or it would look like an axis aligned sausage four correlated Gaussian and these are the real things that interest me correlated gaussians are really interesting um you expect like light people to be shorter and heavy peter will do taller over average it's not general because some people you know a little bit big-boned now if you look at the marginal distributions what I've done here is I've actually set the marginals I'll not put numbers but I have actually set the marginals to have the same width as the marginals looking in each direction right so the interesting the first thing to think about physically with the Gaussian when you think about I try to give the intuitions of a Gaussian this I should have said so this is one contour from a Gaussian density in two dimensions so this is supposed to be Hill coming out of the plane now the marginals come from looking down the axis at that hill so the marginal distribution over h comes from just projecting this thing onto this axis here and projecting onto this axis here so these two marginals are the same as before but the conditionals won't be the same because of this correlation because if it's particularly high or particularly low you get the samples coming out in a different place now the way I like to think of the generation of a Gaussian distribution is by saying okay well let's assume for the start that we've got a multivariate distribution two things are independent right now if you multiply two distributions that are gaussian together because they have this exponentiation quadratic form you get this summing inside here and then the product of the normalizations outside right so because it's multiplying of two exponents the sum of the two quadratics now of course you can then have something that is clearly a Gaussian because it's an exponential quadratic itself but what I want to do is rewrite that in this little matrix form so I've rewritten it so W an H of a vector mu 1 mu 2 is a vector and then we've got a diagonal distribution over them okay so these are the variances as a Dutch I diagonal matrix here and it's inverted to make sure that we're dividing right now that is starting to look a little bit like a multivariate Gaussian so what we'll do is we'll rewrite that as W + H and now vector Y the me.there means it's a vector mu and that diagonal variance of Sigma squared is a diagonal matrix D so we can get a correlated version by rotating that space so what I want to go back to is sort of conceptually what if our acteve were aligned in this way that's actually the principal axes of the ellipse if our axes were aligned in this way we will be back to the independent case so another way of thinking about that is that a correlated Gaussian isn't rotated independent Gaussian so if we do that I'm rotating Y and I'm rotating you a rotation doesn't actually affect the normalization because I'm just spinning that ellipse around conceptually right so there's no effect on the normalization I've lest that as before a rotation of this space then allows me to pull this R in here and one of the things I think is probably I don't know you teach it over a few years and you start trying to think what are the important lessons you're trying to get across and I guess one important lesson to get across is different ways of thinking of a covariance matrix or a Gaussian density or so and so forth to get the intuition for what you're trying to do and this is how I'm trying to introduce things now what we're going to say is we're going to define this matrix here which was a diagonal matrix rotated on either side to be a covariance matrix so the C inverse is given as Rd minus 1 R transpose so that means sea itself is this because this is a rotation matrix why is this Sol because if I put R transpose R I get the identity D times D minus 1 would be the identity and RR transpose is also the identity so that gives me C is of that form so that also turns out that the determinate of this matrix with a rotation is the determinant of C so that's a property of determines so you get the covariance the multivariate Gaussian this form which is now the form of a vector times a matrix inverted times a vector by conceptually thinking of two independent axes that you sort of rotate so gaussians that are correlated have this two the property which is the sort of way conceptually I like to think of them now that has an effect that I talked about earlier so the multivariate consequence if you play with multivariate gaussians and indeed this is what we just did we applied a rotation to a axis aligned independent Gaussian D so we should have said the new variable Y is going to take something that was rotated around and y is equal to WX this is the thing I just mentioned at the beginning there then this is the case that Y is drawn from a Gaussian with W times mu as the mean and W Sigma W transpose the covariance knowing that rule allowed you to derive like loads of models that are much more complex in theory when you first see them derived but it's a rule that applies the gaussians because of the psalm of Gaussian variables being Gaussian some have scaled Gaussian variables so I want to introduce Gaussian processes and we've introduced the multivariate Gaussian so I tend to introduce Gaussian processes and actually I was watching back this this introduction is strongly motivated by a talk I saw where David McKay introduced Gaussian processes in a school I organized ten years ago in Bletchley Park who's seen David McKay's introduction - Gaussian processes okay the rest of you have to watch it it's amazing fortunate he died in April which is very very sad but he's only 5 years older than me but he always seemed a lot older than that because he had an amazing way of introducing things and that I watched that um just shortly I think just shortly before after he died I watched that video again if you search david mackay introduction to Gaussian processes you'll find it it's on video lectures it was a school we organized on Gaussian proces about ten years ago it's a great introduction and my review is very much inspired by that so I've been introducing multivariate Gaussian matrices as things that correlate between W and H and you can think of them as somehow rotating the system so that you're introducing these correlations on an independent system that's very much the way I think about it I think about it like that because by the way it's inverting the principal components so this turns out to be principal component a sitz decomposition this is the eigenvalue decomposition of the covariance matrix right so in some sense if we go back I mean the principal axes of this ellipse are the principal components as defined by Hotelling not Pearson Pearson never talked about principal components the principal axis of the lips are the principal components so they involve a rotation and a scaling the scaling are the eigenvalues which is the Sigma's I spoke about and R is the rotation or so I like to think about the distribution in that way but what I'm going to show you here is a sample from a multivariate Gaussian now this is important I think that this is you have to try this yourself to really believe that this works who sampled from a multivariate Gaussian in this way okay it's a nice and incredible experience I find it personally and it's one of the nicest things I've ever done in math so I'm if you if you take a multivariate Gaussian with a covariance structure of a particular form which we'll talk about in a bit and you take one sample from it and this is important this is one sample from a multivariate Gaussian distribution it's not 25 because it's a vector why it's twenty five dimensional sample but it's one sample taken from this Gaussian and what I've set up is because that this Gaussian has strong correlations in the covariance so the covariance says the covariance is high if two things are related because of that then the neighboring points are very strongly correlated then what you tend to get is any two points here pairwise are close together okay now also it's saying something about how that correlation phase with it so this is this point is correlated to this one this one this one this one actually it's correlated to anything along here but it's sort of dropping off the level of correlation towards zero then you actually what you see is a curvy line now it's odd because most people expect when you sample from gaussians you see something jagged this looks like smooth now it's not a line because it's a discrete sample it's only got 25 discrete points in it this is a Gaussian distribution not a process but it's the starting point conceptually for the process I'm not going to say where that covariance matrix going to come from very importantly and sometimes I forget to emphasize this the mean of the multivariate distribution I'm sampling from is zero okay the mean is zero it's one sample of course it's not along the zero line because there's no chance in a high dimensional Gaussian of getting a sample near the mean that you should know that the higher the dimensions you tend to move away from the me so you will never see example at zero you'll see things off zero on average these samples will be zero but you start getting this curve forming here now to explore that further I want to go back to looking up marginals again because we can plot marginals so instead of white and height I'm going to plot the marginal of this one and this one okay f1 and f2 we'll call them what their marginal distribution is and they're very strongly correlated the marginal distribution of a Gaussian process is actually really easy to compute it's another amazing property so normally if I wanted the marginal distribution I have to integrate out F 3 to F 25 and that's the headache again high dimensional integrals are a headache they're the bane of Bayesian inference but I now want to integrate out these things now it turns out for the Gaussian if I represent it in the form of the covariance this comes up in the school as well why do you always use covariances you know and I think that there's the best answer is this maybe you could give several different answers and I have in the past but I think my favorite answer that I think is indisputable is um we use covariances for two reasons one is this trick I'm about show you the marginalization property and the other is this central limit theorem that the sum of Gaussian variables turns out to require the sum of the covariances and that means that when we combine Gaussian processes together we do it through the covariance so there's many are the more interpretive ways of looking at certain Gaussian processes perhaps you can argue we can always argue that that that's more interpretable bla bla bla that's because it's a subjective thing but it is totally objective the way to combine Gaussian processes to different Gaussian process if you want to add them together you've got one saying one thing and one saying another you have to represent them in terms of their covariance at some point to bring them together so I tend to think of it as like I like to say it's like the API of the Gaussian process people who program who here doesn't know what an API is okay so API is application programming interface and it's defined definition for a computer program how you interact with that program how you make calls to it so from a computer scientist perspective the covariance function is it's the same in the mean function mean functions add together so the but we're going to ignore main functions for a large part Jeremy may say they are you know set general set because he knew I was going to ignore them the covariance function is like how you if you want to combine two Gaussian processes you need to ask it for its covariance function ask the other for its covariance function then and add them together so that's a major reason why the covariance function turns out to be use in this case the covariance matrix it's not a function yet or that were generated from a function so it turns out that the marginal distribution of two Gaussian variables is found simply by zooming in on the covariance for those two this does not work for the inverse covariance this does not work in any other distributions it never comes up so this is slightly different color here this is about point as we'll see in a moment I think this is 0.96 this is one on our color scale here the covariance between these two gives us the marginal distribution of course it makes sense that's the case or else you wouldn't be able to talk about correlation without thinking about other things that might be influencing it I mean that's why correlations so fundamental the relationship between covariance and correlation is that the correlation is just the normalized covariance right so it happens in this case I've defined the covariance that is pre normalized so this is also correlation matrix so the covariance is a nice way of talking about things if you want to know how correlated to aspects are but it also gives us the marginal distribution now that is the form of the covariance matrix but going back to our sort of picture before what that gives us is a correlation a while a sort of an axis a principal axis that is almost there's like 45 degrees almost right between the two variables and this is what it sang between f1 and f2 of what we expect these joint variables to be it's a very strong correlation 0.96 but notice there's still some noise off it right if the correlation was 1 what would I expect your fee here by the way straight line yeah it would be an indefinite Gaussian actually as the correlation would drive towards one its determinant would be zero so that it's a indefiniteness is something that happens to galaxies and covariances and it can happen you miracle ii even though if you know it's a full gaussian so as you drove this because you take the limit as this goes to one you would get a straight line here just a Gaussian bump slice sticking out of the claim a sort of knife a razor-sharp Gaussian sticking out so what happens let's see so let's observe make an observation so observation that f 1 is minus 0.3 let's say okay now machine learning or data-driven analysis of any form is about making one observation and then making a prediction over something else and that's what we're about to do so given this covariance between these two what I've got is a Joint Distribution over F 1 and F 2 so I've got a model that tells me how F 1 and relates to F 2 so if I make an observation of what F 1 is I can now talk about what F 2 might be and for me that is how all that's the sort of gold standard way all modeling should be done in practice you often can't because of interact abilities you do simpler things you do deep neural networks and other crazy stuff but what you really want to do is build a joint model over everything you're interested in condition on the things you've observed and get a joint prediction over those you've not observed that's the ultimate in fact few people would dispute that I think but of course there's many reasons why it's impractical where does your model come from but here we've got our model so we're going to do that we've got a piece of training data that's F 1 and we're going to make a prediction over what F 2 is okay so f 2 is the conditional distribution so the conditional distribution turns out so I said before that marginal distributions for gaussians can be thought of as looking at the Gaussian from two direct marginal distribution for F 2 so I always think of guru the or Ayers Rock in Australia there's big lump that always has a member of the royal family standing in front of it in the pictures on the front page of The Daily Mail now all this Hill is a beautiful over there we used to have a ski village on top of that hill but Europe's largest but not anymore so there's a hill over there as well if you look at that hill you're looking at the sort of marginal distribution right from this side or if someone looks from right angles they're looking at different marginal that's what you're looking at something at bumping out of the plain here um now the condition although if you want to think in that analogy what the conditional get the intuition of what the conditional is it comes from slicing through the hill right so we take a slice through the hill on f2 at our observational point right so you can't do this to the actual ayers rock uluru they'll arrest you but um if you were to do it then you would get a profile of what the hill looks like at that point if you renormalize that profile you're recovering the conditional distribution this is P of F 2 given F 1 in the red this is the observation F 1 and that was the original joint distribution so this is the whole process of learning in a nutshell to me although we just think of it as probabilistic inference which is why you in the world of non-fish Aryans those two terms are conflated learning an inference because what you have is a model that you've defined that includes your assumptions about the world it's your equivalent of y equals MX plus C it may have come from saying y equals MX plus C and then putting a distribution over m and C but it doesn't have to that's a classical likelihood prior way of doing it doesn't have to come that way it may have done it's giving you the relationship you expect between f1 and f2 you make an observation of f1 and then you make a prediction of f2 give an f1 it's simple apart from in practice there's complications but it's quite simple for gaussians fortunately so the conditional distribution says that f2 should be somewhere in the negative quadrant because of the correlation but it also recognizes that it might go either side if the correlation was a hundred percent this conditioning would just lead to a delta function on the same observe value makes a lot of sense doesn't it but as the correlation is decreasing you get an increase in the variance now something that's perhaps slightly counterintuitive is how quickly that variance goes up and so I show you that with a okay bye at first looking at the math of how that comes about I'm not deriving any of them as I'm showing you because I'm trying to give you the intuitions um was it hopfield said at his nips talk maths is that something that should be like sex it should be done in private amongst consenting adults so let me try quite like that I like that but of course we have to show some mass but we're not going to do the proofs I make my poor master's students do proofs because I have to ask them questions in exam and if you don't show them the proofs they complain um so but you know intellectual like this I wouldn't tend to do proofs they just give you the result so it turns out that the conditional distribution importantly is also Gaussian so this is very very weird that like you know you've got the joint distribution all its conditionals a Gaussian all its marginals of Gaussian this doesn't happen for other distributions it's very weird the conditional distribution is of the form a Gaussian and this is my notation for Gaussian which lots of people use the Gaussian distribution over f2 if sometimes we just drop what it's over because it's obvious from this side but here I've got it over f2 this term here is the mean and this term here is the variance but to try and rewrite that a little and I don't we write it here so I should it on so this is the mean of the resulting Gaussian and this is the variance now it's quite intuitive actually if this is the covariance which in our case is a correlation because I've set K 1 1 and K 2 2 to be 1 makes the math up here a little bit easier the mean is going to be the correlation which is 0.96 times F 1 so the mean is going to be naught point 9 6 times minus point 3 1 3 which was the observation and the variance is going to be 1 because we set that to 1 minus point 9 6 squared over 1 so what happens is we've got strong correlation obviously if this was 1 if this were 1 1 then we would get the mean is equal to 1 over 1 times F 1 which would be F 1 and the covariance with the variance would be 1 minus 1 squared over 1 which would be 0 that would be the Delta function but here what we've got is 0.96 squared so it's going to be it's at about 0.9 or something like that over 1 so we're going to get a variance of 1 minus point 1 to be about point nine that's the variance the standard deviation is the square root of that now the fact that has if we go to something I should stop and say any questions there actually because I don't stop enough and do that any questions about like we can do do feel free to interrupt with questions what I want to do though is go back to UM I should have spent longer making you ask a question but I'll do that in a moment what about the correlation between 1 and 5 so now we now just need the Row one and Row 5 to get that distribution and I'm going to do the same thing but I'm going to do it for 1 and 5 so I'm going to see you we're observing f5 instead of observing f1 the correlation between f5 turns out to be 0.57 now notice how quickly that Gaussian is ballooned up and what that's going to do to the you some of you should be already be able to think well probably you wouldn't be able to think but we'll come back to it what the variance of the conditional is going to be so now when we slice through and we look at the variance of f5 given f1 we get something that's much broader so no longer although this correlation is quite good in some sense if we have a doing some sort of social science study and we wanted to know people correlation between here and here we get quite excited we've got a result correlation you know something like that we if we had a sort of statistical significance it seems high it's actually very weak in terms of interrelating these variables so indeed even in the plot before when we sampled when we actually observe what the f5 was in that curve before which is what these red crosses are representing it's not even in the same half space as f1 so it's quite a long way away from f1 I mean you in fact you'd be hard pushed to really distinguish much between that on a couple of how many observations would you need to distinguish between this distribution and just a standard normal quite a few I suspect I don't know I'm not expert in statistical power and hypothesis testing but this distribution is very broad so one thing that is typically I think not in people's intuitions and maybe why covariances aren't always as intuitive as you might expect is that add these correlations drop off between variables you actually get quite a loose relationship which is why very often when you psalm from Gaussian distributions you don't see such tight all the sort of interesting stuff happens when the correlations are very high it seems to me now in general we don't actually make typically well I would love it if the world had datasets where you had a model and one observation and had to make one prediction but in generally we don't do that we have to extend the multivariate Gaussian and what we might get and I haven't said where the covariance is coming from yet I'll probably only finish on where the covariance is coming from so it's not to overrun but what we generally have is the same sort of formula that says the conditional distribution for some set of our observed F so if we had a test set we might call it on machine learning or a number of variables where we want to make prediction F star and we are given a number of variables as our training set or what we want to condition on as F then in general the multivariate Gaussian conditional has this form so it's a little bit more complicated and this one I do I think how if it's split into the mean and covariance and these are the formula for Gaussian process learning basically these are the key formulae that underpins everything once you're given K which is the covariance or the kernel the covariance matrix we'll look at next then it's very clear what you do because you've made a joint Gaussian assumption and this what you do is you get a mean which is equal to the cross covariance between between your test data and your training data times the inverse of the trailor covariance for your training data times F your set of observations and that's the mean of your prediction the covariance of your prediction has starts as the prior covariance and then subtracts K star F times K F F inverse times K F star so this is a sort of the cross correlation again between the two so in multivariate also works it's just less interpretable about what's going on and this is what you get your predictions from if you've made an observation F then you make predictions with mu in terms of your mean prediction that's your well made predictions from you depends what your objective function is actually your costs are but that would be if you wanted to make a mean prediction you use this formula and then this formula shows you how things correlate around that mean what the uncertainties are so let me pause because I think I did that bit I'm not sure I did that bit very well but um let me pause there and ask questions about what I'm saying here let me go back to this plot and see if there were questions people had at this time when I should have stopped about what I'm doing here and what it led to any questions Gary will explain it all clearer later but just you know questions I sense questions I know that you see you know it's good way of wasting time standing it very good 25 to 30 missions and the vertical axis is showing on this act yes on this plot the horizontal is showing the 25 dimensions and the matrix here is showing the correlation between them on this which is showing the value sample from the Gaussian so we've got the dimensions here so I've got a vector and these dimensions are this the dimensions of the vector I've ordered them in this way that ordering dictates how this plot is and these values here are the values I just get from a one line sample so in MATLAB this would be you build your covariance you put it in C and then what would it be a common with a command for something from Gaussian in MATLAB it would be like fun Rand and well yeah but if you do it round in you have to do it round in and then you'll have to say dimension 25 and then you would have to know you have to do a cholesky decomposition on that multiply the cholesky decomposition and then there's no mean so um yeah it's not the most intuitive way of doing it which is actually you know how we're saying early you build the Gaussian covariance by rotating the space the way you get underlying multivariate Gaussian samples is you sample from R and n independently and then you multiply by the fon transformation so that's why you do the cholesky I think in what is it in a so I think in Python it's like multivariate normal which is written out all the way out and then you say I'm mean I can't member how you even say whether you say mean or whatever let's say you say mean equals and then you put a vector of zeros in and then you say covariance equals C but then you say the important thing is you say samples I can't even remember the key name for that is equal to one one sample only one son not 25 samples and so then you get a 25 dimensional vector or whatever the dimension the mean and covariance was this covariance is covariance and then the mean is zero and then we're just plotting those along this line other questions you said nobody we know I just mean if I know that there's something the new trend yeah so I've had some really interesting discussions about this it led me to something presenter view that I think would be consistent between what Jeremy would do and what I would do and but to do that I'm going to go all the way back to here yes if you're a statistician very often you believe in this model and you want to know lemmas because you want to go into a doctor and you want to say we're seeing you know some this increase in heart disease due to smoking right and because you need to be able to the key difference research decisions the machine learners is its statisticians typically have to explain why are their answers are coming from because someone's going to make a decision the people's lives depend upon that's the whole set point in mathematical statistics but in certain mathematical sits in spatial modeling remember this corruption we talked about here so what you see I think pretty much load of classes of spatial models can be seen as having this model here and then saying this corruption is drawn from a spatial Gaussian process which is not independent so least squares comes from assuming independence across these points here but if you're building a spatial model so we're trying to work out disease across Sheffield or something like that then typically you get noise that spatially correlated and then you model it as spatially correlated noise so statisticians tend to come on upon Gaussian processes or some classes of statisticians would do spatial geo statisticians will very often come across them and this being a nuisance parameter that you're trying to get rid of that's spatially distributed but you want all the modeling to be done in M and C so if you I mean I think I think you have to be very cautious about knowing about linear trends linear in particular because linear trends tend to saturate outside the range of the data and if your model is purely predictive that would be the machine learners perspective and you've forced a linear trend in there that is going to extrapolate into regions where you've never seen data before so the machine learners perspective would be don't put that linear trend in there because that's going to cause as you move away from the data it's going to cause you to make crazy predictions you have no right to make its valid like it's on the spring on Hookes law it's like saying well it will continue as I pull this spring to be linear which we all know it want right the spring will break and catch your eye out I suppose so you don't get data in that region ah so the machine learner would be like no don't do that put all the modeling stuff in here and don't force linear trends let the day to speak but this is a difference in philosophy and both are correct according to what they're doing and according to what they want out of the model so a statistician is very often trying to sort of say that this parameter is meaningful in some way would you say that's true Jeremy or part of it's true it's partially true if you're an on fish Aryan only thing can only be partially true because of the sustained uncertainty okay this is the bad idea about PDF animations other questions yeah some of the samples from a Gaussian some of the samples from the Gaussian no if you um that's a different thing right a Gaussian mixture molar comes from when I have two gaussians it looks like a psalm of God that's the sum of the PDF so what why your confusion is coming in is I add the PDFs of gaussians I have a mixture model but that's not what the sum of the disarm Buhl's from that distribution come out so mixture models come about from a probabilistic way of switching between two gaussians so I give two gaussians to one of each of you at the front and then I flip a coin as to who I ask I never add the things they say yeah I only take one or the other so I'm now sampling here now sampling there and so mix your model left a component left is that one component right is that one and so I either get something from here or something from it I never add those things now the PDF of that is of the form of the sum of two gaussians right but that is not the same in fact one horrible nightmare about probability distributions is the sum of things sampled from the distribution is not easily resolved you have to use the characteristic function and convolution and various thing so it's never trivial never trivial it's trivial in the gaussian case it's not normally so trivial to talk about what the distribution over to some variables is which is why central limit theorem so important because a lot of classical statistics is about trying to make sure you're measuring something which provably is leading you to the central limit theorem yeah that clear other questions so yeah it's really interesting that actually because um they are they are basically zero numerically they're zero here but in actuality they're non zero because of the way I happen to have computed them which I haven't talked about it just means that as you get far enough away you get independence it's the sort of thing you rely on from a Markov property now actually strictly speaking this example here is not Markovian it's actually every single point is correlated to every other because this doesn't quite go to zero but you can build your models as we'll look at when I next speak which are Markovian as well so it just means that that point that is pretty much independent once you get down to here of this one we even saw for the fifth data point they're not looking that close once you get to here these things are independent now of course it happens they've landed in the same place that's coincidence but the correlation says they're independent the true model basically like in yeah that's where it's good where you're getting confused because that's the transition you need to go through that confusion yeah normally we don't think about the models in that way normally we make an iid assumption normally we say given some parameters things are independent which is why it's a little bit hard I think when you first see Gaussian processes because you're not saying that you're saying they've got a correlation between them within the data not within the features ja when we do when we fit gaussians but within the data so I mean it certainly took me a while to understand the implications of that when I was first looking at GPS I guess I heard about GPS in about 96 because I was Aston with Chris Williams and I remember David barber talking about them quite early on he gave a really good intuitive I saw Chris Williams speak about them I saw David makai speak about them I think it's actually only when you start playing around with them and implementing them that you really start to understand you get sent oh that's nice but it's a little bit a lien um that probably didn't help because I said yes you should be confused when you hear Jeremy you'll fully understand it any other questions okay so what we haven't said is I think we can do regression problems they have inputs and we haven't said where the inputs are going in and of course the inputs are actually going into the covariance function so then the next thing I want to sort of say is well how does X so I would tend to think I call this F the reason I call this F is because we're modeling a function f of X so much a machine learning can be drawn down to a modeling a function or if you're doing surrogate modeling that's very often what you're trying to do the sort of thing that Jeremy does in computer code modeling is you run a large-scale simulation of the climate you set its parameters according to this and then the output is two degrees warmer 100 years and then you emulate that with this function right and that's what we're trying to estimate here so that's why use F because it's the function I would use why if I've corrupted that function and by adding noise to it so we've got the F so where's the X gone well the X can only be really in this because I'm not using mean functions but Jeremy will the X can only really be in there so this is a covariance matrix it's discrete F was a vector it's discrete now from an engineer's perspective if you just make the vector infinite length and then put the elements of the vector index them instead of by a discrete number one two three four five six seven eight nine ten twenty five a continuous variable that's a function so I say an engineer's perspective because my undergraduate is mechanical engineering so um so f of X you know F as a vector is just a series of numbers to 2.2 2.3 2.4 and then it's a discrete function right because this is element 1 element 2 element 3 element 4 that's a lookup table right so it's not a function because it's not real valued it's got discrete values so the engineer's perspective is f well you know actually continuty doesn't really exist it's a mathematical approximation so I'm if I just have increase if I make the gap between these indices smaller and smaller and smaller and increase the length of this vector to infinite then the vector becomes a function and and you know conceptually that's what you're doing with a Gaussian process if you actually want to prove stuff about that you have to go to martingales and comma graph and all this stuff to show it's consistent to do that I am NOT going to do that because I'm not sure it helps you use these models in practice but it's certainly good stuff interesting math stuff if you want to look at it so the same with the covariance matrix it's a discreet it's a bivariate function with two inputs right it's you know K 1 1 K 1 K 2 1 K 1 2 2 and so forth it's a bivariate symmetric function and actually if we want it to be a conveyance function what we need is some way of generating according to our continuous index and this is what I did to generate this so I just took discrete indexes X I set it from 1 to 25 and then I've just computed the Euclidean distance between the two indices so if it's 1 and 1 its 0 1 & 2 it's um 1 1 minus 2 is minus 1 squared you know distance of 1 squared stays 1 and then I put in a length scale the length scale declares how long these correlations occurring over so the the longer the length scale the longer the correlations occur so this value stays small if L is large or if the distance between X and X Prime to separate observations is small and then I'm taking the negative and exponentiating so if this was 0 this would be 1 alpha times 1 so for the matrix above I set alpha to 1 I can't remember why set l/2 so if it's the same you're getting ones along the diagonal and then the length scale is set in such a way such that the correlations that falls off now what that means is well let me just pause there and check if there's a question I'm gonna have to stop in a minute what would be a so yes good question alt will be the scale of the function so if I set alpha to 1 then two standard deviations would be plus two minus two it's giving me the mod it gives me the marginal values here so if I want to know marginally where I expected to do without any other knowledge alpha is setting that so often we call that the variance parameter because it's giving you the variance of the function for the stationary covariances yeah so it sort of one is scaling the functions like that way and the other scales them like that way other questions I'll just briefly show you overview of how then we get a matrix out of that because I think that's sort of important so in practice we just get data right we're given some set of data we're giving data set points oh we've got three potential places where we're going to observe the function maybe their training or test I don't even know but let's just say we've got X 1 is -3 X 2 is 1 point 2 and X 3 is 1 point 4 and I've set L to be 2 and alpha 1 now if I want the elements of that matrix I have this is like my machine for creating the matrix right it's my function so I implement that in a bit of Python or R or whatever I like oh oh I've lost all my oh my dear oh dear Lord II okay so I won't I won't dig out because we've only got a couple of minutes there so what I do is I can always go through that if you want me to in the next lecture I somehow lost all the follow-on slides from that which is really weird what I do is I substitute in X 1 and X 2 to fill in the matrix so normally what I said is okay normal you've got a matrix which is let me just try and draw it on the board normally you've got a matrix which would be a 3 by 3 here so three rows and three columns and then I've got this function here if I want to fill in this matrix I've got X 1 to compute here I've got X 1 is minus what minus 3 so I look at the square distance between X 1 and itself I get a 0 and then so I just putting in a1 here to get this element here I need to compute the distance between x1 and x2 inside here so normally we would have a matrix where we've just got some indices right the mapping between those indices and the continuous idea of the function comes associated with the fact that every data point comes with an index right now the mapping between the data point of that index is arbitrary we set it up as we like so before I was just substituting the index in to get this formula out but normally we would have a value of an observation at import observation and that's what gives us the covariance matrix so from that I can compute a matrix like this that's my input now it won't be exactly like this and the whole point in the next bit was to show you what those matrices might look at light in practice because there might be a whole it wouldn't look so structured as this it would be given the data you have and then I've got a K that I can combine to make predictions over F I thought my X comes in to define the Gaussian process the covariance function and my f is the output from it I think I wasn't super clear on that so we'll just take a few questions on that and I may come back to it when I talk again but Jeremy may clear it all up anyway so questions there don't happen to my slides yeah to samples themselves no it's call it about refer to it's a sample so just been asked the question again so it's not 25 different samples it's one sample that's 25 dimensional that's really important yeah yeah I keep saying that but no one will ever listen I I know I mean it and it happens again you're not you know it's not you it's just a sort of thing people just can't believe you can do you know why would you do that that's that's the transition if the function is a correlated sample it's the sick every function is one sample yeah not infinite it's one foot sample from a Gaussian process right that's really really important and you're not used to that because you're not used to seeing people talk about data in the way you're taught used to people saying iid that will have what you're used to seeing people talk about is sampling noise it's so uninteresting you know the main thing what iid means is noise right it's that epsilon we had at the beginning that Laplace described so normally you say that noise is iid and then you write down a likelihood we're conditioned on the parameters your data is independent but think about what you're saying that if you were genuinely saying your data was independent then the whole thing would be pointless because I could know nothing about one data point given another so if I say my data's independent we never think about this properly because I think it's taught in this way for simplicity but we never think about what means to say I D it is not an assumption about your model it's about an assumption about the difference between your model and the real world when you say iid and that's really really important now when you do Gaussian processes you don't make that assumption you make the assumption that everything is correlated together so you never right now like that you can sum over all these data points which gives you some headaches other questions so the way that correlation is formed is dependent on the covariance function which I didn't describe very well I'm afraid because of loss of some weird loss of slides this is your model this is the fundamentals of your model for this case and what we're going to hear about later on from Nikolai's different covariance functions you can use I mean so the mechanics of what I've just told you now will break now but I'm the mechanics of what I just told you now it's kind of uninteresting right because it's just if you give me a model and you give me some data that's how I make predictions now you need to know that I need to understand that and how intuitions about it but a whole point in modeling whether you're a data scientist whether you're in uncertainty quantification or whatever else is understanding what assumptions you're putting into the data yeah and where are those assumptions coming in so far where are the modeling assumptions coming in someone tell me what assumptions have we made the kernel function that's one yeah so here we're making assumptions about how the correlations between points are falling off according to a given length scale and what the overall scale of the function is what other assumptions we may multivariate Gaussian we've assumed that all our observations can be modeled as a joint Gaussian distribution someone give me an example of a case where that's it's gonna be totally rubbish yeah yeah frustrating most the time it is Gaussian so they make profits for ten years and then then you get a financial collapse and it isn't Gaussian people do stupid things like leaving their economic community and everything drops off the scale so yeah financial data financial days irritating because day to day it is pretty Gaussian or somewhat its Log Gaussian in some sense yeah that's but what's the characteristic of that that is causing it to be non Gaussian fat tails fat tails is a big one discreteness so fat tails and I mean you can do things with Gaussian process where you deal with a discontinuity like that if you know it's coming but you know people don't know a well you know they don't typically if it does that and then it goes up again that would be like a non that would be a non Gaussian because what if you look at the Joint Distribution between the aspects of this function it might be Gaussian in this region here but then that's non Gaussian and then you have to look at things like levee processes to deal with that now why don't we why isn't this all about levee processes then someone explained why we're not talking about levee processes yeah what specifically you can't write down their likelihood the Gaussian process is a special case of a levee process where you can write down the likely they turn out to be we've talked about some of distributions before the sum of multivariate discuss all over their multivariate distributions they turn out to be the sum of two processes one of which is a Gaussian process and one of which is a jump process I think then there's another term I'm forgetting as well in there and just by doing that just by say I've got a Gaussian process as my observation plus a jump process everything becomes intractable it's a nightmare and people do study them they're quite well studied but they're not that well deployed because they're difficult to implement on a computer and a lot of what we're trying to do is get something that's easy to implement on a computer so you're modeling assumptions in real life are compromised by your need to implement it on a computer your ability to express the model in a simple enough mathematical form that you can do the relative impre the relevant inference Gaussian processes really stick out in the zone of being a very powerful model for which inference is almost trivial and that's why they're interesting not because they're always correct I mean you can sit there and meeting after meeting and people and say well you know no everything's always gay it's true but you know it's much easier to start with a Gaussian process very often and then try and develop it because you get so much power with the GP in the first place but there are many many situations where they don't apply so I'm us up there and Gary's going to cover everything else I'd be a bit yeah it's a final question before coffee any file No so say that again say count data Bern's how big the count data is right so if counts go high enough so what what does the sum of a set of poisson's converge to Gaussian yeah it's it's I'm going to actually they stay pass on but it starts looking Gaussian because they're related to the binomial I think as well so if I've got a series of pause on rates and I'm adding them all together so if counts are high it looks very Gaussian in terms of likelihood and you but you it's still wrong but you can approximate it well by Gaussian and I think that that's in so count data I think in sequencing for example you might be looking at things like negative binomials and things like that if they're low counts then it's a little bit tricky you have to start using approximations so Gaussian processes are kind of trivial and elegant for many different reasons but that the headaches are as follows number one you have to do a matrix inverse or you know even if it were matrix multiplication it's the same difficulty in the size of your training data if that is n if that's a million then you have to invert a million by million matrix you won't even fit that in your computer it won't happen so in the most general form you can't do that so you have to do either approximations or you build classes of specific Gaussian process models like common smoothers for which that inverse is tractable for reasons that we can't see in this form that's that's one area you've got massive data the other area is when things become non Gaussian like count data and depending on how trivial that is so the classical solution for that is to build generalized linear models with Gaussian processes underpinning so to use link functions and likelihoods in order to try and map the data into a domain where you've got it as a Gaussian process again even that sometimes isn't sufficient so that's one type of non Gaussian OT another type of non Gaussian T is this the actual process itself has heavy tails so this is the UM what's the guy who wrote the Black Swan book yeah yeah Nassim Taleb yeah yeah that's like the Black Swan thing I mean it's a I think I kind of think I mean I've read little bits of what he says and I just think it's not true that people don't take that into account it's like people do but it's hard to deal with you know it's some this is your Black Swan moment being in learning we would say it's just a nonparametric discontinuity yeah it's not a Black Swan moment it's a nonparametric discontinuity the challenge is combining a nonparametric discontinuity with a underlying gaussian process in a way that's tractable we can envisage these models it's whether we can do inference in them or not that is the sort of question so sometimes we then look to try and approximate a Gaussian process to that sort of thing that's the sort of thing you might do so massive limitations on the model but the limitations are way less than many other models they're very flexible framework ok let's say let's really stop there for coffee
Info
Channel: Open Data Science Initiative
Views: 7,025
Rating: 4.9148936 out of 5
Keywords: machine learning, Gaussian process
Id: jjRoN-j8Rj0
Channel Id: undefined
Length: 92min 13sec (5533 seconds)
Published: Tue Sep 13 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.