The Kriging Model : Data Science Concepts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] everyone in this video we're gonna start talking about spatial statistics and more specifically we're going to talk about a very very widely used model in spatial stats called the kriging model now let me give you a bit of a context so we're not just doing math and formulas for the rest of this video let's say you're a very famous biologist and your whole career has been focused on exploring this big island now you've gone back to the island several times and you've taken measurements at each of these red dots on the island so assume there's many many measurements at each measurement you have two pieces of information the first is X I X I is a measure of the location of that coordinate so it could be a latitude longitude it could be some X Y it basically just gives you a geographic pinpoint of where that location is the other piece of information you have is why I why I is the elevation in feet or meters or whatever at that point on the island now as many data points as you have sometimes in your research it comes up that you need to get the elevation at a point on the island where you did not explicitly collect data so for example let's say that at this black dot right here you care about the elevation there for some reason but you never collect data there now also you don't have the time or resources to go fly all the way back to the island and collect an elevation measurement there so you're gonna have to do the next best thing the next best thing is of course use the existing data that you have in order to make a good prediction about what's the elevation at that point now as with many things in stats we make predictions about unknown quantities based on things that are similar to this unknown quantity we're going to do the same thing in this spatial statistical concept except here similarity will mean literal Geographic similarity so what I've done here you'll notice there's five closest red dots to this unknown point and I've zoomed that picture in down here so in the middle we have Y underscore new which is the unknown elevation and then we have y1 through Y 5 which are the five elevations of the five closest neighbors geographically to our unknown point so the kriging model is basically going to say that our prediction about the elevation at the unknown point it's going to be some linear combination of the elevations at my five closest neighbors makes sense so let me go ahead and write the model out now we're gonna say that Y nu which is the elevation of the unknown point is going to be W transpose Y and that's the matrix form but if I want to write it out fully it's gonna be W 1 y 1 plus W 2 y 2 plus dot dot plus w 5 y 5 and the part I left out was the error epsilon nu so how you read this mathematical formula in English basically says that if I want to predict the elevation at my unknown thing I'm gonna say it's some linear combination so it's gonna be some combination of y 1 y 2 and y 5 where the each of these elevations is weighted has a weight W 1 through W 5 because I might care about some of my neighbors more than others and then I of course add an error because I'm never gonna be exactly correct I'm gonna be off by hopefully a little bit and that's what the crickey model is so in a nutshell this is the kriging model now the only unknown here right if I knew what the weights were W 1 through W 5 then I could just plug in the weights I could plug in my five known elevations and I could get a pretty good prediction about what my new elevation should be but how am I going to get at those weights well the only other piece of information I have here are the distances between my unknown point to my 5 closest neighbors and maybe just kind of crudely the closer one of my neighbors is to me maybe the more weight I should give it the further away it is maybe the less weight I should give it but how do I formalize this idea and this is where we use this thing called a very Oh gram in spatial stats I wanted to make sure to include the Vario gram in this video because it's such a widely used tool such a widely used graph in spatial stats so I wanted to give you first some exposure to it and then we'll see how it's used in the context of the kriging model so the first thing to understand the Vario gram is this function and this function is basically gamma of two points X I and XJ remember X is the literal spatial placement of that point where it is geographically so I've I put in two points X I and XJ then gamma of X I and XJ is simply given by 1/2 Y I minus y J squared and the Y's remember are the elevations so the story that this equation is telling it's basically saying that if I put in two points any two points let's say this guy and this guy so this is my X I this is my XJ then the function is basically one half of the difference in elevation between those two points and what would we kind of expect from this function we would expect that the closer two points are in space the smaller gamma would be right because if I'm here in space and I take one step to my right that's a very small change in distance so my elevation should also not be that different whereas if I'm here and I go walk a mile away then I would expect my elevation to potentially change by a lot so we're thinking that the closer two points are together the closer X I and XJ are to each other the smaller is why I minus YJ will be there for the smaller gamma should be so using those intuitions let's draw a graph of what we expect H versus gamma of H to look like and H is going to be the distance between two points so as the distance goes up we expect that the gamma or this function of the difference in elevations should also go up so maybe it starts like this it's going up it's going up but we expect it at some point to plateau off why do we expect it to plateau off at some point well here's the story whether you are let's say a hundred like miles away whether your two points are 100 miles away or whether they're 200 miles away you're not expecting your gamma to change by all of that much because basically you're saying if I'm here and I go walk a hundred versus if I'm here and I walked 200 miles I'm not expecting this function to be all that different so that's why we're expecting this guy to plateau off now there's a couple important points in this graph that I want to tell you about because they have very specific terminology in spatial statistics and the first one is the funniest name this is called the nugget so the nugget is the y-value here where the graph begins and some of you are thinking shouldn't the nugget always be zero because if my distance is zero I'm expecting the difference in elevation to also be zero well theoretically yes but of course this is a theoretical Vario gram and let me actually write the name so this is called a very Oh gram so this is called a very Oh gram this is a theoretical very Oh gram but if you look at this island let's say we take two sets of points at our distance H away so this point in this point our distance H away and then this point in this point our distance H away but they might have different they might have different elevation differences so for example if that H were here one of the points might have gamma this and the other point might have gamma this so a true very Oh Graham looks more like a cloud where it's not a single line but rather kind of shape it's just that the very gram we fit to the data is the best fit and that's what we end up calling the theoretical very Oh Graham versus the actual very Oh Graham okay so that's why we have a nugget the higher the nugget is the more noisy our data because it's basically saying that for all sets of points that are distance zero from each other or almost zero from each other very close we should have that the elevation difference is zero the higher the elevation difference is the more noisy there are the more noise there is in our data so that's the first one the second one is called the sill so the sill is the ceiling of the Vario gram where it can no longer get higher than a certain value and the point where it first hits the sill let's say it's around here this is called the range so the range is the H value or the distance value at which point the Vario ground hits the sill so these are three important terms in spatial stats so why did we go through this whole discussion of the varial gram now we're gonna tie this Vario gram back to our kriging model I thought about going through the math behind the kriging model but I did make that video even but it got a little bit convoluted so I want to give you kind of the watered down version and if you still want me to go through all of it go ahead and leave a comment below but what it boils down to is that solving the kriging model solving for these weights W ends up basically just solving a matrix equation solving something where we have a matrix a times our weights W so W is just a vector of W 1 through w v is equal to some other vector b now a is comprised of very Oh gram functions between X I and XJ where X I through XJ are my five points that are my neighbors of the unknown point now B is also comprised of varial gram functions but it's between x i's which are my five points I care about and the X nu which is the new point that I would like data for so basically to solve this it would just be taking the inverse so I get W is equal to a inverse B and that would be solving for my w w hat maybe so that is my predicted weights or my best weights I take my best weights I plug them into my crazy model and I get a prediction for the elevation at my unknown point and that's why the Vario gram is important in order to fit the kriging model and find these weights W now the last thing I want to talk about in this video is when should you use the tricky model we just spent the last several minutes talking about this really cool model in spatial stats called the kriging model but as with all models there are certain situations where you should and should not use it and I wanted to leave the assumptions till the end because it relies on some stuff we just learned so the assumptions if you look online you'll see more formalized versions of these assumptions I want to just give you the in a nutshell version so in a nutshell there's two assumptions the first is stationarity so stationarity is something we saw in Time series videos maybe you're more familiar with in that concept but it really has the same definition in spatial stats so stationarity says that if I look at any small chunk of the island I should get the same attributes such as the mean of the elevations and the same volatility or standard deviation of the elevations so there should be no chunk of the island where the elevation is changing a lot more rapidly than other chunks of the island so that's the idea behind stationarity the other important one is a constant Vario gram so now that we talked about the Vario ground we can look at what this constant Vario gram means this varial gram is basically a relation between the distance between two points and the difference in their elevations we're basically saying that no matter which little chunk of the island we look at our very own gram should look pretty much the same so whether I look at this chunk of the island I look at this chunk of the island or this chunk of the island the Vario gram or the change in elevation based on change in distance between two locations should be about the same if these assumptions are met if it's stationary and if it has a constant variable gram then we're safe to use the kriging model then we can use it um even if these assumptions are not met we can always do some kind of transformations to try and get our data to be more stationary or have a more constant very Oh Graham some basic ones would be like taking the log of the data or taking the square root of the data there's more complicated stuff you can do in stats to coerce data to these conditions but whatever you had to do to get it to those conditions once it's there you're safe to use this really cool thing called the kriging model and actually the last thing I want to do is talk about the pros and cons of the cranium model first one of the biggest pros is that the kriging model gives us not just an estimate of like elevation at a certain point but it also has a built in feature which we didn't talk about but it has a built in feature that gives you the error at that point so you can say that this is my prediction and this is on average how off I think I am so it gives you this nice air built into the model one of the biggest cons of the creaky model is the computational intensity or how long it takes to get these weights because if you notice these weights that we found using solving this linear equation are very specific to these five points because this a is constructed using the gamma function between pairs of these five points and as B is constructed using gamma function of our new point with each of these five points so if I decide that I want to predict my elevation somewhere else and use a different set of five points then my entire linear equations have to change which means I have to resolve this matrix inversion which can be very costly so basically every set of weights is really only good for predicting one point so if I'm predicting like hundreds of points that I have to do hundreds of matrix inversions which could be very expensive so that's the pros and cons of the Craigie model these are the assumptions of the kriging model and this is the kriging model itself so hopefully that helped you to get a good gentle introduction into spatial stats and the kriging model especially the very ogram because that's something that's used not just in the kriging model but in a lot of areas of spatial stats so again if you have any questions I will be happy to answer them just put them in the comments below go ahead and like and subscribe for more videos like this and I'll see you next time
Info
Channel: ritvikmath
Views: 40,251
Rating: undefined out of 5
Keywords: data science, kriging, spatial stats, data, statistics, modeling, prediction
Id: J-IB4_QL7Oc
Channel Id: undefined
Length: 14min 35sec (875 seconds)
Published: Wed Jan 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.