Introduction to monte carlo simulations using R - The absolute basics

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to the screencast on teaching simulations the absolute basics using R so the goal for today's screencast is to really just describe the basic ways and basic idea of doing simulations of doing Monte Carlo simulations in particular that we will then use in later screencasts and in later lectures for doing statistical inference ie generating Monte Carlo confidence intervals and generating sampling distributions under various mill models as well as for doing simulations for power analyses for complex models which which there may be no closed-form approach for and we're using R to do this both because it's quite convenient for these sorts of simulations indeed it's quite powerful for these sorts of simulations and of course it has many statistical tools which we can rely on later so the first place to think is well what do I mean by simulate what am I mean by this all I all I mean but this is that we're generating some form of random data based on some knowledge of some kind of distribution so for instance we'll start with in this tutorial using a normally distributed variable so we assume we're sampling say from a population that's normally distributed with a particular mean and standard deviation we're but taking out a fixed number of samples in each time we do that we can estimate something like the mean and the variance like we would otherwise and if we wish we can repeat that over and over again and use those repeated measures of the means and variances and whatever slope intercepts doesn't really matter for making inferences later on so here's a simple quick example we're going to create this variable called norm dot simulated and what we're doing here is we're going to take 100 we're going to extract essentially a hundred samples from a distribution from a population that's a normally distributed population with a mean of five and a standard deviation of two we know it's a normal distribution because it's using norm our norm means we're generating these random variables variance from from this distribution so we run that line and we can actually look at it just go normed on simulated and indeed we get a hundred numbers that are from this distribution with mean of five and standard deviation of two now of course none of them are going to be exactly equal to five because they're being sampled from this distribution so we can look at this in a couple of ways we can plot it we're going to plot first just the random numbers here and so the first random number is some fairly low number looks like it's pretty close to one or so and then that bounces around you say bouncing around because each time it's sampling independently from this to distribution so for each of these hundred numbers they're independent of all the other draws so there's no relationship here so there's no sort of Markov process here we're not relating each sample is completely independent all right then we can look at surveys to gram of this what we'd expect and indeed what we we do see is that this distribution is centered near five which we expect the mean that we simulated from from this you know this hypothetical population had a mean of five and we see that the mean of this distribution is close to five with a standard deviation well look at that a little bit more closely we're going to look at the histogram again just in such a way so that we can actually plot it against whatever our theoretical expectations would be so the in this bottom plot we've just altered so we're looking at the density instead of the frequencies but otherwise it's going to look identical but what you see in red is the theoretical distribution for normal distributed variable with mean of five and standard deviation of two and you can see that our histogram we could have done this as a density plot instead but you see that our histogram overlays our simulated data reasonably nicely which we would expect this is simulated from that distribution after all okay so we can look at some summary statistics from the simulation also we'll just look at the mean and standard deviation nothing particularly exciting here and we get a mean of 5.0 - which is pretty close to five and a standard deviation of one point seven six which is not too far from two it's certainly not spot-on but not not too far at all okay so we do this once well let's repeat this experiment over and over again so we're repeatedly going to go back to that original population and randomly sample 100 individuals and each of those hundred individuals from that population is an independent draw so that doesn't doesn't depend on previous draws or future processing any way so we do that just four times it was called normed simulated one to four so I'm just going to copy and paste those and let's look at the histograms for all four of those draws now what you see here is that we have these four histograms with each of the samples and they're all centered close to five which we expect because each other they were drawn from the same distribution with a mean of five and standard deviation or two but of course these are different theories are not the same draw so each time we've drawn these hundred numbers we've drawn a different hundred numbers essentially for them each of them is an independent draw and in fact in this particular case it would have been equivalent to just draw four hundred numbers from this distribution and divide it into four groups of 100 that would not for this particular instance have made a difference or you know as compared to what we did here which is draw one hundred numbers four times because they are independent draws from this okay so if we compare the means of each of these things we expect to and indeed see the means are all going to be reasonably close to five in this case they're all slightly lower than five but they're all quite close to close to but not exactly equal to five great so this is the basic idea of the simulation we're just generally given some ideas of the distribution and some of the properties like mean variance or whatever the parameters are for that distribution this case for normal its mean variance but it could be say shape and scale for a gamma distribution or size and probability for binomial distribution we can repeat this process and generate these random samples as if we were going into this you know hypothetical population and drawing samples in this case drawing 100 samples at a time so for each iteration or round of the simulation we've effectively sampled 100 observations from the population right if we wanted to change the number of observations per iteration we simply change that lowercase n so we instead of making N equals 100 we could make n equals thousand to make as if we're sampling a thousand individuals at a time if we wanted to say oh no we generally our experiments are very small we only sample 10 individuals at a time we can make N equals 10 similarly we can change values to the mean or make it to some function like a line you can change standard deviation and of course you can change the type of probability distribution you're sampling from as well any of these things are possible to do these of course we don't want to have to write this right this line over and over and over again if we want to do this more than three or four times that becomes quite painful so we want to find an efficient way of performing these different iterations and there's a number of ways to do it we'll go through two in this tutorial one way to do it in R is to use the replicate function which just repeats the function call n times now I want to make a point there's two because unfortunately of the naming there's two lowercase ends here one is the lowercase n associated with the replicate function in that case this is N equals four we're going to repeat this four times and that just refers to how many times to repeat the call to the function of the function and the function in this case is the our known function the N equal hundred is like we saw above it's basically saying within the call to the R norm functions how many samples are being drawn from that population so despite that being a little bit confusing in both having lower case n I hope it's clear what those two different ends mean so we run that line and of what we can do is we can immediately look at all four of those together and bring this over the naming is not quite as as nice but the same basic idea again these four draws are all similar to each other in that they're centered near five but they're four random draws and are different than the four random draws we saw before similarly we can look at the means and standard deviations just like we did before and apparently gave some error here X is missing oh that's my bad there we go sorry with that let's just clear that and rewrite that I don't know I had that mistaken there we go so and again here again we have for each of the four simulations they have a mean close to but not exactly five interestingly I don't know why but all four of these just also all have a mean that's below five that just by random chance and similarly the standard deviations are all close to but not equal to two okay so let's do this again but repeat the sampling process the number of iterations from four times to 2,000 times and this is the real power of using this not to do it for four times or do it for thousands of times so we can really get a good idea of what the distribution looks like and so here are the only thing that we've changed is go to N equals 2000 we do it may take a second or two to run not very long at all and now we can ask say about the spread of the means from our simulated trials so how much each time we go through this we get a thousand dollar or a hundred observations we calculate the mean we're going to get two thousand different means how much variation is there among those mean so let's calculate that standard deviation among the sample means and that standard deviation among sample means about point two and that's useful to think about because we can compare this to the expected standard error so if we think about our distribution here that we are sampling from our distribution had a standard deviation of two and every time we sample we're sampling a hundred numbers now if you remember your standard error of the mean is simply the standard deviation divided by the square root of the sample size well that's just going to be two divided by the square root of 100 which is not surprisingly about point two it's because that's really approximating the same idea of the sampling of the standard deviation of the sampling distribution which is what we want to get which is just another I find more intuitive oi think about what the standard error of the mean is and we can we can actually plot this simulated sampling distribution for the means and take a take a quick look at it and I want make it clear that the spread here is going to be much less because we're not looking at the individual observations from each sample but instead we're calculating for the draw of 100 observations from a sample we calculate the mean from those 100 and then we repeat that thousands of times this case two thousand times and so we're looking at the distribution of the sample means okay we can do this same process again but in this case we can use the for loop and I just want to make sure it's clear because we will go back and forth sometimes one is more convenient than the other and this for loops can definitely be much more efficient computationally in terms of speed in particular when the number of simulations or the number of parameters being estimated gets very large so first we have to ask well how many iterations simulations we want we're going to call this big case n here so we'll just say we'll do it like before we want N equals 2,000 so we're going to then after we've done that we're going to initialize a variable to store the means that we're going to generate I forgot to write store the store the means and we'll just call that simulated means and we're going to use the repeat function which is just going to repeat n/a which is essentially missing data and big end time so that's mm or we should get from running this a vector of length mm and in each element we're just looking the first six there's just missing data so basically here's a vector length mm but essentially it has nothing in it we're going to now put stuff in it and we're going to use the for loop to do this we're going to repeat this from I equals 1 so we're going to do the first iteration up to I equals n big n where that's mm and we'll just excuse me repeat this all those times and then we're doing exactly like we before before we simulate the data from that distribution calculate the means this line just to remove that simulated data that the basic point here is we just don't want to leave that SCIM underscore data variable in there after we're finished we just remove it each time you actually don't need to do that it's just a good clean practice to clean up after yourself so we just run that takes a second again it's very quick and we can look at the histogram and essentially we've done exactly the same simulation we've done before so the distribution should look pretty similar again these are random so it'll be a little bit different it'd be pretty similar and we look at the standard deviation again of this and it's about point two and this is a standard deviation among the simulated mean so it's going to be close to point two so like our standard error okay now we can repeat this experiment this process with a lower sample size so let's do it again for we'll do two thousand iterations but instead of drawing sample of 100 we're just going to draw sample of 25 observations each time so we do that and I won't wear you let you guys do the histogram yourself but here that standard deviation of the of the sample means about 0.4 so we lowered the sample size and the effect of standard error of the mean in other words the amount of sampling variation has increased a lot how about if we have a much larger sample size now in this case we'll have a sample size of a thousand instead of we started with a hundred or did 25 just a moment ago but same idea let's do that well what what do we expect to happen here we do it and when we run it and we get a much much smaller standard here point oh six so do we see a pattern here hopefully you do for these three different ones for sample size of 25 relatively modest sample size oops that looks like I mixed these numbers up I apologize that should be 4 and 5 my apologies let's change that and run that again there we go so if we look here which is the lowest sample size again our standard essentially standard means about 0.4 standard error of the mean for sample size of 100 is about point one six and then standard error of the mean for a sample size of a thousand is point-of-sales and this reinforces the basic idea of sample size influencing yes and standard error right it's this that's why we have the square root of sample size being the the denominator in the standard of the mean approximation okay so we've now clear this up we've now learned about the basics of Monte Carlo methods we've now generated the Monte Carlo samples generated some value very interesting case even if it's just the mean and then repeated and made some gotten some sampling distributions which in the future we can use to make some kind of inferences ok but let's do something a little bit more interesting let's let's do this for a regression so we have this regression which is just wise again why is distribute normally where the mean is a plus B times X in other words the deterministic part the mean is the line the deterministic part of the line and then the stochastic good part which is our standard deviation so well intercept will be five our our slope will be 0.7 we're going to make our X values a sequence of numbers from 2 to 20 we can think about this as some kind of dose response curve where we're dosing our organisms with different doses of a drug and Y's say how big they get or some some how that influences how long it takes for them to develop something like that and so what we get with is the the equation for that line is that deterministic part of our model and we can plug and of course it's going to be a perfect fit because there's no variation right now which is fine but our response is going to be randomly sampled it's conditioned on X but it still needs B remember examples so we need to incorporate that random variation so let's say the standard deviation is 2 don't worry about where that's come from for the moment we'll talk about 4 inferentially soon but for whatever minute for whatever reason we expect it's to I guess 2.5 actually looks like so we all we do now is we're going to use our norm just like before here length is X so we're saying well how many general want how many samples do we need to generate for here well we'll just use we just one as many as we have the observations of X so we just use length of X here you could specify exactly what it was but this is just a more general way to do it and our mean all we have to say because we think about the equation we looked at before it's just the deterministic part of our model the Y enters were fixed is what we call them just the a plus B times X the equation for the line and then we have our standard deviation so that's great so we can run that line and then we're going to plot the simulated data and we will actually plot so here we've got that simulated data the black line here is that line that we know we've known these parameters because we get to play God here for Thermo but we can ask the question okay but maybe if we you know one of us gets you know if this is two people one person knows the parameters and the other one just gets this random random draws and has to estimate them we can say with this random draw our new hundred observations that we called Y dot sim valve one we can ask let's fit a regression using those new wide variables Y observations why not some no one as a function of our original X I will just run the linear model I won't run through this summary and confidence you can look at that yourself and see how similar they are I just want to actually plot the line the best custom ated line for those simulated data and what you should hopefully observe here is again it's fitting this red line to these dots here not to the ones up here because it has no knowledge of that it only has knowledge of these Y observations down here and so this red line is the best estimate for those points which is not too far from our known values but it's not exact again because this is a simulation and we simulated those Y's okay I don't want to go down there so that's the basic idea and we can clearly use this as an approach and what I would like you to do is an exercise is write for loop that will run the same model the same thing that we just did with this regression say 200 times and plot the lines from each of the simulations onto the same plot you don't actually need a plot to simulated data points just look the actual lines in fact the easiest way of doing this is to and all when you're using the plot function use type people equals n and it suppresses the plot of the points and I'll post the answers for this on Monday thank you and this ends the screencast for the basics of teaching simulations in our

Info

Channel: Ian Dworkin

Views: 28,173

Rating: 4.9111109 out of 5

Keywords: Simulations, R programming language, monte carlo simulations

Id: T_igE6bb6hU

Channel Id: undefined

Length: 19min 29sec (1169 seconds)

Published: Thu Sep 29 2016