Advanced Research Methods with JMP (Webinar, 04/04/2017)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well welcome everyone I'm really odd you can make it here for this webinar before we pop into the content just a couple of quick reminders our academic landing page which I'll kind of refer you to several times during this broadcast jump comm slash academic it'll take you over to the local version if you're in a different country but this is a great place to look at any of the resources that I mentioned today and to get the recording for this webinar once we actually finish with it at the end of the day so if you scroll down there'll be this academic webinar library link and that will take you to our webinar libraries on demand so we have lots of recorded webinars here and this one especially will be recorded so don't feel like you need to follow along in jump while I'm doing this today you can always go back and try it on your own alright so what will we be covering today so advanced research methods with jump now this is a kind of huge category and it's one that depending on the discipline you're in can mean many different things and so just to make sure that I cover things that I think have universal interest what I've chosen is a set of topics that fall into what I would call you know three big domains of research per process so the design process or the design stage the data preparation phase and and some tools that I think kind of skirt all the other different disciplines in terms of analysis and so in each of these I'll cover some of my favorite tools that I've used in my own research and things that I've seen people find valuable but I'll try to always point you to places where you can learn more and just a quick note on my background I'm not a statistician I'm actually a research psychologist so what I did before jump with a poly taught statistics in a psych department and I did research in Psychological Science and so I came to use jump about 12 years ago now and it was for the purpose of research for understanding more from data and jump really is many of you probably know amazing for that I want to show you some of the things that I love most and hopefully give you some new tools and techniques that'll be useful for whatever discipline you're in so starting with design what I mean from design is before you collect data you know how do you organize your observations such that you can make the inferences you want to make from the data you're collecting now this won't be an in-depth do a webinar we actually have a number of on-demand Yui webinars that are over at the mastering jump section so if you go to our website and go to it's gonna be under learn jump or under events and get to the mastering jump on demand webinar section there's a number of these that go in in detail into design of experiments and jump and jump is world class when it comes to do II I mean I didn't use the do a menu for a long time because I didn't really know what they were and once I finally started using design of experiments it really was a great way to increase the power of my designs and my methodology so what I want to talk about in this section though is the use of the custom designer which is really an application of the modern method of design of experiments a way that lets you specify what types of things you're looking to ask to me and then to search the design space to find the types of observations you need to support those inferences there's a custom designer is a great way to do this and I'm gonna do this in the service of talking about a new feature in jump 13 which is this general-purpose simulate tool which when in used injunction or conjunction with custom designer allows you to do really power analysis a priori power analysis for any type of design you can think of and this is really useful especially if you're writing grants or trying to determine whether you collect enough observations to really have a good chance of finding a statistically significant result this general purpose simulate tool allows you for any design to calculate power via a simulation approach let me show you what I mean what we're gonna do is we're gonna set up a custom design using a sort of a hypothetical experiment where what we're interested in is whether people make different decisions on a tasting of a wine based on the type of closure it has a cork or a screw top and whether that that interacts in some way with some other variables so things like wine type you know red red wines versus white wines and whether somebody is a novice or an expert and so what I'm going to do rather than then launch directly into it I'm gonna design this do eat interactively with you in case you've never seen the custom designer so let me start with a custom design under the DOA menu and this may look intimidating at first there's a lot of things that it seems like we have to fill in but we only really need a few the top section this is really what responses we're trying to collect and the only reason I'm going to change why is just so that the table that's made for me has that column labeled and so I'm just gonna say rating because what we're really interested in is how people rate these wines and just for our purposes I'm going to give it a lower limit of zero and an upper limit of a hundred this won't really affect anything but just so you know that's sort of the scale we're imagining we're gonna be collecting on so in this study we're gonna have these individuals come in and they're gonna taste some wines and they're going to give us some ratings and we're gonna define factors that differentiate both the types of wines were providing them but also the types of people were collecting data on and so in psychology we call this a between subject versus a within subject factor and other types of disciplines you might call this a whole plot versus a split plot and so let's actually talk through these let me first start by adding a factor and you'll see I'm gonna start with closure here so under factor you'll see there's different types of factors we can use really the type of factor I'm talking about where closure is something categorical that is it distinguishes different things in the environment not in a quantitative way but just in a qualitative way and it's categorical with two levels that is wines can be cork or screw top as I'm going to call this closure the closure type of wine and we'll call the first one cork and the second screw top if any of you who drink wine a lot or maybe have friends who do some people get get really specific about this they will dislike wines that are screw tops or ones who actually understand there's a cork shortage or maybe just like the ones that are corked and so this is a variable that may distinguish on average in terms of people's ratings maybe there's some sort of preference out there in general for this now the second factor is going to be wine type so again categorical two level I'll say this is wine type and this is red and white wines and the reason to maybe believe that this may interact with or maybe affect how people judge closures is you know red wines have a certain prestige maybe to them and maybe the effect of closure is bigger for certain people than if it's a white wine or maybe they don't they don't care too much about the closure type so that's why they want to be interested in an interaction there now finally let's add an expertise factor and so again categorical two-level and I'm referring here specifically to the expertise of the individuals making the ratings that is somebody who might be a novice versus someone who might be an expert in wine and this is will just allow them to self ascribe this label and a good study we may we may test them on something but notice that this factor is sort of different than the other this isn't an attribute of the wines but is an attribute of the individuals and it isn't something that we're going to be able to change that has change within the experiment very well in fact if I click on change let me say that this is a hard to change factor as far as custom designer is concerned what this is us telling jump is that this is not something we will be able to manipulate within runs that is we can't take a person make them a novice for a couple ratings that make them an expert for a couple ratings it's not something we can really effectively change within an experimental sort of test now you can imagine ways that maybe you could train somebody but let's just imagine here that this is a between subject fact or something that is something on which individuals differ now by doing that when I click continue jump knows to set up this design with something that's calling hole plots now in this terminology a hole plot is an individual so this goes back to agriculture if you had a whole plot of soil and you can plant different plants within that whole plot whole plots have attributes maybe they have different densities with minerals or something like that that's not something you can change within a single plot of land but it is something that across different whole plots might vary so that's where that terminology comes from for our purpose is a whole plot or a single plot is a person now before we get to the design generation let's talk about the model because what custom designer is asking us to specify here is what terms in the model we need to be able to estimate this is what's kind of powerful about a custom design it's asking you what you want to estimate so that it can design the number of observations and what observations specifically one needs to collect to make those estimate of all now for our purposes and this is often true in psychology we're just going to fit the model with all the interactions so I'm going to request up to the third order and what jump we'll just add in here is the interactions between closure and wine type closure and expertise wine type and expertise and then this closure by wine type by expertise interaction so it's just saying we want everything to be estimated all necessary so let me minimize this section factor constraint so you can look at another advanced deal.we webinars or some amazing things we can do here well what I'm just going to say because I want to keep this simple I want to collect subjects ten subjects let's just say I have ten people awaiting here and I want to have each of them rate each of the wines that is there's going to be you know two that are quirked and two that are screw-top and among each of the corked and screw top lines one will be red and one will be white kind of a nice clean factorial design where everyone drinks everything so I just specified 40 different wines will be rated now I'm gonna click make design and jump will search through the design space alright does this regardless of whether this is an easy or hard design it wants to make sure it's going to optimize this to a criterion and again you can read into criteria for optimal designs in separate places I have a set as D optimal for anybody was interested but for now just understand that what it's trying to do is come up with which observations for each run I should be making that is for this whole plot 1 which is person 1 these the wines this person will rate so this happens to be an expert and they're going to rate a quirk with a wine with a white wine a screw top with a red wine a cork with the red wine and screw tab with white wine so the combinations we might expect all of them now when I scroll down I can click make table which will make me a table of these that I can fill out but I'm actually going to do something special here because remember why we're doing this for my purpose I want to show custom designer as a way to do this power calculation a way to figure out how likely are we to find a statistically significant result if there really is some effect in the population of some of these parameters we're going to be fitting and to do that I'm gonna go to the red triangle then if you're new to jump hopefully most of you have seen jump before but if you're not don't worry I'll kind of talk through these things like this specifically this red triangle menu which you'll see all over jump gives us additional options and the option I want to turn on is to simulate responses because when I do this when I make a table now jump will not only make me the table I had before but it's gonna make me a little control panel which allows me to simulate the responses for the outcome variable that is for the rating simulated column that's actually using a column formula and you can see it's a formula cuz then a little plus sign here I click on it this is actually the column formula all the way to the right here it's pulling from random normal distributions don't worry about looking at that right now this panel is actually going to control the generation of those simulated responses so I'm gonna click reset coefficients I'm gonna set the errors all to zero let's actually I'm going to do something that allows to see what the consequences are of changing each of these different numbers these are the parameters in the model but don't worry if you haven't seen model parameters ation in a while we're gonna look at this graphically and the way to look at it graphically is I'm just gonna go to graph builder under the graph menu I'm gonna put the rating simulator that's my Y I'm gonna put the closure as my x-axis I'm gonna put the wine type as overlays and what I'm gonna do is I really have a choice on how I want to display this I can do this as bars or I can do this as lines I'm just going to do this as lines for now and you'll see why I'm gonna do this factorial plots in in sort of experiments often use lines even though there isn't really a space between a cork and a screw top right there's really no no midpoint there but it's nice to see for for vectorial plots now the reason you see just a straight line across and no difference between the red and whites is because we don't have any separation in sort of the population between these two things let's start by changing closure I'm gonna give closure of value I'm gonna give it a value of 1 and click apply and what this says now is among those with a cork it will be one unit higher than the grand mean which in this case is still 0 we haven't changed the intercept and that's one unit below screw-top there's one parameter to define that two-level factor if I do the same thing for wine type let's give it a 1 notice now we have separations so this is a more you know commonly seen factorial plot we have two separate lines now showing us for two separate points on the x-axis so the red wines are higher on average then the white wines let's actually right-click on red I'm going to change that to actually be a red color and the whites a white won't show up very well let's give it like maybe a little bit of orange for a yellow okay so what we're defining here to remind you is a population right so let's imagine the intercept is something more reasonable than 0 our ratings were supposed to be 0 to 100 so let's say 50 so now notice 50 is in the midpoint here the middle of these lines if you were to average all of them together the midpoint is really 50 and notice we have an effective closure and effective wine type now we haven't done anything with expertise but that's let's say that there's an effect there of 1 so experts should be one unit lower on their ratings then novices you don't see that in the plot because if I go back to the control panel we haven't involved expertise yet so let me just put expertise on the group X and notice what happened now experts on average are giving a little bit lower over rating the novices maybe this is something we expect to be true past research shows that experts are a little more discerning so they give lower ratings you know maybe I want this to be even a bigger effect so experts are giving a much lower rating right so they're way down here I said we can change these parameters to reflect what reality we think is or what reality is based on on expectations from previous results and now the interactions so I'm gonna just set one of these to be one and I want you to see why let's imagine that this is the state of the world that there is a closure by wine type interaction that is sort of what I said before that for red wines that is red wines over here there's a big difference between the cork and screw-top but for the white wines there's maybe no difference on average certainly there's still a difference for expertise right but there's no difference between the ratings on average between corks and screw tops people just don't care maybe for that and so we've developed a state of the world here that is we've defined with no air because we're defining a population right now just to see it what we think the world is out there and that's gonna give us a way to calculate power because power is always the probability of detecting a true effect in the population but now let's talk about the errors that we have down here because this is our measure of how much of variability in measurement we'll get every time we give people these wines to rate people won't give us the exact population values they're going to vary around that and we have air in a model like this between airs within the plots and errors among the plots so if I set air within a plot or the air here is one watch what happens when I click apply there's just movement around those means that we specified before every time you click apply there's a slightly different instantiation of those measurements what about the airs of the whole plots let me set that to one well there's movement around but there's movement around for a reason that's not very obvious in this plot right now so I'm going to go back and show you something really neat I'm going to take expertise out let's take the whole plot so I'm going to click that into rap and what I've just done is these are each of my 10 subjects and these are their ratings notice that when I have the air within a plot set to one they all move around within their little plot because within a person we're saying there's air and measurement if I said air to zero and I set the whole plot air to one notice what happens within a plot they're not changing in the pattern instead the intercept or the average rating for a person is chaining and that's what a whole plot air is it's that individuals within themselves don't show a different pattern of the effect instead they have a different set point you know we may have some person if I set this to be very high let's say ten some person who just gives really really low ratings on average like this person here number two just tends to dislike wines maybe but subject six really likes wines they're showing the exact same pattern but one has a much higher starting point or a higher rating on average they tend to give and that's what whole plot air air is and that's what's kind of cool about a mixed model which we're about to fit is that it can take into account extent differences between how people make ratings so let's just give this error three for each of these now we have fluctuations within a person and fluctuations among where people start right so we're sort of getting fluctuations in our measurements because of two different sources of error okay so on the close graph builder because I want to remind us where we're going we're trying to calculate the power for this kind of hypothetical situation that we've just designed in the world that's what I want to do is I'm going to just take the simulated rating result and put it into this rating column when you generate a design and jump it'll actually write you a model that's useful for analyzing that design so I'm going to click that play button this will bring up a model here we put in for a continuous rating so rating simulated here and without getting into the weeds too much for this mixed model we're just accounting for the variation due to individuals and their starting points so on average how much individuals differed from the grand mean that was that whole plot air so when I hit run jumps gonna produce the output for this mixed model and I'm just gonna minimize some sections here but I want you to look at the fixed effect test because for this one instantiation when I clicked apply once this happened to be the results I got I got some statistically significant results for closure or wine type for expertise and for that closure by wine type you know they kind of weren't terribly statistically significant you know they're they're passed a criteria of let's say 0.05 but maybe we were stricter on these data and wouldn't pass a point on one criterion now that was for a single time of clicking apply what would happen if I clicked apply a thousand times and took note of how often I rejected the null for each of these sources that I actually had a real effect for right we have in the population real effects for certain things and not for others we shouldn't expect statistically significant results for these more than five percent of the time or one percent of the time whatever our alpha criterion is so that's what the right click simulate is and this is this new feature and jump thirteen Pro that I think is so neat there's what I can do is right-click on the p-values I'll go to simulate and this little panel comes up and says okay well what do you want to do I'm gonna switch in for this rating column the one that I used for this first sort of model I fit I'm gonna switch in this simulated rating column and what I want to do is basically hit apply a thousand times that is just like I did when I was clicking that button over and over and over and we saw that there was different versions of the graph builder came up right that showed different patterns of results so what it's doing is in essence the exact same thing but algorithmically it's going through clicking that apply button and every time it does it's gonna take the p-values it sees in this column here right for each of the different sources and it's going to take note of that it's going to record them to the table and because of that think about the power of this literally the power of this what we'll be able to do is figure out and actually look at the distribution of p-values and see how often do we reject the null hypothesis that is get a p-value less than a criterion we choose like 0.05 how often do we reject the null for these different sources in the model because we shouldn't expect to reject the null for places where there wasn't an effect we designed but we should expect to reject the null as often as possible for the sources that did have an effect and that's the power of your study we're not guaranteed to reject the null even if there really is an effect remember we had that whole plot variation the errors among individuals and where they started and had the within plot variation the difference is sort of as if a single person were to rate the same one over and over and so that air that air and estimation will sometimes make it so we don't get a result that allows us to reject the null so here's what we have a table of a thousand and one results the first one that we actually measured and then a thousand simulations of the p-values and you notice there's a little script saved to the table this is the power analysis and so when I do this let me just minimize one of the sections here it's just look at simulated power for a criterion of let's say 0.05 we rejected the null for closure remember this was screw-top verses court we rejected it only 522 out of a thousand times which is a power of 0.5 to 2 so if we really ran the study got 10 people to come in and rate for wines apiece and we actually had effects that were low that size well we're only actually you know 50-50 chance of getting a result that allows us to reject that null that's not very good we probably want to collect more individuals if we're really going to run this study that's a pretty low probability for power and so what we can do is look down for the different sources we knew to be statistically significant in the population that is knew to be actually present in the population a really question of significance just are they true effects or not we can look down and see where we have and have-not rejected the null that is the distribution here and we can look at the table for our power calculation so just to back up here and to see the value of this what this allows us to do is for any type of model I mean I just did a mixed model here with some basic setup but your model could be as complicated as possible or as specific to your situation as possible and you can always right-click a table now and simulate the responses as a way to see what your power is for actually estimating those results or for finding a statistically significant result so I think that's a really neat application of this right click simulate and I'm going to show you another application of it a little bit later where we can use it for randomization tests so the same idea but really once we've collected data we get to look at how often would we reject the null hypothesis using the data to estimate the sampling distribution of the difference or the sampling distribution of whatever tests stick we're interested in so that's hopefully a design sort of feature that of jump that will hopefully apply to to whatever research domain you're in now I want to move on to talk about data preparation another thing that we really all have to do whenever we're collecting data and this applies to to every domain no data come pretty and so what I'm going to do is use some sample data to talk through a couple types of data preparation that we all have to sort of work through finding outliers and how we deal with them constructing derived variables so using formulas to make new variables sub setting emerging our data you know combining from sources but or breaking them apart and then reshaping and restructuring you know sometimes this is tabulation or aggregation sometimes this is is splitting or stacking data so we'll look at some tools there so I'm gonna start off with some sample data these are different serials that we have measured on a number of different characteristics and if you're gonna try this later on and you want to know where the sample data are it's under the help menu sample data and so all the sample data is built into jump and you can use it for for exploring sort of features within the software so starting here let's look at what I mean by outliers so outlier detection is one of these sort of things you have to do or be concerned with but it's not as simple as a strict cutoff outlier detection and what you do with them is a little bit art and a little bit statistics and it's a lot of your domain expertise that will come into whether something is or is not an outlier or what you should do with it and so what I really want to show you is methods for detecting potential outliers let's not think of any of these tools as definitively saying something is or is not good data but these are hopefully methods that will help you along that journey to decide what is and it's not good data in your in your sample so first the the first type of outlier I'll talk about our are univariate outliers and what I mean by that is outliers that are are strange with respect to their own distribution and a great way to do this in jump is using the distribution platform and just to give you a quick look at this because I'm sure most of you are familiar under analyze distribution I can provide let's just use our quantitative columns here I'll click them into Y and click OK I have mine set up to stack by default that's under the red triangle at the top if you don't have stack turn on you'll see your histograms vertically but for our purposes here let's keep them stacked and a way to look through for univariate outliers that I think is is useful is simply scrolling through and visually looking at the distributions of your variables and and often if there are outliers points that are strange with respect to their own distribution they'll sort of pop out and certainly in this outlier box plot points that are shown these are points that are more than one and a half times the inner quartile range beyond the the outer fence or beyond the third quartile and so this is the outer fence here and so these are points that are strange in this sort of two key range method so if you select them you'll notice another interesting aspect of jump which is hopefully you've noticed before is the ubiquitous selection if I select an observation in any distribution or in any plot they also get selected in the data table and so when we talk about how we're going to manage these outliers potentially this is a very useful feature I can right-click the points here and give them a marker you know maybe I'll give them a star or I can right-click and exclude them right so I can I can do operations that apply backwards to the table which gives us some some great functionality now when I scroll through these different characteristics or these different variables you'll notice that outliers I may select in one maybe aren't the same points that are strange in another so in terms of fiber there's a couple observations here that weren't strange and in any other distribution but were strange in terms of fiber and so just to jump forward to sort of tools to manage this setting row States is certainly one I can right-click and exclude every time I see something that maybe is strange but that's not a great method especially when you're looking across different distributions and so a really great tool I like is this named selection and so what I can do let me go back up to fat you know these I maybe think are high with respect to fats I'm going to right-click on them and there's this option named selection and column and this will bring up a little dialog which says okay well what do you want an eight label this column that you're about to create and I'll say this is high fat and what do you want to do with the observations that were selected well I want them to have a one and the unselected to have a zero and in my table now I'll have a new column where I have the ones for those selected and zeros for those Nutt if I ever want to bring back a selection right click on the ones and do select matching cells this selects all the ones that are in that column a release will feature but notice what I can also do let me scroll down to fiber write those two were different observations I'm going to right click name selection and column and I'll say not high fat because they're not they're high fiber and give them a one or a zero and so now I have two columns that have different non-overlapping selection sets which gives me a great sort of vehicle to if I ever need to you know select the high fibers or select the high fats it's a simple way of just gathering those selection sets and so that name selection column is very useful and we'll come across that in a number of different places so from a univariate standpoint that's one way to handle this outlier selection or at least a way to to look for the outliers visually when it comes to qualitative variables and so I'll just go back to distribution and let's just pull in you know hot cold or manufacturing wait for qualitative variables you also get an ability to do very quickly see if something's been miscoded it also gives you a sense of you know we'd have very few hot cereals in the sample this is probably not going to be a great variable to use it's not really a question of outliers but certainly and getting to know your data distribution is the the first place you should go with any new data set and the best method is just go to distribution put everything into Y hit OK and just go through each of your columns iteratively and look to make sure everything looks ok I can't tell you how many projects I've collaborated on where people have not looked at their their histograms or univariate distributions and jumped right to a complicated analysis that is now completely wrong because they had something miscoded and so distribution is the first place you should be going now let's say you've done that and you're interested now in doing a little bit more of an in-depth sort of search for outliers and and a next step might be a multivariate selection or a multivariate outlier detection technique now what I mean by this is an outlier detection technique that takes into account relationships among these variables and and to sort of talk about this I'm going to go to the platform we'll be using Thunder analyze multi very methods multi-variant multivariate is sort of like a generalization of distribution right it's a way of looking at observations but now we're gonna look at them in a space that is more than one dimension more than one variable and so I'll take the variables just the quality quantity the ones here calories to potassium I'll click them into Y click OK multivariate starts off with this correlation matrix which is great you can look for for correlations that's colored by the correlation strength so colder will be what tend to be higher and you get sort of you know lighter as you get close to zero I'm gonna minimize that at the scatter plot matrix is also very nice you'll see the points that I marked before some marking points again since it's an it's an action that applies to the table and all graphs reflect the table it's sort of ubiquitous as well very nice but what I mean by a multivariate outliers I'm looking for a point that is strange with respect to the relationships among the variables and if we just focus on on one relationship let's say calories versus protein so the central ellipse here the ellipse is showing us where we expect data to be given a joint normal distribution and if you look there are points that are our extreme and they're a little bit sort of almost beyond the ellipse but not not really that everything's kind of contained well in there but if we look at calories versus fat so most of the data's contain nicely in the ellipse but then there's some points here the ones I marked before that are kind of strange they're there outside that relationship between calories and fat and so in a bi sort of variate sense those points are strange they were strange in univariate sense too but in a bivariate sense now they're a little bit strange now what if we looked at because I had in this case all the variables in the table selected so that's in this case 10 variables so in 10 dimensional space are there points that are strange well that's this before we get to 10 let's look at three dimensions I'll go to the red triangle I'll do an ellipsoid 3d and I'll just first pick the first three so this is a three dimensional representation right with my ellipsoid to the generalization of the ellipse into another dimension okay so points again can be kind of strange with respect to the the three dimensional relationship right the ellipse has a major diagonal that's running along a component of variation between these three variables it has two minor ellipses or two minor diagonals rather showing strength of relationship among those variables but just visually you can see there's points that are kind of far beyond the the ellipsoid so if we want it to get into ten dimensional space which obviously don't have a great visual for that but conceptually think about the generalization of a distance measure in ten dimensions just in some kind of space to a center in ten dimensions we'll call it the multivariate mean so a distance to the multivariate mean between each point that is taking into account the relationships just like we did for the ellipsoid in 3d right these variables have a relationship so that ellipsoid is going in a certain sort of domain of this this space so in ten dimensions we're gonna have something like that we can't show it visually but we can at least get the measure of distance and that's under this outlier analysis and it's something called a Mahalanobis distance which is just as I described so for each row in the table that's the first row and that's the last row so the x-axis here is just the row in the table a distance measure to the multivariate mean of each point and you'll notice a point that we marked before is the highest Mahalanobis distance it's the point that's most distant to that multivariate mean and if we go back to the table let's see what it is that's that hundred percent natural brand oats and honey and so that's the point that in this sort of measure makes the least sense it's the one that's farthest from the center in a place where it shouldn't be it's that's a cool way of doing gathering these points and what's nice is you can grab the points that are sort of above this control line that's sort of a reflection of how far should points be given a whole joint multivariate normal but it's a way of you know grabbing the points that are kind of strange and if I go back to the scatter plot matrix you'll notice that the points that I grabbed and a lot of these bivariate plots you know some of them are are just along where they should be but in some of them they're pretty far in weird places like this is a measure between fiber and total carbohydrates so they're just really far out in fiber but you would expect them to have more carbs for as much as fiber as they have right so this is an aggregate measure of that distance and so we can actually just like we did before my favorite is this named selection in column and so we'll say you know not high fiber high Mahalanobis and one and zero so we can mark those points now the reason I'm going to mark them and I'll show you in a second is a nice thing we can do is we can filter based on whether they are or are not high in terms of that measure and so we can look at whether analyses we care about are impacted by these different measures so the high fat the high fiber so far or being high in this Mahalanobis distance okay so that's a multivariate approach now let me sort of change Tunes here and talk about a model-based approach now so far we haven't really imposed a model I mean in a sense we did in multivariate space we said there Mulder joint normally distributed and we're looking at a distance that are scaled by covariances and that's a model in a sense well let's say we're actually really interested in something like complex carbohydrates by sodium and so this is an output I just pulled up in fit model let me just do it interactively so you can see so the fit model we're looking at complex carbohydrates as a function of sodium let's say like we really care about this so I'll click run and in the output I had before I'd minimize some things I'll do it again here so we have a parameter estimate so sodium does have a statistically significant relationship to complex carbohydrates those cereals that had higher sodium were also those series that had higher complex carbs nothing causal necessarily here but we did see a relationship so there is some statistical trend between the two and you'll notice if you look at this plot immediately there's something kind of funky about this these points here are you know they didn't come up as high Mahalanobis distance high fiber or high fat but they do sort of not fit that relationship very well and so we might be interested in a measure of out lioness with respect to how much a point influences the regression relationships that is how much a point actually is changing the parameters of interest so specifically this slope how much does a point change in aggregate the estimate we have for the effect of sodium or the relationship between sodium and complex carbs and a great one for this is called cooks distance and so under the red triangle under row Diagnostics actually in this case is going to be a SAV columns because for each individual observation we'll have one you'll see an option for the cooks D influence statistic and I'm going to save this and what this will do is save to our our dataset actually the cooks deep so for each individual point in the data set and let me go over to it you'll see there is an observation for this or a value let me just go to distribution let's look at these over all the cooks T's and you'll see we have you know most of them and let me actually just click my hand I'm gonna drag up on this so we can decrease the bin sizes so for most of our points you know they're pretty low and cooks distance in fact if I select the lowest bar you know their points that aren't influencing the regression much at all for those of you who who know a lot about regression you'll notice they're the ones that the mean of X the mean of X is actually a place that influences the slope very little but we have some points here that have a cooks T that are really far out like this point or at this point or at this point actually let me select these if you look over it in the graph to the left notice what I've selected if I select the top four well there are these four points they're the ones that are really influencing this regression relationship and their cooks distances are very high actually we look at them in the table I mean these are cooks T's of a point one one and point three zero a nice rule of thumb for this is if you take the number of observations in your table so four divided by the number of observations seventy six this is sort of a cut-off point ish of where you should expect influence so you don't want values that are influenced with influence values more than about this for this number of observations and so that's four divided by n is one one of these little rules of thumb so what that's saying to us in essence we have some points here that have a lot of undue influence in the regression relationship and so those are points that we might want to again right click let's actually do a name selection and column and I'll say this is hi cooks you know with respect to the the complex carbs relationship complex carbs by sodium because remember we're gonna have a different cook city if we had different measures in our regression so let me show you how we might use this and this is actually getting to the tools we might use before we jump to a different topic so let's say we're in this regression and we want to look at how this influences this result so a great great function jump and this applies to any time you're ever going to be doing an analysis or are doing a graph so this is an architectural thing of jump is this local data filter and local data filter will ask me what do I wish to act as a century for this analysis or graph or whatever I'm doing that is it will filter the data coming into the plot I have on the right and recalculate immediately based on that filtering criterion and so I'm gonna say hi cooks distance with regard to complex carbs then click Add and this zero and one are the levels in that column and so if I just click the zero look what happens is it removes the times when one was an option there if I click one it actually will only include the ones but if I click zero and I could do is I can shift click one to add them back or control click off actually if you're on the Mac it'll be command on the PC it's control clicking it but notice I can add and remove these instantly and so but toggling between the inclusion here you can see well really they've influenced relationship without them in there the effect of sodium is much more it seems reliable there than if they were included that's not saying to do this to make your analyses more statistically significant that's not science that's you know if you have the answer before we ask the question you're not you're not conducting science but you should make sure that points don't substantively individual points don't substantively change your interpretations if they do you're gonna have to talk about when you write it up and you're gonna have to be very careful in how you interpret the results if a few influential points can massively change the relationship more kind of consequentially if a couple points being in there like this had undone the relationship and we wouldn't get to talk about it you know you better we better have some reasons to talk about removing those and so be very careful with that but the local data filter is a great way to see that influence to really understand it all right so that's cooks distance and how we might use a cooks distance in in sort of producing output their mouths which gears a little bit for for how we might discuss an outlier because we had talked or started off this this webinar today talking about repeated measures experiments experiments with subjects as a level or as a whole plot and so I want to talk about sort of subject level type outlier procedures that I found really valuable so this is actually an experiment that I conducted I'll just give you a sense of what it looked like I was looking into how presentations of data actually influence people's judgments about what might or might not be something valuable on Amazon so this was several years ago so this is how Amazon used to look but I wrote sort of a program that simulated Amazon shopping experiences and these pages you would never see all four as a subject in the experiment you'd see one instantiation of each of these different products with a special manipulation how data were presented and so you'll see this is in a kind of a 2d bar chart this is sort of some of them a table here a terrible 3d bar chart graphic that just sort of was related to the experience importantly actually all these pages had in the text the data that were presented here it was just a mess tchen of you know do presentations of data and different ways of presenting data actually influenced consequentially the ratings people make on their beliefs about the products and these this is a vitamin C which was a sort of a high belief to begin with but you know we had lots of products that didn't maybe work or not so tests or testosterone supplements or things that you might be suspicious of so principally the question was within a person judging things like credibility belief of product claims how well-written the page was you know some things we wouldn't expect to change but really we're interesting are people gonna recommend this product so we had a number of stuff just come in and give us you know these measures the ones I just mentioned on on scales one to seven but when you when you do studies like this you really have to ask yourself whether people are taking seriously the experiment at all and something I highly recommend you do I had two columns that I have hidden here but I had the number of seconds individuals viewed each of the pages actually worked through the Amazon pages and the number of seconds they spent making the ratings on those measures that I just showed you that is how many seconds did they actually spend clicking on the page to make the ratings of each of the different products and so if you use Qualtrics this is something you can incorporate I wrote this in PHP but it no matter what you used if you can collect timing data this is invaluable let me show you why so first if we just look across the 30 100 observations here we actually can see for every trial of the experiment the trial level data on how long people viewed and made ratings so there's some clear issues here one time this person spent what fifty six hundred and sixty three seconds viewing a page they've just walked away from their computer almost certainly same thing with making ratings somebody spent sixteen hundred and seventy seconds on one of the ratings pages so obviously just walked away from their computer but on the other side you know maybe there's an issue of the people who spent far fewer seconds viewing a page than I would have liked what about these people who spent you know less than two and a half seconds viewing a page so individual trials there and so just like we did before we might mark these things or choose criteria certainly you want to do it before you run the study on what you would use as a criterion for exclusion but since we have subject level data that is each individual in this experiment did it 15 times looked at 15 different pages you know with different graphical elements you know via a Latin Square applied to them I can actually look at a sort of aggregate measures of subject and let me show you a really great thing under the tables menu and we could do this through tabulate but I'm gonna do it through through the tables menu using something called summary as what I want are aggregated or summary characteristics for each subject based on some of these these different variables and I'm going to put subject as a group and let me show you why I'm gonna just keep keep dialogue open so I can show you what happens without anything else specified by putting subject as a group subjects become the rows of the table I create and without anything else defined all we get is the number of rows that had each of the subject these are unique identifiers so of the 213 people in the experiment I had 15 observations for each of them okay let's see how we can use this what if I take time view and time rate and let's ask for the mean for each subject so now I want the mean time and the mean rating for each individual subject click create okay now we're actually getting somewhere so for each subject I can see on average how much time did they view and I want average how much time do they rate and if I want let's just do a sort here I'll right-click the column and sort ascending you know I have some subjects who across the 15 trials spent 2.4 seconds on average viewing the page and this one actually spent a decent amount of time reading the page but that's pretty suspicious right they they really click through viewing the page very quickly what about time rating that might be even worse if they spend no time rating let's sort ascending again so we have some people who spent seven seconds to answer seven questions about these pages that's not thinking about it very much of course our inferences here can be aided by looking at the distributions of each of these I so I'll get the distribution and so by looking across all individuals we get a good sense of sort of where people lie so this is a subject level sort of measure of what is or is not potentially a problem with with how they're doing the study and I want you to see something very important let's say I was concerned about these first three you know the ones who who rated in fewer than 11 seconds if I go to my original table you'll notice something special about this summary table the tables are linked so if I go row next selected when I select in a summary table those individuals actually get selected in the original table which means I can grab the ones that maybe are are problematic and I can find them here you know I can do the name selection if I want so I can you know exclude right away there's actually this great option for anytime you have a selection you could always go to the rows menu and you can always go to row selection and then name that selection in column so from anywhere and jump if you ever have something selected you can always name it and so that's a nice way to do that let me show you one other thing you can do with subject level data like this that I think is really valuable instead of looking at the ratings let's actually looking at the time to rate let's actually look at the rating so credibility or belief in product claims these turned out to be very highly correlated obviously but I was interested in how credible the person who wrote the page was or that's what I asked in the question and so in the models that we would fit with a mixed model we would be getting the mean structure we're interested in measuring how different individuals are on average but for this question we want to know how potentially have variable our peoples responses and let me show you why I'm gonna put in the standard deviation of this so let's just sort of make sure we all understand where we are we're asking for for this measure of how credible the the sort of people are who made the pages we're asking for within each subject how different are the measures or how different are the values that they ascribed so that what I really want to and then let me right-click and sort ascending again I want to find the people who gave pretty much exactly the same rating for every single page so if I look at the distribution here so this is the distribution of the standard deviations across people it's gonna be slightly chi squared shape with 14 degrees of freedom but what it's really showing here is how variable are people in terms of the variability of their responses and so to give you a sense of what this really means and why this is valuable the individual here who had an average standard deviation of 0.35 right so so that's the average sorry the average are the standard deviation rather of their ratings across their 15 ratings if I find them in the original table so row next selected look at their ratings for credibility so 3 2 3 3 2 3 3 3 they just use the center of the scale right they get basically the same response on everything maybe that's really true maybe they really do believe that everything had exactly the same level of credibility it's odd relative to all other individuals who gave quite a bit more sort of variability in or had much differences in their ratings you know that's not damning this person I should say but it is something interesting to be aware of so with subject level data you're able to do these kind of interesting things which is let's look at a different measure of the measure than we're normally used to let's look at the variability among sort of their their ratings and so that's a great thing you could do with summary it's has some nice sort of subject level things you can do all right so as far as cleaning I'll mention this rather quickly because this is a great tool so I'll pull up a sample dataset like I said before every new dataset you get go to analyze distribution take every column click it into Y click OK so I did that quickly just you know so you know it's quick you can do it every single time and what you should really do is look through all your columns because you'll spot some things like this and this is just an example I made but you know this is very common these where tables at a restaurant and some people put credit card use know as an N and sudden you spelled out no or Y and some spelled out yes so as far as a humanist concern those mean the same thing but if you're fitting a model you know let's say you're fitting under knoweth fit Y by X and you use credit card and you're looking at the difference in bill amounts well as far as jump is comes those are different categories so your ANOVA you know even if you do the ANOVA correctly it's on data that are wrong and so that is a problem and so we would have to clean up those categories and I want you to just to see a really useful tool under the columns menu called columns recode and the way recode works is it shows you your old values your new values and gives you options on combining things so if I grab the nose and the ends I can click group and it just picks which of the two is first occurring or most often occurring so I can real able this may be I really want it to be uppercase as well and I take the Y's and the S is out you right click there and notice I can group to one of the values let's all say Group D yes maybe I also want it to be uppercase once you've done the recoating when you click done there are some options you can recoat in place don't ever do that that writes over the original values unless you're absolutely sure and it's a really trivial recoding i would almost always do a new column or a formula column now new column which I'll select for this one just writes the new values to the table but it's not doing it with a formula that is it's just storing the new values to the table well let's look at a different columns I want to show you the formula column so day of week was another of these these columns here that has a real problem people entered in the values totally differently sometimes with spaces sometimes lower case sometimes spelled out right so we have to clean that up so I'll do columns recode now for this one I want to show you two things I'm going to show you under the red triangle this group similar values and then I'll show you the formula column from it so group similar values is pretty powerful this allows you to ignore attributes of the way the string was entered so do you want to ignore case or whitespace or punctuation by ignore it means if you ignore that thing does one value become a different value like Friday with a space before it can become Friday without a space before it and then this final section is really powerful this is allowing character edits do you allow jump to change a character in the data to make one level the same as another level and so I'm going to allow that and click OK and notice that jump will kind of rip through these and find all the ones that should have been the same now it kind of missed with Wednesday and Wed so I'm gonna grab all all four of those right-click group to Wednesday group to Wed just because Wednesday was too much longer than the other four to really change the valley to make him the same so once the recoating is done we already looked at a new column let's do formula column because what this will do is write to the table the values of course but it's doing it through a formula if you right click under formula here there's an actual match function and so this is your recoding schema if somebody asks you how you did the recoating you can pull up the formula and show them right here which is really useful it's also useful because if you're still getting new data in somebody regardless of your your admonishments keeps writing lowercase W edy it'll do that recoding in real time that is it knows how to do the recoating because it's doing it through a formula that's a really handy way of doing that and so again that's enter columns and it's under recode all right so rather quickly just because I want to finish up the the section on preparation I probably won't do subcenter merging and reshaping in this one we have another webinar on data preparation but I want to show you four new variables calculating new variables is very simple we've already looked at formulas that have been stored to the table but what about if you want to make a formula and so there's a couple ways I want to show you that are really powerful if you right-click a column in the table there's this new formula section and these are some very commonly used formulas you can use so for instance if you go to distributional what if you wanted to standardize that is a z-score that column when you click it it'll just write a new column for you that's standardized or what if you actually wanted calories per per serving all right so right now we have cups per serving as a column here you know not everything has the same number of cups so if I select those two columns I can do right-click I can do new formula column combined and maybe I want to do the ratio so calories divided by the number of cups right so calories per cups now all right so by having these functions in here they'll write to the table a formula and if you right-click the column you'll see there's a formula checked here that's actually what it did is it wrote the formula to the table calories versus or divided by cups per serving you could have made that yourself if you have a blank formula it's as easy as clicking calories divided by cups per serving and you define it click OK so formulas are very powerful those were sort of the instant formulas you can do temporary formulas whenever you're analyzing data you can do the same formulas here temporarily if you want so I can transform and take the log of sodium and look at that in distribution instead of originally the sodium that's a temporary variable until you right-click and add it to the table if you like and so temporary and instant formulas are great ways of exploring sort of all these transformations alright so we're gonna stop there because I really want to save at least five minutes for questions I will mention we have some other webinars that will look at the advanced features the randomization tests and some some of the bootstrapping but let me pause now and see if Ruth you've gotten any questions about the the sections I did did cover here in design and preparation yeah no that that is a perfect time to do it yes so randomization test you know just like we did for the right-click simulate this the stimulate tool lets us do some pretty powerful things with that so let's actually pull up in a simpler data file I'm gonna pull up in the I'll use restaurant tips here and so what just to give a preface for this is let's assume that we don't we don't want to assume normality for the sampling distribution of our comparisons or we don't know something about the the population shapes so simulation based methods for inference give us an ability to sort of ignore certain assumptions of standard parametric models so if we're for instance let's use fit Y by X here and let's say we're using servers to predict the tip amounts so we want to know do our servers differ and how much they get in terms of tips so if I fit an ANOVA the p-value I get from this analysis of variance right this is a p-value based on the assumption several assumptions about the ANOVA and if we don't want to make those assumptions there are ways that we can run via simulation sort of reassigning the servers to each of the values in the table and then recalculating a statistic over and over and that's a simulation based method for for testing a difference and so to show you how this works it's again going to be a right click simulate so right clicking on an F value here to simulate but we need a variable to swap in that's how simulate works as it needs to rerun a formula each time and so what we're gonna do is go to the server column I'm just going to right click get a new formula and under random there's the option to sample with or without replace and so permutation test is sort of the way I'll take here I'm in a sample without replacement that is every row in the table will just be shuffled so we have just the same number of servers a and B and C it's just that we've reassigned where they occurred in the table so this in a sense develops a null hypothesis when we shuffle the group labels the difference between individuals and tip amount is by definition zero right the difference is among them because we're shuffling right every time we reshuffle it'll be a different instantiation of a difference but on average those differences will be zero so when I right click this F ratio let's go to simulate I'm gonna shuffle or switch in the shuffled server for my original server and I'm gonna say do this a thousand times and so by saying a thousand times every time it does the analysis it'll rerun that shuffled column remember shuffling is just replacing where the values in the table occur just randomly assigning where they occur but doing it without replacements that's why it's called a shuffle so out of the thousand times it's gonna collect observations like B F statistic and so that's the F statistic here one point seven nine one nine notice is exactly the same as this one that's the original analysis so the question becomes how extreme relative to the thousand times we resample there reshuffled how extreme was this original difference is it unlikely to occur by chance which is really the question of a hypothesis test so we minimize these other ones let's just look at the F statistic here it's drawn us in the plot where that original F value was and we have an empirical p-value that is how far out in the tail really was this how unlikely given this resampling did this observation occur and so 0.169 zero which is literally what proportion of this distribution is beyond the original value so notice 169 observations out of a thousand and that's 0.169 and so that's actually a simulation based method for doing a hypothesis test and what's valuable about this is that the the generation of the p-value here this empirical p-value isn't based on sort of a sample approximation or our approximation to a sort of a parametric distribution instead it's based on how often in this randomization or this simulation did we get a response more extreme than the original and this generalizes to anything so we just did it for an ANOVA I could have done it for a median test I could have done it for a mixed model so simulation allows you to just like waited for power you know simulate the outcomes of something this lets us simulate the outcomes and look at how extreme our original outcome was so that's why it becomes a hypothesis test because we're testing how different was our original response from what we get just by chance alone that's how we can sort of apply that to to inference so a really powerful feature and again there are assumptions to simulation based hypothesis testing you know we have weak exchange ability under the null we're assuming essentially the same things about the variance structure you know servers having the same variance in their their bill amounts or their tip amounts I think we were testing but we don't have to make assumptions about the distribution of the observations in the population which in this case you know this is sort of long-tailed maybe not normal and we have a small sample size here and so the central limit theorem we can't trust to to help us out for sort of the distribution of the differences
Info
Channel: Julian Parris
Views: 2,542
Rating: 5 out of 5
Keywords: JMP (software), Research, Data Cleaning, Data Analysis, Recoding, Simulation, Power
Id: wMpWjJ188Xg
Channel Id: undefined
Length: 58min 56sec (3536 seconds)
Published: Tue Apr 04 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.