Advanced Research Methods in JMP (10/19/2016)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
wonderful well thanks everyone for joining us for really happy to have you on a one of our academic series webcast I'm going to tell you a little bit about the plan for today and I'll just step through my outline here and you know just to make an intern that Gail pointed out to we have a lot of webinars that are really fantastic on our jump com events page so certainly if if you like what you see here and you want to see more certain certainly visit that I'll show you where to get to it at the end so today what I want to talk about is advanced research methods with jump which kind of covers two sides of thing methodological you know in the design aspect of your experiments or research and also the analysis and preparation parts you know a lot of what we do when we're actually analyzing data is preparing the data for analysis so I want to show you some tools and jump that are really appropriate for that and before I launch into this I'm tell you a little bit about my background you know I'm not a statistician i'm actually a research psychologist and so i started learning jump really for the purpose of analyzing research data and jump really makes this process a lot more straightforward for both the design and also the preparation sides you know all programs can analyze data but there's some parts of jump that really shine for this design and preparation aspects and so starting with the design aspect what i mean here is before you actually collect your data how you decide what observations you have to collect in order to make the types of inferences you want to make and i'm only going to cover a little bit of this certainly there's a lot of depth to jumps design of experiments it's a deal we menu you may have never even clicked before and i just want to point out that we have a live webcast coming up on Tuesday this is actually led by my colleague Volker craft so Tuesday October 25th looking specifically at engineering and do EE and we also have a great deal of on demand and live webinars actually from the mastering jump series which is a really excellent webcast series and you'll see even the first one here is advanced design of experiments and so if you you have never explored the do e tools and jump those are worth watching in order to get up to speed but what I want to talk about today is just an introduction really to some modern design of experiments using the custom designer and something new and jump 13 i think is really special on which is this general purpose simulate tool which has a great deal of applications but for our purposes today it allows us to run some power simulations for complicated designs specifically a mixed model and so without getting ahead of ourselves let's step through the custom designer and look at how we might ahead of time to design a study that we want to collect data on so I'm going to go at the do a menu and just no time using jump 13 a new version of jump that came out a few weeks ago and I'm using it on a Mac of course jump is both Mac and PC and will launch into the custom design tool and the way custom design works is we're going to set up what responses were looking to collect and what factors were looking to measure so for of course we need a context first I was like to use this example when i'm teaching suppose we're running a wine tasting study and so what we're actually collecting from our subjects or our people that were were asking to do this study is some wine rating and so assume that wine rating has a limit so we're asking on a 0 to 100 scale and so then the question is what factors are affecting this wine rating and and my sort of study I always wanted to run was a question of how do we think of the type of closure of a wine you know whether it's that cork or a screw top is really impacting people's ratings you know with without respect to the type of one so if we were to control for type of one are really the brand or Riedel you know how do we think that's actually affecting their belief in the the rating of that one and so I'm going to design a study real quick here so i'm going to say we're going to have four factors i'm going to click on add factor categorical to level so i'm asking for 4 to level factors and let me start naming these factors so let's assume of course closure is going to be the thing that I'm most interested in you know whether it's the cork or a screw top but there are some other things I have to do vary here and some things I might be interested in so perhaps wine type whether it's a red or white wine might actually play an effect here you know my intuition is that for white wines no one really cares whether it's a corker screw top it's a white wine but if it's a red wine you know maybe some of our intuitions about you know the prestige of red wines maybe plays into this where we don't like the screw-top versions but we do like the corked versions but maybe there's some other things that actually play into this and things that aren't as easy to change notice that with closure and wine type i could ask a single person to measure for wines you know one of each cork and screw top and in white the cross of those two but there are some things that are hard to change about a person maybe expertise you know whether this person is a self-described expert or novice and maybe something like gender male and female notice that these factors aren't easy to change within a person and there's actually a section here for how the changes are made so I can actually click on that and say these are hard to make changes now what I'm telling jump here is to actually design me a split plot experiment that is when i click continue notice what jumps going to say is we have a certain number of whole plots we need now if you're not familiar with this notation or this this language what we're saying is we have some number of individuals within which we could vary the other factors now the way custom designer is going to work is that asks us what types of effects we want to measure and the most basic model that we can potentially estimate is one where we're looking for just main effects of each factor and again that notation that language simply means is there some average effect of closure the screw-top versus the cork wine type red versus white expertise and gender but in this particular model so far we haven't asked an important question that I want to know about do we think the effect of closure white and red or sorry cork and screw top varies across the levels of wine type red and white that is do we think they're the same effect of closure for each wine type so what we have to tell jump is that we want to fit some kind of interaction and this is going to help jump know what types of observations we need to make by the end of this we're going to get a table simply a table we can fill out with our ratings so that we can actually run the model so I'm going to tell jump actually fit all the way up to the fourth level interactions that is fit every interaction possible between our four factors and you may have not seen it because I was clicking up here but what jump did was change the number of whole plots necessary and the minimum number of runs necessary to estimate this type of model now if we think about it I want to design this just as an example and if we let's say have 12 people in our study really what we're asking for is 48 observations so for each of the 12 people let's just ask them about each of the four combinations cork as red and white and screw top that's red and white so 12 times for so I'm going to have and make that design for me now when I click make design what jumps going to do is actually search through in terms of its optimality criterion what types of observations we need that it's trying to decide from each of these people from each of the whole plots what observations do we need in order to make estimates of those parameters that I asked for and it comes up with its design so notice that for each of the whole plus we have four rows in each of these that is for each person we're going to ask them about each of the four combinations right and there's the expertise and gender don't vary within a whole plot which is to say expertise for a person and gender of a person isn't varying or changing when we asked them about each of these wines that would be kind of a difficult thing to change so we've essentially designed a repeated-measures experiment a mixed model is what we're going to use to fit this but what jump has done is found the design that makes sense for the types of evaluations or a type of analyses we want to run now the reason I'm doing this instead i want to show you this new and jump 13 a general-purpose stimulate tool and so what i'm going to do after I've made this custom design is something kind of special I'm gonna go to the red triangle and i'ma tell jump that when it makes the table for me I want to actually simulate some responses that is I want to pretend like I know something about how big these effects are so that I can do some power calculations do i actually have reasonable power to run this experiment you know it's going to take some money i'm going to have to buy a bunch of wine i want to make sure i have some reasonable expectation of actually detecting the effects i want to detect so when I click make table notice what jumps going to do is it first writes a table out for me this is actually a table of the whole plots the closure wine type expertise and gender levels that is I can actually fill out this wine rating if I just change these values to what people actually say that's my table that's my data but I also got a new column this wine rating simulation and that column is actually responding to this little control panel here and so what this control panel is let me set is the parameters of this model now if you've ever looked at sort of the parameters of a model before this might be a little bit unclear so what I'm going to do is I'm going to click reset coefficients I'm actually going to go to graph builder here's a nice way to examine what this is doing I'm going to take the wine rating simulated put it into the Y and let's look at that affect I'm most interested in you know how does the closure differ across wine type so I'm going to drag closure to overlay wine type to my x-axis and because factorial plots are often shown this way I'm going to click on a line certainly in displaying data that's not the best choice because you know what is the midpoint between red and white it doesn't really make sense in a lot of cases to plot categorical with lines but for factorial plot set is useful because we can look for the presence of interactions based on whether the slope of the line is the same so what I'm going to also do is I'm going to set the air over here 20 and i'll tell you why i'm doing this when i click apply what jumps going to show me is what based on the terms i've specified here what the result on average is and what you're seeing here is that there's no effect of red versus white right the mean here is exactly 0 and here exactly 0 there's no effect of closure because korkin screwed up are right on top of each other and we can't really see the other terms in the model yet because I haven't included them in the graph but let's start adding effects so for instance let's add a one here for wine type and i want you to see what response it gets so when i click apply suddenly now the red wines are rated to more than the white wine that's what that parameter is really saying these are called effect coded offsets that is how much different from average is the red one the first level when I entered it into the design and so this may be in effect we think is out there do we think on average in our sample or in the population or sample is drawn from people rate the red wines better than the white well probably let's just say that's true now notice that the rating of 1 and negative 1 doesn't make a lot of sense for our scale so we have to give an intercept here to make this a little more sensible let's say the grand mean of everybody who's rating wines is 50 right at the midpoint and so I'm going to click apply now and notice the scale changes so we're starting to build out what our prediction is about the population that our sample is coming from and because I've specified no air here we're seeing this as if this is the true average so let's start making effects what if there's an effect of closure so if I go down to closure here if I add a one click apply notice quickly we get the offset of two above the screw top is the cork so the midpoint of this line right in the center is to above the midpoint of this line the screw top and so now we've we've added a closure effect on average people are going to rate the Cork's better than the screw top but remember i add a prediction perhaps it's the case that people actually don't care about the type of closure when it's a white one and so if I go to that interaction term closure by white let's add a one there and click apply notice what that does is it removes that effect I've added the interaction such that for the white wines there's no difference in the population between cork and screw tops but there is for red and between cork and screw top here so I've designed a population and remember the reason we're doing this is I want to do something about simulating power how likely is it that I will be able to detect these differences this interaction and maybe even the main effects how likely is it to detect those given that I'm sampling I don't measure the population I'm just going to get a sample of 12 people here and so that's where we start bringing in air and so notice we have two different types of air in the simulated responses air for the standard sigma that is within a person how much variability do we expect from repeated observations if I were to ask you multiple times to rate the same one how much air around this grand mean of 50 would you observe and then the whole plot air which is how much different people on average differ that is if I'm rating or your rating or somebody else's rating our average rating of wines maybe some of us are more or less critical about wines than than others so let's just make up some values here I mean when you're doing opry power you'll have to come up with reasonable estimates this is done based on prior research but let's just say for an individual will do for and whole plot air let's do seven and so now to now now notice when I click apply every time I draw a sample which is what clicking apply is doing I get different estimates for these terms right because every time I draw a sample this is the population still but the population has air that is when I take measurements I'm getting sampling so not always going to be able to resolve the actual effect clearly the question for power is how often will i get a pattern of results in my sample that allows me to detect or find statistically significant the terms that actually were true in the population and that's where the simulate tool comes in so what I'm going to do is actually i'll copy my wine rating simulated to my wine rating column and what i'm going to do is fit a model to this wine rating column and when i use simulate it's going to in essence click that apply button for me as many times as I want and record the results that has record how often I get a result that is statistically significant now this is a mixed model we can run it as such I'm going to actually use an ad in the repeated measures full factorial mixed model add in I'll show you where that is that's actually on our user community and it just makes it easier to set up this particular design so we have wine rating as our why the closure and wine type are what are called within-subject effects these are our terms or factors that change within a person expertise and gender are between subject effects these are terms that change across people and then subject ID is whatever identifies the person or the whole plot in that notation so that goes right there when I click run jump will actually run this as a mixed model if you've never seen mixed model output there are some special components to it like the r-mo variance component estimates but in essence this is like an ANOVA it's a model that's actually looking for categorical differences and we get a fixed effect table looking at each of the terms we included in our model and notice for the one simulation we just ran we found an effective expertise which actually wasn't there in the population that's a false alarm we found the effect of closure which was there in the population remember i had a term for that but we didn't find that closure by let's find it here wine type term point 09 so in this case we missed one of the terms we false alarm don one of the terms and we found one so one actual hit the question is for power how often will this happen how often will we detect things we want to detect and how often will we only reject the null on things we shouldn't this is where simulate comes in we can right-click this table and then jump their team there's this brand new option called simulate and well here is what simulate will do when i click ok or click on it it asks you which column in this particular analysis do I want to switch in for a simulated column and jump is already written that simulated Colin that's what simulate responses did and the number of samples here is how many times do I want jump to click that apply button for me and if you're going to do a simulation for power 2500 is a great starting point since this is a mixed model takes some time to run each time I'm going to do a hundred just because I want to get the webinar moving so what this is going to do in essence is click that apply button first run the model get the results click it again run the model get the results and a hundred times later we'll end up with a table made for us and jump of all the hundred times that model was run with different samples from this population we set up right the simulated or the predicted population and the resulting table looks like this our first row is excluded because that's the row from our original analysis and then we have a hundred more rows that show for each of those different terms right so if we look at this term table here what our rows in this table are now columns in the new table we have the p-values that resulted from those hundred little experiments we ran given some population that we specified again for opera a power we have to have reasonable estimates for these things in order to to really run it and trust our results so these are all p values and remember we reject the null if we get a p-value lower than some criterion some alpha value and so we can look at the distributions of these p values for each of the terms we care about in order to realize or figure out what our power was and what our false alarm rates were so i'm going to click this pre-loaded script this power analysis script and what this does is runs distribution and for each of the different terms or for the each of the different sources in our model affect sources it's shown us the distribution of p-values so let's look at what our model really was in the population and what we got after our simulations remember closure had an effect so it was one here our p-values had this distribution so the distribution you'd get if there really is an effect and notice that for our rejection counts at alpha 0 point 05 we rejected the null there about 36 times well exactly 36 times point 36 is our rejection rate so our power here would be 0 point 36 but not particularly great if that's that in fact we really want to detect closure by wine type was an effect we wanted to detect as well that was the interaction term here and we rejected that null only 26 times or rejection rate of point 26 this is pretty terrible power if we're going to spend all this money to have all these people drink wine we probably want to collect more subjects because this is not looking good for us if these really are the magnitude of the effects that we're looking for and let's look at a term that didn't actually have an effect and so if we look down at simulated power for expertise where we didn't put in in effect we get power that's that's point one here if we were to look at our rejection rate here it should be 0 point 0 5 so if we ran more simulations it would normalize to that but for any term that doesn't actually have a real effect we're expecting about five percent of the time or five out of the hundred simulations to reject the null and that's just the price we pay for taking a sample and so notice that this was a nut terribly complicated model but it was a mixed model something that is not typically easy to do power calculations for and really no matter what the design was from custom designer we can do this simulation process in order to calculate power or to simulate power and so this custom designer and the sort of the modern approach to do he gives us a lot of tools to make that table and then that simulate tool lets us do a lot with it and there's many more things we can do with the simulate tool which some of our other webinars will look at so again check out our other DB webcasts if you want to look more custom designer that was a very quick and simple example using it but it's a very powerful tool for designing your experiments all rights the next thing I want to talk about is it's really a very time-consuming part of most research and that's preparing your data for the analysis an analysis it turns out for the most part for Advanced Research doesn't actually really be that complicated i mean mixed models are complex in a sense but the tools using them are not that complicated to run but preparing data requires kind of a little bit of skill and a little bit of art and the first I want to talk about are detecting and finding outliers and so I'm going to pull open some sample data these are available under the help menu under sample data and this is a whole set of cereals and the reason i like this example is we just have a lot of columns and I think we have intuitions about what observations we should get for cereals so this is an easy one to work with now when I say outliers I mean points that are extreme enough that we don't trust they were entered correctly or points that are so extreme that we think they'll influence our model in an undue way I have to be very careful with outlier detection do this not after you fit models and seeing whether your your hypotheses are confirmed or not we have to be very careful not to only find results we want to find an outlier detection should always be done before you've even determined whether your your data supports your hypotheses and we can think about outlier detection in several ways univariate outliers multivariate and subject level and so I'm gonna talk about each of these sort of in term so but univariate outliers I mean outliers that are strange with respect to the column there in so a point here you know let's say for protein if we had apple cinnamon cheerios listed 20 as its its protein level even if we're just looking at a table will be able to see that easily but oftentimes outliers are hard to spot just looking at the table and that's where distribution as a platform becomes very useful and so my standard procedure whenever I get any new data table is actually grab all the columns that I'm interested in put them into y and click OK and go through one by one just checking that the distribution looks as I expect it to look that is if I scroll down it becomes very obvious when there are points that are very strained with respect to the the remainder of the distribution so for instance for fat here if I hover over this point non-natural brand oats and honey it's very high on fat relative to the rest of the distribution and the reason there are points here the reason the points are are actually shown for these cereals is because they're beyond what's called the outer fence one and a half times the interquartile range plus the third quartile and so those points are selectable and this is actually a really useful feature if you're new to jump you know everything gets selected when you select points including the rows in the data table which means I can operate on these as if I've clicked on them in the data table for instance I can right-click and exclude them I can give them a marker maybe I'll give them a star and this way I can identify them notice the table is updated with that marker or I can do something which is my favorite I can name this selection in a column if I select this jump ask me what the new column should be should be labeled i'll say high fat and if they're selected i want that new column to get a 1 if unselected is 0 and notice when i scroll to my table i now have this new column that allows me to make those selections later on for instance i can right click on the one select matching cells and now i have my selection back and so what I like to do is actually go through each of the different columns and just fine points that I think are a little bit suspicious this is not me excluding them this is me just giving myself data for later to find points that are a little bit weird so high fiber and complex carbo's everything looks okay is this is a nice univariate approach to looking for outliers when you have a great number of columns under the columns utility section or sorry now it's under columns I think we actually moved it under predictive modeling where do we move we just actually moved all of our our menus around but there is a way to do a basically a screening for for outliers and so by especially when you're working with very large data sets it's going to be a lot easier to look for those outliers and more of a sort of large-scale way we put it into the screening menu here so under screening explore outliers especially when you're working with a large number of columns this becomes a little bit simpler but before getting to that let's actually look at sort of a different approach to to looking at outliers which isn't this mune of varied way this way with respect to a single distribution and that's what i call a multivariate outlier detection and multivariate outliers are outliers that are strange not with respect to their own distributions but with respect to really where they fall relative to a bunch of other variables and the relationships among those variables and so for instance let's go to analyze multivariate methods multivariate and multivariate as a platform sort of like a generalization of distribution and I really like the multi platform if you never seen it let's just grab some columns click them into why these have to be numeric because sort of the basic idea of multivariate is to come up with the covariances and the relationships among these variables so you start off with a correlation matrix and you get the scatterplot matrix as well and my interest here for multivariate outliers is to look for the points that are strange with respect to the relationships among the variables and so for instance if we have a strong relationship here between potassium on the x-axis and fiber on the Y the points that are sort of outside this little ellipse our points that are strange with respect to that relationship right there a little bit too high relative to how much potassium they have and so when you have points that are outside and notice the points I marked before actually are outside the relationships and by various space here when you points outside of that they give you a little bit of cause for for concern in a different way not that they're extreme in any one dimension but that their extreme in two or more dimensions and so if we want to capture extremeness in terms of this n dimensional space here we have a lot of variables all at once if we could plot them all together and look in some ellipsoid which points are outside of that that would be great it's very hard to plot more than three dimensions so what we can utilize is something called them Hal anovas distance and that's under the red triangle outlier analysis and Mahalanobis distance and inhale numbers distance is a type of covariance scaled distance that is is looking for the points that are not just distant from the centroid the multivariate mean but once that are distant in a way that doesn't really make sense just like a point that's really high on fiber but not really a high on potassium and a high in fiber or the opposite so points that are sorted outside the general relationship and so if I turn on them a howling of this distance what we get is the distance plot what we're looking at in this plot on the x-axis is the row number and so this is literally the first row in the data set second row in the data set third row in the data set and then the y-axis is the Mahalanobis distance that scaled numeric distance measure and again scaled in terms of the relationships but is the covariance structure and points that are above this upper control limit I can just select them here so points that are above this 4.16 that's the number that changes based on how many rows you have points that are above that are ones that are sort of too far to be expected that is further than we would expect by random sampling with some you know given assumptions about how these things are distributed now again these aren't necessarily outliers they're just points that are a little bit strange relative to the relationships among all the variables at once and so even though you don't see them sort of coming out in the two-dimensional plots remember this is taking into account all those variables and so again i can grab these points right click i can do a name selection in column and maybe i'll say hi mahalanobis and this is something i can look into later whether these influence important relationships you know the the best thing to find out about your outliers is that whether they're included or not whatever important effects you're looking at don't matter what you don't want to find out is that when they're included your effect is there and when they're not included your effect isn't I'd say it to be really careful about about those situations all right so that's a great way to look for multivariate outliers now this sort of a different type of outlier I want to talk about and this is one that's often harder to find and these are subject level outliers and to to sort of motivate this one I'm gonna bring up some some data I collected for an experiment a while ago this was actually looking at how I can influence people's predictions of credibility of products they're doing shopping online so this is actually i'll show you what the design looks like suppose you're shopping for health supplements on amazon and on these pages sometimes you see graphs that are two-dimensional showing the data table showing the same data a terrible three-dimensional plot showing the data or just an image sort of about that about the information on the page you know if you were to see these pages certainly if you saw for you would realize the manipulation here but across people do people who see certain instantiations of data make different sort of judgments about the credibility of the information on the page and these were all pretty well controlled and the data was actually already on the page so it's just a matter of reselling the data so the question was in the study does the instantiation of data actually make an effect or is it just the fact that they is presented and so we measured a number of different things and measure them across many different products we actually wrote the script to to spoof the Amazon page so we had undergraduates come in thinking that we're shopping research is sometimes like that so the question is here do I have subjects that are potentially what i would call bad subject subjects that didn't take the task seriously and I had some additional information that became very useful for me how many seconds it took for them to view that Amazon page and how many seconds did it take for them to rate on the seven criteria that Amazon page and so I want to show you something that I found really useful when I was actually doing this research and writing it up and that's on our tables summary and if you haven't used the summary tool this is a great way to just do quick summarization on data here but what you can do is actually summarize on the basis of some grouping so in this case subject I want measures for each subject on some statistics and specifically let me take viewing and rating and ask for the mean what's the average time across the 25 products they rated that they took to view and the time they took the rate and the reason I want to do this is when I click OK jumps going to go through my table and it's actually going to calculate for each of my different subjects right the up i did twice here reach my my subjects the average view time and average rating time and so this gives me some sense of their attention now here's what i like to do so i'm going to right click on reduce sort and sort ascending and notice i have some subjects here subject 407 8609 only two seconds looking at pages in my study and if i go to distribution and look at the distributions for time viewing time rating just like I did for univariate outliers I can quickly see there are some weird ones here's somebody who on average took two thousand seconds to make ratings this is actually somebody who just let the page time out when they were doing the study and we had people on who on average took four hundred seconds 433 to make the the viewings of the page this is very interesting this gives me information about sort of subject attentiveness and and what they're doing in the study and so this first subject clearly is not going to be good took 2,000 on average to rate and 2 on average to view notice by using tables summary this table is a linked table that's why these are locked that is if you look on the left-hand side I've actually selected in the original table this subject and so if i go to my original table i'm going to do next selected notice that this is that person's data and if I look across clearly they didn't respond to anything they took no time to rate and they just let a time out so I can go right click hide and exclude so I don't want to include those data in any kind of analyses now let me show you another way to do this i'm actually going to bring back these data and this may be not as obvious of a way to do this but if i go back to table summary I'm again going to put subject as my grouping variable now let's say credibility was my most important measure that's the one I wanted to see if got moved around by these different sort of graphical elements or the instantiation of data now instead of looking at the mean rating which of course is what my models will look at I'm going to look at the standard deviation that is I want to know across the 25 objects or 25 different pages each subject rated how variable was their ratings of credibility because if i click OK and I go and sort again and we sort ascending if there are people that have 0 or no variability that as they gave exactly the same responses or no responses at all well that's a problem and if I people that gave very small variability that is they rated every single thing the same then that's probably also problem that means they just clicked from each page each page and kept saying 33333 again looking at the distributions here gives us information about sort of how much we should expect and notice that the distribution of the standard deviations this would be kind of Chi square shaped right has a setter so on average people are about 2.13 in terms of their standard deviation across measurements and we have some limits so I can actually come up with good criterion you know when do I want to exclude observations well maybe if they're in the lower half a percent of variability or lower two and a half so I could actually in a sense windsor eyes or do a trimmed a sample based on people who actually gave me some meaningful information not ones who just rated every page the same and this generalized is pretty well so let me go back to my original table and go back to summary let's actually take all the the sort of dependent measures I'm going to ask for the standard deviation of each of them and then I'm also going to put subject it as my group and what I can do is get an average measure across all the measures of how variable they are that is I'm going to take these columns I can right click and I'll talk about this later in the sort of making new variable section there's instant formulas and jump so with them selected new formula column combine and let's actually take the average now I get the average of and this gets tricky average of the standard deviation of the measurements each subject made and I'm going to right click let's do sort ascending and now I could find the people who on average had the lowest variability across all the measures again looking at the distribution makes us a lot more clear we can see on average who these individuals are that are giving very very similar responses from each measure all rights that subject level and so looking at the variance and certainly looking at the mean things if you have time rate and time view that becomes very important and certainly know that if you use something like Qualtrics which is a great survey tool you can embed timers and so that's actually how I did some of these and so you can time how long they are on particular questions within a block or within a page that you show them all right so very useful all right that's subject level data now calculating new variables is something that's very important certainly if i go back to serials you know sometimes we don't maybe care about certain variables until we we do calculations on them so in fact in here notice we have cups per serving as a column and so if I'm looking at the number of calories well maybe I want the number of calories per serving I don't care about the number of calories if they're they're not scaled to be the same some of these servings or more and less in terms of cups so I want calories per cup and so in jump there's a formula editor and so if you make a new column and hopefully you found this if you're a jump user right click and click formula and you can define a formula for that column that is I can actually take something like calories / it's at the divide key and cups per serving hit apply and now I get that scaled measure and so the formulator is really fantastic it actually can take advantage of all the JSL scripting so you can actually write code within a formula to evaluate in the column and as we have some really advanced things we can do conditionals we can also do sort of the the comparisons and statistical procedures are all in here and so certainly take advantage of the formula editor when defining those formulas but there's also instant formulas which become very useful when you want to do these calculations quickly and so for instance the cups per serving let me scroll back over to calories those are the two columns that I want to involve in a formula actually let me graph calories and cups per serving here and so with them both selected you can right-click the first one and remember that new formula column that I use to make the average before well there's a lot of different ways we can combine them we can do the ratio right so the first / the second which is what I actually want to do the reverse order we can do averages differences and all these kinds of things so if i click ratio again jump will make that column and it does it in the same way I go to formula it's actually made the formula in the column it didn't just write the values and so that's actually a very useful way to make these formulas quickly and we also have options like standardizing this I actually like quite a bit you know these are all on very different scales if I right click all of them new formula column distributional notice we can Center them subtract the mean off from each standardized subtract the mean off and then divide by the standard deviation of each we can range them from 0 to 1 we can do transformations like the Johnson normalizing or do other things i'm going to do standardized i noticed what this rights is z scored columns for each and so now these are actually on the same scale so if i wanted to make comparisons across sort of extremeness for each i can actually do that that's instant formulas now it's worth noting that you could do temporary formulas as well so if i go into the state graph builder or any kind of launch window if there's a particular formula i want to make but i just want to make it for the purposes of this graph for instance maybe I want the log of calories if I right click go to transform or any operation I click log notice what jump will do is write this sort of italicized column and that's actually temporary I could drag it in and use it in an analysis but it does live in my table the reason why is it's living just in this window right now if I right click it notice I have the option to add it to the data table or I can rename it and do other things by it so this is a nice temporary we're way to make a formula we are just inside of a platform and that actually works all across jump that's temporary variables now one other thing you're often going to find yourself doing especially before running analyses are creating new variables that bin by either intervals or by percentiles and there's a number of ways to make these binning variables you maybe we just want a high and low calories kind of column not something categorical like we have here but something that actually just bends them so there's a number of ways to do this under columns you actually have the option under utilities to make a binning formula and the bidding formula is rather nice that is you can actually set for specifically with bins really where you want and what labels you want for each of these different values and so if i have a bin width of let's say 150 and let's offset this at 0 you know maybe that's the two bins we want to offset sets where the bin starts the first time the width is the width of the bin so how why each ones are and so we see we have two bins here since we had a nice range and so if we actually create this as a formula column we actually have these now as a categorical distinction you know whether they're zero to 150 or 152 300 that's not always the way you want to do your bidding you know sometimes you would rather been in a way that's based on percentiles or or based on cut points that you design and so there's this interactive binning add-in and i'm going to show you where that is in the community in just a second but interactive bidding is really quite nice since this is an ad in that you just specify the variable you want to work with let's you actually draw these cut points interactively and it's going to show you what proportion you're cutting off and so if I want to cut points out particular places I could simply do that by clicking add cut points and I can even give these a name now wonder the red triangle is where you can actually save these cut points to a grouping column or you can do things like set the cut point set percentiles so potentially I want to make for cut points so I'll set it up point two five so jump will do its best to put that proportion in each of the different cuts and so interactive binner is really quite nice and so if you go to the jump user community it's doc 20 or 6237 you can actually just search for interactive binning on the jump user community community jump calm and that's also we refine that repeated measures added that I showed before and so interactive binning is a really handy tool to have installed all right so there's some additional things we can do so subsetting and merging this is something we often find when we're first combining data together that is when we first get data and we need to bring them together into a final data table you know I always like to show this one we have this us demographics data table at some sample data but it has some columns that you would clearly know we never found this table online somewhere this actually had to be created by bringing together measures from multiple places right there's no table online that has household income and also the number of smokers maybe in the state so we've essentially joined tables together and so imagine we started off with a bunch of individual tables you know just the attributes of the states so things like gross state product vegetable consumption proportion things like that basics of the state IQ the region the population educational characteristics of the state and also the household income for each state on average and so pulling these tables together before jump their team we do this using a join and joins a very powerful platform that allows us to specify you know joining the income column with something else do we think or do we know that there are columns that match and so state in both of those matches and so we could actually specify this and set up the joining now this would be an iterative process we'd have to join each table with the the final table until we bring all the columns together but I want to show you something new and jump 13 which is the jump tables query builder and query builder is a really fantastic tool under the tables menu query builder was originally built for pulling against sequel databases which is something you can do and jump but what query builder lets you do is specify a first table so it's picked up income as my primary table since that's what I've opened first I can add these others as secondary tables so education basics Matt recruits and then what jump does is it searches through those tables and does its best to find a marker or a column that is matching between them and so it's found that key it knows that state is a common key between them all and so if I click on the table snapshot notice it's going to do a great job combining these that's the original table it's going to pull back together once it pulls in education basics and the attributes so when I click build query jump will actually go to a next section I can tell it what columns I want to have for the final data set i'm gonna click at all i can set up the characteristics of it it even actually writes me the sequel code so if i wanted to pull this against a database somewhere else that's actually the code that it's using to make that and so if you're teaching sequel that's quite handy i get set up filters right now but i'm just going to click run query and by run query i mean pull together those different data tables and put them all into one and it's going to keep the identifier from each so we have state three two one here we don't really need all those since they all match but jump does that so you know where those tables start but notice it's pulled together the tables for us so it's a great way to pull together multiple tables so let's again under the tables jump query builder I want to show you something else that's new and jump 13 which is actually quite impressive and gives you a lot of flexibility when you're working with large look-up tables you know suppose we have this table are our basic stable but we want to bring in some of the attributes from our other table and so what's really powerful about this is I can specify a link ID in one table and right click and specify a link reference in another table so i'll just point it to that reference and notice what happened really quickly is I got a little key attached here so in the first table I just told the table what column is identifying the rows and then in my final table I just said well where do you want to look up values from and I said hey look in that other table the basics table and what it did is it added in this little section here and I said here are some columns that are referenced from the other table that is they don't actually have to exist in my final table and yet when I go to let's say graph builder and I want to use some of these columns let's say I want to make a map and actually want to use the IQ in the states to make that coloring right this is referenced this doesn't exist in my final table and yet I'm able to use the data from those columns and that becomes a very powerful tool especially when you have lookups you know suppose you have an item number and you have a huge database of items you don't want to join that against your actual final data table you can actually just link the items and so maybe that's useful on you know in research situations especially when you have lookups for particular characteristics so again the way that works is in your table you want to reference to you specify the column as a link ID and then the table you want to bring the values into you link reference it to that table that gives you that ability and this can actually be done in a nested you know maybe we actually have another link here so region could actually reference to something else so another table that's linked to it so we can eventually form a really chains of links across tables alright so those are tools for subsetting and merging now for reshaping and restructuring there's a number of great tools and jump that I want to just point out really quickly stacking up tables you know sometimes you have tables such as this where your observations are across the columns for each different unit of sampling if we're going to analyze this with a mixed model we actually have to have these across rows and so just be aware that under tables we have the option to stack and split I'll show this rather quickly when you stack columns you're simply selecting the columns then make up the different observations those are your stack columns and I don't actually have to change anything else I'll click OK and those would jump does is it now stacks the column the original hiss 2013 5 those were the original column titles those are now in this new column called label and the data from each of those point for point to point 1 point 0 8 those were the original observations for that first dog here and so we've just taken a table that 16 rows and made it 64 because for each of the different dogs there are now four rows making up those four original column and so stacking is how you get your table into that that sort of long format now the inverse of that is splitting a table that is taking that stack table and reversing it and then also times when you really don't need to stack or split but you're really just tabulating and I just bring this up this is an example I usually go to you know a situation where you have eight observations for each state on s80 scores that you've measured across different years but if you want to run a regression let's say on just the salary and a state against the average s80 score and you don't want to do it in some specialized way you really just want to average over maybe those scores across the eight years and under analyze the tabulate platform is a really nice way of doing this much like summary we can make tabulations based on our data but I like the graphical version of this because we can simply drag and drop column so if i drag state here notice jump builds a table it says how many rows do you have for each state and says okay you have eight let's say we actually want to look at I don't know let's say expenditure I'll drop that in the center and ask them to calculate the mean for me by drag mean on top of some so I'm what the mean expenditure across those eight years and i also want SI t total remember i can do instant formulas i showed you before right click let's do combine and take a sum so SI t verbal plus SI t math now to add this to the table i'm going to drag it not on top of expenditure but just to the right and so what I'm going to do actually is is sort of add it to the table it doesn't want to add there let's see if i'll just do one of these I'll do one of them for now so once you've actually made that table this is actually a tabulation of the table if you click the red triangle make it into a data table this will actually write out that table as a jump table you can analyze so with 51 our 50 states plus the District of Columbia and so this isn't so much a restructuring as that is a tabulation that was necessary to do the analysis and so be aware that tabulate is sometimes something you use before you get to the final analysis especially when you're working over repeated observations alright so that's tabulation now just in the time we have I want to point out a couple of specialty things within the jump tools and so under analyze and so I advanced fit model options you know there are times when we're running specialized multiple comparisons we've already looked at mixed models I want to show some options in multiple comparisons rather quickly so this is a data set sort of based off of real data I collected I used to time how long it took me to get to to campus when I was teaching and so I had not this many observations this I made for a class but it made this factorial combination of the time morning I left the different routes to campus I took in the day of week and so let's fit a model in fit model rather quickly because I want to show you some tools you may not have known we're there so I'm going to put in my time to campus in seconds I'm gonna take each of these variables now under model effects if you never use fit model what you're designing is which effects in the model you want to include this is a full factorial to each of these factors has crossed so I actually want to fit all the possible crosses when I click run jump produces the basic fit model output I'm going to hide some of the sections because you may look to the effect test sections but maybe you've never expanded effect details and what effect details do is for each of the different terms or sources in the model it's going to show us the means for the different levels and under the red triangle give us a number of options and so maybe you've turned on least squares means plots before this gives you the plot across the different days of the week for how many seconds it took me but maybe you've never explored these other options and there's two I want to point out today so the least squares means contrast brings up a panel that can be a little intimidating at first because if you don't know what it's doing is not entirely clear but it becomes a very powerful tool and what you're really putting in here is little indicators to say which mean from any of these five do you want to compare against another mean so the way it works is you give a plus do anything you want to compare to anything given a minus so for instance Monday versus Tuesday looks like that one versus negative one now the one a negative one is not a prediction of the mean but it's really a weight in a linear contrast and without getting into the details these have to some to an absolute value of two and each side has to some two or the total is zero so they're balanced on either side so the left hand side versus the right hand side when I click done jump actually produces that test and so this is the test of money versus Tuesday but we can do complicated things with these I'm going to do another least-squares means contrast so for the same term here looking at day of week what if we wanted Monday versus the average of the other days well I'm going to click minus for the other days notice that they all get the same amount so a point 25 so the sum of all these is still 0 and the absolute the sum of the absolute values is too so we're sort of keeping a balance on the left-hand side one on the right hand side one when i click done that gives me that test so we're basically doing weighted means so monday vs all the others can be done in the context of this model is that becomes a really powerful tool when you want to make specific comparisons so another thing you can do with linear contrast is what's called a slice and i'll cover this rather quickly but let me do the day of week by route plot and so this plot is for each day of week which are the different colors are the different lines each of the different routes and let's say i want to test a sub design analysis of variance is there a difference between the days of week for the different routes so within first genesee comparing these means within gilman dr comparing these means with in la jolla village dr comparing these means and which in within nobel dr comparing these that's a sub design analysis the variance that was four of them we also have five other sub designs within monday is there a difference between the routes within tuesday is there a difference between the routes right so those are all sub designs and within the red triangle this test slices option is what that is and when i run that we're going to get slices for monday tuesday wednesday all the way through friday and also slices for genesee Gilman and the others now let's interpret one of these the slice for monday is a question of do the routes differ on monday at the slice meaning at just that part of the design is there difference for the other factor similarly if we look at the slice for genesee that's a question of ok just considering genesee is there a test or a difference across the different days a week and so these sub design analyses give me the ability to break up with in a bigger model tests of smaller things and so test slices is certainly something to utilize these are actually all linear contrasts on their own and so if you look at the way the slice is formed unless you're really familiar with these you know you don't have to expand this but that's essentially what it's doing is simply a test of you know for each of the different sections what's the test across the other the other variable all right those are testing slices now we want have time to get into it too much but bootstrapping is an option and jump anywhere you see something you want to bootstrap confidence limits for you can simply right click and you'll have a bootstrap option that's available also in jump pro similar to simulate but this is actually taking a bootstrap sample so repeated samplings from the same data to give you estimates on confidence limits and then of course jump has many other options we can do validation model comparisons partially squares and generalized regression very specialized tools as some of our more advanced webinars will cover okay I said I definitely wanted to give some time for questions and I know we just went through a lot i remember this is recorded and i'll send out those recordings to you but I'm going to pause and see if a gale has any questions that have come in that's a great question i'm really glad you asked that so with slices and with those types of comparisons it isn't and that was actually a section under here i didn't have time to get to hear the Alpha corrected methods let me recall here and bring this back up just to so everyone knows what was asked here so let's go back to a slice when we had a lot of comparisons so day a week by route was a good example so we had a lot of comparisons here and if I right click and make these into a combined table you can see we actually had nine of them so nine slices so this is an uncorrectable now there are some some things i would do here since you can't really get the same types of tests by running something like a two key hsd that will compare every single mean against every other mean that's a pairwise comparison that would be alpha controlled but in this case the slices are what we want so there's a great tool it's the false discovery rate correction so this is also in the user community as this employees of FDR correction so a false discovery rate p value and the way this works is you simply give this sort of platform launched the effect and so in this case actually let me give a term and level that's what identifies what the actual comparison is here and the p value column probability f and when i click ok jump will make a new table that's actually the the FDR adjusted p values for each of these and so they're so extreme at this point it didn't really change too much but certainly those things that are our border line that's really worth considering so multiplicity of course becomes a huge issue especially with with research where you're really you know not running a pilot study this is you know the primary confirmatory research and the place to find this go to community jump calm and go and search false discovery rate Corrections I'll search false discovery rate that's actually we'll get there and so once it searches you'll get the false discovery rate p value and this is an ad in that will actually work for for any table that you have P values in and so this would be my general purpose p value correction so you don't necessarily have to do it with something like a two key you can actually get your table of P values and correct given the domain of inquiry that you're in and certainly you could do it experiment wise or within particular measures so hope that helps
Info
Channel: Julian Parris
Views: 1,158
Rating: 5 out of 5
Keywords: jmp, power
Id: ZxVA1LO5EBI
Channel Id: undefined
Length: 56min 36sec (3396 seconds)
Published: Wed Oct 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.