Multivariate Analysis and Advanced Visualization in JMP (12/2017)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well thanks everyone for joining us for this webinar advanced multivariate visualization and analysis I want to start by taking you through a little bit of the outline and a half for today to give you a sense of where we're going and what sorts of things were gonna be seeing just a note at the start I assume you've seen a little bit of jump before and so I won't be covering the basics of the interface although as I go along even if you are a brand-new user you'll probably see that jump makes data analysis and visualization so easy that really if you've never used it before even with no introduction you'll be able to do many of the things that I'll be doing today but with that said I am hoping that you've seen jump before so I can point out some places and jump where you can do a bit more advanced visualization and analysis specifically for multivariate data and what I mean by multivariate is when we have more than two variables so non univariate as one variable by very two variables and of course multivariate dennis is more than two and so what I'm gonna do is in the context of several multivariate analysis types I'm going to take you through visualizations that would accompany them and start with an eye on how you might communicate the meaning of those analyses and especially when you're trying to communicate the meaning to new students and so of course this is a webinar in our academic series and so I expect that many of you will be teaching these concepts and so I definitely want to put an emphasis on visualizations that are great for explaining the concept behind analyses and so let's take a look at which analyses I mean I'm going to be talking about analyses that live under the multivariate platform if you're new to jump that's gonna be under analyzed multivariate methods and that will encompass a number of things such as correlations in fact even principal components we can get to from inside of multivariate I will also see visualizations come about just by using the multivariate platform we're gonna look at the scatter plot matrix it's actually under the graph menu and that gives us an option of producing scatterplot matrices in a specific way that I want to show in a way that I think is quite handy we're going to look at multivariate models and so this could be things like a multiple regression it could be a multivariate analysis of variance many of these will live under analyze fit model and that's a place we go and jump often when we're working with more than two variables many of you are familiar with jump know that fit Y by X is our bivariate analysis platform and distribution is our univariate analysis platform so if it model is a place we'll go for a lot of multivariate analyses and we even have some specialized multivariate methods that live under its own menu things like principal components discriminant partially squares and even clustering so methods of unsupervised learning where we can cluster rows in our data set and so we'll look at those analyses and we're going to look at certain visualizations that come about by using them and I'm going to show you how to create more advanced visuals using graph builder a wonderful drag-and-drop graphing platform under the graph menu we're gonna look at the prediction profiler the one that actually comes built in with analyze fit model analyses as well as profilers that we can get to directly under the graph menu and so there's actually a bundle of them here so the straight profile or the contour the mixture and the custom this we'll look at a few of those I definitely want to show you the 3d scatter and surface plots you know 3d gets a bad rap but in jump 3d scatter plots are interactive which really lends something to their ability to convey the meaning behind the data points so we're gonna look at those and then we're gonna look at a really fun and really useful multi very visual is a ssin technique called bubble plots and so those are the visualizations we'll see as I'm going along I'm gonna point out things that I think are very valuable for presenting and sharing your visualizations and your analyses specifically we're going to combine some of our outputs together to make dashboards and I'll point out how we can save or export particular visualizations and analyses because of course we're not simply creating these visuals and analyses for ourselves usually we need to share them with others at the very end I'll point you back to resources and even at the start I'll mention jump comm slash teach a great place to go with all of our teaching materials and jump comm slash webinar which is the place you'll go to find the recording of this webinar and many other webinars so let's get started and what I'm going to use is a sample data set that I quite like in jump it's called cars 1993 and if you're new to jump you can get to all the sample data by going to the help menu and going to sample data you can also go to the sample data library which will open the older that all the sample data lives in on your computer so let's orient you to the data set first cars 1993 is this data set from Robin Locke's paper and statistics education and these are new cars in 1993 with many attributes of them so we have things like manufacturer model the vehicle category we have several numeric columns city mileage horsepower fuel tank we have the minimum and maximum price of cars in that category so for an acura integra for instance which is a small car you can see that the minimum price and thousands was twelve point nine thousand dollars and eighteen point eight thousand for the maximum so that's in the model class so we have a good number of numeric variables that we'll be able to use when we enter into these multivariate analyses and i'm going to emphasize numeric today although of course multivariate analyses can also incorporate qualitative variables what we'll say in jump are nominal or ordinal scaled but we're gonna start with a platform that explicitly requires numeric variables variables that are measured using numbers the numbers that actually mean something on a numeric scale so let's start off with the multivariate platform i'll go to analyze multivariate methods multivariate and you'll notice that i haven't even started out with any real predictions about these data I'm not trying to model the outcome of anything in these data the multivariate platform is really about looking at the multivariate relationships among the variables and so not with an explicit eye on how to predict one thing in particular but how the variables relate to each other and multivariate is an excellent place to start when you're investigating the predictor space in a data set if we did have something we want it to predict we could use it here as well but especially if we're interested in how our variables just relate together in their own space multivariate is a great entry point so I'm going to start with the first five variables here so the mileage and mpg that it gets within the city horsepower fuel tank passenger capacity and weight to enter these into a role many of you know we can click Y columns I could have also dragged them over we have additional options we're going to come back to in a minute how we can estimate these multivariate relationships especially in the presence of missing data and how we want the matrix organized and so this is how the scatter plot matrix will be shown I'll leave everything on the default and click OK jump starts out by showing us the correlation matrix and the scatter plot matrix so let's take a minute to look at our defaults and then we're going to look at other options we can get under the red triangles one for scatter plot matrix and one for multivariate so first the correlation matrix if you haven't seen a correlation matrix before the diagonal is showing the correlation of each variable with itself obviously it's going to be positive one and the off-diagonal points are the correlation at that intersection so negative zero point six seven two six here is the correlation between city mileage and the maximum horsepower and so a negative correlation indicating that increases in one variable are tending to be associated with decreases in the other so we can say among these cars we've sampled those cars that have more horsepower tend to get worse City mileage okay not a mind-blowing thing to find of course but that is actually the correlation coefficient for that you'll notice that same correlation coefficient negative 0.67 to 6 is here as well negative zero point six seven two six that's not an error on jumps part it's just the fact that the intersection also occurs between horsepower and city mileage on this column and so basically one side of the off diagonals is a mirror of the other and that's just how correlation matrices read and so if you never seen them before that may have caught you off guard you'll notice that the values are colored by the magnitude of their correlation positive correlations are blue negative correlations are red and the degree that they are dimmed is relative to how close they are to zero a zero correlation means no linear association between the variables at least within the sample we have all right so that's our correlation matrix I'm gonna minimize this temporarily which I can do with the grey triangles and let's look at our scatter plot matrix so as I had in the correlation matrix the diagonal here is one variable with itself there's actually nothing shown in that diagonal if you like you can turn on something there which I quite like which is under the red triangle showing histograms and I'll do a vertical histogram and so vertical histograms and jump show or the axis on the Y those observations and where they occur so you can see the city mileage is sort of skewed positively most cars have a city mileage around fifteen to twenty five and a few quite far up and so that diagonal since it would only show one variable with itself this is a nice way of making use of that space let's check out the off diagonals just like we had for the correlation matrix the variables will show in two places the same relationship except it's going to swap the axis so let's look at this right here so city mileage is actually what's on the x-axis here that's the scale right here at the bottom and the y-axis is maximum horsepower so this is the city mileage by maximum horsepower scatterplot up here even though these are the same data notice the axes have been switched City mileage is now on the y-axis that's what these points are if you map them over to the Y and horsepower is on the x-axis so like in the correlation matrix we have somewhat redundant cells however in a scatter plot matrix they're mirror images that are they're flipped in terms of the axes and I find this to be quite useful sometimes it makes more sense to think of one variable on the x axis than the other notice that we also have density ellipses these are showing the relationship again just like the correlation they're showing the magnitude of linear association between these two variables as the density ellipses get tighter so the minor diagonal as that gets smaller relative to the major diagonal then this is a larger correlation either more positive or more negative remember negative correlations can be just as strong but in the negative direction so these give you a very quick way to assess the correlation between these variables and in fact sometimes when you have lots of variables the points don't even add much and so I can uncheck show points and as you're thinking about this if you're building a visual especially one that you may want to publish there may be things you want to change notice I'm hovering my hand and moving the axes just to to get all these circle or the ellipses to fit on the screen so our customizations here can play a big part in how we present our data and so think about using these vigils visuals even though they're part of an analysis platform as a way of communicating relationships and the data all right so that's just the basic scatter plot output let's look again under the red triangle there's some other things you may like fitting lines if I fit in lines for each of these and turn off the density ellipsis this is the linear regression of the Y variable on to the X variables I could turn the points back on even and have those lines so you can see they they go through the cloud of lines you can show the correlations actually on each of these this makes for another nice visual if I drag out the size of these boxes remember they're all gonna stay equal size and so I can drag any one of them I like to make them square just to make sure that we have a nice square output here as well but this gets us to the point where this is actually turning out to be a pretty useful visual one we may want to show people especially if we're reporting in a paper or showing in a class relationships among variables so while we're keeping this open before we do anything else I'm just gonna hop ahead and talk quickly about how we might get this out of jump of course this isn't the only place we might want to show this and so of course we can do file export on the Mac file save as is what you would choose on the PC and this will give us lots of options for how to get this particular output directly out of jump we can even save it as interactive HTML if we like or simply as an image and you have different image types you can export as another option under your Tools menu is to use the selection tool and this is one of my favorites you can simply click on a particular outline box or even the overall output and that'll let you just use file or edit copy when you do edit copy this again then gets copied to the clipboard if I were to go to Adobe Acrobat or even on the Mac I can go to the preview app and do file new from let me do file new from clipboard this will actually open this up and now my preview app so this is something I can save out in fact this has even saved out as a vector graphic I can zoom in and notice that there are no pixels to be found these are actual little objects these are objects that can be edited if you had a vector graphics program and so as you're thinking about creating visuals here I'll just save this to my desktop here as you're thinking about creating visuals remember that once you've created something and jump that's not the last place you can use it and so I'll open this up in Adobe Acrobat whatever your favorite editor is you can use but let me just go to tools here click Edit PDF and notice that these are all little objects so if I didn't like that it put that correlation there I can simply move it over so every one of these is infinitely editable and so keep in mind as you're creating your own visual software whether it's jump or any other software will only get you so far if you are customizing a graphic for a very special publication or a very important purpose remember to go and make edits to customize it exactly the way you want it to be and so again just back step to where I got to that from using the selection tool and jump selection I simply copied that out from jump using edit copy then jump smart enough to take that out as a vector graphic so going backwards the multi very platform so far gave us sort of an insight into how these variables relate to each other and we already saw one kind of expected relationship cars with more horsepower tend to get worse mileage similarly cars with bigger fuel tanks tend to get worse City mileage probably that's because cars with bigger fuel tanks have more horsepower they have to make up for the fact that they get poorer gas mileage and maybe we see some things where we didn't expect a relationship cars with more passenger capacity don't tend to have more or less horsepower although there are a few points here of cars like the Chevy Corvette which has a very small passenger capacity and very large amount of horsepower all right so I'm going to just minimize this for now we may use that in a dashboard later let's explore a scatter plot matrix version of this so scatter plot matrix is a graphing tool it's under the graph menu it's not going to give us as many numeric options as the multivariate platform but I want to show you how it can compare to what we just made in multivariate I'll take the same variables we used these first five I'm gonna click them into Y columns we'll come back and look at what X and Group do but notice that by default there's the lower triangle so that's the matrix format that it's defaulted to when I click OK I want you to see what this does and I'll bring back my multivariate output from the previous time we did this and you'll notice there's a slight difference in how it's orienting the scatter plot matrix instead of having the entire square we're eliminating the on diagonals and everything to the other side of that and so we're basically on the left hand side looking at just this lower section here now that ends up being just fine the redundant cells for many people are just that they're redundant and so if what you care about is seeing just once those particular scatter plots the scatter plot matrix gives you a nice way of doing that it turns out that so does multivariate if you remember when I invoked multivariate methods multivariate at the very bottom let me hit recall that'll bring back the variables there is a matrix format here to use lower triangle as well obviously upper triangle will do the opposite and I'll do lower triangle just so you can see and that ends up being really the same exact output now there is something important though the reason I wanted to show scatter plot matrix there is a different thing you can do a scatter plot matrix then you can do with multivariate if I go to multivariate and let's bring in those same variables we had before let's imagine we wanted to bring in one additional variable is this a domestic manufacturer or is it not a domestic manufacturer if I try to click that in multi-varied doesn't quite like that this is one of the variables and the list was not the right analysis type or wrong data type and that's because that is a categorical variable right now scatter plot matrix on the other hand under graph scatter plot matrix actually will be okay with that I can add in my variables I'll take domestic manufacturer and also add that in so notice it is in the Y column and the way it handles that is to make little jittered dot plots and this actually works pretty well you may notice in this particular case there not a huge difference for each of our variables between whether they're domestic or not which you are noticing maybe in city mileage there's a clump of them under domestic one who actually get much worse gas mileage and a few that are not domestic I get quite good gas mileage but this gives you a way of actually using and getting dot plots from variables that are not actually quantitative and so scatter plot matrix has one little advantage there the reason why multivariate disallows this is because multivariate methods requires the variable as being American to do the rest of the numeric calculations notice we got correlations we can get confidence intervals on the correlations inverse correlations partial correlations all of these require that we have numeric data in order to perform those calculations since scatter plot matrix under the graph menu doesn't require that or doesn't perform those operations it doesn't much care if those entries or if those variables entered are fully numeric there are a few additional options that you can get under scatter plot matrix there's also the option for nonparametric densities this is a pretty nice output if I turn off the points you'll see what it's done is it creates little heat maps of where most of the observations are and so this is a nice way without imposing any sort of model like lines are doing but just seeing where the observations tend to fall so be aware that scatter plot matrix is an option and something that you can use in your arsenal when you're creating visualizations all right well we're still in multivariate I want to point out a few other things you might find useful I'll minimize the scatter plot matrix and let's go back to the top red triangle when you have a large number of variables your scatter plot matrices tend to be a little overwhelming you have so many of those little sections with scatter plots that they don't become very useful to assess what variables relate to which other variables what ends up being very useful are the color maps and so if I select color map on correlation what we get is a color map where positive correlations are red very negative correlations are blue centered is 0 this is actually picking up on my default color gradient under the Preferences menu in jump I can actually set what default color gradient is and so that's my continuous color theme and so you can change that if you like but this ends are working pretty well the on diagonals all have a positive correlation of 1 a variable with itself is positively correlated at 1 and you can see where the off diagonals are so at a glance you can see very quickly where are the places where there are very large positive or very large negative correlations so we can see that city mileage is very negatively correlated with weight in pounds and so that kind of makes sense very heavy cars probably aren't going to get very good mileage maybe that's because they're related to horsepower but that's something we'll be able to investigate with a proper multiple variable analysis in just a minute so the color map on correlations is an excellent option to use you can also color map on the P values and cluster on the correlations cluster on correlations gives us a slightly different variant of this where it's basically stuck all the correlations that tend to be close into the same place so it's reordering via the rows here so additional options that I like parallel coordinate plots we're going to come back to in just a minute know that you can get to them from multivariate I'll just produce one here if you've never seen a parallel coordinate plot before don't worry we'll step through it in much more detail later but just to give you a sense I'll click on one of these lines that line will notice it is selected in the dataset the Chevy Astro van what a parallel coordinate plot is showing is the profile for these five variables of a Chevy Astro van it tends to have low city mileage relative to the rest of the cars it has medium horsepower a large fuel tank medium high actually very large the largest passenger capacity and almost the highest weight so what we're looking at is in a kind of a standardized coordinate space the profile of that car and any car we pick will have a profile and so if we find the one with the best city mileage there it is right the one with the best city mileage tends to have the lowest horsepower and fuel tank which car is that this was the Geo Metro 1993 if you all remember the Geo Metro was a car so the parallel coordinate plot is another high dimensional visual we're see a different way of making it in just a minute but this is a very nice thing you can get to directly from the multivariate platform finally there are some additional methods we're not going to cover here but some of my favorites live under the outlier analysis section if you're performing analyses that require or use a lot of variables in the X space looking at whether there are outliers in that X space is really important somehow novus distance measures are really great for that you can also launch directly into principal components item reliability or even get 3d plots here an ellipsoid 3d lets you select three of your variables and allows you to look at that variable those variables rather in three dimensions then the ellipsoid is sort of a generalization or is a generalization of that ellipse we saw in the correlation scatter plot matrix this is showing in three dimensions the ellipsoid now so that's multivariate a lot of options and even beyond just the basic ones a lot of ways that give you nice ideas of thinking through what variables are really relating to each other in your data set and so before we start fitting any models let's think about how we might display some of the relationships we've seen but using other graphing platforms and maybe invoking things you haven't seen before and jump and so I'm gonna start with graph builder probably my favorite graphing platform and one that although very simple gives you the ability to construct some very advanced graphics that tell a great story about relationships you see in your data as a graph builder is available under the graph menu I'll invoke graph builder here and I'm gonna keep the multivariate platform open because let's investigate or look at some of these relationships but maybe let's see how they relate to each other and so I'll start by looking at city mileage that was the first variable that I mentioned or I even put in and so maybe one we're interested in understanding now the wait graph builder works if you've never seen it before is it's built around drop zones and there are drop zones for data like Y and X and drop zones that group your data or group your visual group X group Y and rap then there are drop zones that modify the points like coloring and sizing or put multiple visuals on top of you each other which is what overlay will do based on the levels of a particular variable this is all made more clear once you see it happen I'll take city mouch the drop zones where I could drop that are highlighted if I hover over why jump will add it to the y-axis one brilliant thing about graph builder is I don't have to drop a variable I can simply move it to different spaces and see the effect it has on the visual and so I'm actually going to drop it in the Y drop zone so I'm gonna treat this as my outcome variable jump is jittering the points by default I can uncheck jitter the control panel here gives me controls over the elements that are in the visual the only element here right now is the points I can toggle between different elements like a box plot bar chart contour plots right these are all elements that can display data in one dimension so I have left jitter on I have left the points on let's now define an x-axis let's consider horsepower we know that's something based on the scatter plot over here there's a negative relationship between horsepower and city mileage but let's see if we can clarify how that happens and so I'm going to drag horsepower down to the x-axis again jump creates a visual as I drop the variable or even hover the variable over that axis roll we have a number of things we could put on this visual so I could put on a smoother this is actually the default if you were to drop those variables in at the same time the smoother is sort of like a moving average as you move lambda up the stiffness of the line rather you get towards the linear regression of Y onto X as you move to the left you're allowing the line to be more smooth to pick up more irregularities relative to a linear fit there's sort of a point where you're probably fitting noise and a point where you're probably imposing too much linearity so the smoother gives you an ability to to sort of play around with how the data are fit so I'll leave the smoother on we could have had a linear fit if we liked or we could connected the points if there was some reason for that based on the row order but I'll put back on the points with the smoother all right remember we noticed that the weight of a car was related to city mileage as well and when I actually used the scatter plot matrix we even looked at things like whether a car is de mess or not and whether we didn't look at it yet but whether a car is a small medium sporty how its classified and so we can invoke those other variables in a number of different ways and and so domestic manufacturer is a simple one to start with you know what if we want it to look at the relationship here city mileage and horsepower but across domestic and foreign cars and so if I drag domestic and foreign to this group for X section notice what jump does it's now split up the x-axis we're on the left hand side we have just for the domestic cars or sorry non domestic cars and on the right hand side the domestic cars group for why does a similar operation breaks up the y-axis wrapping seems to do the same thing as group for X notice I can move between the two of them the reason why that's not differing is because I only have two levels here you'll see when we work with a variable that has more than two perhaps I use number of cylinders and so if I do number of cylinders notice that breaks it up as a trellis plot I'll go back to domestic manufacture so group 4x and wrap with two variables or two levels of a variable excuse me will do the same and then finally overlay does something I quite like to and so overlay will take instead of having the axis split across the levels of domestic manufacture it takes whatever visual we've defined in this case the smoother and displays it separately for the levels of that variable and so we have the domestic manufacturer zero and domestic manufacturer one as separate points in separate lines and not only are the points colored but the lines here are our different colors as well and you'll notice something there's sort of a mean shift here so the non domestic cars are just getting higher mileage overall even for the same horsepower and so even for a car at 150 horsepower we're actually getting better mileage if we happen to have a foreign car you may notice it's maybe a little hard without any gridlines let me make some modifications to this this graph because maybe we want to make this easier for our visual users and so I just double click to the axis and I can turn on gridlines I'm going to try on gridlines with three minor tics I just happen to like three so there's our graph with three minor tics four horsepower I'll do the same thing for mileage let me double click the axis turn on gridlines and turn on three minor tics and so now we have looks a little bit like graph paper which I always loved and so now we have a graph where we can look at you know a particular mileage here or a particular horsepower so notice by using overlay we were able to to break apart that visual and really see how those differences really fleshed out I'm gonna take out domestic manufacturer and let's try a different way of involving another variable suppose we wanted to look at the horsepower differences let's say by the type of car so vehicle category vehicle category is small compact van sporty midsize or large so I could use this in an overlay role it's gonna be pretty messy and notice what it's doing is showing the differences in the relationship but what if I only want to show differences in horsepower among these different classes well if I don't want to involve mileage I can use one of the drop zones above or below mileage this is a way of doing a separate plot but adding into the current plot so I'm going to drop it right above mileage now jump starts off by using the same element that is the point that I had used before jump is smart enough to know that a smoother doesn't really make sense for categorical we're not trying to connect the lines between these two but maybe I want something different than the points maybe I want to turn on the box plots or a bar chart an easy way to do this and the way I typically do is I right-click hover over the point section I say change that element for this subgraph into let's do a actually eyes do a box plot that's a great way of looking at these and so we have a box plots now looking at just the differences among horsepower for the different vehicle categories so we're looking at the medians the quartiles and the fences and we can make lots of customizations I can right click on the box plot in the legend maybe I want to fill it with a color I quite like box plots that have a little fill we can resize maybe I don't need to use half the real estate here if I hover right below calm Packt notice I can grab just that section and so I can make this a little bit smaller I so I'll do it like this alright so now we have a visual that that maybe tells a bit more of a story and I can change all the axes just by by dragging if I don't want the legend I can go to the red triangle turn off that legend probably isn't adding too much additional since we already know which each of the variables are and which sections they're in and notice when I I don't need all the controls anymore I can click done and it closes all the controls you can still make changes to things like the titles the axes those are all still editable but if you need to change which variables are used go back to your red triangle and turn back on your control panel and that'll give you your graph builder controls back but I'll click done and leave this this is a graph maybe I want to save for later speaking of saving maybe this is a graph you'll want to remake multiple times and if you add more data that took me a number of steps so maybe I want to save it in a way that lets me recreate it and so I want to touch briefly on saving to a script under your red triangle for any graph or any analysis there's a save script section and so since this is one maybe I'm gonna make again I'm gonna select saved to data table and I'll give this a reasonable title that's actually fine it just picked it up from the title of my graph I'll hit OK and notice that in my data table now there's this little script saved over here I can click this play button and jump will recreate that graph using whatever new data I have and so saving a script is a really valuable way of saving your analyses not just for recreating but when you're presenting I highly encourage you whether it's in your classes or during research presentations have your data set open and work with the live data there's nothing more impressive than actually being able to select points to create graphs to make modifications in real time and jump gives you a great ability to do that but saving your scripts means you don't have to subject your audience to leave two or three minutes it took to set up this graph so remember under the red triangles save script alright so that's a that's a nice graph I'll minimize that I want to show you a useful trick in graph builder if you wanted to create something like this notice this graph really is just adding multiple variables to each axis and then using the same variables on each axis over and over just to prove that point to you let's go back to graph builder let's do it for just two of the levels if I wanted city mileage on the Y and then I wanted maximum horsepower on the Y and then I put city mileage on the X and maximum horsepower to the right on the X notice I'm dropping them in the to the right hand side or to the above sections we are pretty close we have to right-click in one side change that smoother I'm gonna change it to a histogram and I'll turn the points off and I'll change this one to a histogram and turn the points off so I basically just recreated a scatter plot matrix with histograms on the diagonal but that took too long even though it took a few seconds and jump time that's still too long so the trick is and this is a very useful thing to know you can do in graph builder click on the dialog button and there actually is a dialog version of graph builder I quite love the interactive version but if you're creating certain visuals dialog works very well did you this I'll actually grab those four variables I'll click them into both Y and X and then I'll tell jump I want to create down here a graph matrix and I'll turn off the smoothers even I'll just say none so I'm only gonna look at the points and I'm asking for a graph matrix jump will when I click OK create that graph matrix now there's a lot of coloration here that I'm not totally fond of an easy way to fix this is double click on one of the points select all of the points here and all I'm gonna do is right click and I'm gonna say make them all black and knowing that you can double click an axis to get to the axis settings is really important again I just double clicked one of the points select everything command a works if you're on a Mac or ctrl a if you're on a PC click the point and then set the color so maybe you don't want it to be black maybe just one dark gray alright so that's a nice way of offsetting those all and remember I turned off the legend before if I do that now and click done I've gotten pretty close to the visual I made before and so an easy way to use graph builder to even make your scatterplot matrices so a very useful thing that you can do now just like we were able to using scatter plot matrix we can do even more advanced things using graph builder again because graph builder knows how to operate on not just continuous variables but also on categorical variables let's go into that dialogue I'll put in those four variables again but actually this time let's add one additional remember vehicle category let's actually I'll even just remove these let's just add them all at once I'll take vehicle category including the rest of those and click them into the Y and X now if I ask for a graph matrix now and I'll turn off the smoother jump will do something very similar to what it did using scatter plot matrix it grabs those points and sticks them sort of in their locations jittered but like we did before I can right click and change those points to something else I want to change them now to let's use a box plot I think that'll look nice for each of these and so I'll do points change to the box plot it's all those rather quickly just to make the point that if you especially when you get used to jump you can do all these things rather quickly and create visuals that tell quite a bit in a small amount of space and just remember that you know graph builder is one of these platforms and we have other webinars that'll go through a lot of this one of these platforms that once you've used quite a bit and I'll just change the colors to black here and click done and turn off that legend you know and now we have kind of a neat visual that I really love so involving a categorical variable gives us the histogram for that Plus also the box plots as it relates to every other variable so we get a lot of information density here rather quickly and again if this is something you're gonna make again go to the red triangle save script and drop that to your data table and so a lot of flexibility and graph builder ok so let's look at a different way I'm going to minimize the multivariate we're not going to use that anymore just to show you another way graph builder lets you invoke multiple variables of course this is a visualization for multi multivariate data talk so we want to look at that another way we can do it let's start off with something categorical perhaps let's use manufacturer on the X and so this is the type of car we have and let's say we're interested in city mileage and highway mileage how those two relate to each other so if I put city mileage in the Y and let's turn on let's just do bars here to keep this this low-density if I put in highway mileage let's look at different places I can drop this if I drop this above city mileage although I can make a comparison I can read off where Acura is on each it doesn't do a great job of showing me at an instant whether a manufacturer has a wide variety between their highway and city mileage all right so dropping the variable there I'm just gonna drag it out isn't particularly useful to me but there is a drop zone right inside of the graph notice what happens when I drop in side I'm adding this variable keeping the existing variable but adding it inside and so for each of the manufacturers I actually now have one of each so what it is for highway and what it is for city and under the bar style is how I control how this is displayed is what we're actually looking at here is side by side there's a great option called nestin which I quite like because notice what nested will do is take one of those variables in this case the first one we entered so the mean for the city mouch and it's going to nest it inside of highway mileage and that ends up being exactly what we want because city mileage is always lower than highway mileage on average we can look across the different manufacturers and actually see that difference and we can do some clever things with this you may not notice it but if you right click there's an order by option so I can actually order these manufacturers by one of the variables in the data set perhaps I wanted to order by city mileage ascending and so we're looking at the blue bars will always be ascending and we can see how that changes with the the highway mileage you know maybe we also want to let's order by something even more clever what about the difference between city and highway mileage on average so which manufacturer has the biggest difference between those two so I can actually write inside a graph builder right-click I'll do combine and make a difference so I'm making a different score between city mileage and highway mileage and now I'm going to drop this into the drop zone just below the graph so notice there's places I can drop this it's a manufactured ordered by that ascending and so we're actually looking at here which are the ones with the largest difference and so saturn's for some reason they just they may do well on city but they do poorly on highway whereas Suzuki and Mitsubishi tend to have a very small difference between those and so notice that that that ordering gives us even ability to tell a story nice and conveniently with our data here all right so I'll hit done let me minimize this alright so let's look at a couple multivariate ways we might work with two numeric variables and I'll start this with graph builder but we're very quickly going to go into a multivariate platform called fit model and to preface this let's let's look or think back to a relationship we saw we saw city mileage versus horsepower and we saw that there was this this relationship where cars with higher horse powers tended to have lower city mileage and there's a bit of curvature to that this is telling me probably that a transformation might linearize this so looking at horsepower maybe the log of horsepower might be interesting to us we don't have to get too too advanced with that yet but let's start off with the idea of thinking how this relationship maybe relates to the weight of a car and we know that as we get cars with larger weights their mileage went down as well in fact I can I can double plot that here I'll just put it to the left so of cars that are heavier have worse City mileage and cars that are more horsepower have we're sitting mileage but what's gonna happen if I use this variable the weight variable in one of the other roles the roles that breaks up the relationship so I can do group for X and notice what jump does is breaks our weight it's a numeric classification it breaks it up into five sections for us so five sections with an equal number of point I can do group why it breaks it up that way too I can do rap and I can do overlay so I'm gonna stop here and overlay for a second and I want you to notice something if I click on the lowest weight class it looks like there's a pretty pretty sizable negative relationship for very light cars a little bit of extra horsepower is really hitting their gas mileage which maybe make sense if your car is very light you're gonna get a really great gas gas mileage if you have very little horsepower there's not much drawing on that but even if you add a little bit it's going to really dig into that but city mileage as I get to very heavy cars there's almost no relationship among the heaviest cars horsepower didn't seem to relate at all the city mileage there already so heavy they're so worthless for mileage that your horsepower doesn't really matter too much so we're looking at an interaction here between two quantitative variables but I want to show you something as far as visualizing us because I think this visual does not tell that story clearly at all there's too much noise happening what's actually quite useful and jump let me go to the group 4x you may not have known this even if you're a jump user if you right-click that group x-axis you can tell Chum how many levels you want it to use for that grouping so it's creating the groupings for us and I'm gonna tell jump just to make two so I want a low and a high so the low weight goes from 16 95 to 30 40 and then the high weight goes 30 40 to 40 105 if I now drag that variable to overlay jump honors the fact that I told it to use only two levels to now have a two-level grouping and so weight in pounds without me having to make my own binning formula which is actually very straightforward to do as well but just doing it graphically I can create this right inside of graph builder and the same relationship that we just observed before but now shown more clearly among the heaviest cars very little relationship between horsepower and mileage and along the lighter cars quite a bit of relationship now this this distinction we can look at even as linear regressions so I can turn on the linear regression line again this is doing it on a subset of each of the variables and so this gives us a nice way of exploring that relationship without having to make the biddings or do the proper modeling you but let's do the proper modeling because this becomes a very interesting way to teach multivariate regression and to talk about interactions between variables which is actually a very difficult thing traditionally to visualize especially the interaction between quantitative variables so a bilinear interaction term is one of the more mysterious things especially when teaching new students it's a hard thing to grasp so let's start by by talking about how we would get into that analysis and that's gonna take us into the fit model platform and fit model is useful anytime we're working with more than two variables it's fine with just two but it is a platform that's designed for working with more than two variables so let's define those variables I'm gonna put city mileage as my Y so what I'm trying to predict and I'm going to use horsepower and weight as my X so I've selected them both on the left hand side and the reason I selected them both before entering them into the model effects is because I want to use a macro I want to use a macro that says create full factorial effects using these two variables which means create main effects terms so how much does maximum horsepower on average holding everything else constant effect city mileage how much does weight on average holding everything else constant change with city mileage and then this by a linear interaction term which basically says how much does increases in maximum horsepower attenuate or strengthen the relationship between city while age and weight or said differently how much do changes in weight attenuator strengthen the relationship between city mileage and horsepower so how does one variable affect the relationship between the other two that's what an interaction term is but it's often a very hard interaction term to visualize or to think about but something that jump does quite well so I'm gonna click run and before we create a visual I'm just gonna clean up our output I'm just going to minimize some sections let's just look at our parameter estimate section here and if we're looking at the P values in this data set everything is statistically significant not because it's fake data just because these are very obvious relationships but we are seeing the relationships we expected as you have cars with more horsepower they tend to be cars with lower gas mileage as you have cars that are heavier they also tend to be cars with lower weight and then you have this estimate for the interaction term no it's beyond the scope of today to talk about why there's the - here just know that that's because the variables are centered before the cross product is taken but what we're seeing here is a negative coefficient for this interaction which basically means an attenuating relationship as you increase one variable the relationship between y and the other variable will get less steep how on earth would we visualize this and so this takes us to a couple visuals that I quite love and the first probably one of the things that first made me fall in love with jump was this prediction profiler and the prediction profiler under the red triangle and fit model is under factor profiling and it's right here called the profiler and the profiler is one of these magical things that lets you look at what the terms in this model actually mean and I'm just gonna drag up some axes just to clean things up before I start moving things around and what you're gonna see when I move things around is two things happen first the prediction on the Y so we're looking at predicting city mileage based on how much of each of the input variables we have and so as I drag and you can drag things around here and change how much a car weighs or how much horsepower it has you're gonna see the city mileage estimates ange so cars that have 200 horsepower and are 3,500 pounds tend to get eighteen miles to the gallon on average and there's the confidence interval but as I did that moving you probably spotted it because it's very obvious and I'll just point it out right now watch the coefficient or watch the slope between mileage and horsepower so keep your eyes where my mouse is here but I'm gonna move how much our car weighs so among cars that are very light there's a very large relationship very positive watch is a negative slope between city mileage and horsepower but as I get two cars that are heavier that relationship flattens out and remember that's what we saw in graph builder when we just made the two level split among cars that are heavy there's a pretty flat relationship between horsepower and city mileage but among cars that are light there's a strong relationship so we're now profiling this interactively that is the continuous regressor form of that interaction and that's what that negative coefficient means as I move literally above the mean of one variable that's what the minuses are all about those are the means of each variable as I move away from the mean in the positive direction we're taking a little bit of the relationship this amount away from the slope of Y City mileage and horsepower and there only needs to be one coefficient here because that interaction is actually symmetric and this is another place where jumps visualizations help you see why what I'm looking at here is a two-dimensional representation of the three-dimensional response plane fit through the three-dimensional points of these three variables so there is in three dimensions a cloud of points I'll go to graph scatterplot 3d just to show you those quickly so I'll take city mileage horsepower and weight so in three dimensions there is this cloud of points and what we do in multiple regression is fit a response plane through it just like we do with two dimensional points and fit a line so let's look at that response plane under factor profiling there's the surface profiler and the surface profiler lets us see the curvature of the response plane notice that that is not a flat plane going in three dimensions the curvature is this estimate here that interaction term that curvature is responsible for Y as we get higher and wait there's a flat relationship between city mileage and horsepower or similarly and symmetrically as we get to a high amount of horsepower there's a flat relationship between weight and city mileage and as we get to very low horsepower there's a strong relationship in the negative direction so the curvature of the response plane is what's accounting for that and this is a way we can see it and in fact if I click under appearance and turn on the surface plus residuals I'll just move this up so I can make it a little bit larger for you let's make this nice and big and I recommend this if you're ever teaching multiple regression so being able to interact with the response plane that is the plane that minimizes the sum of the square residuals from the points to the plane the points to the model that's what a residual is and so what we're looking at here is that response plane and that relationship that we observe in the profiler is really there in the response plane so what's another way of seeing this and this takes us to the contour profiler another thing I love to see so under the red triangle under factor profiling there's also this contour profiler which is a two dimensional way of seeing the three-dimensional response plane and so I'll actually minimize a couple of these just so we can we can keep our screen somewhat clear there's so much fun stuff to look at so here what we're gonna do I'm gonna pick a particular mileage let's say what do we have to have in terms of horsepower and weight to get 30 miles to the gallon this is the contour notice is not flat across there's an interaction so you're not going to get straight lines here you have to be somewhere on this curve to get 30 miles to the gallon if you want to get 20 miles to the gallon you have to be along this curve and notice that each variable there's some place or some amount of each that you can get that'll give you 20 miles to the gallon but you always get 20 if you're along this line so there's a fun thing we can do here I like turning on the contour grid if I do this it'll give me a low value and a high value this is for city mileage and the increments and let's do it in five mile increments and I'll go up to 45 click OK and notice what we get here are for each of these different mileages what we have to have of each and I'll just drag out the axis here now it's gonna look a little strange that you have these curves like this turns out that that's not that strange those are also places where you can get in this case 20 miles to the gallon so these contour lines just like we're looking at topography give you a way of seeing the response plane but in two dimensions and so once you get used to seeing these they become a excellent excellent way of visualizing the three-dimensional relationships now I'll mention under your red triangle for the fit model under save columns you can save at a prediction formula and if you do this the prediction formula actually has enough information what it is if I go over to it this is actually I right-click and go to formula this is the formula for predicting observations it's basically the linear regression line that we saw before or charge say the response plane we saw before that prediction formula lets you invoke under the graph menu those profilers directly so there actually is a profiler under the graph menu that only needs a prediction formula when I click it in there's your profiler the one we saw before and so you can profile any formula as long as you have those columns in your data set similarly we can go to the contour profiler I can click that in with my predicted Y score and this is the same contour profiler so these profilers exist for any models you have so it doesn't have to simply come from the fit model platform anytime you have a model that you can profile and you want to understand the surface of similarly the surface plot the same thing we saw before you can simply access directly from the menu so these aren't great ways of understanding those multi multi variate relationships and certainly if you've never seen a contour plot before or even the surface plot they might look a little mysterious but I certainly invite you to play with them because especially the prediction profiler gives you an incredible way of understanding these relationships now again we're not always doing this just for ourselves we want to share these things and we want to make them available to others I always like this mention exporting your fit model output or using file save as on the PC you can save this out as interactive HTML now of course the 3d plots aren't going to be able to be saved at as interactive unfortunately those are very difficult to render on a web browser but your profilers are and so this is a profiler that you can share especially if you're publishing a journal article or if you simply need to share it with those people who don't have jump if you need to explain what a interaction term is in a multiple regression highly encourage the use of that profiler it's probably the single best way of communicating that result so be aware that that is there so there were a few other things we didn't get to unfortunately we're coming up on the end of time and I want to make sure I stop and take some questions but plus I'll just mention our another way of visualizing high dimensional data and we're not going to be able to talk through how to set it up but just to whet your appetite I'll pull open what a bubble plot is these are looking at SAT scores by year and so these are for the different regions of the country well the plots basically lets you look at changes over time and so they're a great option under graph bubble plot and so we have other webinars which cover some of those things but certainly certainly be aware that there's beautiful visuals and jump well beyond the ones that are simply shown when you do analysis output so I'm going to pause here and see if Mia any questions have come in with anything I've shown thus far hey there can you hear me yeah great great job just a quick question on other multivariate procedures for multivariate analysis so in the dimension reduction sort of realm can you just show like PCA absolutely so under multivariate methods the principal components is the second option PCA is very useful when you have a space of variables that share correlations and so I'll take those variables that we used before and of course they're correlated what PCA lets us do is reduce the dimensionality to independent principal components and so what we're looking at very briefly here is the proportion of the total variation in those five variables that get captured in the first component and the proportion in the second component and the way you might use these there's plenty of options you can see is saving your principal components and I let's say I want to save to to the data set and you can see what these literally are are for each row in the table the value on each component and the component is simply a linear combination of the input variables and as I mentioned these are actual independent linear components so if I I take the minified Y by X they actually won't have any relationship straight across zero so independent components of variation now you can also go into factor analysis that's under consumer research factor analysis factor analysis like principal components works on extracting variation and does it on the basis of shared variance and the idea with factor analysis is typically trying to understand or extract underlying the underlying factors that contribute to these observations so not simply reducing the dimensionality but finding something latent in the structure and so lots of different options there and as far as before you go oh yeah where you go out of clustering so I encourage you if you have any questions Julian will address the question about clustering in a moment please record in either the chat panel or the Q&A panel yep next one and so clustering one of the other ones I was going to show is a slightly different approach so this is a taking the rows in your table and finding out which ones you believe are similar on the basis of some criterion and so there's a number of different clustering methods I'll just show one so you get the idea we'll start off with hierarchical clustering and we provide which variables we wish to identify similarity among the rows on the basis of I'll click this into Y I'll click OK the rows in the table now are all here and the dendrogram is showing the closeness based on that iterative clustering and so if I take some number of clusters let's say five and I did that by just dragging this little diamond around I'm gonna color the rows by their cluster and so there's a big group of them here that are similar on the basis of those variables so you might ask you know what is similar on the basis of so and your graph builder here's a great way of visualizing that similarity you remember those parallel plots I showed before we can make them in graph builder just by dragging the variables we want to the x-axis and clicking on the final option the parallel coordinates and just like I showed you before each one of these lines is a particular car so that's the Chevy Astro again and we're looking at how high or low each car is on those variables but now if I grab a cluster notice they share the profile and so if I grab one of those clusters here you know you can identify what the attributes are that's clustering that's cool okay so we do have a couple of other questions actually one question is how do you include categorical variables in the multivariate multivariate platform good question so multivariate right here in a multivariate methods is only appropriate for numeric so it's not going to be appropriate for for categorical variables now if there that isn't to say there aren't analyses and jump that are multivariate analyses that are situated for categorical outcomes and so under clustering even the hierarchical is a good example that can take any variable type or latent class analysis which is a type of clustering that specifically is about looking at clusters based on categorical outcomes but that said multivariate is not built for for that type of analysis now that isn't to say you can't trick it sometimes if you notice in my data set domestic manufacture even though we were treating it as categories it actually is a numeric variable so if I click on that red triangle and tell jump even though this is categorical treat that as continuous I can then use it in the output so I'll take some of my variables and click it in so you can trick jump I don't recommend typically doing that sometimes things will be okay based on sort of symmetries among analyses but I wouldn't I wouldn't recommend doing doing that and for an ordinal ordinal variable so there's a question by good very good moderate there are other correlations that are appropriate under the top triangle correct that's right so you'll notice there is the nonparametric correlations here right that's right alright so I know we're running low on time we've got one more question can you once again very quickly show how to save the model result as an interactive HTML absolutely yeah so I'll just recall what I had done before and I'll produce that factor profiling profiler so under on the Mac it's gonna be file export on the PC you'll do file save as once you invoke either of those there's the interactive HTML with data option so when you select that and save it what that saves for jump is an interactive HTML it actually does contain your data so I we always just like to caution you if there are sensitive data that you be sharing with others be aware that for this to be interactive the JavaScript here has to access the data and the data is actually a part of this page source it's cryptic and it's probably not easy for people to recover but it's in there so just be aware if it's very sensitive interactive hTML is not the right method for you great hey can we take this a step further just quickly talk about saving a web report absolutely yeah so you know we had a number of things that we created here even and so what if I wanted to package these together in a single place and so under the View menu there is this create web report option and what this asks is for us to grab whatever outputs we created those are all the reports including multivariate and least squares and what we're gonna do is build a package and I'm gonna tell jump to to stick that package directly on my desktop I'll click Next jump gives me the option to title leaves specific things I'll just leave them as the defaults lots of customizations in terms of how you want to do it I'll click build report and what jump does is assemble all of these outputs together all the interactive HTML versions of them and makes a web report and so I can click through this this title slide and everything just like we've seen before is still interactive and so I love these multivariate outputs when they're interactive online so the web report is a really great way of capturing all of your work together and that's just because I kept those windows open and so again under the View menu go to create web report
Info
Channel: Julian Parris
Views: 53,492
Rating: 4.9637189 out of 5
Keywords:
Id: bQWCgJCea20
Channel Id: undefined
Length: 63min 47sec (3827 seconds)
Published: Wed Dec 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.