Data Preparation and Modeling 08Nov2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today we're going to talk about data preparation and modeling with jobs and I'll primarily be using jump Pro for this webinar serenity tools involved in preparing data for modeling so we'll talk about some of the key activities that take place so some of the things we need to do when we're preparing data for modeling we'll see tools in jump data preparation and we'll use an example for this the equity data which can be found in the sample data directory the sample data in jump is under help and as a sample data library which contains many useful datasets will see many tools for predictive modeling and jump so we'll see how to use fit model for building least-squares regression models and also for stepwise regression and logistic modeling we'll use the recursive partitioning app form for classification and regression trees we'll see how to build neural networks in jump row a relatively new platform and jump generalized regression which is only available in jump Pro provides a number of modeling options including maximum likelihood and penalized regression methods we'll see how to use generalized aggression for building predictive models and we'll also see how to use the built-in model validation model comparison platform for comparing competing models and then in jump 13 pros a new feature formula Depot that allows you to collect all of your formulas for different models that you've built and compare those models and also deploy those models using different types of school and code if time permits we'll see how to use the new text Explorer platform or Explorer and unstructured data and also preparing data for modeling and everything that we're going to cover today I want to point out that we have a variety of resources that are available on our academic community or academic website at jump calm slash teach so I'll quickly point these out so at the bottom of our academic website the junco massage epidemic our links to all the resources that our group maintains and it has collected over the years but a learning jump you'll see links to getting started videos so if you're brand new to jump would recommend that you watch some of these videos because they're really useful in helping with basic navigation and use of jump there's a variety of additional webinars that are being provided this fall and we're organizing our webinars for the spring semester at the Learning Library a certain link for this is jump comm slash learn and this contains short guides that help us answer the question how do I do X and jump so for example if you want to build a classification tree or a regression tree or use clustering on each one of these guides as one page it describes a data set to use step by step keystrokes within jump and also provides help with interpreting the results for each one of these I'll also see is short one to three minute video the case study library provides some really nice comprehensive case studies in a variety of different topics so these include case studies in in multiple and logistic regression and there's also a series of case studies for analytics and predictive modeling and finally I'll point out books and jump there are several books that integrate jump that are in the field of predictive modeling and analytics so with that I'll return to the journal I am using a journal throughout this webinar and a couple of more links that I'll point out under the help menu you'll find the jump help and the jump help is searchable documentation grouped into books and most of what I'll be covering today you can find information under fitting linear models or predictive and specialized modeling and I will introduce a few multivariate techniques as well and then back to the help menu one more link that I'll point out is books so if you'd like PDF versions of the documentation and you'd like to search locally the books all ship with the software and finally there's a jump user community that has links to additional information and resources and I'll talk about a couple of these during this webinar so let's get started with data preparation the data preparation is one of the most time-consuming aspects of predictive modeling or analytics there are a number of things that have to be done to prepare data for modeling so I'll highlight some of these here so we need to do things like pull our data together compile combine and structure data so it's in an appropriate format or structure for data analysis explore our data understand what's in the data set what's missing what additional information do we need assess the quality of the data or missing values do we have a lot of outliers do we have unruly data do we have a lot of text data or a lot of categories of categorical variables that you need to be addressed we need to clean and transform our data you may need to define new features in our data and I'll talk about this in a moment we may have data that has a lot of columns so there are wide data and we may need to employ tools to reduce the dimensionality of our data set particularly if a lot of our columns are correlated one another or are we done it and finally in predictive modeling we're interested in developing predictive models that have very high accuracy or very low error rate so a general technique is to to train our model on a subset of our data to check the model on a second subset of our data that we have often call a validation data and then for a selected model to see how well the model works on data that wasn't used in building the model so we call this these subsets train validation and test and then general approach we call model validation so we'll see how to do this in jump in a moment I'll provide a highlight of the tools for data preparation and jump but then we'll use an X an example the equity data to illustrate so for compiling and structuring data there are a lot of really nice tools and jump for pulling data into jump and then massaging the data so it's in a structure that is useful for analysis and modeling so I'll point some of these out under file open we can open data in a variety of different formats so this includes an Excel import wizard that allows you to look at your data first and adjust some parameters to make sure the data comes in in a format that is easy to model or use an analysis you can also import text data staffs datasets datasets in a variety of different formats under database there's a relatively new query builder that allows you to write sequel query directly from jump so you can go out and connect to a database and pull data directly in the jump that way you can also use connectivity to SAS to browse data and pull SAS data directly in a jump or you can write a SAS query to be able to pull a subset of SAS dataset directly in the jump and you can pull data in directly from the internet now if you have data directly in jump there's a new feature under the tables menu and I should mention that I'm using jump 13 Pro we're doing this webinar on a Mac the jump runs the same on Mac and Windows on jump 13 came out in September and there are a few new features that I'll highlight today so one of those new features is this jump query builder what this allows you to do is if you've got several jump data tables and they all have information that you'd really like to be able to access from one data set this allows you to write a query directly in jump to be able to pull data from these different data sets without opening those data sets it's a really nice useful tool under tables you'll also see sort stacks split and I don't have time to go into these today but these are all useful for restructuring your data so if your data comes in where your values are in separate columns you may need to stack those values so they're all in one column you may need to transpose your data so the features under tables allow you to rearrange your data so it's in a format that's appropriate for analysis and the final tool will show or talk about under compiling and restructuring data is this feature called virtual join so jump generally assumes that all of your data is in one data set virtual join allows you and I'll go ahead and open up a couple data sets but what virtual join allows you to do is access data that's stored in another file from one one file that's linked to that out of file so in this example I've got some United States state information and this information in a column or this data set has a column called region but there's a second file and this file has populations for each of the regions if I'd like to do an analysis where I'm using the information in US state abbreviations file but I want to access the population file instead of joining these two data tables together what I'll do is I'll right-click on region and select link ID so this defines region as the link variable and then within the file that I'm interested in working in I'll right-click and select link reference and I'll point to that file now region is this icon that tells it tells us the region is linked to the column region another file I'm able to pull in this information that's in this other file without having that other file open so now if I'm interested in doing an analysis I can select the variables and it will pull in those variables from other files so this is a really nice way of working with data where several components may reside in other files where we don't have to build one very large data set to encompass all of this information we call this virtual join now in data exploration we use a lot of the tools that we would typically use in any analysis who are getting to know our data so our goal here is really to understand what's in the data set what kinds of problems what might we have are we missing a lot of values do we have a lot of skewed distributions do we have a lot of categorical variables that have a lot of levels so we'll use the built in tools for data visualization to get a better understanding of the quality of our data set so in the example that follows I'll use columns columns viewer and this allows us to get a high-level summary of all of the variables in our data set analyze distribution so this is the first option under the analyze menu and this will give me a look at the univariate distributions for all of our variables but these are all linked together so it gives us a feel for potential by their relationships and we also get a feel for the shape and spread and centering of our distributions and then graph builder which allows us to look at several variables at a time and it's also really nice launching point if we realize that we need to transform variables so again we'll see this in an example in a moment driving new variables this this can include transforming variables recoding data so if we have data that is categorical and it's messy you might have trailing spaces we might have an issue with capitalization this allows us to clean up categorical data we may have continuous variables that have messy or poorly behaving distributions so we can use the make pinning formula utility or you can use the formula editor to create a formula to bend continuous variables and I'll point out we won't show it but there is a add-in on our jump community called interactive binning and this allows you to bend continuous data into buckets to make it a little bit more manageable for analysis so see if a few of these features as we go along there's some nice tools for dealing with missing values under analyze predictive modeling sorry screening we've added a few of these so there's a really nice utility for exploring outliers there's another one for exploring missing values to get a feel for the pattern of missingness under tables will use missing data pattern there's this feature informative missing and again we can use recode to deal with missing values nice tools for dealing with outliers including explore outliers and some robust methods and then I won't have a lot of time to talk about multivariate procedures but there are lot of nice built-in tools for reducing the dimensionality of a data set so this predictor screening option allows us to actually utilize bootstrap forests to take a really wide data set and narrow down the list of potential predictors from modeling to just a subset of the larger set principal components and clustering are available and these are both commonly used dimension reduction procedures so let's go right into an example and these data are the equity data and what we're going to do is we're going to take a look at the data and we're going to see some of these built-in tools for dealing with data quality issues that we've identified in the data set the data our equity and again this is data from the sample data directory under the help menu the column has the data table has 14 columns and make this a bit a little bit longer we have almost 6,000 rows the little dot anytime you see a dot in jump for a continuous variable this tells us that we're missing values and again if you're relatively new to jump the icon next to the variable name under the columns panel tells us what type of data we're dealing with so when you see the red bars that says we've got categorical data where you see the blue triangles this says we've got continuous data as I hold my mouse on top of these variables you'll see that a little note pops up so within the column information for these variables in fact I'll right-click on this variable and select column information this is where we tell jump how to handle this variable for analysis and we can also add notes so the note was added to give us a little bit of information about this particular variable while we're here I'll mention there are a number of different column properties that can be set to aid in model in preparing data and a couple of these that I'll point out missing value codes is important so if you pull in data from another source and say 999 is used to code data as missing then you can assign a missing value code here if you're dealing with categorical data that comes in for example as 0 1 or 0 1 2 3 and you'd like to show the labels instead you can add value labels and value ordering is useful if for example you'd like to change the order that a categorical variable displays or the labels of categorical variable displays on a graph so for example if I pulled data in and the data or code it is or have value small medium large the data are going to plot in alphanumeric order so we can use value ordering so that the data actually ordered small medium and then March actually it would be small yeah small medium large what we'd like there so anyway these column properties are there and there's a few other column properties I'll talk about informative missing in a few moments informative missing is used to tell job how to how to handle missing data when I'm building predictive models I'll hit cancel here so to get a first peek at these data and understand what kind of problems we might have in the data set I'll go to columns and then columns viewer and Collins viewer allows us to get familiar with our data I'll select every variable in the data set except validation and I'd like to see the quartiles so if I select show quartiles and then show summary and I'll hit clear select - deselect the variables what this does is it shows me basic summary statistics for all of the variables in the data set so I can see the number of observations I see the number of missing and notice that some of these are missing quite a few values for categorical data I see the number of categories so bad is two categories reason is 2 job is 6 and if I take a step back and take a look at the data a little bit of background on this data so these are some data and we're looking at a loan and the loan can be good or bad risk the data actually coded as euro 1 and this little aspect tells us that we've used value labels where if I've got 0 it's coded as bad risk I'm sorry euros put it as good risk and 1 is coded as bad risk so here we're interested in building a predictive model to predict bad risk loans and we've got a number of pieces of information for each customer so we've got some categorical variables with four continuous variables we've got the minimum and maximum value so this is useful in understanding the range of values so for example under value this is the value of the loan the minimum value is 8,000 the maximum value is 850 5000 so we've got quite a range there if I look at the median median and the mean I'm generally interested in whether there's a big difference between the mean and the median which we need to keep skewness for some of these variables they see that there are a lot of zeros so if I look for example it delinquent median is 0 the lower quartile 0 the upper quartile 0 and the interquartile range is 0 this tells us that at least 75% of these values are 0 the maximum value is 15 so this is likely to be count data in fact any any one of these that starts with 0 is likely account variable now if I'm interested in in taking this a step further I can select the variables of interest here and select distribution or I can launch distribution directly from the analyze menu and I'll go ahead and do that here so we'll go to analyze distribution and selecting variables and jump is a matter of clicking and dragging in one keystroke or selecting a variable and holding on the shift key and will select all of the variables in between when we click and I'll click ok so what kinds of things do we see here by default the view is vertical I'll use stack to convert it to a horizontal layout so I've got about 20 percent of the customers were bad risk I can see the shapes of the distributions so for continuous variables I'm interested in whether I've got a lot of outliers the shapes of the distributions I'm interested whether I've got missing values for the categorical data an interest in the number of categories and what some of the categories are relatively small so I can see that sales and self have relatively few values compared to other and some of these other categories you're on the job for all of these I'm looking to see whether the distributions make sense so if this is the year on the job it makes sense that it would start at 0 with a tail that fits that quickly declines as we go forward for things like derogatory and just link quit I can see that I basically get data in buckets so I've got values at 3 2 3 4 5 but most of our values are at 0 in fact at least 75% of our values are 0 the same thing for delinquent so by doing this I'm getting a feel for what kinds of things might I need to address in order to be able to model effectively with these data so let's talk about some of the things I might do with this data set I'll start with a variable job so if I go back and look at distribution of job I can see that job has several categories it's missing 279 values and some of these categories are relatively small in terms of the number accounts so one way of addressing the missing values and also dealing with the low counts is to use recode so with job selected other two columns and then recode and under the red triangle we'll see lots of options for cleaning a potential data quality issues convert to tile case upper case lower case trim out white space I can group similar values in this case they might do something like assign missing where there are no values or I might group two values together so I might group these and I'll just select the values and a group in to self and I'll rename the self slash sales so in cases where I've got relatively little counts I may not need to do this if I were really doing modeling with these data but for illustration this is a nice way of combining buckets of data now under the done button there are some options here if I click in place it will replace the data with these recoded data what I generally like to do is use formula column and what formula column does is it gives us a new column with the data recoded in its storing the logic for recoding the data so this is the formula editor and basically what it's done is it's setup a conditional if statement or match statement so if job is missing call it missing if job is sales or self then we combine or combine these in the sales itself otherwise populate the new column with whatever was in the job home the Rico's really nice for cleaning up categorical data I can also bend values so bending values for example if I look at the wrong go ahead and create a distribution of derogatory so binning values as a matter of changing this over two discreet buckets so I might have a bucket that zero I might have a bucket that's one or more and also missing values I might have another bucket with missing so to do this I would use an if statement you create a new call I'll double click then I'll right-click and select formula and from this formula editor I'll use the functions on the side Alaska or conditional yes and this populates the formula with some expressions so if the raagh and then I'll use a compare statement is less than or equal to zero and invent in the vent clause this is what I'm going to group it into then I'll call this none now with the nun argument highlighted I'll click the carrot and I'll assert another argument so for the second expression again if 2 rogg now I'll use is greater than 0 again I'll do double quotes I'll call this one or more and then for this last one I'm actually going to add one more argument again durog now I'm going to use is missing so if the rogatory is missing then I'm simply going to call that missing so this is a really nice way of bending continuous data where you might have a really unruly distribution and you want to put it in discreet buckets there is a utility so when your columns utilities there's a make pinning formula utility but I like the ability to use the formula editor to create these manually so what are some other things we might do we might take a variable like value and we might decide that we need to transform value so if I go to graph graph builder I'll look at the distribution and graphically this time instead of distribution and I'll just look at the distribution of value I'll use this little grabber to get a better understanding of the shape of the distribution and I can see that it's right skewed and there are some relatively extreme values if I decide that I need to use this in a model and it needs to be transformed to normality within any analysis platform and also within the data table you can right click on a column and you can select a transformation so for example I've got different distributional transformations different random functions under transform I'll select log and what this does is it creates a temporary variable down at the bottom of the list but this variable is active so if I want to be able to explore this particular variable to see if this does a good job of normalizing my data I'll drag this variable and drop it on top of the x-axis and it replaces the original variable and now if I look at this distribution it looks much more normally distributed now if I want to retain this variable this is a temporary variable it's not actually in my dataset yet I'll right-click and select add to data table and then back our data table you'll see a new column and any time in jump you see a plus sign next to a column that indicates a business stored formula so basically jump just wrote this formula for us we could have done it manually by using one the transcendental functions if we know what transformation is required we can also do it directly from the data table so if I right-click on a column and select new formula column this will allow me to apply all sorts of transformations we can also select multiple columns let's say for example we want a ratio or percentage if I select two columns and right click a new formula now we've got some additional options so I might want the sum or the difference or a ratio or an average or might want to aggregate these in some way so really useful for for preparing data and creating new derived variables for missing data in a data set like this I might explore whether the data is missing at random or if it is something there's something systematic going on so a really nice feature for doing this is called missing data pattern so this gives us a snapshot at the nature of the missing list in our data set so I'll select all of the variables and add columns and the resulting table gives us a summary of the columns that are missing values so this first row where I see 0 0 0 0 if there's a 1 it so that this the data is missing for that particular variable so only 3364 rows are not missing any values 883 are missing just debt-to-income ratio and some of these we see are missing a lot of values so there's 19 that are missing all of the variables at the end of the data set there are a couple of options here and I won't go too far into these but these allow you to visualize the nature of you're missing this so anywhere where you see red colors this says we're missing a lot of values so we've got this box here and this box here we're missing 0 or only one value but this can give us a really good feel for if we're missing data across all of our variables are just within particular variables and I'll go ahead and close this out now to deal with missing this there are different ways of dealing with missing this under analyze screening there's an explorer missing values utility so if I select just the continuous variables here go ahead and select all of them and it will bump out two of those guys this gives us some options for dealing with missing values and imputing missing values I can also imputing values from the multivariate platform if I know that I'm missing a value and I'd like jump to include this missingness or this missing information in my analysis then I can use an informative missing column property and most of the analysis platforms in the dump also have the ability to do this so for example if I take value and I go to the column information window for value there's a column property down towards the bottom called informative missing so what this does is it tells jump anytime I set a model to impute the mean for missing value but also to add a second column which tells us whether that value is missing or not so we can do this directly within a modeling platform but we can also do it for particular variables if we know that there are issues I'll hit cancel here the last thing I want to talk about is how to make a validation column so if I were to build a predictive model using this data set the status set for bad as a function of these other variables or as a function of the clean variables factor go ahead and go to the clean dataset so in this clean data that by the way will make this journal available and these datasets are embedded in the journal so if you click on the links that allow you to open the data set so I've got original variables and I've got a lot of transformations and cleanup work that I've done but let's say I want to be able to create a validation column I'm going to use these data from modeling so what I'd like to do is I'd like to partition these data into subsets and the best way to do this is to go to analyze predictive modeling and then make validation column and if you're in jump twelve this option is under the columns menu menu under models utilities this is the jump profie so what this allows me to do is tell jump what percentage of my values I'm going to use to train or build my models and I'll put 60% of my data to training the model what percent I'm going to use to validate the model or to check the model and many model platforms this is also used to stop model growth or model building and then what percent of my data I'm going to use it's not using the model building process at all but it's held out to check the model afterwards to see how well the model performs on data not used in modeling so I'm going to I'm going to partition this data set into three sets and I'll call the column validation and there are different ways to do this you can have data completely at random you can have a fixed random with a random seed and if you're teaching or you want repeatability then using a random seed is useful here so for example if I plug in a random seed of 1 2 3 if somebody else is creating a validation column using the same data and use this same feature then they would get the same partitioning of the data in cases where we have an unbalanced response in this case we've got 20% in our target category stratified random is useful and you can use try to find random from your predictor or for your response but also for predictors so if I select bad here then it's going to sample evenly good and bad in each of the training validation and test sets so I'll go ahead and select just fixed random here and the way jump handles validation or jump pro handles validation is by creating a column and jump recognizes that any time the value training is here to use this corresponding observations to train the model validation to check the model and then test is held completely out of the model so I'm going to close this data set and let's talk about predictive modeling so there's a lot of work done to prepare data for modeling in fact is one of the time-consuming aspects of modeling but we're going to skip ahead and talk about some of the predictive modeling tools and I'll start with a data set that has a continuous response so this data set on this is data that I downloaded from the city of Boston dot-gov so this is data on assessment values for homes in a neighborhood in Boston the neighborhood is East Boston I'm interested in looking at total value of these homes as a function of a lot of characteristics and they said data that I've already cleaned up so so I've done a little bit of recoding of building style and some of these other variables going to get a little easier to work with but what I'm interested in doing is predicting total value so let's take a look at this data so I'm going to start by just graphing my data and anytime we do any sort of predictive modeling we want to start by getting a really good firsthand picture of our data this data set has latitude and longitude so by dragging those variables on it's plotting my points and each one of these points as a house it's plotting it geographically and I'm going to right click and add a background map and if I select Street map service this goes out and connects to a server that's able to draw the latitude and longitude with the map from the server this allows me to zoom in and look at the street level view so each one of these points is a home and I can hold my mouse over the point and I can see the value of the home and different characteristics so I'd like to develop a predictive model for property values for the homes in this neighborhood and there are a few other things I can do here for example if I want to be able to explore for example total value I can color the homes by total value and I can see the homes in this neighborhood up here are a little bit more expensive than some of the homes in some of the other areas so I would drag and drop and explore different values to see which values might be indicative of home prices I'm going to go ahead and close this they get a feel for this neighborhood in fact this this Airport here is Boston Logan International Airport let me close this the variables and stuff if I take a look at the predictor variables you and I'll go ahead and stack this several variables we can see lot square footage to some large houses some small houses you're built we've got some very old homes some that are not the old living area floors so these are all characteristics of the homes in the neighborhood and as I scroll through remodel building styles as some of these have a lot of categories I might want to combine some of these same with roof type we see some that are very low in terms of counts finished II pipe as I scroll through this there's nothing indicating the geographic location of the properties the only thing we have in the data set that does this is latitude and longitude so if I look at latitude longitude I'm going to stack this again and go ahead and recreate that Geographic map I'll just run this little script in bling this model it might make sense to try to pick up variables that tell us something about the location so for example if I click and drag and highlight some of these values I can see that they're really some different geographic locations within these within this neighborhood so one way of dealing with this is to apply a multivariate procedure called clustering or hierarchical clustering so I'm going to create a feature to find feature using clustering so good as multivariate map is actually clustering and hierarchical clustering and I'll simply use latitude and longitude what this is going to do is it's going to group homes that are really close together and the longer these lines are the further apart the clusters are from one another so if I click and drag we start where we've got individual homes and if you keep an eye on the map in the background notice that it's actually grouping these homes into neighborhoods they're sub neighborhoods so I can use this to define a geographical effect or location effect so for example if I stop here where I've got five clusters I can see that there are five regions to the data and if I save this out to the data table under safe clusters then this becomes a variable that I can use for predictive modeling so I've already done that in the background and let's go ahead and talk about modeling now other things I might do first so for example I might transform data so I might look at different variables and see what's important but I might also do some transformations and variables so I'm going to skip over that I would probably also look at the nature of the correlation between variables so this is under multivariate multivariate I can see that some of the variables are highly correlated one another this is all work that I would do before doing any sort of modeling with these data let's go straight into building a regression model so under analyze fit model this is our multi-purpose modeling platform I'm modeling total value as a function of all of these predictors I've got 17 predictors and include all of the predictors in the model I could also include interactions so if there are interactions that I want to add for example I can add in interactions for all the continuous variables if I select all of the continuous variables and select macros factorial to degree this will add in interactions for all of those two-way interactions or all those always continuous variables since I divided my data into training validation and pets that they did this earlier I'll add validation here so this will build a model on the data that has validation or has training in this validation column and it will keep track of the model and on purport validation statistics for the validation set I'm going to add in one more variable here I'm going to add in cluster so from this platform the default personality is standardly squares regression we can also do stepwise regression for model selection and also generalize regression in a few moments so for now I'll go ahead and click run and the first thing you see is an actual buy predicted plot so you can think of this as a residual plot where the blue line is the overall mean across the bottom we see the values that our model predicts for value and on the y-axis is the mod is the value that we actually observed we see that we've got one value way up here that looks like it might be an outlier but this gives us that overall feel for the fit of our model there's an effect summary table the effect summary table has P values for all of the terms in our model and makes it easy for us to reduce our model so for example if I click on something here at the bottom so these are all p values that are not their terms that are not significant if I select these I can hit remove and I can continue this by it by slowly removing and this is equivalent to doing backwards stepwise selection so as I slowly reduce this model all of the statistics that are displayed update automatically so I'll touch some of these other things away and I'll go ahead and just remove all the leaves that are not significant now I can also use a stepwise platform it's tough to do model selection instead of doing this manually so slowly reducing the model normally I would be looking at residuals I might be looking at since their Diagnostics as well but I'll just do this for the for the sake of time so I'm slowly reducing our mall by getting rid of things that are not significant I'm not removing any of these terms that have to care it because the carrot is telling us that the term is included in interaction that's higher up in the list so anything that's involved in interaction need to stay in the model so I'll basically stop there so what I've just done is build a model and we can see the things that are most significant lot square footage and here's that cluster variable so location the location effect is really important in predicting total value things to finish whether it's been remodel or not what whether it has AC or not half bath so from here we can we can explore for this model by default and jump their team will see this thing called a a prediction profiler and for anytime you fit a model in ghent you'll see there are a lot of things under the top red triangle and we tuck away options under the red triangle so under regression reports you'll see if you're looking for the ANOVA table you can ask our ANOVA table or the high-level summary of fit under estimates show prediction expression will show the formula for our model for categorical variables jump uses what we call effect coding so this is the minus 1 plus 1 coding scheme if you select indicator parameterization estimates this provides a model where the estimates are using a zero one dummy coding or indicator coding so this would match what you would get in SAS so these red triangles and the options under the red triangles provide a lot of additional options and profilers one of the options under factor profiling if your interest in looking at model Diagnostics under saved columns to see options for exploring residuals and studentized residuals you can save the formula out to the data table you can look at confidence intervals or prediction intervals and a new feature in jump 13 is to publish the prediction formula and what this does is it publishes the formula out and go ahead and select it now to what we call a formula Depot and the formula Depot allows you to collect all of your models and then from here you can you think models or you can write what about in C or Python or JavaScript so any one of these will actually write out the code so in this case I selected C let's write out the code to be able to deploy this model throughout the organization I'll go ahead and close that and I'll close the funding Depot for now and what the profiler does is it allows us to explore our coefficients so the steeper the slope the more significant a term is if the slope increases it tells us that our predicted response increases as we change the value that particular that particular predictor so if I change lot square footage from the low value to the high value and don't change any other values then we're seeing the Potala predictive response changes and this value here is a confidence interval for the response so some of these are relatively steep some of them like gross area are relatively flat but notice as they change gross area the slope for living area increases so this is an indication that we have an interaction between gross area and living area since we add added a validation column then jump is producing statistics for the training set so this is a set that was used to build a model and Ras C is the same thing as root mean square error and we also see the values for the validation set so that the training set has an R square of eighty three point nine four point eight three nine four and what we're looking for is we hope that our statistics for validation set are fairly close so if we see that this R square is substantially lower that's an indication that we've overfed our model we'd also like for the R square are the root mean square error our nasc to be very similar so the model was built using just a training set and then it's applying that model to the validation set to see how well the model actually works now under the red triangle prediction profiler and by the way the profiler is available from any modeling platform and jump if you've got a big model like this we've got a lot of terms in the model then it's difficult to see which the variables are most most important if you control for all the other variables so a very nice tool here is this option assessed variable importance and what this does is it runs a simulation that isolates out the effects of each of the individual terms so it produces a marginal model plot so we can see that independent of all the other variables as lot square footage increases so does the total value and it's sorting these in order of importance so the most important item is lot square footage followed by rooms and gross area and then cluster actually drops down on the list also from this red triangle next to the profiler is a simulator that allows you to do Monte Carlo simulations so I'll save this formula out to the data table and anytime you save the formula out to the data table you'll see new columns appear and when I right-click on this formula then it's basically writing out all of the crammed or estimates in that model that we actually didn't talk about this in jump but it's basically writing out the formula for our linear model and then if we add new rows the data table it will make predictions of those new rows so that was a quick peek at multiple regression let's talk about some of the other modeling tools under analyze predictive modeling you can see neural networks and partition partition is classification and regression trees so let's take a quick look at partition here and I'll set this up the same way I've got total value as a Y I've got all these predictors all that cluster and I'll add validation now from method we can select a decision tree and a few other options that we might want to explore for this particular data set a decision tree will build basically a tree where it's doing a series of binary splits and we'll do this in a moment strap forests will we'll also build trees they'll build a number of small trees that are combined to make a forest and boosted tree is also going to build trees but it build a series of individual trees one on top of the other so we'll talk about this very quickly going along so I'm going to click OK and the graph of the top tells us what our overall average property value is a total value is and each point is simply scattered relative to that mean so this value is 526 thousand this one is five hundred nine thousand the values below the line are less than the mean and we get starting values and we see that the overall average mean is 241 and the standard deviation of 77 so what three methods are going to do in this case since we have a continuous response we're going to fit a regression tree if we have a categorical response who would fit a classification tree they look for variables that if we were to form subsets by splitting this variable into two subsets those subsets have a mean that is as far apart as possible so it basically looks at every single variable and looks at every value for every variable to find a cut point that will force the subsets or two subsets that are formed that have means that are as far apart as possible if this was categorical data we'd be looking at at probabilities or proportions rather than means so it sorts to everything and it does a p value for every single place it might possibly split the data into two subsets and instead of reporting a p value it reports this log or statistic the higher the log worth the lower the p-value so report the log worth and then it reports a cut point wherever you see this little asterisk this is telling you that this is the variable and the cut point that will be the first split point so if I click split here the data are split above the relative width here for lot square tells us that we've got a lot more values that are less than 30 450 then we have that are greater than that but we also see these lines that are drawn at the means for each of those those values so we're basically building a logic statement if not square footage is less than 30 for 50 than the mean is 200 3,000 it was greater than or equal to that then is 314,000 now this process continued that I've got two sets of candidates and is going to look through all the possible splits for this node and all the possible splits for this node and the second split is on your built so if I've got a relatively small house built before 1991 the average price is 19 199,000 it is greater than an equal that it's 320,000 and since I have validation if I click go jump automates this process and it keeps splitting and keep splitting and keep splitting and it keeps track of validation R square so the validation R square is this red line and it basically tries to find a point where the validation R square stops improving so it founded point here at 22 splits and it splits an additional 10 times after that and if the validation R square doesn't improve any further then it prunes it back to this point so our final model has 22 splits and let's see what that looks like if I click on the red triangle we can see the splits up above it's a little bit difficult to see a small tree view can give us a summary of the splits that were made and again with 22 splits is a little bit difficult to see so we can look at the column contributions and column contributions us a feel for the most important variable so the variables that were involved in the most splits and we see a similar theme to what we saw in regression hot square footage living area your build gross area clustering finish so basically we built a model it's a series of v of if-then statements and we can also save that formula out to the data table now our alternative approach to building a straight classification to your regression tree is to use bootstrap force or boosted tree and very quickly what these are go ahead and click recall to pull in what I did last time what a bootstrap forest is it is going to build a series of very small trees and then it's going to combine those together so our final model is going to be an average of the predictions from each of the individual trees I'll click OK here and there are some controls we can specify how many trees to build and we're building a forest so the default is to build a hundred trees where the trees are all relatively small and you can change these different settings to explore different options and you can also set a random seed here because there is randomness but if I click OK here you'll see what happens I'll go ahead and show the trees and by the way we're getting these these statistics that appear here so we can compare these different models in terms of performance I call them contributions you'll actually see that there are more variables that have entered in the model when you run a straight decision tree or a classic regression tree what happens is the very first split can't control all of the other splits that are made so bootstrap forest allows other variables to enter into the model so that's why we see additional variables here and then what we've essentially done is I'll show the trees is we've built a forest and that forest is comprised of as many trees as we specified that it stops at 3:15 so there's an early stopping purpose so here we've got a relatively small tree and another small tree and another small tree and the final model will be an advocate of all of those trees final option here is boosted trick boost a tree and I'll go ahead and hit B call here but booster does is it builds a small tree and then it calculates the standardized residuals or error from that tree and it builds a tree on top of that and another tree in another tree so this is actually a layering sort of process and here we can randomly sample from our data and also randomly random the sample from our columns we can do the same thing in boostrap forest I'll click OK here and again we get overall statistics and we get the ability to see which of durables enter the model and we can save this formula out so I'm going relatively quickly here so those are tree based methods on neural network is a form of a nonlinear model so if I populate this as soon as I have before I had clustering again and add validation and what we're going to do here is we're going to build a model and the model can have multiple layers and multiple functions in fact I'll go ahead and click go here so we can create a model and look at the diagram to see what we've just done so basically we've got a series of inputs so these are all of our inputs and then we've got a hidden layer and by default we have three nodes in our hidden layer and those nodes follow an activation function so there are three possible activation functions the tan H which is sigmoid illan shape like a logistic function linear or Gaussian and then our final model is a linear combination of the models from each of these nodes in the hidden layer so this is a nonlinear model that tends to do a really really good job of predicting but it tends to be a little bit of a black box it's also a highly parameterize model so if I were to look at this model and look at the estimates for this model and we can see that there are a lot of things that are estimated are being estimated so for the hidden layers I've got all these parameters and then I've also got things that I'm estimating in the output layer so very nice model for making predictions but not a very good model if you're really trying to understand which variables are important now the last type of model I'll show here under fit model and I'll go ahead and hit recall is under personality and it's generalized regression so generalized regression is a modern modeling platform that is a really a modern approach to generalize linear modeling where you can specify the response distribution you can also have censoring here and when I click run by default it performs the maximum likelihood analysis which is the same as the standard least squares if I had a categorical response so for example good or bad a pass or fail and it will allow me to do logistic regression and then there are several different approaches to building a model here so the standard least squares of the defaults that can be forward with the regression last so elastic net Ridge on Double A so these are all penalized methods and I'll cut this away and if to apply one of these penalized methods so we can see what happens here I'll click and go and basically what happens is that jump is applying a penalty so our parameter estimates can be inflated if this correlation and it's applying a penalty so here we've got a full model and as it applies the penalty that parameter estimates are shrunk so we're actually introducing a bias and here I've selected lasso the panel on the side is basically telling me at what point I have my best model so the top line is the negative log likelihood for my validation set the bottom line is for my training set so it's trying to basically minimize the parameter estimates by applying a penalty the end result is a model that is basically a linear model so if I were to save this out we would have a model that was highly interpretive so let's take a look at these different models I've actually created several different models I'm going to run the formula deep so I've actually stored all these models out to the formula Depot so I ran several different models least squares model our convocation tree a bootstrap forest a boosted tree I ran a couple of neural nets and I did several different on generalized models and if I want to compare these models under the red triangle in the formula depo is a model comparison option this option is also available if you don't use the formula depo and jump Pro under predicted modeling malt makes our model comparison so if I save the call columns after the data table I can also access model comparison that way so basically what this does is it allows me to look at fit statistics for all of the different models that I've fit and if I want to compare these statistics on just the validation set then I can see the best model in terms of our ASC it's actually this one called a two staged forward selection so basically this is a generalized procedure where I'm fitting a model where I've got interactions and it first checks which two-way interactions are significant and then it finds which main effects are significant so my best model is actually this bottom model here now if I wanted to be able to play that model I can use a formula depo and write outscoring code for any for this model any one of these types of formats well this whole example is based on using a continuous response well what if I had a categorical response so for example if I have a data set and I'll go back to my journal here like Titanic passengers where I've got a response yes or no and several different predictors or the equity data that we saw earlier we're looking at good or bad well the same sort of thing applies so we can use the same sorts of modeling approaches instead of doing with least squares regression we do nominal logistic I can still do classification trees and instead of regression trees we check for s boosted trees and I can still use this model comparison platform so the mechanics are all very similar instead of looking at root mean square error or R square I'm looking at the misclassification rate so here I can compare the misclassification rate for several competing models to be able to take the best model so I think I'm out of time I didn't have a chance to talk about the text explore there are some nice webinars out there and I'll open it up for questions in a moment but I do want to remind you that there's additional information in our learning library if you get a dump dot-com slash teach you'll find links to live and on-demand webinars and a lot of really good information in our help files and several books on predictive modeling and analytics available from jump comm slash teach so
Info
Channel: Mia Stephens
Views: 1,474
Rating: 5 out of 5
Keywords:
Id: 55oth3LBkV8
Channel Id: undefined
Length: 59min 45sec (3585 seconds)
Published: Wed Dec 07 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.