Impute, Transform, Regression & Neural Models | Getting Started with SAS Enterprise Miner

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

getting started with SAS enterprise minor imputing and transforming data building a regression model in building a neural network this is the fourth in a series of six getting started with SAS Enterprise miner instructional videos together the videos present a stepwise approach to a real-world data mining problem using SAS Enterprise miner 13.2 on SAS 9.4 in the previous video segment you built three different decision tree models in order to search for a robust nonparametric predictive model first you automatically trained a full decision tree and pruned it to size selecting split rules that maximize the log worth splitting criteria next you interactively trained a decision tree selecting the best candidate split rules from a list finally you used a gradient boosting approach to form a single predictive model that evolved from a set of decision trees in this video segment you will use parametric methods to model the data in order to compare performance to the nonparametric decision trees that you created earlier you will do the following impute values to use as replacements for missing values and the input data we replace missing data because regression and neural network models ignore observations that contain missing values transform the input variables to better suit the input data for regression analysis create an add a logistic regression model and finally create and add a neural network model you might be wondering why are we replacing missing values for regression and neural networking models when we did not replace missing values for the decision trees we built that's a good question missing values are not problematic for decision trees you can use surrogate splitting rules to select other variable values when splitting variable values are missing in SAS Enterprise minor regression and neural network models ignore observations that contain missing values this reduces the size of the training data set which can weaken the predictive power of those types of models to overcome the obstacle of missing data you can impute missing values before you train the models it is a good idea to impute missing values before you train a model that ignores observations with missing values this is especially true if you plan to compare the model to a decision tree model a model that does not ignore observations with missing values model comparison is most appropriate between models that were fitted using the same set of training observations now we will add an impute note to our process flow diagram to start open the process flow diagram that you created in the previous video segment building decision trees your process flow diagram should resemble the one shown at right dragan impute node from the modified tab of the toolbar to your diagram workspace connect the new impute node to the control point node as shown with the impute note selected we can configure the note properties when performing imputation the SAS enterprise minor default input method specifies which statistic to use for missing values in the class variables section of the impute node train properties click default input method and select tree surrogate from the list this property configuration means that missing class variable values and observations will be imputed using predicted values from a decision tree in the interval variables section of the impute note train properties click default input method and select median from the list the values of missing interval variables are replaced by the median of the non missing values the median statistic is less sensitive to extreme values than the mean or mid-range statistic the median is also useful for replacing values in skewed distributions now we are ready to run the impute node in our process flow diagram so we can generate replacement variable values right-click the impute node in your diagram and select run then click yes in the confirmation window after the impute note processing completes click OK the impute node exports new values by creating new variables that contain replacements for missing values the impute note does not overwrite observation values in the original data set instead the impute note creates new variables that contain the imputed values imputed variables and SAS results are identified by the prefix iymp underbar Kattan aidid with the original variable name after imputing and replacing missing values we should consider transforming the input data before we submit it to the regression and neural networking modeling nodes transforming the data can improve model response transforming the data tends to stabilize variants remove non-linearity improve additive 'ti and counter non-normality for many models transforming input data leads to better model fits these transformations can be a function of one or more variables drag the transform variables noted from the modify tab to your diagram and then connect it to the impute nut now configure the properties for the transform variable note in the train group of the transform variable properties click the ellipsis button next to formulas selecting the ellipsis next to the formulas property opens the formulas window for the transform variable node in the formulas window we can browse the input variable distributions before specifying the type of transformation we want to perform select the role column in the variables table to sort the variables by variable role you can select any row in the table to see a histogram of the variable values look at the histograms of all the input variables notice that several variables have skewed distributions the common log transformation is often used to control skewness click cancel to close the formulas window you must select the variables that you want enterprise miner to transform and you must specify the transformation method that you want to use for each variable in the Train group of the transformed variables properties click the ellipsis button to the right of the variables property the variables trans window opens here we will select some of the variables that we examined in the formulas window and we will specify transformation methods for the selected variables the common log transformation is often used to control skewness we will select four variables that had skewed distributions for a common log transformation you can hold down the control key to select multiple variables and then specify one transformation method for all the selected variables using a single method column assignment select the following four variable rows in the variables trans table file average gift last gift amount lifetime average gift amount and lifetime gift amount all four variable rows are selected you can click the method column in any highlighted row then specify the transformation that you want applied to the group of variables for the selected variables choose the log10 transformation notice that when you specify the transformation type all the selected variables in the table are updated to display the new transformation method next we will examine transforming data to create interval variables some of the variables in our table might be useful as interval variables the optimal binning transformation helps choose good interval boundaries for such data use the control key to simultaneously select the following seven variables lifetime card prom lifetime gift count median home value median household income per capita income recent response prop and recent star status then click the method column and choose optimal binning as the transformation method for the group of highlighted variables then click OK right click the transform variables node in your diagram on the drop down menu click run at the confirmation prompt click yes when the process flow diagram run completes click OK if you open the transformed variables results browser and scan through the SAS log there are some interesting analytic results the data exported by the transformed variables node contains new variables that were created for each transformation original variables are preserved but are set to a rejected role imputed variables are preceded with an identifier for the transformation type that was used for example LG 10 underscore of our name for variables that were created using the log 10 transformation opt underbar of our name for class variables created using optimal binning this arrangement preserves the original data and provides easily identified data for the imputed and replacement variable values after imputing values for missing variables and transforming the input data for parametric modeling the process flow diagram should now resemble the following we can use a regression node to examine the imputed and transformed input variables drag a regression node from the model tab of the toolbar onto your diagram connect it to the transform variables now the fault regression that property settings are okay to view histograms of the updated variables right-click the regression node and select update on the drop-down menu Update connects the transformed variables output data to the regression node now you can use the regression node variables viewer to display histograms of the imputed in transform data go to the Train group of the regression properties and click the ellipsis to the right of variables this action opens the variables reg window click the name column to sort the variables table by name now select all the variables that have the prefix LG ten underbar after you select all of the log10 transform variables click the explore button to open the Explorer window within the Explorer window you can select a bar in any histogram the observations from that bar are highlighted in the EMW s-trans train data set window as well as the other histograms after interactively exploring the histograms of the input variables that you find interesting close the variables rag window in the model selection group of the regression note properties click the selection model property and select stepwise from the drop-down list this configures the regression node to use stepwise variable selection to build the logistic regression model the regression node automatically performs logistic regression if the target variable is a class variable with a binary outcome for continuous targets the regression node performs linear regression by default in the diagram workspace right-click the regression node and click run then click yes in the confirmation window when processing completes click results in the regression results browser examine the output window the output window displays the stepwise variable selection process including the fit statistics for each iteration examine the school rankings overlay window use the plot selector in the upper left corner to choose total computed profit the total computed profit chart opens the plot data is ordered by expected profit calculated by using the profit matrix that you defined earlier the plot represents the total computed profit the plot shows the total predicted profit using each candidate models selection algorithms on the training and validation data according to the validation data the total computed profit from the solicitation using the regression theta would be $2,495 close the regression results browser next we will create a neural network model to compare to the regression model neural networks are a class of parametric models that can handle a wider variety of nonlinear relationships between a set of predictors in the target variable often better than regression models it is a good idea to reduce the number of input variables submitted to neural network models we will use the variable selection mode to perform this task drag a variable selection mode from the explore tab of the toolbar and place it next to the progression node in your process flow diagram connect the transformed variables node to your variable selection mode as shown we will use the default settings from the variable selection note as configured variables that have low R square values are rejected right-click the variable selection node and click run click YES in the confirmation window that appears when variable selection processing completes click results in the variable selection results browser expand the variable selection window click on the role column to sort the variables by role type variables that were not selected display a role type of rejected the rejected variables have low r-squared values as configured the variable selection now chooses the variables that have the highest r-squared values there are significantly fewer input variables with the low R square observations removed rejecting the input variables that have lower square scores reduces the number of input variables to analyze to eight the adjusted input data is now more suitable for neural network modeling now we can add a neural network node to the diagram from the mater tab on the toolbar drag a neural network node onto your diagram place it under the variable selection node connect the neural network now to the variable selection node as shown here is a view of the full process flow diagram that you should have at this point as long as the connections between nodes are exactly as shown it does not matter if the nodes relative position varies from the display a neural network trained property group click the ellipsis button to the right of Network this action opens the network window we will use the network window to configure our neural network settings ensure that direct connection is set to yes this enables the network to directly connect between input and output units in addition to the connections made via hidden units set the value of number of hidden units to five our multi-layer perceptron neural network will train with five units in the hidden layer our neural network node is now configured for model training click OK to close the network window then right-click the neural network node click run and then click yes in the confirmation window when the neural network Run completes click the results button to open the neural network results browser maximize the school rankings overlay window on the drop down menu select tunnel computed profit according to the validation data if you solicited individuals selected by the neural network model the kernel computed profit would be about one thousand fifty nine dollars how did the preliminary regression and neural network modeling results compare using the total computed profit plots and the validation data as a preliminary measure the regression model appears to outperform the neural network model comparing the respective models total computed profit plots the solicited donors would be expected to contribute $2,495 via the logistic regression model versus 1,000 $59 via the neural network model in this video segment you use parametric methods to model the data in order to compare performance to the nonparametric decision trees that you created in a prior video you completed the following tasks imputed values to replace missing values in the input data you replaced missing data because the regression and neural network models ignore observations that contain missing values viewed and transform the input variables to better suit the input data for regression analysis created and added a logistic regression model and finally you created and added a neural network model this completes the fourth segment of the getting started with SAS Enterprise miner video series to continue the tutorial see the next segment getting started with SAS Enterprise miner comparing models SAS Enterprise minor automatically saves your project work when you close the software your project work will be saved and available the next time you open a SAS Enterprise miner session

Info

Channel: SAS Software

Views: 60,936

Rating: 4.8896551 out of 5

Keywords: SAS, SAS Enterprise Miner, Enterprise Miner, Getting Started with SAS Enterprise Miner, Data Mining (Software Genre), data mining software, imputing data, transforming data, building neural network, building regression model, SAS 9.4, Chip Robie, SAS tutorial, SAS Enterprise Miner 13.2

Id: TnWRJQb5z4c

Channel Id: undefined

Length: 19min 34sec (1174 seconds)

Published: Tue Feb 10 2015