Overview of Discover Best Model (Binary Response)

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

Usually, the easiest way to determine which type of model makes the best predictions for a specific dataset is to build all of the models and compare the performance. Use Discover Best Model (Binary Response) to compare the performance of 4 common types of models: Fit Binary Logistic Model, Fit Model for TreeNet® Classification, Random Forests® Classification and CART® Classification. All 4 analyses model a binary response with many categorical and continuous predictor variables. For example, a market researcher wants to identify customers that have higher response rates to specific initiatives and to predict those response rates. The researcher compares the performance of the different types of models to decide how to get the most accurate predictions.

Among the 4 model types are 2 more general types of models: binary logistic regression and tree-based models. Fit Binary Logistic Model makes binary logistic regression models. The other 3 commands make tree-based models. The model fitting methods for the 2 general types are very different, yet they complement each other. A binary logistic regression model assumes that the event probability of a binary response is a parametric function of the predictors. The model uses the maximum likelihood criterion to estimate the parameters for a data set. If the parametric function adequately represents the relationship between the event probability of a response and its predictors, then the model can estimate the event probability well. Then, the expression has a great chance to correctly predict the response levels for new observations. A binary logistic regression model simplifies identification of the optimal settings for the predictors. The effective fit also means that the fitted parameters and standard errors are useful for statistical inference, such as the estimation of confidence intervals for the predicted event probabilities.

Sometimes, the binary logistic regression model does not fit a data set well or characteristics of the data prevent the construction of a binary logistic regression model. The following are common cases when a binary logistic regression model has a poor fit:
  1. The relationship between the event probability of a binary response and the predictors does not follow a parametric function.
  2. For certain data sets, the maximum likelihood estimation algorithm fails to converge to unique parameter estimates.
  3. The data do not have enough observations to estimate the parameters in the event probability expression when the number of predictors is large.
  4. The predictors are random variables.
  5. The predictors contain many missing values.

In such cases, tree-based models are good alternative models to consider.

Among the tree-based models, CART uses a single decision tree. A single decision tree starts from the entire data set as the first parent node. Then, the tree splits the data into 2 more homogenous child nodes using the node-splitting criterion. This step repeats iteratively until all unsplit nodes meet the criteria to be a terminal node. After that, cross-validation or validation with a separate test set is used to trim the tree to obtain the optimal tree, which is the CART model. Single decision trees are easy to understand and can fit data sets with a wide variety of characteristics.

Single decision trees can be less robust and less powerful than the other 2 tree-based methods. For example, a small change in the predictor values in a data set can lead to a very different CART model. The TreeNet® and Random Forests® methods use sets of individual trees to create models that are more robust and more accurate than models from single decision trees.

For more information on each model type use the following links: