Types of predictive analytics models in Minitab Statistical Software

Models from predictive analytics provide insights for a wide range of applications, including manufacturing quality control, drug discovery, fraud detection, credit scoring, and churn prediction. Use the results to identify important variables, to identify groups in the data with desirable characteristics, and to predict response values for new observations. For example, a market researcher can use a predictive analytics model to identify customers that have higher response rates to specific initiatives and to predict those response rates.

In many applications, an important step in model construction is to consider various types of models. Analysts find the best type for an application at a specific time, find the optimal version of that model, and use the model to generate the most accurate predictions possible. To assist in the consideration of various models, Minitab Statistical Software provides the capability to compare different model types in a single analysis if you have a continuous response variable or a binary response variable.

If you have a categorical response variable with more than 2 categories, create models one-by-one.

Linear regression models

A linear regression model assumes that the average response is a parametric function of the predictors. The model uses the least-squares criterion to estimate the parameters for a data set. If a parametric regression model fits the relationship between the response and its predictors, then the model predicts the response values with new observations accurately. For example, Hooke's Law in physics says that the force to extend a spring has a linear relationship with the distance of extension, so a regression model fits the relationship very well.

A linear regression model simplifies the identification of optimal settings for the predictors. The effective fit also means that the fitted parameters and standard errors are useful for statistical inference, such as the estimation of confidence intervals for the predicted response values.

Linear regression models are flexible and often fit the true form of relationships in the data. Even so, sometimes a linear regression model does not fit a data set well or characteristics of the data prevent the construction of a linear regression model. The following examples are common cases of when a linear regression model has a poor fit:
  • The relationships between the response and the predictor do not follow a model that a linear regression model can fit.
  • The data do not have enough observations to estimate enough parameters to find a linear regression model that fits well.
  • The predictors are random variables.
  • The predictors contain many missing values.

In such cases, tree-based models are good alternative models to consider.

In the Predictive Analytics Module, Minitab Statistical Software fits linear regression models to continuous and binary response variables with the Discover Best Model commands. For a list of other linear regression models in Minitab Statistical Software, go to Which regression and correlation analyses are included in Minitab?.

Tree-based models

CART®, TreeNet®, and Random Forests® are 3 tree-based methods. Among the tree-based models, CART® is easiest to understand because CART® uses a single decision tree. A single decision tree starts from the entire data set as the first parent node. Then, the tree splits the data into 2 more homogenous child nodes using the node-splitting criterion. This step repeats iteratively until all unsplit nodes meet a criteria to be a terminal node. After that, cross-validation or validation with a separate test set is used to trim the tree to obtain the optimal tree, which is the CART® model. Single decision trees are easy to understand and can fit data sets with a wide variety of characteristics.

Single decision trees can be less robust and less powerful than the other 2 tree-based methods. For example, a small change in the predictor values in a data set can lead to a very different CART® model. The TreeNet® and Random Forests® methods use sets of individual trees to create models that are more robust and more accurate than models from single decision trees.

Minitab Statistical Software fits tree-based models to continuous response variables, binary response variables, and nominal response variables. To see an example of each model in Minitab Statistical Software, select a model type:

MARS® Regression models

MARS® Regression first constructs an extensive set of basis functions that fit the data as well as possible. After forming the extensive model, the analysis reduces the risk of overfitting by searching for an optimal subset of the basis functions. The reduced model remains adaptable to various non-linear dependencies in the data. The resulting model is a linear regression model in the space of these basis functions. The characteristic of searching for different fits for different regions of the data in a stepwise fashion connects MARS® Regression to tree-based models. Because of the tree-based characteristics, MARS® Regression provides some of the same advantages:
  • Automatic detection of the model form
  • Automatic handling of missing values
  • Automatic selection of the most relevant predictors
The use of an equation connects MARS® Regression to linear regression models. Because of the linear regression characteristics, MARS® Regression also provides some of the advantages of this model type:
  • A regression equation makes the effects of the variables easy to understand.
  • The continuous function means that small changes in the predictors result in small changes in the predictions.
  • Even for small models, different values of the predictors yield different predictions.
The flexible models from MARS® Regression provide accurate predictions and can provide insights into the form of the model that improve the fit of other types of models. Minitab Statistical Software fits MARS® Regression models to continuous response variables. To see an example of MARS® Regression in Minitab Statistical Software, go to Example of MARS® Regression.