Models from predictive analytics provide insights for a wide range of applications,
including manufacturing quality control, drug discovery, fraud detection, credit
scoring, and churn prediction. Use the results to identify important variables, to
identify groups in the data with desirable characteristics, and to predict response
values for new observations. For example, a market researcher can use a predictive
analytics model to identify customers that have higher response rates to specific
initiatives and to predict those response rates.
In many applications, an important step in model construction is to consider various
types of models. Analysts find the best type for an application at a specific time,
find the optimal version of that model, and use the model to generate the most
accurate predictions possible. To assist in the consideration of various models,
Minitab Statistical Software provides the capability to compare different model
types in a single analysis if you have a continuous response variable or a binary
response variable.
If you have a categorical response variable with more than 2 categories, create
models one-by-one.
Linear regression models
A linear regression model assumes that the average response is a parametric function
of the predictors. The model uses the least-squares criterion to estimate the
parameters for a data set. If a parametric regression model fits the relationship
between the response and its predictors, then the model predicts the response values
with new observations accurately. For example, Hooke's Law in physics says that the
force to extend a spring has a linear relationship with the distance of extension,
so a regression model fits the relationship very well.
A linear regression model simplifies the identification of optimal settings for the
predictors. The effective fit also means that the fitted parameters and standard
errors are useful for statistical inference, such as the estimation of confidence
intervals for the predicted response values.
Linear regression models are flexible and often fit the true form of relationships in
the data. Even so, sometimes a linear regression model does not fit a data set well
or characteristics of the data prevent the construction of a linear regression
model. The following examples are common cases of when a linear regression model has
a poor fit:
- The relationships between the response and the predictor do not follow a
model that a linear regression model can fit.
- The data do not have enough observations to estimate enough parameters to
find a linear regression model that fits well.
- The predictors are random variables.
- The predictors contain many missing values.
In such cases, tree-based models are good alternative models to consider.
In the Predictive Analytics Module, Minitab Statistical Software fits linear
regression models to continuous and binary response variables with the Discover Best
Model commands. For a list of other linear regression models in Minitab Statistical
Software, go to Which regression and correlation analyses are included in Minitab?.
Tree-based models
CART®, TreeNet®, and Random Forests® are 3
tree-based methods. Among the tree-based models, CART® is easiest to
understand because CART® uses a single decision tree. A single decision
tree starts from the entire data set as the first parent node. Then, the tree splits
the data into 2 more homogenous child nodes using the node-splitting criterion. This
step repeats iteratively until all unsplit nodes meet a criteria to be a terminal
node. After that, cross-validation or validation with a separate test set is used to
trim the tree to obtain the optimal tree, which is the CART® model.
Single decision trees are easy to understand and can fit data sets with a wide
variety of characteristics.
Single decision trees can be less robust and less powerful than the other 2
tree-based methods. For example, a small change in the predictor values in a data
set can lead to a very different CART® model. The TreeNet® and
Random Forests® methods use sets of individual trees to create models
that are more robust and more accurate than models from single decision trees.
Minitab Statistical Software fits tree-based models to continuous response variables,
binary response variables, and nominal response variables. To see an example of each
model in Minitab Statistical Software, select a model type:
MARS®
Regression models
MARS®
Regression first constructs an extensive set of basis functions that fit the data as well as
possible. After forming the extensive model, the analysis reduces the risk of
overfitting by searching for an optimal subset of the basis functions. The reduced
model remains adaptable to various non-linear dependencies in the data. The
resulting model is a linear regression model in the space of these basis functions.
The characteristic of searching for different fits for different regions of the data
in a stepwise fashion connects
MARS®
Regression to tree-based models. Because of the tree-based characteristics,
MARS®
Regression provides some of the same advantages:
- Automatic detection of the model form
- Automatic handling of missing values
- Automatic selection of the most relevant predictors
The use of an equation connects
MARS®
Regression to linear regression models. Because of the linear regression characteristics,
MARS®
Regression also provides some of the advantages of this model type:
- A regression equation makes the effects of the variables easy to
understand.
- The continuous function means that small changes in the predictors result in
small changes in the predictions.
- Even for small models, different values of the predictors yield different
predictions.
The flexible models from
MARS®
Regression provide accurate predictions and can provide insights into the form of the model
that improve the fit of other types of models. Minitab Statistical Software fits
MARS
® Regression models to continuous response variables. To see an
example of
MARS®
Regression in Minitab Statistical Software, go to
Example of MARS® Regression.