Stepwise regression is an automated tool used in the exploratory stages of model building to identify a useful subset of predictors. The process systematically adds the most significant variable or removes the least significant variable during each step.
For example, a housing market consulting company collects data on home sales for the previous year with the goal of predicting future sales prices. With more than 100 predictor variables, finding a model can be a time-consuming task. Minitab's stepwise regression feature automatically identifies a sequence of models to consider. Statistics such as AICc, BIC, test R2, R2, adjusted R2, predicted R2, S, and Mallows' Cp help you to compare models. Minitab displays complete results for the model that is best according to the stepwise procedure that you use.
Exercise caution when using variable selection procedures such as best subsets and stepwise regression. One problem is that these procedures cannot consider special knowledge the analyst might have about the data. The procedure cannot consider the practical importance of any of the predictors.
A related problem to the procedure's inability to consider special knowledge is that when two predictors are highly correlated, the procedure can select only one of the two predictors even though either can be important. For example, the procedure can remove a predictor that is cheap and easy to measure in favor of a correlated predictor that is difficult and expensive to measure. The analyst would have to use their knowledge of the data to make judgements about criteria that the procedure cannot consider.
Another problem with stepwise procedures is that the different models can optimize different criteria. For example, the model with the highest adjusted R2 value will not necessarily be the model with the highest test R2 value. The analyst has to consider the different criteria to select a final model.
To ensure that your model doesn't just fit one specific data set, you should verify the model found by the selection procedure on a new set of data. You can also take the original data set, randomly divide it into two parts, use one part to select a model, and then verify the fit on the second part. This procedure helps ensure that the model you select will apply to other data sets. Go to the section on stepwise procedures with automatic validation to learn about commands that can partition your data automatically and calculate validation statistics.
All of the analyses that include automatic stepwise procedures in Minitab include the following procedures. The following methods let you quickly evaluate a high number of different models in terms of their model summary statistics for the data that you use to build the model.
The stepwise procedure that Minitab can automatically perform with a test data set is called forward selection with validation with a test data set. In this procedure, the initial model is empty or includes model terms that you specifically select. Then, Minitab adds the next potential term with the smallest p-value at each step. Minitab calculates the test R2 for the model at each step as the R2 value for the model on the test data set. The model results that Minitab presents are for the model with the maximum value of the test R2 value.
For Fit Regression Model, you can choose a second validation technique to perform with stepwise selection called forward selection with k-fold cross-validation. In k-fold cross-validation, Minitab divides the dataset into k subsets. These subsets are called folds. Most often, validation uses 10 folds, but other numbers are possible. The folds have as close to equal numbers of observations as possible. Minitab performs forward selection k times. For each forward selection, k–1 folds are the training data set and the last fold is the test data set. As in other forward selection procedures, the initial model is empty or includes model terms that you specifically select. Then, Minitab adds the next potential term with the smallest p-value at each step. For each step, Minitab calculates the k-fold stepwise R2 value by combining the information from the different stepwise selection procedures.
A hierarchical model is a model where for each term in the model, all lower order terms contained in it must also be in the model. For example, suppose there is a model with four factors: A, B, C, and D. If the term A * B * C is in the model then the terms A, B, C, A*B, A*C, and B*C must also be in the model, though any terms with D do not have to be in the model.
The terms that enter or leave a model at a step depend on the specifications for hierarchy. By default, Minitab Statistical Software requires a hierarchical model at each step, requires hierarchy for all terms, and allows only one term to enter the model at each step. These settings limit the terms that Minitab considers at each step. For example, a two-way interaction cannot enter the model unless both of the lower-order terms in the interaction are already in the model. You can adjust these settings by clicking Hierarchy when you select a stepwise method.
Best subsets regression is an automated tool used in the exploratory stages of model building to identify a useful subset of predictors. The procedure displays model summary results for the number of models that you request for each size: models with one predictor, models with two predictors, and so on. The models that display have the highest values of R2 among the possible models of that size. To use best subsets regression in Minitab, choose .
As an automatic selection procedure, best subsets regression shares many problems with stepwise regression. The procedure cannot use specialized knowledge that an analyst has, nor is there any guarantee that different criteria identify the same model. Correlations among the predictors can make the identification of the best models more difficult. Validation of the model with new data increases the confidence you can have in the performance of the model.
Best subsets is an analysis in Minitab Statistical Software. Stepwise regression is an option in several analyses. Both of these automated model selection techniques provide information about the fit of several different models. From the different models, you can identify any models that deserve further exploration.
|Characteristic||Best Subsets Regression||Stepwise regression|
|Models considered||All possible models for the predictors.||A sequence of models chosen by the statistical significance of the terms.|
|Number of predictors to consider||Up to 31 free predictors, plus any predictors that you require in every model.||No set limit.|
|Types of predictors||Numeric columns in the worksheet.||Text or numeric columns plus interaction terms and other higher-order terms.|
|Types of response variables||One numeric column.||Different analyses in Minitab can analyze different types of response variables. For stepwise regression, you can choose an analysis for a continuous response variable, a binary response variable, or a Poisson response variable.|
|Results||The results include model summary statistics that explore the fit of the data. To view full regression results, such as residual plots, explore your chosen model in an analysis like Fit Regression Model.||The analysis displays full regression results for the optimal model according to a criterion that you select. You can also choose to look at model summary statistics for each step in the procedure.|