Best Subsets Regression

Summary

Provides a method of evaluating multiple process inputs without the use of a designed experiment. Best subsets regression is a highly automated "black-box" solution that automatically determines which inputs provide the best predictive model for the output.

Answers the questions:
  • Which process inputs have the largest effects on the process output (which inputs are the key inputs)?
  • Do any important interactions exist between process inputs?
  • How much of the variation in the process output can be explained by varying the process inputs?
When to Use Purpose
Mid-project In projects where testing of multiple inputs is not done using a designed experiment, use best subsets regression to determine which inputs are the key inputs.

Data

Continuous Y, numeric X's (Note: You can convert categorical X's into indicator variables.)

How-To

  1. Verify that the measurement systems for the Y data and the process inputs are adequate.
  2. Develop a data collection strategy (who will collect the data, as well as where and when; the preciseness of the data, how to record the data, and so on).
  3. Enter the Y (response) data into a single column.
  4. In other columns, enter the input (X, or predictor) data, one column for each X.
  5. If you want to include squared terms in the model, you must manually create the squared terms by multiplying an X variable by itself and storing the result in a new column.
  6. If you want to include interactions between X variables in the model, you must manually create them by multiplying the appropriate X variables and storing the result in a new column. Repeat this step for each desired interaction.
  7. In Minitab, use Stat > Regression > Best Subsets .

Guidelines

  • Samples should be taken across the entire inference space.
  • Do not extrapolate; in other words, do not use the equation to predict Y values outside the range of sampled X's.
  • The residuals must be independent, reasonably normal, and have reasonably equal variances. To check these assumptions, you must manually run the selected model identified by the best subsets regression using the multiple regression tool.
  • The best subsets method presents many competing regression models. The r-squared value, adjusted r-squared value, standard deviation, and Mallows' Cp statistic are used to evaluate these competing models. The value of this method is in providing the analyst with an understanding of the gain/loss in explaining a model that occurs when an optional solution is chosen (either a model with fewer terms or a model with different terms).
  • The default is to provide the best two 1-variable solutions, the best two 2-variable solutions, and so on, up to and including the evaluation of the model with all terms included. The best subsets method does not provide the regression coefficients or identify outliers; therefore, you must run a standard multiple regression on the selected model.
  • When you compare models with different numbers of terms, use the adjusted r-squared value for comparison rather than the r-squared value.
  • Generally, the best subsets method is not as effective in handling large numbers of highly correlated factors as stepwise regression.
  • To evaluate interactions or squared terms, you must manually create them in the Minitab worksheet using Calc > Calculator.
  • When you convert a categorical variable to indicator variables, you create one indicator for each category. To properly model differences between categories you should use all but one of these indicator variables. If you use best subsets regression with indicator variables, the best subsets algorithm might not include the right number of indicator variables in many of the solutions presented. If the algorithm omits some indicator variables, you will not have complete information about the categorical variables. In that case, you should manually run the regression using the multiple regression tool.
  • If you have discrete numeric data from which you can obtain every equally spaced value, and you have measured at least 10 possible values, your data often are evaluated as though they are continuous.
By using this site you agree to the use of cookies for analytics and personalized content.  Read our policy