# Regression analysis

Use regression analysis to describe the statistical relationship between one or more predictors and the response variable and to predict new observations.

## Best subsets regression

Use best subsets regression to provide a method of evaluating multiple process inputs without the use of a designed experiment. Best subsets regression is a highly automated "black-box" solution that automatically determines which inputs provide the best predictive model for the output.

• Which process inputs have the largest effects on the process output (which inputs are the key inputs)?
• Do any important interactions exist between process inputs?
• How much of the variation in the process output can be explained by varying the process inputs?
When to Use Purpose
Mid-project In projects where testing of multiple inputs is not done using a designed experiment, use best subsets regression to determine which inputs are the key inputs.

### Data

Your data must be a continuous value for Y, and numeric Xs. You can convert categorical Xs into indicator variables.

### Guidelines

• Samples should be taken across the entire inference space.
• Do not extrapolate; in other words, do not use the equation to predict Y values outside the range of sampled X values.
• The residuals must be independent, reasonably normal, and have reasonably equal variances. To check these assumptions, you must manually run the selected model identified by the best subsets regression using the multiple regression tool.
• The best subsets method presents many competing regression models. The r-squared value, adjusted r-squared value, standard deviation, and Mallows' Cp statistic are used to evaluate these competing models. The value of this method is in providing the analyst with an understanding of the gain/loss in explaining a model that occurs when an optional solution is chosen (either a model with fewer terms or a model with different terms).
• The default is to provide the best two 1-variable solutions, the best two 2-variable solutions, and so on, up to and including the evaluation of the model with all terms included. The best subsets method does not provide the regression coefficients or identify outliers; therefore, you must run a standard multiple regression on the selected model.
• When you compare models with different numbers of terms, use the adjusted r-squared value for comparison rather than the r-squared value.
• Generally, the best subsets method is not as effective in handling large numbers of highly correlated factors as stepwise regression.
• To evaluate interactions or squared terms, use the Minitab calculator to create them.
• When you convert a categorical variable to indicator variables, you create one indicator for each category. To properly model differences between categories you should use all but one of these indicator variables. If you use best subsets regression with indicator variables, the best subsets algorithm might not include the right number of indicator variables in many of the solutions presented. If the algorithm omits some indicator variables, you will not have complete information about the categorical variables. In that case, you should manually run the regression using the multiple regression tool.
• If you have discrete numeric data from which you can obtain every equally spaced value, and you have measured at least 10 possible values, your data often are evaluated as though they are continuous.

### How-to

1. Verify that the measurement systems for the Y data and the process inputs are adequate.
2. Develop a data collection strategy (who will collect the data, as well as where and when; the preciseness of the data, how to record the data, and so on).
3. Enter the Y (response) data into a single column.
4. In other columns, enter the input (X, or predictor) data, one column for each X.
5. If you want to include squared terms in the model, you must manually create the squared terms by multiplying an X variable by itself and storing the result in a new column.
6. If you want to include interactions between X variables in the model, you must manually create them by multiplying the appropriate X variables and storing the result in a new column. Repeat this step for each desired interaction.
7. Perform a regression analysis in Minitab.

## Fitted line plot

Use the fitted line plot to evaluate linear, quadratic, or cubic relationships by plotting one Y versus one X. The best fit regression line is displayed with optional confidence and prediction intervals around the fitted regression line. The confidence interval is for the regression equation and the prediction interval is for the individual points.

• Is any relationship between two variables, Y and X, apparent?
• How strong is the relationship between Y and X?
• What value of a process input X results in the optimal process output Y?
When to Use Purpose
Start of project Compare a proposed or existing gage to a highly qualified device (such as a certified lab). A high r-squared (R-Sq) value provides reasonable certainty of the test gage matching the reference gage.
Start of project Use to help develop alternative measurement systems for cases in which a variable is difficult or expensive to measure. Use highly correlated and logically linked alternative variables as substitute variables.
Mid-project Investigate the relationship between a process input and the process output to help decide to either keep the input as a potential leverage variable or set it aside as most likely not important.
Mid-project Evaluate two inputs to identify whether they duplicate the same information. For example, inputs of Degree Obtained and Years of School are likely to explain the same variation of the output, so one of them can be eliminated. This is used primarily in multiple regression analysis with many variables.
End of project Verify the measurement system. If you use a fitted line plot earlier as part of the validation of the measurement system, create another one with the improved process to again validate the measurement system.

### Data

Your data must be a continuous value for Y and one continuous or discrete X (with multiple levels).

### Guidelines

• Samples should be taken across the entire inference space. Do not extrapolate by using the equation to predict Y values outside the range of sampled X's. The graphical output also helps identify outliers.
• The residuals must be independent, reasonably normal, and have reasonably equal variances. The fitted line plot uses regression, which is quite robust to nonnormality. Analyze the residuals using a histogram, normal probability plot, plot versus fits, and plot versus order, which can be run at one time using the Four-in-one option.
• If you have discrete numeric data from which you can obtain every equally spaced value and you have measured at least 10 possible values, you can evaluate these data as if they are continuous.

### How-to

1. Collect data from your process over its expected range of input values. Enter the input values into one column and the output into a second column.
2. Select either a linear, quadratic, or cubic analysis.
3. You can also add confidence and/or prediction intervals around the regression line.

## Multiple regression

Use multiple regression to provide a method of evaluating multiple process inputs without the use of a designed experiment. Regression is a mathematical method for establishing the best fit relationship between a process output Y and multiple process inputs (X's, also called predictors). Multiple regression enables you to predict the output Y for any combination of input values (X's).

• Which process inputs have the largest effects on the process output (which inputs are the key inputs)?
• Do any important interactions exist between process inputs?
• How much variation in the process output can be explained by varying the process inputs?
• What is the equation (Y = f(X)) relating the process output to the settings of the inputs?
• What settings of the key inputs result in the optimal process output?
When to Use Purpose
Mid-project In projects where testing of multiple inputs is not done using a designed experiment, use multiple regression to determine which inputs are the key inputs, develop a predictive model using the key inputs, and find the optimal settings of the key inputs.

### Data

Your data must be continuous values for Y and numeric Xs. You can convert categorical Xs into indicator variables.

### Guidelines

• Samples should be taken across the entire inference space.
• Do not extrapolate (use the equation to predict Y values outside the range of sampled X's). Check for possible outliers in the unusual observations table (Session window output).
• The residuals must be independent, reasonably normal, and have reasonably equal variances. Multiple regression is quite robust to nonnormality. For multiple regression, the residuals are usually analyzed by a histogram, normal probability plot, residuals versus fits, and residuals versus order. You can display these graphs at one time using the Four in one option.
• When comparing models with different numbers of terms, use the r-squared (adj) for comparison, not the r-squared.
• It is generally good practice to look at all pairwise relationships of the X's for possible multicollinearity with scatterplots, a matrix plot, or fitted line plots.
• Manually reducing to a final multiple regression model can be complex and, in the case of many inputs with high degrees of multicollinearity, can easily result in analysis error. In these cases, you may want to use stepwise or best subsets regression analysis.
• To evaluate interactions or squared terms, use the Minitab calculator to create them.
• When you convert a categorical variable to indicator variables, you create one indicator for each category. To properly model differences between categories, you should use all but one of these indicator variables.
• You can use one of three common methods to evaluate multiple regression results:
• Manually analyze multiple regression using a combination of statistical measures (p-values to test statistical significance and variance inflation factor (VIF) values to check for multicollinearity) and graphical analysis of the correlation between the variables. Manually reducing to a final multiple regression model provides more understanding of the model and the relationships of the various X's.
• Best subsets (separate tool – highly automated).
• Stepwise (separate tool – highly automated).
• If you have discrete numeric data from which you can obtain every equally spaced value and you have measured at least 10 possible values, you can evaluate these data as if they are continuous.

### How-to

1. Verify the measurement systems for the Y data and the process inputs are adequate.
2. Develop a data collection strategy (who should collect the data, as well as where and when; how many data values are needed; the preciseness of the data; how to record the data, and so on).
3. Enter the Y data into a single column. These are the response data.
4. In other columns, enter the input (X) data, one column for each X. These are the predictor data.
5. To include squared terms in the model, you must manually create the squared terms by multiplying an X-variable by itself and storing the result in a new column.
6. To include interactions between X-variables in the model, you must manually create them by multiplying the appropriate X-variables and storing the result in a new column. Repeat this step for each desired interaction.
7. Perform a regression analysis in Minitab.
8. Reduce the model using p-values, variance inflation factor (VIF) values, and graphical analysis.

## Simple regression

Use simple regression to provide a mathematical method for establishing the best fit, straight-line equation relating a process output Y to a process input X. Simple regression allows you to predict the value of the output Y for any value of the input X.

• Does a linear relationship exist between two variables (usually a process output Y and a process input X)?
• What is the equation (y = f(X)) for the relationship between Y and X?
• How much of the variation in the output Y can be explained by varying the input X?
• What value of the process input X results in the optimal process output Y?
When to Use Purpose
Start of project Can be very useful for comparing a proposed or existing gage to a highly qualified device (such as a certified lab). A high r-squared value provides reasonable certainty that the test gage matches the reference gage.
Start of project Assists in developing alternative measurement systems in cases when a variable is difficult or expensive to measure; highly correlated and logically linked alternative variables can be used as substitute variables.
Mid-project Investigate the relationship between a process input and the process output, either keeping the input as a potential leverage variable or setting it aside as most likely not important.
Mid-project Find the optimal setting of the input variable.
End of project If used earlier as part of the validation of the measurement system, it should be reapplied to the improved process to again validate the measurement system.

### Data

Your data must be a continuous value for Y and a numeric value for X.

### Guidelines

• Take samples across the entire inference space.
• Do not extrapolate; do not use the equation to predict Y values outside the range of sampled X's.
• Check for possible outliers in the unusual observations table (Session Window output).
• The residuals must be independent, be reasonably normal, and have reasonably equal variances. Simple regression is quite robust to nonnormality. For simple regression, the residuals are usually analyzed by a histogram, normal probability plot, residuals versus fits, and residuals versus order, which can be run at one time using the Four in one option.
• When evaluating only two variables, note that the Fitted Line Plot is an option because it provides both the graphical output (from which you can easily identify outliers) and the ability to quickly model quadratic and cubic relationships.
• If you have discrete numeric data from which you can obtain every equally spaced value and you have measured at least 10 possible values, you can evaluate these data as if they are continuous.

### How-to

1. Verify the measurement systems for the Y data and the input X are adequate.
2. Develop a data collection strategy (who should collect the data, as well as where and when; how many data values are needed; the preciseness of the data; how to record the data, and so on).
3. Enter the Y data into a single column. These are the response data.
4. In a second column, enter the input (X) data. These are the predictor data.
5. Perform a regression analysis in Minitab.

## Stepwise regression

Use stepwise regression to provide a method of evaluating multiple process inputs without the use of a designed experiment. Stepwise regression is a highly automated, "black-box" solution that automatically determines which inputs should be included in a predictive model for the output. It also allows you to predict the value of the output Y for any combination of values of the inputs (X's).

• Which process inputs have the largest effects on the process output (which inputs are the key inputs)?
• Do any important interactions exist between process inputs?
• How much of the variation in the process output can be explained by varying the process inputs?
• What is the equation (Y = f(X)) relating the process output to the settings of the inputs?
• What settings of the key inputs result in the optimal process output?
When to Use Purpose
Mid-project In projects wherein you do not test multiple inputs using a designed experiment, you can use stepwise regression to determine which inputs are the key inputs, develop a predictive model using the key inputs, and find the optimal settings of the key inputs.

### Data

Your data must be a continuous value for Y, and numeric Xs. You can convert categorical Xs into indicator variables.

### Guidelines

• Take samples across the entire inference space.
• Do not extrapolate; do not use the equation to predict Y values outside the range of sampled X's.
• The residuals must be independent, be reasonably normal, and have reasonably equal variances. To check these assumptions, run your selected model identified by stepwise regression manually using the multiple regression tool.
• When comparing models with different numbers of terms, use the r-squared (adjusted) value for comparison rather than the r-squared value.
• Stepwise regression does not identify outliers.
• The (default) stepwise method determines which single X explains the most variation in Y. Given that X is now included in the model, stepwise regression searches for the next best X to add as a second variable. It repeats this step until it can find no more X's that statistically add value. Then, stepwise regression does a backwards sweep as a check. The value of this method is the handling of data sets with large numbers of inputs with high degrees of multicollinearity. Stepwise regression has two important drawbacks:
• Given two highly correlated X's, it may eliminate the X that a practical observer would have kept.
• Also, it may cause a slightly suboptimum solution.
• You should generally run best subsets regression as a check-and-balance method to evaluate alternate solutions to ensure a proper practical equation has been selected. For example, two highly correlated X's are Car Weight and Engine Size. Stepwise eliminates one and its algorithm may choose to keep Car Weight. From a logical viewpoint, the analyst recognizes that Engine Size is a better term for understanding and a more universal application within the inference space. Best subsets regression showed that replacing Car Weight with Engine Size had minimal effect on the r-squared value.
• To evaluate interactions or squared terms, use the Minitab calculator to create them.
• When you convert a categorical variable to indicator variables, you create one indicator for each category. To properly model differences between categories, you should use all but one of these indicator variables. If you use stepwise regression with indicator variables, you should be aware that the stepwise algorithm may not include the right number of indicator variables. If some are left out, you will not have complete information about the categorical variables. In that case, you should manually run the regression using the multiple regression tool.
• If you have discrete numeric data from which you can obtain every equally spaced value and you have measured at least 10 possible values, you can evaluate these data as if they are continuous.

### How-to

1. Verify the measurement systems for the Y data and the process inputs are adequate.
2. Develop a data collection strategy (who should collect the data, as well as where and when; how many data values are needed; the preciseness of the data; how to record the data, and so on).
3. Enter the Y data into a single column. These are your response data.
4. In other columns enter the input (X) data, one column for each X. These are the predictor data.
5. To include squared terms in the model, you must manually create the squared terms by multiplying an X-variable by itself and storing the result in a new column.
6. To include interactions between X-variables in the model, you must manually create them by multiplying the appropriate X-variables and storing the result in a new column. Repeat this for each desired interaction.
7. Perform a regression analysis in Minitab.