What is multicollinearity?

Multicollinearity in regression is a condition that occurs when some predictor variables in the model are correlated with other predictor variables. Severe multicollinearity is problematic because it can increase the variance of the regression coefficients, making them unstable. The following are some of the consequences of unstable coefficients:
  • Coefficients can seem to be insignificant even when a significant relationship exists between the predictor and the response.
  • Coefficients for highly correlated predictors will vary widely from sample to sample.
  • Removing any highly correlated terms from the model will greatly affect the estimated coefficients of the other highly correlated terms. Coefficients of the highly correlated terms can even have the wrong sign.

To measure multicollinearity, you can examine the correlation structure of the predictor variables. You can also examine the variance inflation factor (VIF), which measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. If the VIF = 1, there is no multicollinearity but if the VIF is > 1, the predictors are correlated. When the VIF is > 5, the regression coefficients are not estimated well. Usually, you should remove highly correlated predictors from the model. Because the predictors supply redundant information, removing them often does not drastically reduce the R2.

If the correlation of a predictor with other predictors is nearly perfect, Minitab displays a message that the term cannot be estimated. The VIF values for terms that cannot be estimated typically exceed one billion.

Note

Multicollinearity does not affect the goodness of fit and the goodness of prediction.

Ways to correct multicollinearity

Possible solutions to severe multicollinearity:
  • If you are fitting the quadratic or cubic model in simple regression, subtract the mean of the predictor from the predictor values.
  • Instead of multiple linear regression, use partial least squares regression or principal components analysis. These methods decrease the number of predictors to a smaller set of uncorrelated components. Minitab Statistical Software contains both methods.
  • In multiple linear regression, consider whether to remove highly correlated predictors from the model. When the predictors supply redundant information, R2 does not decrease drastically when you remove correlated predictors. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these predictors.
For example, a toy manufacturer wants to predict customer satisfaction and includes "lack of breakage" as a predictor variable in the regression model. The investigator determines that the relationship of this variable to customer satisfaction is curved, so the investigator fits a cubic model. The VIF values for the terms in the cubic model all exceed 5,000, so the investigator worries that multicollinearity affects the results. The investigator follows these steps in Minitab Express to subtract the mean of the predictor from the predictor values:
  1. Open the standardize dialog box.
    • Mac: Data > Standardize
    • PC: DATA > Standardize
  2. In Standardize the following columns, enter lack of breakage.
  3. In Method, select Subtract the mean.
  4. Click OK.

After the investigator subtracts the mean, the investigator repeats the analysis with the new predictor. The VIF values fall below 10. Although the VIF values are still large, the investigator feels more confident in the results with lower multicollinearity.

How Minitab identifies and removes highly correlated predictors from the regression equation

To remove highly correlated predictors from a regression equation, Minitab does the following steps:
  1. Minitab performs a QR decomposition on the X-matrix.
    Note

    Using the QR decomposition to calculate R2 is quicker than using least-squares regression.

  2. Minitab regresses a predictor on all other predictors and calculates the R2 value. If 1 - R2 < 4 * 2.22e-016, then the predictor fails the test and is removed from the model.
  3. Minitab repeats steps 1 and 2 for the remaining predictors.

Example

Suppose a model contains the predictors X1, X2, X3, X4, and X5, and the response Y, Minitab does the following:
  1. Minitab regresses X5 on X1-X4. If 1 - R2 is greater than 4 * 2.22e-016 then X5 remains in the equation. X5 passes the test and remains in the equation.
  2. Minitab regresses X4 on X1, X2, X3, and X5. Suppose 1 - R2 for this regression is greater than 4 * 2.22e-016 and thus remains in the equation.
  3. Minitab regresses X3 on X1, X2, X4, and X5 and calculates the R2 value. X3 fails the test and is removed from the equation.
  4. Minitab performs a new QR decomposition on the X-matrix and regresses X2 on the remaining predictors, X1, X4, and X5. X2 passes the test.
  5. Minitab regresses X1 on X2, X4, and X5. It fails the test and is removed from the equation.

Minitab regresses Y on X2, X4, X5. The results include a message saying that predictors X1 and X3 cannot be estimated and were removed from the model.

By using this site you agree to the use of cookies for analytics and personalized content.  Read our policy