What is multicollinearity?

Multicollinearity in regression is a condition that occurs when some predictor variables in the model are correlated with other predictor variables. Severe multicollinearity is problematic because it can increase the variance of the regression coefficients, making them unstable. The following are some of the consequences of unstable coefficients:
  • Coefficients can seem to be insignificant even when a significant relationship exists between the predictor and the response.
  • Coefficients for highly correlated predictors will vary widely from sample to sample.
  • Removing any highly correlated terms from the model will greatly affect the estimated coefficients of the other highly correlated terms. Coefficients of the highly correlated terms can even have the wrong sign.

To measure multicollinearity, you can examine the correlation structure of the predictor variables. You can also examine the variance inflation factors (VIF). The VIFs measure how much the variance of an estimated regression coefficient increases if your predictors are correlated. If all of the VIFs are 1, there is no multicollinearity, but if some VIFs are greater than 1, the predictors are correlated. When a VIF is > 5, the regression coefficient for that term is not estimated well. If the correlation of a predictor with other predictors is nearly perfect, Minitab displays a message that the term cannot be estimated. The VIF values for terms that cannot be estimated typically exceed one billion.

Multicollinearity does not affect the goodness of fit and the goodness of prediction. The coefficients (linear discriminant function) cannot be interpreted reliably, but the fitted (classified) values are not affected.

Note

Multicollinearity has the same effect in discriminant analysis as in regression.

How Minitab removes highly correlated predictors from the regression equation

To remove highly correlated predictors from a regression equation, Minitab does the following steps:
  1. Minitab performs a QR decomposition on the X-matrix.
    Note

    Using the QR decomposition to calculate R2 is quicker than using least-squares regression.

  2. Minitab regresses a predictor on all other predictors and calculates the R2 value. If 1 - R2 < 4 * 2.22e-016, then the predictor fails the test and is removed from the model.
  3. Minitab repeats steps 1 and 2 for the remaining predictors.

Example

Suppose a model contains the predictors X1, X2, X3, X4, and X5, and the response Y, Minitab does the following:
  1. Minitab regresses X5 on X1-X4. If 1 - R2 is greater than 4 * 2.22e-016 then X5 remains in the equation. X5 passes the test and remains in the equation.
  2. Minitab regresses X4 on X1, X2, X3, and X5. Suppose 1 - R2 for this regression is greater than 4 * 2.22e-016 and thus remains in the equation.
  3. Minitab regresses X3 on X1, X2, X4, and X5 and calculates the R2 value. X3 fails the test and is removed from the equation.
  4. Minitab performs a new QR decomposition on the X-matrix and regresses X2 on the remaining predictors, X1, X4, and X5. X2 passes the test.
  5. Minitab regresses X1 on X2, X4, and X5. It fails the test and is removed from the equation.

Minitab regresses Y on X2, X4, X5. The results include a message saying that predictors X1 and X3 cannot be estimated and were removed from the model.

Note

You can use the TOLERANCE subcommand with the REGRESS session command to force Minitab to keep a predictor in the model that is highly correlated with a different predictor. However, lowering the tolerance can be dangerous, possibly producing numerically inaccurate results.

Ways to correct multicollinearity

Possible solutions to severe multicollinearity:
  • If you are fitting polynomials, subtract the mean of the predictor from the predictor values.
  • Remove highly correlated predictors from the model. Because they supply redundant information, removing them often does not drastically reduce the R2. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables.
  • Use Partial Least Squares or Principal Components Analysis. These methods reduce the number of predictors to a smaller set of uncorrelated components.

For example, a toy manufacturer wants to predict customer satisfaction and includes "strength" and "lack of breakage" as predictor variables in the regression model. The investigator determines that these two variables are strongly negatively correlated and have a VIF greater than 5. At this point, the investigator could try removing either variable. The investigator could also use Partial Least Squares or Principal Components Analysis to use these related variables to create a "durability" component.

By using this site you agree to the use of cookies for analytics and personalized content.  Read our policy