Multicollinearity in regression

Multicollinearity in regression is a condition that occurs when some predictor variables in the model are correlated with other predictor variables.

What is multicollinearity?

Severe multicollinearity is problematic because it can increase the variance of the regression coefficients, making them unstable. The following are some of the consequences of unstable coefficients:
  • Coefficients can seem to be insignificant even when a significant relationship exists between the predictor and the response.
  • Coefficients for highly correlated predictors will vary widely from sample to sample.
  • Removing any highly correlated terms from the model will greatly affect the estimated coefficients of the other highly correlated terms. Coefficients of the highly correlated terms can even have the wrong sign.

To measure multicollinearity, you can examine the correlation structure of the predictor variables. You can also examine the Variance Inflation Factors (VIFs) of the regression coefficients in the model. The VIFs measure how much the variance of an estimated regression coefficient increases if your predictors are correlated. If all of the VIFs are 1, there is no multicollinearity, but if some VIFs are greater than 1, the predictors are correlated. When a VIF is > 5, the regression coefficient for that term is not estimated well.

Another measure of multicollinearity is the condition number. Minitab provides the condition number in the expanded table for Best Subsets Regression. The condition number assesses the multicollinearity for an entire model rather than individual terms. The larger the condition number, the more multicollinear the terms in the model are. Montgomery, Peck, and Vining1 suggest that a condition number larger than 100 indicates moderate multicollinearity. When the multicollinearity is moderate or worse, you should use the VIFs and the correlation structure of the data to investigate the relationships among the terms in the model.

If the correlation of a predictor with other predictors is nearly perfect, Minitab displays a message that the term cannot be estimated. The VIF values for terms that cannot be estimated typically exceed one billion.

Multicollinearity does not affect the goodness of fit and the goodness of prediction. The coefficients (linear discriminant function) cannot be interpreted reliably, but the fitted (classified) values are not affected.

Note

Multicollinearity has the same effect in discriminant analysis as in regression.

Ways to correct multicollinearity

Possible solutions to severe multicollinearity:
  • If you are fitting polynomials, subtract the mean of the predictor from the predictor values.
  • Remove highly correlated predictors from the model. Because they supply redundant information, removing them often does not drastically reduce the R2. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables.
  • Use Partial Least Squares or Principal Components Analysis. These methods reduce the number of predictors to a smaller set of uncorrelated components.

For example, a toy manufacturer wants to predict customer satisfaction and includes "strength" and "lack of breakage" as predictor variables in the regression model. The investigator determines that these two variables are strongly negatively correlated and have a VIF greater than 5. At this point, the investigator could try removing either variable. The investigator could also use Partial Least Squares or Principal Components Analysis to use these related variables to create a "durability" component.

1 Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (5th ed.). Hoboken, NJ: Wiley.