Ways to identify outliers in regression and ANOVA

In the context of model-fitting analyses, outliers are observations with larger than average response or predictor values. Minitab provides several ways to identify outliers, including residual plots and three stored statistics: leverages, Cook's distance, and DFITS. It is important to identify outliers because they can significantly affect your model, providing potentially misleading or incorrect results. If you identify an outlier in your data, you should examine the observation to understand why it is unusual and identify an appropriate remedy.

Hi (leverage)

A leverage (Hi) measures the distance from an observation's x-value to the average of the x-values for all observations in a data set. Use to identify observations that have unusual predictor values compared to the remaining data.

Observations with large leverage can have a large effect on the fitted value, and thus the regression model. For example, an observation that has a large leverage can cause a significant coefficient to seem insignificant. However, not all leverage points are unusual observations.

Investigate observations with leverage values greater than 3p/n, where p is the number of model terms (including the constant) and n is the number of observations. Minitab identifies observations with leverage values greater than 3p/n or .99, whichever is smaller, with an X in the table of unusual observations.

Cook’s distance (D)

Geometrically, Cook's distance is a measure of the distance between the fitted values calculated with and without the i^th observation. Use to identify observations that have unusual predictor values compared to the remaining data and observations that the model does not fit well. Observations with large Cook's Distances can have a large effect on the fitted value, and thus the regression model.

Investigate observations where D is greater than F(0.5, p, n-p), the median of an F-distribution, where p is the number of model terms (including the constant) and n is the number of observations. A different way to examine distance values is to compare distance values to each other graphically, using a line plot. Observations with large distance values relative to other observations can be influential.

DFITS

DFITS represents approximately the number of standard deviations that the fitted value changes when each observation is removed from the data set and the model is refit. Use to identify observations that have unusual predictor values compared to the remaining data and observations that the model does not fit well. Observations with large DFITS values can have a large effect on the fitted value, and thus the regression model.

Investigate observations with DFITS values greater than 2*sqrt(p / n), where p is the number of model terms (including the constant) and n is the number of observations. A different way to examine DFITS values is to compare DFITS values to each other graphically, using a time series plot or a line plot. Observations with large DFITS values relative to other observations can be influential.

To determine how much effect the unusual observation has, you can fit the model with and without the observation and compare the coefficients, p-values, R², and other model information. If the model changes significantly when you remove the unusual observation, first, determine whether the observation is a data entry or measurement error. If not, determine whether you omitted an important term (for example, an interaction term) or variable, or have incorrectly specified the model. You might need to collect more data to determine a resolution.