Multiple case Cook's distance

This macro computes the multiple case extension of Cook's single case distance measure. Depending on the data set size, the distance measure can be computed for all case pairs and triplets. In addition, the distance measure can be computed for user selected subsets of up to ten cases. Graphs produced include a plot of Cook's distance for single cases against case number, an influential case pairs ID plot, and fixed-pair effect plots which show the effect, or change in Cook's distance, due to adding a third case to a fixed pair of cases. Like functionality is available for models with no constant term

This macro computes the multiple case extension of Cook's single case distance measure. Depending on the data set size, the distance measure can be computed for all case pairs and triplets. In addition, the distance measure can be computed for user selected subsets of up to ten cases. Graphs produced include a plot of Cook's distance for single cases against case number, an influential case pairs ID plot, and fixed-pair effect plots which show the effect, or change in Cook's distance, due to adding a third case to a fixed pair of cases. Like functionality is available for models with no constant term

Download the Macro

Be sure that Minitab knows where to find your downloaded macro. Choose File > Options > General. Under Macro location browse to the location where you save macro files.

Important

If you use an older web browser, when you click the Download button, the file may open in Quicktime, which shares the .mac file extension with Minitab macros. To save the macro, right-click the Download button and choose Save target as.

Required Inputs

  • One column of response values
  • Multiple columns of predictor values

Optional Inputs

HOLD
Use to specify a pair of cases from which to create fixed-pair effect plots.
NOCONSTANT
Use if you want no constant term in the model. This command is especially useful if you are analyzing a mixture model in which case the constant term is omitted from the model to avoid rank deficiency in the XTX matrix.
NOPAIR
Use if you do not want to calculate distance values for all case pairs. Use of this subcommand requires computation for all triples, computation for one or more selected subsets, or use of the hold subcommand.
NOPLOTS
Use if you don't want to show diagnostic plots.
REPORTALL
Use to report all the computed distance values. Selecting this subcommand eliminates comparisons with the threshold value since all the distance values are reported. If you choose this subcommand, the threshold value will still be shown on graphs as a visual aid.
SPAIRS C C C
Use this subcommand to store all the distance values for case pairs in the worksheet. Specify three columns; the first two for the indices and the third for the distance values.
STRIPLES C C C C
Use to store all the distance values for case triples in the worksheet. Specify four columns; the first three for the indices and the fourth for the distance values.
SUB1 K…K
Use this subcommand if you want to calculate the distance value for a selected subset of up to ten cases (K). This subcommand is especially useful for subsets of more than three cases. You may specify up to five subsets by using the subcommand SUB1, SUB2, SUB3, SUB4, and SUB5.
THRESHOLD K
Use to specify a threshold value. By default, the threshold value is 1.00. The output will show all computed results that are greater than or equal to this value. The specified threshold must be a positive numerical value.
TRIPLE
Use this subcommand if you want the macro to calculate Cook's Distance for all case triples and compare to the default or specified threshold value.

Running the Macro

The syntax used to run the macro varies slightly depending on the version you are using.

The following example uses the sample data which is the "Modified Data on Wood Specific Gravity" data set of twenty cases and five predictors in Rousseeuw and Leroy (1987). The computational results for the five selected case subsets match those given in Seaver, Triantis, and Reeves (1999).

Suppose the values of the response Y, specific gravity, are in C1 and the values of the five predictors, X1-X5, are in columns 2-6. Five subset cases were selected.

To run the macro, choose View > Command Line/History and type:
%MULTDIST C1-C6;
SUB1 5;
SUB2 8 19;
SUB3 6 8 19;
SUB4 4 8 19;
SUB5 4 6 8 19.

Click Run.

Output

Here is what the macro will produce.

Multiple Case Cook's Distance

Model Information
------------------------
Response:     Y

Predictors:   X1 , X2 , X3 , X4 , X5                                            

Parameters:    6
 
Threshold value:    1.00
------------------------
 
*** Cook's Distance for Case Pairs ***
 
     Cases        Cook's Distance

     7 , 11             1.03

 
*** Cook's Distance for a Subset ***

     Cases:  5   Cook's Distance:  0.06                                              


     Cases:   8 , 19   Cook's Distance:  0.33                                        


     Cases:   6 ,  8 , 19   Cook's Distance:  1.99                                   


     Cases:   4 ,  8 , 19   Cook's Distance:  0.49                                   


     Cases:   4 ,  6 ,  8 , 19   Cook's Distance:  53.93 
Note

Graph output not shown.

More Information

Data set size

The data set size limit for computing Cook's Distance is 60 and 30 for case pairs and triples respectively. The data set size limit for case subset computations is 500. You may change the case pairs and triples limits within the macro. To change the limits, go to the section in the macro code labeled "MSE check, triple, nopair" and change 30 and 60 to the sizes you want. Note that computing time increases as the data set size increases, especially for computing all triples.

Inverse does not exist

If analyzing a mixture model, you must specify the noconstant subcommand. If you do not, you will get an error message indicating that the inverse of the XTX matrix does not exist. Usually, if any predictors are (nearly) perfectly correlated you will get this error message.

Missing values

The macro handles missing data by removing rows that have missing data in them. This is shown in the output and in the graphs.

References

Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression & Outlier Detection, John Wiley & Sons, Inc.

Seaver, B., Triantis, K., and Reeves, C. (1999), The Identification of Influential Subsets in Regression Using a Fuzzy Clustering Strategy, Technometrics, 41, 340-351.