Multiple comparisons let you assess the statistical significance of differences between means using a set of confidence intervals, a set of hypothesis tests or both. As usual, the null hypothesis of no difference between means is rejected if and only if zero is not contained in the confidence interval.

The selection of the appropriate multiple comparison method depends on the inference that you want. It is inefficient to use the Tukey all-pairwise approach when Dunnett or MCB is suitable, because the Tukey confidence intervals will be wider and the hypothesis tests less powerful for a particular family error rate. For the same reasons, MCB is superior to Dunnett if you want to eliminate factor levels that are not the best and to identify those that are best or close to the best. The choice of Tukey versus Fisher's LSD methods depends on which error rate, family or individual, you want to specify.

The characteristics and advantages of each method are summarized in the following table:

Method | Normal Data | Strength | Comparison with a Control | Pairwise Comparison |
---|---|---|---|---|

Tukey | Yes | Most powerful test when doing all pairwise comparisons. | No | Yes |

Dunnett | Yes | Most powerful test when comparing to a control. | Yes | No |

Hsu's MCB method | Yes | The most powerful test when you compare the group with the highest or lowest mean to the other groups. | No | Yes |

One-Way ANOVA also offers Fisher’s LSD method for individual confidence intervals. Fisher's is not a multiple comparison method, but instead contrasts the individual confidence intervals for the pairwise differences between means using an individual error rate. Fisher's LSD method inflates the family error rate, which is displayed in the output.

It is important to consider which means to compare when using multiple comparisons; a bad choice can result in confidence levels that are not what you think. Issues that should be considered when making this choice might include:

- How deep into the design should you compare means-only within each factor, within each combination of first-level interactions, or across combinations of higher level interactions?
- Should you compare the means for only those terms with a significant F-test or for those sets of means for which differences seem to be large?

How deep within the design should you compare means? There is a trade-off: if you compare means at all two-factor combinations and higher orders turn out to be significant, then the means that you compare might be a combination of effects; if you compare means at too deep a level, you lose power because the sample sizes become smaller and the number of comparisons become larger. You might decide to compare means for factor level combinations for which you believe the interactions are meaningful.

Minitab restricts the terms that you can compare means for to fixed terms or interactions between fixed terms. Nesting is considered to be a form of interaction.

Usually, you should decide which means you will compare before you collect your data. If you compare only those means with differences that seem to be large, which is called data snooping, then you are increasing the likelihood that the results indicate a real difference where no difference exists. If you condition the application of multiple comparisons on achieving a significant F-test, then you increase the probability that differences exist among the groups but you do not detect them. Because the multiple comparison methods already protect against the detection of a difference that does not exist, you do not need the F-test to guard against this probability.

However, many people commonly use F-tests to guide the choice of which means to compare. The ANOVA F-tests and multiple comparisons are not entirely separate assessments. For example, if the p-value of an F-test is 0.9, you probably will not discover statistically significant differences between means by multiple comparisons.

The p-value in the ANOVA table and the multiple comparison results are based on different methodologies and can occasionally produce contradictory results. For example, it is possible that the ANOVA p-value can indicate that there are no differences between the means while the multiple comparisons output indicates that some means that are different. In this case, you can generally trust the multiple comparisons output.

You do not need to rely on a significant p-value in the ANOVA table to reduce the chance of detecting a difference that doesn't exist. This protection is already incorporated in the Tukey, Dunnett, and MCB tests (and Fisher's test when the means are equal).