In Variables or distance matrix, enter either the columns that contain measurement data or a stored distance matrix that contains the distances between all pairs of variables.

If you enter a stored distance matrix, Minitab cannot calculate statistics for the final partition.

For measurement data, you must have two or more numeric columns, and each column must represent a different measurement. Delete rows that have missing data from the worksheet before you perform this analysis. If you have many rows of data, you may want to subset your worksheet to exclude the rows with missing values. For more information, go to Overview for Subset Worksheet.

You cannot enter a categorical variable for this analysis. If you have a categorical variable, you must first convert the text values to a numerical scale, or you must perform a separate analysis for each level of the categorical variable. For more information, go to Data considerations for Cluster Variables.

For the stored distance matrix, the entry in row i and column j of distance matrix D is the distance between variables i and j. For information on creating and using stored matrices in Minitab, go to Overview for Matrices.

In this worksheet, each column contains measurements on different variables, measured in different cities around the world, which may be associated with college admission rates. The variables include the number of newspaper copies, radios, and television sets per 1,000 people in different cities, the literacy rate, and the existence of a university. Investigators hope to reduce the number of variables by combining the variables based on similar characteristics.

C1 | C2 | C3 | C4 | C5 |
---|---|---|---|---|

Newspaper | Radio | TV Sets | Literacy Rate | University |

279 | 267 | 227 | 0.98 | 1 |

143 | 112 | 332 | 0.94 | 1 |

9 | 113 | 7 | 0.25 | 0 |

391 | 314 | 566 | 0.99 | 1 |

112 | 48 | 423 | 0.82 | 1 |

67 | 66 | 134 | 0.45 | 0 |

From Linkage method, select a method to specify how the distance between two clusters is defined. You might want to try several linkage methods to see which method provides the most useful results for your data.

For Cluster Observations, distance refers to the distance between observations, and linkage refers to the distance between the clusters of observations. For Cluster Variables, distance refers to the distance between variables, and linkage refers to the distance between the clusters of variables.

- Average
- The distance between two clusters is the mean distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. Whereas the single and complete linkage methods are based on single pair distances, the average linkage method uses a more central measure of location.
- Centroid
- The distance between two clusters is the distance between the cluster centroids or means. Like the average linkage method, this method is also an averaging technique.
- Complete
- The distance between two clusters is the maximum distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. This method, also called the furthest neighbor method, ensures that all observations (or variables) in a cluster are within a maximum distance and tends to produce clusters that have similar diameters. However, the results are greatly affected by outliers.
- McQuitty
- The distance of the new cluster to any other cluster is calculated as the average of the distances of the clusters that will soon be joined to another cluster. For example, if clusters 1 and 3 are to be joined into a new cluster, say 1*, then the distance from 1* to cluster 4 is the average of the distances from 1 to 4 and 3 to 4. For this method, the distance depends on a combination of clusters instead of on individual observations (or variables) in the clusters.
- Median
- The distance between two clusters is the median distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. Because this averaging technique uses the median instead of the mean, it reduces the effect of outliers.
- Single
- The distance between two clusters is the minimum distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. This method, also called the nearest neighbor method, is a good choice when clusters are obviously separated. When observations (or variables) lie close together, the single linkage method tends to identify long chain-like clusters, with relatively large distances separating observations at either end of the chain.
- Ward
- The distance between two clusters is the sum of squared deviations from points to centroids. The goal of Ward's linkage method is to minimize the within-cluster sum of squares. This method tends to produce clusters that have similar numbers of observations (or variables), but it is sensitive to outliers. Also, the distance between two clusters can sometimes be larger than dmax, which is the maximum value in the original distance matrix. When this occurs, the similarity value is negative.

From Distance measure, select the method for calculating the distance between variables.

- Correlation: The correlation method gives distances between 0 and 1 for positive correlations, and between 1 and 2 for negative correlations. If it makes sense to consider negatively correlated data to be farther apart than positively correlated data, use the correlation method.
- Absolute correlation: The absolute correlation method gives distances between 0 and 1. If you think that the strength of the relationship is important in considering distance and not the sign, then use the absolute correlation method.

Indicate the criteria that you want to use to determine the final groupings.

- Number of clusters: Select to enter the number of clusters for the final partition.
- Similarity level: Select to enter the similarity level for the clusters in the final partition.

For the best results, you should be flexible with the criteria. For example, if you define the final partition using the number of clusters, you should also consider changes in similarity level, as well. A precipitous drop in similarity when adding a specific cluster might prompt you to specify the final partition before this grouping. Conversely, if you define the final partition using the similarity level, you might determine that similarity levels do not change much over a range of clusters, and for the sake of simplicity you may choose to go with the step with the fewest clusters.

If you do not know what value to enter to specify the final partition, first perform the analysis using the default setting (1 cluster in the final partition). Minitab displays the results for all possible numbers of clusters. Use the results to determine a value to enter for the final partition. Then repeat the analysis and specify the final partition that you determined. For more information, go to Determine the final grouping of clusters.

Select to display a tree diagram that shows how clusters were formed at each step in the amalgamation procedure. The dendrogram allows you to view the similarity (or distance) values for the clusters at each step.

To change the default display of the dendrogram, click Customize.