Enter your data for Cluster Observations

Stat > Multivariate > Cluster Observations

Specify the data for your analysis, select the linkage and distance methods, indicate whether to standardize the variables, specify the final partition, and select the graph options.

In This Topic

Enter your data
Specify the linkage method
Specify the distance measure
Standardize variables
Specify the final partition
Show dendrogram

Enter your data

In Variables or distance matrix, enter either the columns that contain measurement data or a stored distance matrix that contains the distances between all pairs of observations.

Note

If you enter a stored distance matrix, Minitab cannot calculate statistics for the final partition.

For measurement data, you must have two or more numeric columns, and each column must represent a different measurement. Delete rows that have missing data from the worksheet before you perform this analysis. If you have many rows of data, you may want to subset your worksheet to exclude the rows that have missing values. For more information, go to Overview for Subset Worksheet.

You cannot enter a categorical variable for this analysis. If you have a categorical variable, you must first convert the text values to a numerical scale, or you must perform a separate analysis for each level of the categorical variable. For more information, go to Data considerations for Cluster Observations.

For a stored distance matrix, the entry in row i and column j of distance matrix D is the distance between observations i and j. For information on creating and using stored matrices in Minitab, go to Overview for Matrices.

In this worksheet, each column contains different measurements on athletes.

C1	C2	C3	C4
Gender	Height	Weight	Handedness
2	67	155	1
1	74	193	1
2	68	152	1
1	70	172	0
1	72	169	1
2	66	134	0

Specify the linkage method

From Linkage method, select a method to specify how the distance between two clusters is defined. You might want to try several linkage methods to see which method provides the most useful results for your data.

Note

For Cluster Observations, distance refers to the distance between observations, and linkage refers to the distance between the clusters of observations. For Cluster Variables, distance refers to the distance between variables, and linkage refers to the distance between the clusters of variables.

Average: The distance between two clusters is the mean distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. Whereas the single and complete linkage methods are based on single pair distances, the average linkage method uses a more central measure of location.
Centroid: The distance between two clusters is the distance between the cluster centroids or means. Like the average linkage method, this method is also an averaging technique.
Complete: The distance between two clusters is the maximum distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. This method, also called the furthest neighbor method, ensures that all observations (or variables) in a cluster are within a maximum distance and tends to produce clusters that have similar diameters. However, the results are greatly affected by outliers.
McQuitty: The distance of the new cluster to any other cluster is calculated as the average of the distances of the clusters that will soon be joined to another cluster. For example, if clusters 1 and 3 are to be joined into a new cluster, say 1*, then the distance from 1* to cluster 4 is the average of the distances from 1 to 4 and 3 to 4. For this method, the distance depends on a combination of clusters instead of on individual observations (or variables) in the clusters.
Median: The distance between two clusters is the median distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. Because this averaging technique uses the median instead of the mean, it reduces the effect of outliers.
Single: The distance between two clusters is the minimum distance between an observation (or variable) in one cluster and an observation (or variable) in the other cluster. This method, also called the nearest neighbor method, is a good choice when clusters are obviously separated. When observations (or variables) lie close together, the single linkage method tends to identify long chain-like clusters, with relatively large distances separating observations at either end of the chain.
Ward: The distance between two clusters is the sum of squared deviations from points to centroids. The goal of Ward's linkage method is to minimize the within-cluster sum of squares. This method tends to produce clusters that have similar numbers of observations (or variables), but it is sensitive to outliers. Also, the distance between two clusters can sometimes be larger than dmax, which is the maximum value in the original distance matrix. When this occurs, the similarity value is negative.

Specify the distance measure

From Distance measure, select the method for calculating the distance between pairs of observations.

Euclidean: The most common distance measure, which calculates the square root of the sum of squared differences.
Squared Euclidean: The square of the distance that is calculated using the Euclidean method. This method gives more weight to outliers.
Pearson: The square root of the sum of square distances divided by variances. This method makes the variances the same and is used for standardizing.
Squared Pearson: The square of the distance that is calculated using the Pearson method. This method gives more weight to outliers and makes the variances the same.
Manhattan: The sum of absolute distances. This method gives less weight to outliers.

Tip

If you selected Average, Centroid, Median, or Ward as the linkage method, you should usually use one of the squared distance measures.

Standardize variables

Select Standardize variables to have Minitab weight all the variables equally. Standardizing is good practice in most cases, and is particularly important when the variables use different scales. Suppose variable A is on a scale in dollars from $0 to $10,000,000, and variable B is a ratio on a scale from 0.0 to 1.0. If the variables are not standardized, then the cluster observations procedure places much more weight on variable A than on variable B due to the larger values of its scale, which is probably not the desired result. Therefore, the variables should be standardized.

When you standardize the variables, Minitab makes all the means equal to 0 and all the variances equal to 1. To make only the variances equal, do not select the standardize option, but instead select either Pearson or Squared Pearson under Distance measure.

Specify the final partition

Indicate the criteria that you want to use to determine the final groupings.

Number of clusters: Select to enter the number of clusters for the final partition.
Similarity level: Select to enter the similarity level for the clusters in the final partition.

For the best results, you should be flexible with the criteria. For example, if you define the final partition using the number of clusters, you should also consider changes in similarity level, as well. A precipitous drop in similarity when adding a specific cluster might prompt you to specify the final partition before this grouping. Conversely, if you define the final partition using the similarity level, you might determine that similarity levels do not change much over a range of clusters, and for the sake of simplicity you may choose to go with the step with the fewest clusters.

Note

If you do not know what value to enter to specify the final partition, first perform the analysis using the default setting (1 cluster in the final partition). Minitab displays the results for all possible numbers of clusters. Use the results to determine a value to enter for the final partition. Then repeat the analysis and specify the final partition that you determined. For more information, go to Determine the final grouping of clusters.

Show dendrogram

Select to display a tree diagram that shows how clusters were formed at each step in the amalgamation procedure. The dendrogram allows you to view the similarity (or distance) values for the clusters at each step.

To change the default display of the dendrogram, click Customize.