K-means clustering starts with a grouping of observations into a predefined number of clusters.
- Minitab assesses each observation, moving it into the nearest cluster. The nearest cluster is the one which has the smallest Euclidean distance between the observation and the centroid of the cluster.
- When a cluster changes, by losing or getting an observation, Minitab recalculates the cluster centroid.
- This process repeats until no more observations can be moved into a different cluster. At this point, all observations are in their nearest cluster by the previous criterion.
Unlike when you create a hierarchical cluster of observations, it is possible for two observations to be split into separate clusters after they are joined together.
K-means procedures work best when you provide good initial points for clusters. There are two ways to start the clustering process: specifying a number of clusters or supplying an initial partition column that contains group codes.
You might be able to start the process when you do not have complete information to initially partition the data. Suppose you know that the final partition should have three groups, and that observations 2, 5, and 9 belong in each of those groups, respectively. Continuing from here depends on whether you specify the number of clusters or supply an initial partition column.
- If you specify the number of clusters, you must rearrange your data in the worksheet to move observations 2, 5 and 9 to the top of the worksheet, and then specify 3 for Number of clusters.
- If you enter an initial partition column, you do not need to rearrange your data in the worksheet. In the initial partition worksheet column, enter group numbers 1, 2, and 3, for observations 2, 5, and 9, respectively, and enter 0 for the other observations.
The final partition will depend to some extent on the initial partition that Minitab uses. You might try different initial partitions. Based on Milligan, K-means procedures might not do as well when the initializations are done arbitrarily. However, if you provide good initial points, K-means clustering can be quite robust.