Enter your data for Cluster K-Means

Stat > Multivariate > Cluster K-Means

Enter your data

In Variables, enter the columns that contain the measurement data.

You must have two or more numeric columns, with each column representing a different measurement. You must delete rows with missing data from the worksheet before using this procedure. When you have a large data set with many missing values, it may more convenient to subset your worksheet to exclude the rows with missing values, rather than delete each row manually. For more information, go to Overview for Subset Worksheet.

In this worksheet, columns C1 through C4 contain measurements for each variable that describes a characteristic of a company. The Initial column indicates the initial cluster membership for the observations. Notice that only the non-zero values in the Initial column are used to define each initial cluster (1, 2, and 3). The remaining observations with an initial value of 0 are not assigned to an initial cluster but instead are assigned to a cluster during the clustering algorithm process, based on the cluster centroid they are closest to.
C1 C2 C3 C4 C5
Clients Rate of Return Sales Years Initial
150 13.5 50400200 18 1
98 11.7 45665230 12 2
79 12.0 19800800 7 0
122 11.4 42560000 13 0
143 12.4 47635980 15 0
49 9.8 22342600 6 3

Specify the initial partition

Indicate the starting cluster designations. K-means procedures work best when you provide good starting points for clusters. Base the initial clustering on practical and/or engineering knowledge about the observations being clustered. For more information, go to How the cluster K-means process starts.

  • Number of clusters: Select if you have no a priori knowledge of initial clusters. Enter a value to specify the number of clusters to form. The initial clusters are the first rows of data in the worksheet. For example, if you enter 3, then the first three rows of data are the initial cluster centroids.
  • Initial partition column: Select to specify a column that contains the initial cluster membership. Use positive integers for the observations that define the initial clusters and use zeroes for the remaining observations.

Standardize variables

Select Standardize variables to have Minitab weight all the variables equally. Standardizing is good practice in most cases, and is particularly important when the variables use different scales. Suppose variable A is on a scale in dollars from $0 to $10,000,000, and variable B is a ratio on a scale from 0.0 to 1.0. If the variables are not standardized, then the cluster procedure places much more weight on variable A than on variable B due to the larger values of its scale, which is probably not the desired result. Therefore, the variables should be standardized.

Minitab standardizes all variables by subtracting the means and dividing by the standard deviation before calculating the distance matrix. When you standardize variables, the grand centroid is 0 for all clusters.