Data considerations for Cluster Observations

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

You can have raw data or a matrix of distances

Usually, you use raw data for this analysis. Each row contains measurements on a single item or subject. You must have two or more numeric columns, with each column representing a different measurement. You must delete rows that have missing data from the worksheet before you use this analysis.

If you store an n x n distance matrix, where n is the number of observations, then you can use the matrix for the analysis. The (I, j) entry in the matrix is the distance between observations I and j. If you use a distance matrix, then Minitab cannot calculate statistics for the final partition.

The data must be numeric

To form the clusters, this analysis calculates the distance between observations, which cannot be measured between levels of a categorical variable. To use a categorical variable in the analysis, you must first convert the text values to a numerical scale. For example, an analyst measures customer satisfaction using the categories "Very satisfied", "Satisfied", "Unsatisfied" and "Very unsatisfied". To perform cluster observations, the analyst recodes these categories as +2, +1, −1, −2. The distances between observations can now be calculated for the analysis. Alternately, you could split the worksheet into separate worksheets for each level of the categorical variable and cluster the observations at each level. For more information on splitting the worksheet, go to Overview for Split Worksheet.