Data considerations for Cluster Variables

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

You can have raw data or a matrix of distances

Usually, you use raw data for this analysis. Each row contains measurements on a single item or subject. You must have two or more numeric columns, with each column representing a different measurement. You must delete rows with missing data from the worksheet before using this analysis.

If you store a p x p distance matrix, where p is the number of variables, then you can use the matrix for the analysis. The (i, j) entry in the matrix is the distance between variables i and j. If you use a distance matrix, then Minitab cannot calculate statistics for the final partition.

The data must be numeric

To form the clusters, this analysis calculates the distance between variables, which cannot be measured between levels of a categorical variable. To use a categorical variable in the analysis, you must first convert the text values to a numerical scale. For example, an analyst measures customer satisfaction using the categories "Very satisfied", "Satisfied", "Unsatisfied" and "Very unsatisfied". To perform cluster variables, the analyst recodes these categories as +2, +1, −1, −2. The distances between variables can now be calculated for the analysis. Alternately, you could split the worksheet into separate worksheets for each level of the categorical variable and cluster variables at each level. For more information on splitting the worksheet, go to Overview for Split Worksheet.