Data considerations for Cluster K-Means

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

You must use raw data
Each row contains measurements on a single item or subject. You must have two or more numeric columns, with each column representing a different measurement. You must delete rows with missing data from the worksheet before using this analysis.
The clustering process works best when you base the initial clustering on practical and/or engineering knowledge
However, if you have no a priori knowledge of initial clusters, you can perform the analysis without initializing the process by indicating only the number of clusters to form. For more information, go to Enter your data for Cluster K-Means.
To initialize the clustering process using a data column, you must have a column of values to indicate cluster membership
The initialization column must contain positive, consecutive integers or zeros (it should not contain all zeros). Initially, each observation is assigned to the cluster identified by the corresponding value in this column. An initialization of zero means that an observation is initially unassigned to a group. The number of distinct positive integers in the initial partition column equals the number of clusters in the final partition.
Outliers can significantly influence the results
The presence of outliers, which are unusually large or small values in your data, can affect the clustering results. The clusters are often larger when outliers are not removed, and the resulting solution may not seem logical. Investigate outliers and remove any values that are due to measurement or recording errors. Extreme outliers may also be an indication of specific observations that are fundamentally different than all the other observations in your data set, perhaps due to some special cause. If there are practical reasons to not include extreme outliers in the analysis, consider re-running the analysis without them to see how they influence the results.